A benchmark called 'BS Bench' tests LLMs by asking them nonsense questions where the premise is logically incoherent (e.g., relating fire safety codes to curry recipes). Claude models generally refuse to answer such questions, while OpenAI and Google models tend to confidently fabricate detailed answers. Gemini 2.5 (nicknamed

11m watch time
3 Comments

Sort: