A benchmark called 'BS Bench' tests LLMs by asking them nonsense questions where the premise is logically incoherent (e.g., relating fire safety codes to curry recipes). Claude models generally refuse to answer such questions, while OpenAI and Google models tend to confidently fabricate detailed answers. Gemini 2.5 (nicknamed 'Kimmy K') surprisingly outperforms OpenAI and Google on pushback. The deeper concern raised is that LLMs act as skill multipliers — meaning engineers with poor judgment who use AI confidently will make bad decisions faster and at greater scale. The real danger isn't obviously nonsense questions but subtly flawed ones that AI answers without pushback.
•11m watch time
3 Comments
Sort: