Why not benchmark with physical experiments?

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

An informal experiment benchmarking several LLMs (Claude, GPT, Gemini, Qwen, Kimi, GLM) on their ability to predict the cooling curve of hot coffee in a ceramic mug. Each model was asked to produce a temperature-over-time equation given specific physical parameters. The author then ran the actual experiment and compared predictions to reality. All models used exponential decay formulas; Claude 4.6 Opus performed best but cost $0.61. Predictions were directionally reasonable but underestimated early cooling and overestimated later cooling. DeepSeek and Grok failed to return answers while still charging for tokens.

LLMs predict my coffee