An informal experiment benchmarking several LLMs (Claude, GPT, Gemini, Qwen, Kimi, GLM) on their ability to predict the cooling curve of hot coffee in a ceramic mug. Each model was asked to produce a temperature-over-time equation given specific physical parameters. The author then ran the actual experiment and compared predictions to reality. All models used exponential decay formulas; Claude 4.6 Opus performed best but cost $0.61. Predictions were directionally reasonable but underestimated early cooling and overestimated later cooling. DeepSeek and Grok failed to return answers while still charging for tokens.
Sort: