An informal experiment benchmarking several LLMs (Claude, GPT, Gemini, Qwen, Kimi, GLM) on their ability to predict the cooling curve of hot coffee in a ceramic mug. Each model was asked to produce a temperature-over-time equation given specific physical parameters. The author then ran the actual experiment and compared predictions to reality. All models used exponential decay formulas; Claude 4.6 Opus performed best but cost $0.61. Predictions were directionally reasonable but underestimated early cooling and overestimated later cooling. DeepSeek and Grok failed to return answers while still charging for tokens.

5m read timeFrom dynomight.net
Post cover image

Sort: