unknown

GPT-5.3-Codex is #3 on Vending-Bench 2.

Performance was on par with Claude Opus 4.6 for the first 200 days, but then dropped off.

Unlike the Claude models above it, it never lied to anyone throughout the simulation. https://t.co/zGhBAS0I5n

Tibo

A brief reflection on the value of truthfulness in engineering and research contexts, with a self-aware acknowledgment that this view may be idealistic. Links to an external resource for further context.