•From x.com
Truthfulness is useful for engineering & research (and hopefully beyond). But maybe we are naive :) https://t.co/lU7hxzV5Cz

Andon Labs @andonlabs
GPT-5.3-Codex is #3 on Vending-Bench 2. Performance was on par with Claude Opus 4.6 for the first 200 days, but then dropped off. Unlike the Claude models above it, it never lied to anyone throughout the simulation. https://t.co/zGhBAS0I5n
Sort: