From x.com

Truthfulness is useful for engineering & research (and hopefully beyond). But maybe we are naive :) https://t.co/lU7hxzV5Cz

andonlabs's profile

Andon Labs @andonlabs

GPT-5.3-Codex is #3 on Vending-Bench 2. Performance was on par with Claude Opus 4.6 for the first 200 days, but then dropped off. Unlike the Claude models above it, it never lied to anyone throughout the simulation. https://t.co/zGhBAS0I5n

Sort: