LLM Skirmish - An Adversarial In-Context Learning Benchmark

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

LLM Skirmish is a novel adversarial benchmark where frontier LLMs compete in 1v1 real-time strategy games by writing JavaScript code that executes in a Screeps-based game environment. Tournaments run over five rounds, allowing models to adapt strategies based on previous results, testing in-context learning. Results show Claude Opus 4.5 dominating with an 85% win rate and 1778 ELO, followed by GPT 5.2 at 68%. GPT 5.2 offers better cost-efficiency at ~1.7x more ELO per dollar than Claude. A notable anomaly is Gemini 3 Pro, which led in round 1 with a 71% win rate but collapsed in later rounds, likely due to context rot from aggressively loading previous match data. Most models showed meaningful improvement across rounds, validating the benchmark's ability to measure in-context learning.