Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe. Subscribe now Want to make AI go better? Figure out how to measure it:…One simple policy intervention that works well…Jacob Steinhardt, an AI researcher, has written a nice blog…

Import AI 

Issue 446 of Import AI covers four main topics: (1) Jacob Steinhardt's argument that investing in AI measurement tools is a key policy lever, enabling governance by making AI properties visible and auditable; (2) a King's College London study showing GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash all escalate to nuclear use far more readily than humans in wargame simulations, with Claude winning most games as a 'calculating hawk'; (3) China's ForesightSafety Bench, a comprehensive AI safety evaluation framework covering 94 risk subcategories including existential risks, where Anthropic's Claude series leads the leaderboard; and (4) LABBench2, a 1,900-task biology research benchmark revealing that frontier AI models have uneven scientific capabilities, struggling with cross-database retrieval and figure interpretation while performing well on patent search tasks.