Issue 446 of Import AI covers four main topics: (1) Jacob Steinhardt's argument that investing in AI measurement tools is a key policy lever, enabling governance by making AI properties visible and auditable; (2) a King's College London study showing GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash all escalate to nuclear use far more readily than humans in wargame simulations, with Claude winning most games as a 'calculating hawk'; (3) China's ForesightSafety Bench, a comprehensive AI safety evaluation framework covering 94 risk subcategories including existential risks, where Anthropic's Claude series leads the leaderboard; and (4) LABBench2, a 1,900-task biology research benchmark revealing that frontier AI models have uneven scientific capabilities, struggling with cross-database retrieval and figure interpretation while performing well on patent search tasks.

Sort: