N-Day-Bench is a benchmark that evaluates frontier LLMs on their ability to discover real-world N-Day vulnerabilities disclosed after each model's knowledge cutoff. Using a standardized harness via OpenRouter-backed finder models, it prevents reward hacking and measures genuine cybersecurity capability. The benchmark updates monthly with new test cases and model versions. Current top performers include GPT-5.4 (83.93), GLM-5.1 (80.13), Claude Opus 4.6 (79.95), Kimi K2.5 (77.18), and Gemini 3.1 Pro Preview (68.50). All traces are publicly browsable.

1m read timeFrom ndaybench.winfunc.com
Post cover image

Sort: