AI News

OpenAI launches EVMbench to stress-test AI against smart contract vulnerabilities

OpenAI and Paradigm EVMbench benchmark tests AI ability to detect, exploit, and patch smart contract vulnerabilities securing over \ in crypto.

OpenAI launches EVMbench to stress-test AI against smart contract vulnerabilities

OpenAI has unveiled EVMbench, a new evaluation framework built to measure how well artificial intelligence can detect and fix security flaws in smart contracts — the self-executing code that powers most of decentralized finance.

Developed with crypto investment firm Paradigm and published on February 18, the benchmark draws on 120 real-world vulnerabilities sourced from 40 separate security audits and open code competitions. The goal: establish a clear, reproducible standard for assessing AI in one of crypto’s most high-stakes arenas.

Why smart contract security matters

Smart contracts are the backbone of DeFi, running everything from decentralized exchanges to lending protocols. Because they’re typically immutable once deployed, a single uncaught bug can drain millions — or even billions — in user funds. These contracts collectively secure more than $100 billion in open-source crypto assets, according to OpenAI’s announcement.

“As AI agents improve at reading, writing, and executing code, it becomes increasingly important to measure their capabilities in economically meaningful environments,” the company said, “and to encourage the use of AI systems defensively to audit and strengthen deployed contracts.”

Three modes, one benchmark

EVMbench evaluates AI across three distinct capabilities. In Detect mode, the AI audits a contract repository and is scored on how many genuine vulnerabilities it identifies. In Patch mode, it modifies vulnerable code to close flaws without breaking intended functionality. In Exploit mode, the AI executes a live fund-draining attack against a deployed contract in a sandboxed blockchain environment.

Performance is strongest in exploit mode, where the objective is clear and feedback is immediate. In detect and patch modes, AI systems tend to stop early or struggle to eliminate subtle bugs while preserving contract behavior.

GPT-5.3-Codex leads the pack

In exploit mode, OpenAI’s GPT-5.3-Codex scored 72.2% — a significant jump from GPT-5’s 31.9% just six months ago. That’s roughly a doubling of capability in half a year, with major implications for both defenders and attackers across the DeFi ecosystem.

The benchmark also covers scenarios from the Tempo blockchain, a Layer 1 designed for high-throughput stablecoin payments, extending the evaluation into a domain where AI-driven financial activity is expected to grow.

Bigger picture

EVMbench arrives at a time when DeFi exploits remain a persistent threat. Recent incidents like the Mixin Network hack demonstrate that hundreds of millions can vanish when security fails — and stolen funds can sit dormant for years before resurfacing. The combination of immutable code and real money makes crypto a uniquely demanding test for any AI system.

If models continue improving at this pace, AI-assisted auditing could meaningfully close the gap in DeFi security. But EVMbench also shows the flip side: the same AI improving at patching vulnerabilities is getting faster at exploiting them.