The Rise of Autonomous Vulnerability Research Capabilities in LLMs

December 2, 2025

In this article

    December 2, 2025

    Irregular's Frontier AI security evaluation system was built to test how models behave in adversarial contexts. While expanding our evaluation suite to cover more production-like environments, it started finding real bugs, including three zero-day vulnerabilities across Soft Serve and QuickJS, all now disclosed and patched. That wasn't the original goal, but it reflects a broader shift in what these systems can do.

    A Capability Shift Three Years in the Making

    For the last 3 years, the security community watched LLM progress with a mix of curiosity and skepticism. The models could explain vulnerabilities, generate plausible-looking exploits, and occasionally assist with code review, but the gap between an impressive demo and a useful tool remained wide. False positives were frequent, context windows too narrow to accommodate real-world codebases, and hallucinations common enough to undermine any meaningful level of trust. That changed in 2025. As reasoning improved, purpose-built tooling matured, and context windows expanded, language models reached a level where AI systems started finding real bugs in production software that traditional methods had missed.

    The Signal Is Getting Cleaner

    DARPA's AI Cyber Challenge, a competition to build autonomous systems that find and patch vulnerabilities, saw its finalists detect 86% of the planted test vulnerabilities in the final round, up from 37% a year earlier. Last fall, Google's Big Sleep found its first real-world vulnerability, a stack buffer underflow in SQLite that existing fuzzing infrastructure had missed. In May 2025, Sean Heelan used o3 to find a remote zero-day in the Linux kernel, with a signal-to-noise ratio of roughly 1:50. Noisy, but it worked.

    That was six months ago, using nothing more than the model API, no scaffolding, no agents, no tool use. Since then, specialized agentic systems have emerged alongside stronger base models. OpenAI's Aardvark has been assigned 10 CVEs. DeepMind's CodeMender has upstreamed 72 security fixes in six months. Google's Big Sleep issue tracker now lists more than 70 vulnerabilities across projects like SQLite, FFmpeg, curl, and Redis. The pattern is clear: as the tools get more specialized and the models get smarter, the signal gets cleaner.

    When Evaluation Systems Start Finding Zero-Days

    This phenomenon was also observed within our own work. Our AI security evaluation system, initially designed to analyze model behavior in adversarial contexts, started surfacing real vulnerabilities. While expanding our evaluation suite to cover more real-world systems, our pipeline surfaced several real vulnerabilities. Although this was not the system's primary objective, while evaluating models on a broader set of production-like codebases, it flagged what were subsequently confirmed to be zero-day vulnerabilities. For example, we found two in Soft Serve, a Git hosting project: CVE-2025-64494 - an ANSI escape sequence injection and CVE-2025-64522 - an SSRF in Webhooks. A third, CVE-2025-63998, was a use-after-free in QuickJS. We disclosed all three and worked with the maintainers to immediately ship fixes.

    The dual-use implications are obvious. The same capabilities helping defenders find and patch vulnerabilities can help attackers find them first. This is why we treat evaluation as more than a benchmarking exercise. Irregular partners with major LLM providers on capability evaluations, systematically testing what these models can do offensively so we can identify where safeguards and mitigations are needed before those capabilities are widely deployed.

    Tools built to evaluate AI security are increasingly functioning as part of the security stack rather than as one-off benchmarks. As models become more capable and more autonomous, these systems will be one of the main ways we observe what they actually do in real environments. They give defenders new leverage, but they also lower the cost for capable attackers to probe complex software. How we design, operate, and share these pipelines will go a long way toward determining whether they reduce overall risk or amplify it.

    December 2, 2025

    Irregular's Frontier AI security evaluation system was built to test how models behave in adversarial contexts. While expanding our evaluation suite to cover more production-like environments, it started finding real bugs, including three zero-day vulnerabilities across Soft Serve and QuickJS, all now disclosed and patched. That wasn't the original goal, but it reflects a broader shift in what these systems can do.

    A Capability Shift Three Years in the Making

    For the last 3 years, the security community watched LLM progress with a mix of curiosity and skepticism. The models could explain vulnerabilities, generate plausible-looking exploits, and occasionally assist with code review, but the gap between an impressive demo and a useful tool remained wide. False positives were frequent, context windows too narrow to accommodate real-world codebases, and hallucinations common enough to undermine any meaningful level of trust. That changed in 2025. As reasoning improved, purpose-built tooling matured, and context windows expanded, language models reached a level where AI systems started finding real bugs in production software that traditional methods had missed.

    The Signal Is Getting Cleaner

    DARPA's AI Cyber Challenge, a competition to build autonomous systems that find and patch vulnerabilities, saw its finalists detect 86% of the planted test vulnerabilities in the final round, up from 37% a year earlier. Last fall, Google's Big Sleep found its first real-world vulnerability, a stack buffer underflow in SQLite that existing fuzzing infrastructure had missed. In May 2025, Sean Heelan used o3 to find a remote zero-day in the Linux kernel, with a signal-to-noise ratio of roughly 1:50. Noisy, but it worked.

    That was six months ago, using nothing more than the model API, no scaffolding, no agents, no tool use. Since then, specialized agentic systems have emerged alongside stronger base models. OpenAI's Aardvark has been assigned 10 CVEs. DeepMind's CodeMender has upstreamed 72 security fixes in six months. Google's Big Sleep issue tracker now lists more than 70 vulnerabilities across projects like SQLite, FFmpeg, curl, and Redis. The pattern is clear: as the tools get more specialized and the models get smarter, the signal gets cleaner.

    When Evaluation Systems Start Finding Zero-Days

    This phenomenon was also observed within our own work. Our AI security evaluation system, initially designed to analyze model behavior in adversarial contexts, started surfacing real vulnerabilities. While expanding our evaluation suite to cover more real-world systems, our pipeline surfaced several real vulnerabilities. Although this was not the system's primary objective, while evaluating models on a broader set of production-like codebases, it flagged what were subsequently confirmed to be zero-day vulnerabilities. For example, we found two in Soft Serve, a Git hosting project: CVE-2025-64494 - an ANSI escape sequence injection and CVE-2025-64522 - an SSRF in Webhooks. A third, CVE-2025-63998, was a use-after-free in QuickJS. We disclosed all three and worked with the maintainers to immediately ship fixes.

    The dual-use implications are obvious. The same capabilities helping defenders find and patch vulnerabilities can help attackers find them first. This is why we treat evaluation as more than a benchmarking exercise. Irregular partners with major LLM providers on capability evaluations, systematically testing what these models can do offensively so we can identify where safeguards and mitigations are needed before those capabilities are widely deployed.

    Tools built to evaluate AI security are increasingly functioning as part of the security stack rather than as one-off benchmarks. As models become more capable and more autonomous, these systems will be one of the main ways we observe what they actually do in real environments. They give defenders new leverage, but they also lower the cost for capable attackers to probe complex software. How we design, operate, and share these pipelines will go a long way toward determining whether they reduce overall risk or amplify it.

    To cite this article, please credit Irregular with a link to this page, or click to view the BibTeX citation.

    To cite this article, please credit Irregular with a link to this page, or click to view the BibTeX citation.