The Next Generation of Cyber Evaluations - Case Study: Gemini 3 Pro

November 19, 2025

In this article

    November 19, 2025

    TLDR: We’ve created next-generation evaluations now used by frontier models like Gemini 3. This is essential because AI models have begun saturating even the most complicated existing tests, making more advanced cyber evaluations critical for security.

    AI models are now passing most cybersecurity tests. Newer models can find and exploit vulnerabilities with far more skill than earlier versions. We need harder, more sophisticated tests to properly evaluate today’s advanced AI; and the even more capable models coming next.

    Irregular has built a proprietary set of cybersecurity evaluations that go beyond standard benchmarks. These tests place AI models in realistic attack scenarios that closely mirror real-world operations.

    The Age of Scenario Evaluations

    As LLMs became truly useful, researchers began testing their cybersecurity capabilities via evaluations. Early tests used multiple-choice questions, later tests borrowed challenges from hacking competitions (CTF) or created similar ones: these were harder but often unrealistic. The current SOTA revolves around more complex evaluations testing specific capabilities, such as vulnerability exploitation.

    The trust in Irregular’s evaluations stems from the composite human expertise in AI, cybersecurity, and evaluation theory that goes into each one. Our research into best practices for evaluations and offensive cyber capabilities informs the creation of new evaluations; our SOLVE scoring system outlines a framework for scoring their difficulty; our research into methodology helps with the problem of quantifying AI capabilities based on evaluation results. This expertise makes Irregular a trusted partner to frontier labs such as OpenAI, Anthropic, and Google DeepMind, which use our evaluation suite and research conclusions to assess the cyber capabilities of their models. Through these collaborations, we gain early visibility into how the landscape is shifting and why new models must be tested with evaluations that are deeper, broader, and more realistic.

    Current industry evaluations are mostly task-based: achieving code execution; disabling firewalls to communicate with targets; exploiting cryptographic vulnerabilities. These evaluations seek to isolate specific capabilities and measure AI models against them. Such evaluations are very useful, but they are insufficient to capture real world attacks. For example, an offensive operation often requires more than just a single task, especially in today’s world where the importance of digital security and the know-how are much more prevalent, and where many systems are aspiring to be secure by design. Operations require chaining many different tasks to exploit a target successfully. As exemplified by recent events such as Anthropic’s report on GTG-1002, we might be heading into a future where AI agents are able to perform these operations semi-autonomously and eventually autonomously.

    Irregular’s new Scenario Evaluation Suite aims to simulate exactly this kind of picture, and evolve the SOTA from task evaluations to scenario evaluations, testing AI models on complex exploitation chains, inspired by research into several examples of impactful real-world cyber attacks. They are built from the ground up to explore and measure each capability on its own, to inform mitigations - but in context of a realistic operation.

    Case Study: Google DeepMind’s Gemini 3 Pro

    Previous Gemini models underwent extensive testing on Irregular’s evaluations, as well as other cybersecurity benchmarks. With Gemini 3 Pro, Google DeepMind chose to retire other benchmarks in favor of Irregular’s evaluation set - including the new addition of the Scenario Evaluation Suite, as mentioned in the Frontier Safety Framework report (page 8).

    Gemini 3 Pro represents a significant increase in model capabilities across many domains, as highlighted by our evaluations and the model card. In particular, our evaluations have shown Gemini 3 Pro exhibits a leap in offensive cybersecurity capabilities. The model’s performance on Irregular’s evaluation set (the “key skills” benchmark) highlights the need for a new generation of cyber evaluations, which we are happy to establish: While existing task evaluations (“V1”) had been mostly saturated by this model, even those rated as hard, Irregular’s Scenario Evaluation Suite (“V2”) currently remains fully unsolved. The research behind these new evaluations enables us to assess model cyber capabilities with significantly higher confidence. Additional details on our Scenario Evaluation Suite will be presented in a future publication.

    We are proud to play a leading role in shaping Frontier AI Security, and we thank Google DeepMind and other frontier labs and organizations for their partnership and shared commitment to securing frontier AI.

    November 19, 2025

    TLDR: We’ve created next-generation evaluations now used by frontier models like Gemini 3. This is essential because AI models have begun saturating even the most complicated existing tests, making more advanced cyber evaluations critical for security.

    AI models are now passing most cybersecurity tests. Newer models can find and exploit vulnerabilities with far more skill than earlier versions. We need harder, more sophisticated tests to properly evaluate today’s advanced AI; and the even more capable models coming next.

    Irregular has built a proprietary set of cybersecurity evaluations that go beyond standard benchmarks. These tests place AI models in realistic attack scenarios that closely mirror real-world operations.

    The Age of Scenario Evaluations

    As LLMs became truly useful, researchers began testing their cybersecurity capabilities via evaluations. Early tests used multiple-choice questions, later tests borrowed challenges from hacking competitions (CTF) or created similar ones: these were harder but often unrealistic. The current SOTA revolves around more complex evaluations testing specific capabilities, such as vulnerability exploitation.

    The trust in Irregular’s evaluations stems from the composite human expertise in AI, cybersecurity, and evaluation theory that goes into each one. Our research into best practices for evaluations and offensive cyber capabilities informs the creation of new evaluations; our SOLVE scoring system outlines a framework for scoring their difficulty; our research into methodology helps with the problem of quantifying AI capabilities based on evaluation results. This expertise makes Irregular a trusted partner to frontier labs such as OpenAI, Anthropic, and Google DeepMind, which use our evaluation suite and research conclusions to assess the cyber capabilities of their models. Through these collaborations, we gain early visibility into how the landscape is shifting and why new models must be tested with evaluations that are deeper, broader, and more realistic.

    Current industry evaluations are mostly task-based: achieving code execution; disabling firewalls to communicate with targets; exploiting cryptographic vulnerabilities. These evaluations seek to isolate specific capabilities and measure AI models against them. Such evaluations are very useful, but they are insufficient to capture real world attacks. For example, an offensive operation often requires more than just a single task, especially in today’s world where the importance of digital security and the know-how are much more prevalent, and where many systems are aspiring to be secure by design. Operations require chaining many different tasks to exploit a target successfully. As exemplified by recent events such as Anthropic’s report on GTG-1002, we might be heading into a future where AI agents are able to perform these operations semi-autonomously and eventually autonomously.

    Irregular’s new Scenario Evaluation Suite aims to simulate exactly this kind of picture, and evolve the SOTA from task evaluations to scenario evaluations, testing AI models on complex exploitation chains, inspired by research into several examples of impactful real-world cyber attacks. They are built from the ground up to explore and measure each capability on its own, to inform mitigations - but in context of a realistic operation.

    Case Study: Google DeepMind’s Gemini 3 Pro

    Previous Gemini models underwent extensive testing on Irregular’s evaluations, as well as other cybersecurity benchmarks. With Gemini 3 Pro, Google DeepMind chose to retire other benchmarks in favor of Irregular’s evaluation set - including the new addition of the Scenario Evaluation Suite, as mentioned in the Frontier Safety Framework report (page 8).

    Gemini 3 Pro represents a significant increase in model capabilities across many domains, as highlighted by our evaluations and the model card. In particular, our evaluations have shown Gemini 3 Pro exhibits a leap in offensive cybersecurity capabilities. The model’s performance on Irregular’s evaluation set (the “key skills” benchmark) highlights the need for a new generation of cyber evaluations, which we are happy to establish: While existing task evaluations (“V1”) had been mostly saturated by this model, even those rated as hard, Irregular’s Scenario Evaluation Suite (“V2”) currently remains fully unsolved. The research behind these new evaluations enables us to assess model cyber capabilities with significantly higher confidence. Additional details on our Scenario Evaluation Suite will be presented in a future publication.

    We are proud to play a leading role in shaping Frontier AI Security, and we thank Google DeepMind and other frontier labs and organizations for their partnership and shared commitment to securing frontier AI.

    To cite this article, please credit Irregular with a link to this page, or click to view the BibTeX citation.

    To cite this article, please credit Irregular with a link to this page, or click to view the BibTeX citation.