A Frontier Fortnight: Major Model Releases and What They Mean
December 3, 2025
The past two weeks brought a leap in LLM capabilities. Two weeks ago, Google DeepMind released Gemini 3 Pro, their latest frontier model; later, OpenAI released GPT-5.1-Codex-Max, their new agentic coding model; and last week, Anthropic released Claude Opus 4.5, their most capable model yet. All three frontier models represent significant advances in LLM capabilities, achieve impressive results in benchmarks, and demonstrate remarkable improvements in software engineering workflows.
Cybersecurity capabilities have naturally improved as well, and the release of new frontier models presents risks that must be evaluated and mitigated. In light of the report by Anthropic last month, where they have uncovered an AI-orchestrated nation-state cyber espionage campaign, these risks are clearer than ever.
Irregular is proud to be a leader in the evaluation of frontier models. Over the past few weeks, Irregular was a trusted partner in multiple frontier model releases. The Claude Opus 4.5 system card describes how Anthropic utilized the SOLVE scoring framework, developed by Irregular, to measure the capabilities of Claude on vulnerability discovery & exploit development tasks. OpenAI recently wrote about their reliance on external testing as part of their approach to safety, citing our work on constructing network simulations to gauge cyber capabilities of the GPT-5 family of models, starting with the initial release of GPT-5 in August, and most recently including GPT-5.1-Codex-Max. Other recent frontier models were evaluated by Irregular as well. Earlier this year, we worked with Anthropic to evaluate Claude Sonnet 4.5, the Claude 4 family, and Claude 3.7 Sonnet; with OpenAI, to evaluate o3 and o4-mini; and more.
Irregular is a pioneer in frontier cyber evaluations, working with AI labs since the introduction of powerful and capable LLMs. Our early work defined the philosophy and methods of measuring cyber capabilities, which has quickly become the standard in evaluating previous generations of models. But with the ever-continuing progress, the evaluation methods must advance as well. Simple task-based evaluations, where the model needs to find and exploit a bug in a highlighted vulnerable endpoint, were standard last year, but are no longer sufficient today. Foreseeing this rapid progress, Irregular was early to develop a next-generation cyber evaluation suite, focusing on scenario-based evaluations rather than task-based ones. This work once again sets the standard for how to measure frontier models’ cyber capabilities, as we are now seeing that recent models are starting to saturate the previous generation of evaluations.
We thank Anthropic, OpenAI, and other frontier labs and organizations for their partnership and shared commitment to securing frontier AI.