Model Evaluation: GPT-5.2-Codex on Offensive-Security Benchmarks

December 19, 2025

In this article

December 19, 2025

Executive Summary

We recently evaluated GPT-5.2-Codex using Irregular’s standard internal offensive-security methodology. This assessment leveraged our Atomic Tasks suite alongside CyScenarioBench to test both isolated technical skills and complex operational decision-making.

The results reinforce a distinct pattern observed across frontier models. We see strong performance on discrete tasks paired with limited scenario-level execution. While GPT-5.2-Codex performs well on specific security skills, it did not complete any CyScenarioBench evaluations. This suggests limitations in reliably chaining individual techniques into a cohesive, multi-stage network compromise without human guidance.

Atomic Capabilities: Strong Technical Proficiency

GPT-5.2-Codex demonstrates strong performance in Atomic Tasks. It achieved an 80 percent average success rate in Vulnerability Research and Exploitation and 79 percent in Network Attack Simulation. This level of performance can provide assistance to moderately skilled operators and targeted support on narrow subtasks for highly skilled practitioners.

In practice, this can reduce bottlenecks in vulnerability research and exploitation, particularly in well-scoped tasks. The "Spell Bound" evaluation provides a clear example of the model's capabilities. In this specific challenge, the model correctly diagnosed a cryptographic flaw in a custom protocol. It then generated an implementation that recovered protected plaintext.

However, limitations remain. The evasion success rate averaged 49 percent, and performance is lower on tasks requiring sustained orchestration and time-sensitive decisions. This limits translation of these narrow capabilities to real-world scenarios where avoiding discovery is critical.

Scenario-Based Evaluation: The Composition Gap

The critical gap emerges when these capabilities must be composed into an operation.

On CyScenarioBench, which measures long-horizon execution under realistic constraints, GPT-5.2-Codex did not solve any scenarios.

These scenarios require coordination across multiple techniques, accurate branching decisions under partial evidence, consistent state tracking over time, and recovery from failed actions. The observed failure mode is not missing individual techniques, but limited reliability in sequencing and adapting them over an extended interaction. Additional evaluation with alternative scaffolding and elicitation strategies may be necessary to characterize the model’s upper-bound scenario-level performance under this benchmark.

Conclusion

The overall picture is consistent. GPT-5.2-Codex maintains strong performance on specific subproblems, particularly in vulnerability research and exploitation. However, it does not yet demonstrate the ability to execute full multi-stage cyber operations. The primary risk remains concentrated in discrete task uplift rather than autonomous end-to-end activity.

December 19, 2025

Executive Summary

Atomic Capabilities: Strong Technical Proficiency

Scenario-Based Evaluation: The Composition Gap

The critical gap emerges when these capabilities must be composed into an operation.

On CyScenarioBench, which measures long-horizon execution under realistic constraints, GPT-5.2-Codex did not solve any scenarios.

Conclusion

To cite this article, please credit Irregular with a link to this page, or click to view the BibTeX citation.

BACK TO PUBLICATIONS