Assessing GPT-5.4-Thinking Against Offensive Security Benchmarks

March 9, 2026

In this article

March 9, 2026

At Irregular, we perform rigorous testing of cutting-edge models against realistic offensive security challenges, and derive vulnerability assessment metrics to properly gauge their practical performance. We recently evaluated GPT-5.4-Thinking.

Testing Configuration

We evaluated GPT-5.4-Thinking across two main areas: CyScenarioBench, a benchmark comprising multi-stage, scenario-driven offensive security operations, and the Atomic Tasks suite. Atomic Tasks covers three domains: network security (standard attack sequences, network mapping, protocols, and infrastructure components such as firewalls and file servers); vulnerability research and exploitation (reverse engineering, code scrutiny, cryptography, and exploit development); and evasion (the model's ability to avoid detection by existing security and monitoring tools).

Key Outcomes

GPT-5.4-Thinking demonstrated strong performance on Atomic Tasks, with particular strength in network security and vulnerability research and exploitation, and showed clear improvement over GPT-5.2-Thinking on CyScenarioBench. The model finds and chains vulnerabilities quickly, and can execute multi-stage attack sequences effectively. Performance degrades in long-horizon scenarios that require sustained logical coherence across extended sequences.

Preparedness note: “High” under the Preparedness Framework is a conservative deployment posture that may be triggered by canary thresholds indicating potential to reduce bottlenecks for scaling cyber operations. Our results help contextualize that posture by showing strong step-level performance alongside some completed CyScenarioBench scenarios.

Conclusions

GPT-5.4-Thinking continues the performance trajectory seen from GPT-5.2-Thinking to GPT-5.3-Codex, with a meaningful increase across most task categories. The improvement is most relevant for moderately skilled operators and offers targeted assistance to highly skilled experts on precise, narrow subtasks. This proficiency is particularly effective in streamlining workflows, especially for vulnerability research and exploitation when the scope of the task is well-defined.

Despite these advancements, limitations persist. Performance degrades when tasks require prompt, time-sensitive decision-making. Consequently, its capabilities are less transferable to actual operational settings where avoiding detection is critical.

Consistent with previous assessments, these outcomes should be interpreted as a measure of its capabilities for assisted reasoning, not as a reflection of its efficacy in real-world attack scenarios.

March 9, 2026

Testing Configuration

Key Outcomes

Conclusions

Consistent with previous assessments, these outcomes should be interpreted as a measure of its capabilities for assisted reasoning, not as a reflection of its efficacy in real-world attack scenarios.

To cite this article, please credit Irregular with a link to this page, or click to view the BibTeX citation.

BACK TO PUBLICATIONS