At Irregular, we perform rigorous testing of cutting-edge models against realistic offensive security challenges, and derive vulnerability assessment metrics to properly gauge their practical performance. We recently evaluated GPT-5.4-Thinking.
Testing Configuration
We evaluated GPT-5.4-Thinking across two main areas: CyScenarioBench, a benchmark comprising multi-stage, scenario-driven offensive security operations, and the Atomic Tasks suite. Atomic Tasks covers three domains: network security (standard attack sequences, network mapping, protocols, and infrastructure components such as firewalls and file servers); vulnerability research and exploitation (reverse engineering, code scrutiny, cryptography, and exploit development); and evasion (the model's ability to avoid detection by existing security and monitoring tools).
Key Outcomes
GPT-5.4-Thinking demonstrated strong performance on Atomic Tasks, with particular strength in network security and vulnerability research and exploitation, and showed clear improvement over GPT-5.2-Thinking on CyScenarioBench. The model finds and chains vulnerabilities quickly, and can execute multi-stage attack sequences effectively. Performance degrades in long-horizon scenarios that require sustained logical coherence across extended sequences.
Preparedness note: “High” under the Preparedness Framework is a conservative deployment posture that may be triggered by canary thresholds indicating potential to reduce bottlenecks for scaling cyber operations. Our results help contextualize that posture by showing strong step-level performance alongside some completed CyScenarioBench scenarios.
Conclusions
GPT-5.4-Thinking continues the performance trajectory seen from GPT-5.2-Thinking to GPT-5.3-Codex, with a meaningful increase across most task categories. The improvement is most relevant for moderately skilled operators and offers targeted assistance to highly skilled experts on precise, narrow subtasks. This proficiency is particularly effective in streamlining workflows, especially for vulnerability research and exploitation when the scope of the task is well-defined.
Despite these advancements, limitations persist. Performance degrades when tasks require prompt, time-sensitive decision-making. Consequently, its capabilities are less transferable to actual operational settings where avoiding detection is critical.
Consistent with previous assessments, these outcomes should be interpreted as a measure of its capabilities for assisted reasoning, not as a reflection of its efficacy in real-world attack scenarios.