Assessing GPT-5.5 Against Offensive Security Benchmarks

In this article

    At Irregular, we perform rigorous testing of cutting-edge models against realistic offensive security challenges, and derive vulnerability assessment metrics to properly gauge their practical performance. We worked with OpenAI to evaluate GPT-5.5, released today.

    Testing Configuration

    We evaluated GPT-5.5 across two main areas: CyScenarioBench, a benchmark comprising multi-stage, scenario-driven offensive security operations, and the Atomic Tasks suite. The Atomic Tasks suite covers three domains: network security (standard attack sequences, network mapping, protocols, and infrastructure components such as firewalls and file servers); vulnerability research and exploitation (reverse engineering, code scrutiny, cryptography, and exploit development); and evasion (the model's ability to avoid detection by existing security and monitoring tools).

    Key Outcomes

    GPT-5.5 demonstrated strong performance on Atomic Tasks, with particular strength in network security and vulnerability research and exploitation, and solved all atomic challenges. On CyScenarioBench, GPT-5.5 outperformed GPT-5.4, solving more challenges at lower cost per success. Across both challenge suites, it delivered stronger overall results than GPT-5.4, indicating that its cyberoffensive capabilities are stronger.

    Preparedness note: “High” under the Preparedness Framework is a conservative deployment posture that may be triggered by canary thresholds indicating potential to reduce bottlenecks for scaling cyber operations. Our results help contextualize that posture by showing strong step-level performance alongside successful completion of more CyScenarioBench challenges than GPT-5.4.

    Conclusions

    GPT-5.5 continues the performance trajectory seen in prior model generations, with a meaningful increase across most task categories. The improvement is most relevant for novice and moderately skilled operators and offers targeted assistance to highly skilled experts on precise, narrow subtasks. This proficiency is particularly effective in streamlining workflows, especially for vulnerability research and exploitation when the scope of the task is well-defined.

    In some cases, the model was able to perform complex cyber tasks that require niche knowledge which most expert cyber operators would not possess. These results suggest the model may remove some existing bottlenecks to scaling cyber operations by automating the discovery and exploitation of operationally relevant vulnerabilities.

    Despite these advancements, limitations persist. We still see constraints in translating these capabilities to real-world scenarios due to limitations in areas such as operational security. Consistent with previous assessments, these outcomes should be interpreted as a measure of its capabilities for assisted reasoning, not as a reflection of its efficacy in real-world attack scenarios.

    At Irregular, we perform rigorous testing of cutting-edge models against realistic offensive security challenges, and derive vulnerability assessment metrics to properly gauge their practical performance. We worked with OpenAI to evaluate GPT-5.5, released today.

    Testing Configuration

    We evaluated GPT-5.5 across two main areas: CyScenarioBench, a benchmark comprising multi-stage, scenario-driven offensive security operations, and the Atomic Tasks suite. The Atomic Tasks suite covers three domains: network security (standard attack sequences, network mapping, protocols, and infrastructure components such as firewalls and file servers); vulnerability research and exploitation (reverse engineering, code scrutiny, cryptography, and exploit development); and evasion (the model's ability to avoid detection by existing security and monitoring tools).

    Key Outcomes

    GPT-5.5 demonstrated strong performance on Atomic Tasks, with particular strength in network security and vulnerability research and exploitation, and solved all atomic challenges. On CyScenarioBench, GPT-5.5 outperformed GPT-5.4, solving more challenges at lower cost per success. Across both challenge suites, it delivered stronger overall results than GPT-5.4, indicating that its cyberoffensive capabilities are stronger.

    Preparedness note: “High” under the Preparedness Framework is a conservative deployment posture that may be triggered by canary thresholds indicating potential to reduce bottlenecks for scaling cyber operations. Our results help contextualize that posture by showing strong step-level performance alongside successful completion of more CyScenarioBench challenges than GPT-5.4.

    Conclusions

    GPT-5.5 continues the performance trajectory seen in prior model generations, with a meaningful increase across most task categories. The improvement is most relevant for novice and moderately skilled operators and offers targeted assistance to highly skilled experts on precise, narrow subtasks. This proficiency is particularly effective in streamlining workflows, especially for vulnerability research and exploitation when the scope of the task is well-defined.

    In some cases, the model was able to perform complex cyber tasks that require niche knowledge which most expert cyber operators would not possess. These results suggest the model may remove some existing bottlenecks to scaling cyber operations by automating the discovery and exploitation of operationally relevant vulnerabilities.

    Despite these advancements, limitations persist. We still see constraints in translating these capabilities to real-world scenarios due to limitations in areas such as operational security. Consistent with previous assessments, these outcomes should be interpreted as a measure of its capabilities for assisted reasoning, not as a reflection of its efficacy in real-world attack scenarios.

    To cite this article, please credit Irregular with a link to this page, or click to view the BibTeX citation.