Model Evaluation: GPT-5.3-Codex on Offensive Security Benchmarks

February 5, 2026

In this article

February 5, 2026

At Irregular, we evaluate frontier models on realistic offensive security tasks and vulnerability reasoning benchmarks to understand performance under practical constraints. We recently evaluated GPT-5.3-Codex.

Evaluation Setup

We tested GPT-5.3-Codex on:

CyScenarioBench: multi-stage, scenario-based offensive security tasks.
Atomic Tasks suite:
- Network security: common attack flows, reconnaissance, network protocols, and components (e.g., firewalls, file servers).
- Vulnerability Research and Exploitation (VR&E): reverse engineering, code analysis, cryptography, and exploitation.
- Evasion benchmarks: avoiding detection by security controls and monitoring systems.

We ran the model with Codex CLI, using xhigh reasoning effort, with a web search tool enabled, and allowed autonomous execution for up to 1,000 tool calls per run. We compared the model’s results to those of GPT-5.2-Codex and GPT-5.2, which we ran using our internal Irregular Agent.

Results

CyScenarioBench: GPT-5.3-Codex did not complete any scenarios end-to-end. In many cases, the model demonstrated partial progress, such as identifying plausible initial access vectors or relevant vulnerability classes. However, it consistently failed to maintain coherent reasoning across multiple stages or adapt when earlier assumptions proved incorrect.

Atomic Tasks scores:

Network security: 86%
VR&E: 72%
Evasion: 53%

Overall, GPT-5.3-Codex appears materially similar to GPT-5.2 and GPT-5.2-Codex. Strengths concentrate in bounded, well-defined reasoning tasks, such as identifying misconfigurations and explaining vulnerabilities, while weaknesses persist in long-horizon planning, scenario adaptation, and evasion reasoning.

Preparedness note: “High” under the Preparedness Framework is a conservative deployment posture that may be triggered by canary thresholds indicating potential to reduce bottlenecks for scaling cyber operations. Our results help contextualize that posture by showing strong task-level performance alongside no CyScenarioBench scenarios completed.

Takeaways

Similar to GPT-5.2-Codex and GPT-5.2, GPT-5.3-Codex offers a notable performance boost, particularly for moderately skilled operators and by providing focused support for highly skilled practitioners on specific, narrow subtasks. This capability is effective in easing bottlenecks, especially in vulnerability research and exploitation when tasks are clearly defined.

Despite these strengths, limitations persist. The model's performance decreases when tasks demand prolonged orchestration or time-sensitive decision-making. Consequently, its capabilities are less transferable to real-world operational scenarios where remaining undetected is paramount.

As with earlier evaluations, these results should be interpreted as a measure of assisted reasoning capability, not real-world attack effectiveness.

February 5, 2026

Evaluation Setup

We tested GPT-5.3-Codex on:

CyScenarioBench: multi-stage, scenario-based offensive security tasks.
Atomic Tasks suite:
- Network security: common attack flows, reconnaissance, network protocols, and components (e.g., firewalls, file servers).
- Vulnerability Research and Exploitation (VR&E): reverse engineering, code analysis, cryptography, and exploitation.
- Evasion benchmarks: avoiding detection by security controls and monitoring systems.

Results

Atomic Tasks scores:

Network security: 86%
VR&E: 72%
Evasion: 53%

Takeaways

As with earlier evaluations, these results should be interpreted as a measure of assisted reasoning capability, not real-world attack effectiveness.

To cite this article, please credit Irregular with a link to this page, or click to view the BibTeX citation.

BACK TO PUBLICATIONS