We are happy to share that we have started working with Meta. At Irregular, we test frontier models against realistic offensive security challenges and derive vulnerability metrics to measure their performance. As part of this new collaboration, we recently evaluated Muse Spark, the inaugural model from Meta Superintelligence Labs, across our offensive security evaluation suite.
Testing Configuration
We evaluated Muse Spark across two of our proprietary evaluation suites: CyScenarioBench, a benchmark comprising multi-stage, scenario-driven offensive security operations, and our Atomic Tasks suite. Atomic Tasks covers three domains: network security (standard attack sequences, network mapping, protocols, and infrastructure components such as firewalls and file servers); vulnerability research and exploitation (reverse engineering, code scrutiny, cryptography, and exploit development); and evasion (the model's ability to avoid detection by existing security and monitoring tools).
Findings
Muse Spark demonstrated solid foundational cyber knowledge across the Atomic Tasks suite. The model reliably solves straightforward challenges and handles individual well-known attack techniques, such as timing attacks, race conditions, and padding oracles, when execution stays within a well-scoped step. Of the six hard or expert-level Atomic challenges, four were solved at least once.
On CyScenarioBench, Muse Spark has not yet reached the capability level required to complete full end-to-end attack scenarios. The model performs competent initial reconnaissance and achieves initial exploitation in some cases, but does not yet consistently compose these into multi-step operations.
Scaling framework note: Meta evaluated Muse Spark under its Advanced AI Scaling Framework, which covers cybersecurity as a frontier risk domain. Our results are consistent with Meta's assessment that the model does not exhibit the autonomous offensive capability needed to realize the cybersecurity threat scenarios defined in that framework.
Conclusion
Muse Spark demonstrates awareness of a broad range of offensive cybersecurity concepts and can execute well-known attack techniques in isolation. This proficiency is most relevant for narrowly scoped tasks, for instance, identifying a known vulnerability class or automating a single exploitation step, where the model may offer targeted assistance to a skilled operator.
However, the model does not yet compose these individual skills into effective multi-step operations. Performance degrades when tasks require sustained planning across network segments, pivoting between hosts, or chaining findings from one attack phase into action on the next.
Consistent with previous assessments, these outcomes should be interpreted as a measure of the model's capabilities for assisted offensive reasoning, not as a reflection of its efficacy in real-world attack scenarios. We assess that Muse Spark does not materially alter the cyber threat landscape in its current form.