When Success Rates Mislead: The Case for Expected Cost as a Metric in AI Evaluation

December 18, 2025

In this article

    December 18, 2025

    Executive Summary


    • AI capability evaluations typically report success rates, but this metric alone provides an incomplete picture. A low success rate can seem to indicate lack of capability, while still representing a task that is economically feasible to achieve through repeated attempts.

    • We propose Expected Cost per Success as a complementary metric: the total cost of all attempts divided by the number of successes. It reflects the average amount an actor can expect to spend before achieving success.

    • This metric is particularly relevant in cyber threat models where attackers can target multiple systems, retry without consequence, and need only occasional success. In these cases, economic feasibility becomes a primary barrier.

    • The full analysis includes how Expected Cost per Success changes the interpretation of evaluation results, as well as the advantages and limitations of adopting this metric.

    Success Rates in Current Evaluations

    The recent model release boom (Claude Sonnet 4.5 & Opus 4.5, Gemini 3, GPT-5.1, GPT-5.2, …) has placed the evaluation of their expanding capabilities at the center of industry discourse. As these models push the frontiers of both coding and autonomous behavior, even small percentage gains on benchmarks attract considerable community attention. However, focusing solely on the success rate, the percentage of attempts in which a model successfully completes a task, provides an incomplete picture of the practical implications of these capabilities. This analysis suggests incorporating a cost measure as a complementary metric.

    To evaluate a model’s capabilities, we use challenges that require achieving a specific objective. For cyber capabilities, this could mean compromising a system. Each time a model is run, it takes steps until it either succeeds, indicating the challenge is solved, or reaches a set limit. Because language models produce different outputs each time, evaluators usually conduct multiple "runs" and report the success rate (for example, the model has succeeded in 22 out of 100 attempts).

    The success rate metric suggests a scale that is sometimes misleading. Typically, evaluations are run a few dozen or a hundred times, and a success rate of 1% (one in a hundred attempts) suggests low capability. But, in practice, the cost of a single attempt may be low enough that running even several hundred (or thousand) attempts isn’t prohibitively expensive.

    Looking at cyber capabilities as an example: In many cases, an attacker needs to succeed only once but can make multiple attempts. A “success rate of 1%” implies a lot less danger than “a success would likely cost $100-$200”. If a challenge can be solved in 1% of runs and each run costs $1, an actor repeatedly attempting to solve the challenge can expect to spend an average of $100 before success. Many threat actors can easily afford this.

    In most current AI capability evaluations, this success rate is treated as the primary metric, with little or no attention paid to cost¹. This can obscure our understanding of practical implications. In cybersecurity, LLMs and agentic AI could complete a cyber attack end-to-end, since tasks such as exploiting vulnerabilities and executing network attacks require no physical logistics. If a model can autonomously manage a cyber attack, even rarely, there are some cases when the only remaining barrier is the cost of trying again.

    Expected Cost per Success 

    To better communicate the implications of capabilities, we suggest using an additional metric to reflect the average amount an actor can expect to spend before achieving success. We define this metric as Expected Cost per Success, estimated as the total cost of all runs divided by the number of successful runs.
    For example, consider these evaluation results of a state-of-the-art model on one of Irregular’s cyber challenges:


    Challenge

    Success Rate

    Cost/run

    Cost of a successful run

    Expected Cost per Success

    Incorrect usage of a cryptographic scheme

    1.7%

    $3.81

    $2.87 ²

    $224

    Lateral movement and 1-day exploitation

    4.16%

    $40.18 ³

    $7.15

    $965

    These results illustrate why Expected Cost per Success is an important complementary metric for communicating the risk introduced by a frontier model. When a slim chance of success such as 1.7% or 4.1% translates into a manageable cost, the challenge should be treated as practically solved. What matters is not only how likely a model is to succeed, but how much that success could cost.

    The Expected Cost per Success metric also allows evaluators to fairly assess whether a smaller or older model may offer better capability per dollar. It will enable checking if a cheaper model setting, including agent, tool use, and other parameters, could reduce token usage and outperform a more expensive one. It also provides a path to compare the relative risk of a new model to that of previous models. Even if all a model did was become 100x cheaper, and therefore a cyber attack that cost $100k is now only $1k, that's a legitimate difference in the real-world consequences of a new model’s capabilities.

    The cost dimension is largely missing from current practice. AI capability evaluations, including those published in model system cards, typically report success rates as the primary metric, with limited or no cost analysis. This is intuitive for humans because it mirrors how people improve on tests, where repeated failure often indicates a lack of skill. For models, this analogy breaks down. A model may already have the capability but express it inconsistently. In such cases, a meaningful barrier to success is the cost of running the AI model enough times, and for some threat models, that barrier is measured mainly in dollars.

    Cost-Sensitive Threat Models

    In certain cyber threat models, the Expected Cost per Success is particularly relevant: When non-state actors with limited resources target numerous systems rather than a specific, well-defended few, and do not need to succeed on every attempt, cost becomes an important barrier to a successful attack. These scenarios could exhibit some combination of the following characteristics:


    • Multiple targets: Attackers can target many systems simultaneously. Even if most attempts fail, attacking at scale could mean some will succeed.

    • Repeated attempts: Even against a single system, attackers could try multiple times. Unlike physical attacks, failed cyberattacks can leave the target unchanged and the attack path open.

    • Limited consequences for failure: Attribution is difficult, evasion tools are available, and public accusations may carry little weight for some actors.

    • Defender asymmetry: Defenders must identify real threats amid noise from both failed attempts and benign traffic. The cost of comprehensive monitoring can exceed the cost of attacking.

    Such threat models are particularly significant for linking AI capability evaluation and risk. High-specificity targeting by expert actors with substantial resources already exists. The potential game-changer is whether AI could enable multiple complex attacks at scale by actors who previously lacked such capability. For threat actors in this scenario, offensive operations that cost $100 or even $1,000 to carry out remain accessible, making Expected Cost per Success the relevant measurement for assessing practical risk.

    Not all cyber threats fit this model. Consider a scenario of a skillful actor, often a state or state-aligned group, targeting a specific live system where detection carries serious consequences. In those operations, a failed attempt can trigger legal or political fallout, alert defenders, harden the target’s security posture, or permanently close that particular attack path. 

    Assessing risk in such cases requires a different analysis. Cost can play a part in this analysis, along with the target's and the attacker's characteristics: how sensitive the target is, how much exposure the attacker can tolerate, the role of humans in the loop, and how many attempts they can realistically make before detection. In these high-stakes settings, whether the attacker can achieve a very high probability of success in only a small number of attempts could play a more significant role than cost.

    By contrast, in the wide-net threat model discussed above, attackers accept low per-attempt success rates and rely on scale. Practical risk in that setting is primarily driven by economic feasibility, making the Expected Cost per Success a relevant metric.

    Implications for Cyber Evaluation Practices 

    The relevance of cost to specific threat models has implications for evaluation practices. Stopping an evaluation run at a fixed iteration limit mixes model capability with arbitrary boundaries. Expected Cost per Success shifts focus to economic feasibility. While any threshold involves judgment, cost can be anchored to the resources available to relevant threat actors, thereby connecting the cutoff to real-world considerations.

    As cyber evaluations become more advanced, they could become more expensive. When allocating resources, investment in evaluation should be proportional to the potential cost of damage from, for example, misuse. For defenders, knowing the cost of particular attack vectors can help prioritize defenses. Therefore, the linkage between threat model and cost helps guide such decisions.

    Beyond cutoffs and prioritization, a cost metric also allows evaluators and defenders to account for how attacks might be conducted differently from the test environment. Attackers might use better agents or have superior elicitation techniques, potentially at lower costs. A cost metric should incorporate a buffer to account for such differences in the risk calculation, focusing evaluations on economic feasibility rather than exact replication of the test environment.

    Rethinking "Success" using Expected Cost per Success

    Using Expected Cost per Success changes how we interpret evaluation results. An evaluation task that a model is incapable of solving remains meaningful regardless of cost, because it marks a genuine capability boundary. Partial success, however, has to be read in context. A 5% success rate means something very different depending on whether each attempt costs $0.10 or $100, and depending on the resources available to the relevant threat actors.

    This metric also reframes challenges that were never solved. When we focus on success rates, we may put little emphasis on a change from 0% to 1%. With Expected Cost per Success, having no successful attempts retains its significance as a likely capability threshold⁴, while any nonzero success rate must be evaluated against realistic attack economics.

    For unsolved challenges, this has practical consequences. When reporting that a model never solved a given evaluation task, it's essential to specify how much money was spent before stopping, both on running the model itself and on “helping the model” with tool use or humans-in-the-loop. There is a big difference between a complex task that remains unsolved after $10,000 in tokens and one abandoned after $2.

    Limitations

    Expected Cost per Success is a useful complement to success rate, but it has limitations:


    • Cost structures vary by actor. A sophisticated actor who invests upfront in tooling and infrastructure may achieve lower marginal costs than our evaluation setup assumes. Conversely, an unsophisticated actor who can't replicate the evaluation harness faces higher effective costs.

    • API pricing changes over time. A challenge that costs $500 to solve today might cost $50 in a year. To avoid false reassurance, evaluators could re-run benchmarks regularly or report token counts alongside dollar costs so readers can calibrate over time.

    • Focus on cost where it’s not the most important metric. For example, not all threat models are cost-sensitive. In scenarios where detection carries serious consequences or a failed attempt closes the attack path, success probability matters more than cost. Expected Cost per Success should not obscure how difficult a capability is to achieve in these contexts.

    • Evaluations often don't measure real-world costs. When cost becomes a primary metric, differences between evaluation environments and live systems with active defenses and monitoring affect not just whether a model succeeds, but how much it costs to succeed. This increases the importance of designing evaluations that approximate realistic conditions.

    Conclusion

    The most effective way to communicate capability evaluation results depends on the audience and purpose. Success rates remain intuitive for general communication about model performance, where the public reasonably wants to know how often a model gets things right. But for risk assessment, particularly in cybersecurity, success rate alone can mislead. A 1% success rate sounds reassuring until you learn that success costs $50.

    Empirical evaluation is inherently imperfect, and the Expected Cost per Success metric cannot guarantee precise capability boundaries. But it enables fairer model comparisons, more meaningful thresholds, and risk assessment grounded in dollar cost.

    ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯

    ¹ Notable exceptions: In the Claude Sonnet 4.5 system card, in the CyberGym evaluation, it reports success rate with a mean cost per trial, and an average cost for successful trials. The GPT-5.2 system card update further details the model’s average cost-per-success on Irregular cyber challenges.

    ² Successful runs are cheaper because the model stops at the goal, while the overall mean includes costly failed attempts that exhaust the maximum permitted limit.

    ³ The difference in cost scale between the two challenges is due in part to more iterations for each run- the first challenge had a 100 cut off, while the second one had a 1000 iterations cutoff. For discussion of cutoffs, see “Implications for Evaluation Practices” section.

    ⁴ For evaluations that were never solved, it would be possible to estimate the confidence interval.

    December 18, 2025

    Executive Summary


    • AI capability evaluations typically report success rates, but this metric alone provides an incomplete picture. A low success rate can seem to indicate lack of capability, while still representing a task that is economically feasible to achieve through repeated attempts.

    • We propose Expected Cost per Success as a complementary metric: the total cost of all attempts divided by the number of successes. It reflects the average amount an actor can expect to spend before achieving success.

    • This metric is particularly relevant in cyber threat models where attackers can target multiple systems, retry without consequence, and need only occasional success. In these cases, economic feasibility becomes a primary barrier.

    • The full analysis includes how Expected Cost per Success changes the interpretation of evaluation results, as well as the advantages and limitations of adopting this metric.

    Success Rates in Current Evaluations

    The recent model release boom (Claude Sonnet 4.5 & Opus 4.5, Gemini 3, GPT-5.1, GPT-5.2, …) has placed the evaluation of their expanding capabilities at the center of industry discourse. As these models push the frontiers of both coding and autonomous behavior, even small percentage gains on benchmarks attract considerable community attention. However, focusing solely on the success rate, the percentage of attempts in which a model successfully completes a task, provides an incomplete picture of the practical implications of these capabilities. This analysis suggests incorporating a cost measure as a complementary metric.

    To evaluate a model’s capabilities, we use challenges that require achieving a specific objective. For cyber capabilities, this could mean compromising a system. Each time a model is run, it takes steps until it either succeeds, indicating the challenge is solved, or reaches a set limit. Because language models produce different outputs each time, evaluators usually conduct multiple "runs" and report the success rate (for example, the model has succeeded in 22 out of 100 attempts).

    The success rate metric suggests a scale that is sometimes misleading. Typically, evaluations are run a few dozen or a hundred times, and a success rate of 1% (one in a hundred attempts) suggests low capability. But, in practice, the cost of a single attempt may be low enough that running even several hundred (or thousand) attempts isn’t prohibitively expensive.

    Looking at cyber capabilities as an example: In many cases, an attacker needs to succeed only once but can make multiple attempts. A “success rate of 1%” implies a lot less danger than “a success would likely cost $100-$200”. If a challenge can be solved in 1% of runs and each run costs $1, an actor repeatedly attempting to solve the challenge can expect to spend an average of $100 before success. Many threat actors can easily afford this.

    In most current AI capability evaluations, this success rate is treated as the primary metric, with little or no attention paid to cost¹. This can obscure our understanding of practical implications. In cybersecurity, LLMs and agentic AI could complete a cyber attack end-to-end, since tasks such as exploiting vulnerabilities and executing network attacks require no physical logistics. If a model can autonomously manage a cyber attack, even rarely, there are some cases when the only remaining barrier is the cost of trying again.

    Expected Cost per Success 

    To better communicate the implications of capabilities, we suggest using an additional metric to reflect the average amount an actor can expect to spend before achieving success. We define this metric as Expected Cost per Success, estimated as the total cost of all runs divided by the number of successful runs.
    For example, consider these evaluation results of a state-of-the-art model on one of Irregular’s cyber challenges:


    Challenge

    Success Rate

    Cost/run

    Cost of a successful run

    Expected Cost per Success

    Incorrect usage of a cryptographic scheme

    1.7%

    $3.81

    $2.87 ²

    $224

    Lateral movement and 1-day exploitation

    4.16%

    $40.18 ³

    $7.15

    $965

    These results illustrate why Expected Cost per Success is an important complementary metric for communicating the risk introduced by a frontier model. When a slim chance of success such as 1.7% or 4.1% translates into a manageable cost, the challenge should be treated as practically solved. What matters is not only how likely a model is to succeed, but how much that success could cost.

    The Expected Cost per Success metric also allows evaluators to fairly assess whether a smaller or older model may offer better capability per dollar. It will enable checking if a cheaper model setting, including agent, tool use, and other parameters, could reduce token usage and outperform a more expensive one. It also provides a path to compare the relative risk of a new model to that of previous models. Even if all a model did was become 100x cheaper, and therefore a cyber attack that cost $100k is now only $1k, that's a legitimate difference in the real-world consequences of a new model’s capabilities.

    The cost dimension is largely missing from current practice. AI capability evaluations, including those published in model system cards, typically report success rates as the primary metric, with limited or no cost analysis. This is intuitive for humans because it mirrors how people improve on tests, where repeated failure often indicates a lack of skill. For models, this analogy breaks down. A model may already have the capability but express it inconsistently. In such cases, a meaningful barrier to success is the cost of running the AI model enough times, and for some threat models, that barrier is measured mainly in dollars.

    Cost-Sensitive Threat Models

    In certain cyber threat models, the Expected Cost per Success is particularly relevant: When non-state actors with limited resources target numerous systems rather than a specific, well-defended few, and do not need to succeed on every attempt, cost becomes an important barrier to a successful attack. These scenarios could exhibit some combination of the following characteristics:


    • Multiple targets: Attackers can target many systems simultaneously. Even if most attempts fail, attacking at scale could mean some will succeed.

    • Repeated attempts: Even against a single system, attackers could try multiple times. Unlike physical attacks, failed cyberattacks can leave the target unchanged and the attack path open.

    • Limited consequences for failure: Attribution is difficult, evasion tools are available, and public accusations may carry little weight for some actors.

    • Defender asymmetry: Defenders must identify real threats amid noise from both failed attempts and benign traffic. The cost of comprehensive monitoring can exceed the cost of attacking.

    Such threat models are particularly significant for linking AI capability evaluation and risk. High-specificity targeting by expert actors with substantial resources already exists. The potential game-changer is whether AI could enable multiple complex attacks at scale by actors who previously lacked such capability. For threat actors in this scenario, offensive operations that cost $100 or even $1,000 to carry out remain accessible, making Expected Cost per Success the relevant measurement for assessing practical risk.

    Not all cyber threats fit this model. Consider a scenario of a skillful actor, often a state or state-aligned group, targeting a specific live system where detection carries serious consequences. In those operations, a failed attempt can trigger legal or political fallout, alert defenders, harden the target’s security posture, or permanently close that particular attack path. 

    Assessing risk in such cases requires a different analysis. Cost can play a part in this analysis, along with the target's and the attacker's characteristics: how sensitive the target is, how much exposure the attacker can tolerate, the role of humans in the loop, and how many attempts they can realistically make before detection. In these high-stakes settings, whether the attacker can achieve a very high probability of success in only a small number of attempts could play a more significant role than cost.

    By contrast, in the wide-net threat model discussed above, attackers accept low per-attempt success rates and rely on scale. Practical risk in that setting is primarily driven by economic feasibility, making the Expected Cost per Success a relevant metric.

    Implications for Cyber Evaluation Practices 

    The relevance of cost to specific threat models has implications for evaluation practices. Stopping an evaluation run at a fixed iteration limit mixes model capability with arbitrary boundaries. Expected Cost per Success shifts focus to economic feasibility. While any threshold involves judgment, cost can be anchored to the resources available to relevant threat actors, thereby connecting the cutoff to real-world considerations.

    As cyber evaluations become more advanced, they could become more expensive. When allocating resources, investment in evaluation should be proportional to the potential cost of damage from, for example, misuse. For defenders, knowing the cost of particular attack vectors can help prioritize defenses. Therefore, the linkage between threat model and cost helps guide such decisions.

    Beyond cutoffs and prioritization, a cost metric also allows evaluators and defenders to account for how attacks might be conducted differently from the test environment. Attackers might use better agents or have superior elicitation techniques, potentially at lower costs. A cost metric should incorporate a buffer to account for such differences in the risk calculation, focusing evaluations on economic feasibility rather than exact replication of the test environment.

    Rethinking "Success" using Expected Cost per Success

    Using Expected Cost per Success changes how we interpret evaluation results. An evaluation task that a model is incapable of solving remains meaningful regardless of cost, because it marks a genuine capability boundary. Partial success, however, has to be read in context. A 5% success rate means something very different depending on whether each attempt costs $0.10 or $100, and depending on the resources available to the relevant threat actors.

    This metric also reframes challenges that were never solved. When we focus on success rates, we may put little emphasis on a change from 0% to 1%. With Expected Cost per Success, having no successful attempts retains its significance as a likely capability threshold⁴, while any nonzero success rate must be evaluated against realistic attack economics.

    For unsolved challenges, this has practical consequences. When reporting that a model never solved a given evaluation task, it's essential to specify how much money was spent before stopping, both on running the model itself and on “helping the model” with tool use or humans-in-the-loop. There is a big difference between a complex task that remains unsolved after $10,000 in tokens and one abandoned after $2.

    Limitations

    Expected Cost per Success is a useful complement to success rate, but it has limitations:


    • Cost structures vary by actor. A sophisticated actor who invests upfront in tooling and infrastructure may achieve lower marginal costs than our evaluation setup assumes. Conversely, an unsophisticated actor who can't replicate the evaluation harness faces higher effective costs.

    • API pricing changes over time. A challenge that costs $500 to solve today might cost $50 in a year. To avoid false reassurance, evaluators could re-run benchmarks regularly or report token counts alongside dollar costs so readers can calibrate over time.

    • Focus on cost where it’s not the most important metric. For example, not all threat models are cost-sensitive. In scenarios where detection carries serious consequences or a failed attempt closes the attack path, success probability matters more than cost. Expected Cost per Success should not obscure how difficult a capability is to achieve in these contexts.

    • Evaluations often don't measure real-world costs. When cost becomes a primary metric, differences between evaluation environments and live systems with active defenses and monitoring affect not just whether a model succeeds, but how much it costs to succeed. This increases the importance of designing evaluations that approximate realistic conditions.

    Conclusion

    The most effective way to communicate capability evaluation results depends on the audience and purpose. Success rates remain intuitive for general communication about model performance, where the public reasonably wants to know how often a model gets things right. But for risk assessment, particularly in cybersecurity, success rate alone can mislead. A 1% success rate sounds reassuring until you learn that success costs $50.

    Empirical evaluation is inherently imperfect, and the Expected Cost per Success metric cannot guarantee precise capability boundaries. But it enables fairer model comparisons, more meaningful thresholds, and risk assessment grounded in dollar cost.

    ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯

    ¹ Notable exceptions: In the Claude Sonnet 4.5 system card, in the CyberGym evaluation, it reports success rate with a mean cost per trial, and an average cost for successful trials. The GPT-5.2 system card update further details the model’s average cost-per-success on Irregular cyber challenges.

    ² Successful runs are cheaper because the model stops at the goal, while the overall mean includes costly failed attempts that exhaust the maximum permitted limit.

    ³ The difference in cost scale between the two challenges is due in part to more iterations for each run- the first challenge had a 100 cut off, while the second one had a 1000 iterations cutoff. For discussion of cutoffs, see “Implications for Evaluation Practices” section.

    ⁴ For evaluations that were never solved, it would be possible to estimate the confidence interval.

    To cite this article, please credit Irregular with a link to this page, or click to view the BibTeX citation.

    To cite this article, please credit Irregular with a link to this page, or click to view the BibTeX citation.