How We Measure

The data sources, the proxy assumption, and what we deliberately don't control for. This page is mandatory reading if you plan to disagree with the numbers.

The Metric

Time-to-Parity (TTP) is the number of days between the first frontier model to cross a SWE-bench Verified threshold and the first open-weight (downloadable, locally runnable, no guardrails) model to match. We track seven thresholds from 49% to 94%. The trendline shows TTP compressing from 440 days to 106 days — a 4.15× compression in roughly two years.

Forward Projection Methods

We project when non-restricted models will cross the Mythos parity thresholds using two methods. Toggle between them to compare.

Leading-Edge (Primary). Tracks the best non-restricted model score at each point in time (the “envelope”). Directly answers the threat question: how fast is the best available model advancing toward Mythos-level capability? Higher R² but smaller sample (N=5–6). This is the method used in the headline projection.

GPQA Diamond — projection to 94.5%

SWE-bench Pro — projection to 77.8%

CyberGym — projection to 83.1%

We compute everything from five independent benchmark sources across three dimensions (reasoning, software engineering, cybersecurity). The full model roster is available in the Model Explorer.

Access Model Categories

Restricted. Not available to end users. Currently: Claude Mythos Preview (93.9% SWE-V, 100% Cybench, 83.1% CyberGym). Withheld from general release on explicit cybersecurity grounds. A threat actor cannot obtain this capability today.
Frontier. Available via API with logging, rate limits, and terms of use. Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro. A threat actor can use these but leaves an audit trail and is subject to guardrails.
Open-weight. Downloadable model weights. Runnable locally with no logging, no guardrails, no audit trail. GLM-5.1, MiniMax M2.5, Kimi K2.5, DeepSeek V3.2. This is the threat-relevant category. Once downloaded, it is decentralized intelligence — the lab that created it has no ability to restrict or monitor use.

Why three categories matter: On CyberGym, an open-weight model (GLM-5.1, 68.7%) has already surpassed the best frontier models (Opus 4.6, 66.6%; GPT-5.4, 66.3%). The only model ahead is restricted (Mythos, 83.1%). Collapsing frontier and open-weight into one “publicly available” category hides the fact that the most security-relevant capability tier — unmonitored, decentralized — is already ahead of the monitored tier.

The Metric

Forward Projection Methods

GPQA Diamond — projection to 94.5%

SWE-bench Pro — projection to 77.8%

CyberGym — projection to 83.1%

Access Model Categories

Data Sources

Confidence Levels

The Proxy Assumption

Known Limitations