Citizen Hacker

How We Measure

The data sources, the proxy assumption, and what we deliberately don't control for. This page is mandatory reading if you plan to disagree with the numbers.

The Metric

Time-to-Parity (TTP) is the number of days between the first frontier model to cross a SWE-bench Verified threshold and the first open-weight (downloadable, locally runnable, no guardrails) model to match. We track seven thresholds from 49% to 94%. The trendline shows TTP compressing from 440 days to 106 days — a 4.15× compression in roughly two years.

Forward Projection Methods

We project when non-restricted models will cross the Mythos parity thresholds using two methods. Toggle between them to compare.

Leading-Edge (Primary). Tracks the best non-restricted model score at each point in time (the “envelope”). Directly answers the threat question: how fast is the best available model advancing toward Mythos-level capability? Higher R² but smaller sample (N=5–6). This is the method used in the headline projection.

GPQA Diamond — projection to 94.5%

SWE-bench Pro — projection to 77.8%

CyberGym — projection to 83.1%

We compute everything from five independent benchmark sources across three dimensions (reasoning, software engineering, cybersecurity). The full model roster is available in the Model Explorer.

Access Model Categories

Why three categories matter: On CyberGym, an open-weight model (GLM-5.1, 68.7%) has already surpassed the best frontier models (Opus 4.6, 66.6%; GPT-5.4, 66.3%). The only model ahead is restricted (Mythos, 83.1%). Collapsing frontier and open-weight into one “publicly available” category hides the fact that the most security-relevant capability tier — unmonitored, decentralized — is already ahead of the monitored tier.

Data Sources

Confidence Levels

The Proxy Assumption

Known Limitations