Citizen Hacker

When does the capability Anthropic refused to release become available to everyone?

days

Mythos can autonomously discover zero-day vulnerabilities at industrial scale. Anthropic restricted it to eleven organizations. Unrestricted models are closing in.

Cost of the Mythos campaign at parity
from $20,000 today

Anthropic’s Claude Mythos Preview discovered zero-day vulnerabilities in OpenBSD, FFmpeg, and Firefox — flaws that survived decades of fuzzing and manual audit. Anthropic concluded general release would make large-scale cyberattacks “far more likely” and restricted access to eleven partner organizations.

This project tracks how quickly unrestricted models are closing the gap across three dimensions: reasoning, software engineering, and cybersecurity. The headline number is the last pillar to fall — the date by which an unrestricted model matches Mythos on all three.

Three pillars to parity

Mythos-level capability requires reasoning, software engineering, and cybersecurity. The headline number tracks the last pillar to fall.

01

Reasoning

GPQA Diamond (198 graduate-level science questions) measures deep reasoning that cannot be solved by search. Anthropic identifies this as the root capability behind Mythos's cybersecurity performance. Frontier models have essentially closed the gap already — Gemini 3.1 Pro (94.3%) is 0.2 points from Mythos (94.5%). The best open-weight model (Qwen3.5-397B, 88.4%) is 6.1 points behind.

GPQA Diamond — top models

Restricted Frontier Open-weight 94.5% Mythos threshold

Why reasoning matters

Anthropic's own assessment is that Mythos's cybersecurity capability is a downstream effect of general reasoning improvement, not specialized cyber training. GPQA Diamond is the hardest public general reasoning benchmark.

The causal chain: General reasoning → Code reasoning → Cybersecurity capability. If the reasoning gap closes, the others follow.

Frontier has essentially achieved reasoning parity. Open-weight is 6.1 points behind and closing at ~2 points per quarter. This pillar will not be the bottleneck.

02

Software Engineering

SWE-bench measures autonomous software engineering — the ability to understand, navigate, and modify real codebases. This is the current bottleneck: the best non-restricted model is 18.7 points from Mythos's 77.8% on SWE-bench Pro. The historical Verified variant (left) shows TTP compressing from 440 to 106 days.

SWE-bench Verified (historical)

Deprecated Feb 2026 due to confirmed contamination. Historical context only.

Restricted (Mythos) Frontier Open-weight Mythos 93.9%

SWE-bench Pro — top models

Restricted Frontier Open-weight 77.8% Mythos threshold
03

Cybersecurity

CyberGym (UC Berkeley) tests PoC generation against 1,507 real-world vulnerabilities — the most direct measure of offensive cyber capability. Open-weight has already surpassed frontier (GLM-5.1 68.7% vs Opus 4.6 66.6%). Only Mythos (83.1%) is ahead. The convergence chart shows capability over time.

CyberGym — capability over time

Restricted Frontier Open-weight 83.1% Mythos threshold

CyberGym — top models

Restricted Frontier Open-weight 83.1% Mythos threshold

Gap Over Time

The frontier (best commercial API model) and open-weight (best downloadable model) scores over time. When the lines converge, the gap closes. When the green line crosses above cyan, open-weight has surpassed frontier. On CyberGym, that inversion has already happened.

GPQA Diamond — frontier vs open-weight

Frontier best Open-weight best Shaded area = gap

SWE-bench Pro — frontier vs open-weight

Frontier best Open-weight best Shaded area = gap

CyberGym — frontier vs open-weight

Frontier best Open-weight best Shaded area = gap

Why this matters

Mythos can discover a 27-year-old vulnerability in OpenBSD, write a 20-gadget ROP chain to exploit a FreeBSD NFS server, and escape a browser sandbox by chaining four zero-days together. Today, only eleven organizations have access to it. The question this project answers: how long until anyone with a GPU can do the same thing?

That question has three parts. A model needs deep reasoning to understand the code, software engineering skill to navigate real codebases, and direct cybersecurity capability to find and exploit vulnerabilities. We track each independently. The headline number — the countdown at the top of this page — is the last pillar to fall: the date by which an open-weight model matches Mythos on all three.

The implications compound. Anthropic spent roughly $20,000 on 1,000 runs against OpenBSD and found several dozen vulnerabilities, over 99% of which remain unpatched. Inference costs are falling 10–100× per year. By the time open-weight models reach parity, that same campaign could cost under $5,000 — and be run by anyone, with no audit trail, no guardrails, and no way for the originating lab to intervene.

How We Measure details the benchmarks, projection methods, and confidence intervals.

The Convergence tells the narrative behind the data.

Forward projections fit logistic growth with linear fallback on the leading-edge (best-so-far) trajectory of non-restricted models. 95% bootstrap CIs from 1,000 resamples. All-model regression reported as cross-check.