Frontier models are remarkable general engines. But high stakes security work is not won by raw model capability alone. It is won by directing the right capabilities of various models into the right task, with the right data, benchmarks, feedback loops, and deployment system.
Why Mythos Can't Win the Security Race Alone
Frontier models are remarkable general engines. But high stakes security work is not won by raw model capability alone. It is won by directing the right capabilities of various models into the right task, with the right data, benchmarks, feedback loops, and deployment system.
Every powerful technology lives a double life. It is built for one original purpose, then rebuilt by users who find more in it than its makers intended. Rockets, radar, and the internet all followed this pattern: a specific tool became a general platform, then general capability was aimed back at specific high value tasks.
No contemporary technology illustrates this process more clearly than the graphics chip. In the late 1990s and early 2000s, graphics chips were largely associated with real time 3D rendering in PC gaming and consoles like the Nintendo 64 and Sony PlayStation.

This general, retail application persisted as the main use case for the hardware until scientific and academic researchers realized that the programmable shader pipeline inside a graphics card was really, underneath the gaming veneer, a cheap parallel supercomputer.
Fluid dynamics modeling, astrophysics n-body problems, and seismic processing for oil and gas all share one structural feature. Each involves a vast array of nearly identical arithmetic operations that don't depend on one another, meaning they can all be done simultaneously rather than sequentially. This makes them ideally suited to GPUs with thousands of small cores, whereas a CPU with a handful of large cores would grind through the same computations one by one, and thus much less efficiently.
Harnessing the Horsepower
Realizing this was one thing; harnessing it was another. Ian Buck is, in a real sense, the man who turned the gaming card into a general computing platform. As a PhD student at Stanford under graphics pioneer Pat Hanrahan (a future Turing Award winner), he saw that the shader pipeline could function as a data-parallel processor, held back only by the fact that you had to disguise your math as a graphics trick to get at it. His prototype language, Brook, proved the GPU could work as a general parallel computer as long as you built the right architecture over it.

Buck’s prototype of Brook is what led him to NVIDIA, which he joined in 2004 and where he remains today.

The company had the silicon hardware while Buck had the software vision for what it could become. He led the team that built CUDA, launched in 2007, which took Brook's central idea and scaled it up into a C-based environment that let any programmer write parallel code for the GPU directly.
What turned NVIDIA's chips into the substrate of modern computing wasn't the raw silicon alone; it was the layer built on top: the abstraction, the tooling, and years of engineering that focused the beam at the problems people need solved. The engine mattered enormously, but it was never going to win the race without wheels and a chassis. And ideally a steering rack.
The Edge Is in Your Aim
In 2012, three researchers from the University of Toronto – Geoffrey Hinton and his students Alex Krizhevsky and Ilya Sutskever – trained a network called AlexNet on two NVIDIA gaming GPUs. AlexNet won computer vision’s marquee contest by a huge margin, cutting the error rate from 26% to 15% when the field had previously been improving by just one percentage point a year.

Hinton went on to win a Nobel Prize. Sutskever co-founded OpenAI and led the research that produced ChatGPT. The enduring lesson of AlexNet is that general compute, aimed correctly, can change entire fields.
Anthropic, OpenAI, Google, and xAI are all fighting to build the best general intelligence model. But directing that inference into a specific task will keep outperforming a general base model with no domain specificity. The companies that can do that consistently with each new model release will earn their margin.
We Ran the Race
The easiest way to settle the argument is to run the race. So we took the strongest frontier reasoning models available and ran them (cold, at maximum reasoning effort) against the same codebases Octane analyzes, measuring how many of the known vulnerabilities each one actually surfaced.

On the high-severity vulnerabilities, Octane recovered 86% (25 of 29). OpenAI's GPT-5.5 at its highest reasoning setting found 38% (11 of 29). Anthropic's Opus 4.7 in Max Thinking mode found 31% (9 of 29). Across findings of every severity, Octane recovered 82% of the total while the frontier models landed between 28% and 34%. Weighted by severity, it was 84% against the low 40% range.
Run at full power against real code, the best general models on the market missed nearly two of every three high-severity vulnerabilities that Octane caught.
What makes the difference is everything that sits on top of the foundation model: security-specific context, task orchestration, custom evaluation datasets, hallucination filtering with custom trained models, impact validation, and researcher judgment fed back into the system. That architecture is worth roughly fifty points out of one hundred of recall.
Note what we did not test: Mythos. Anthropic’s release strategy includes strictly limiting any form of independent cybersecurity research using the model. And the well-publicized export controls slapped on Fable days after its release are still in effect.
But the result is structural, not just a snapshot of current capabilities.
We welcome each new frontier model release because it’s the latest version of the process of hardware-software co-design. The labs will keep raising the floor of raw capability and we’ll keep raising the ceiling of their cybersecurity-specific prowess. And if we don’t continue to raise it, the attackers and adversaries will and defenders will lose - that’s the race.
See what that looks like in practice: 0-Days in 72 Hours: How Octane Found Vulnerabilities in the Code Behind 99.7% of Browser Traffic



