Margin of Safety #35: OpenAI’s agentic security researcher 'Aardvark'
Aardvark gives OpenAI access to real-world, complex code reasoning signals. This helps OpenAI improve future models’ ability to understand, debug, and fix large software systems
(source: OpenAI’s announcement, Link)
We’ve previously written about the potential for AI in security research, so last week we couldn’t help but notice OpenAI’s announcement of Aardvark, an “agentic security research powered by GPT-5.” While OpenAI has previously invested in some security startups, this is their first public announcement of a core, cybersecurity focused product. We also think it’s notable because while software security research is a market near and dear to our hearts, it’s not necessarily the first product line expansion that comes to mind for a company eyeing a $1T IPO.
So what could be motivating this announcement? We think it’s a desire to get more top line revenue training data. It’s clear at this point that code generation is a huge use case for advanced models, and that a major challenge for both models and the tooling around them is their ability to reason about large codebases. It’s also very much the case that it’s difficult to generate synthetic examples of large codebase reasoning tasks from the ground up. Hunting down vulnerabilities and proposing fixes, however, is a great example of a task that requires the ability to reason about subtle behavior in large codebases. Working with early design partners can be a win:win for all involved parties: OpenAI gets invaluable training data, partners get high-touch (and presumably carefully monitored) security research, and everyone gets a slightly more secure world.
For the existing security ecosystem, it is probable that OpenAI is concentrating on high-quality code reasoning rather than attempting to capture the entire security research market directly. If you are building on OpenAI’s platform, this could provide an upstream capability improvement; however, your product must be positioned to avoid direct conflict with that core capability. We believe good candidates for this differentiation are services focused on the enterprise layer or workflow enhancements that surround a code model, such as integrating with existing developer and security workflows, including code change management. Making opinionated bets on which capabilities will remain out of reach for the core model. It is tempting but also quite risky unless you possess the technical depth to form a very sound hypothesis.
Motivations aside, we’d also observe that the capabilities are currently in private preview. To us, this implies one of a few things. In some combination, we suspect:
(A) The capabilities are not yet at the quality bar that OpenAI feels is appropriate for mass enterprise usage (as possible supporting evidence: they appear to have found 10 FOSS vulnerabilities to date, which is simultaneously impressive and a very low number from which to draw broad assumptions about capabilities.)
(B) The capabilities are sufficiently dual use (eg, they could also be used to find a bunch of Windows 0 days) that OpenAI is only comfortable making them available to carefully vetted partners.
(C) The capabilities are using so many GPU cycles under the hood that it’s economically infeasible to make them broadly available... yet.
Of these explanations, (B) seems less likely. If this was really the case, the responsible thing to do would be to use the capabilities to stress test core parts of public infrastructure (the Linux kernel, K8s, Python and its package explosion, etc.) and help fix issues before they could become weaponized by the bad guys. We don’t see any evidence of a flood of emergency patches, so we don’t believe that a new tier of capability is quietly being rolled out. (A) and (C) seem like the most likely options. But in both cases, potential competitors should expect some degree of improvement over time. (C) might not behave intuitively: we’ve seen in other domains that as inference costs decline, other advances such as reasoning models cause such a large inference increase that cost increases. The same thing could easily happen here, but we’d still expect that a given level of capability would become cheaper over time. (A) can be harder; we expect that for any highly granular issue, OpenAI could likely tune performance improvements into ChatGPT. But the raw fact they’re prioritizing this capability hints at how difficult it is to drive an overall improvement in model reasoning capabilities. It may well be that the domain has sufficiently nuanced and diverse reasoning challenges as to make near term, generalizable gains challenging. Or maybe not! We think this is the biggest point of ambiguity, and time (or ChatGPT 6.0) will tell how it shakes out.
If Aardvark bears fruit, however, the rewards for OpenAI will be best-in class training data, plus the ability to prove out complex reasoning about real world, enterprise class codebases (and for better or worse, ‘enterprise class’ codebases can pragmatically translate to things that are full of obscure, half-deprecated dependencies, unresolved #TODOs, and the special sort of complexity that forms when software has gone through multiple teams and many years of development). Even if enterprise agreements prevent OpenAI from directly training on the code Aardvark can access, it provides a test ground for OpenAI to discover which techniques – or which model versions – perform well in the real world. Simply telling OpenAI that a certain model + agent version was able to successfully resolve a large number vulnerabilities is a valuable signal, assuming OpenAI can gain the traction and scale to experiment with a large number of versions.
And then if Aardvark performs well, it will also be a signal to rivals. Anthropic and DeepMind are unlikely to watch quietly if OpenAI shows signs of converting complex code exercises into higher quality reasoning in future model generations. Expect them to pursue their own versions. For everyone else in the ecosystem, it raises the bar. Products built on OpenAI’s stack must now justify their existence around that upstream reasoning capability through workflow integration, vertical specialization, or opinionated focus areas the core model will likely not prioritize.
Have you been experimenting with AI-generated code fixes? If so, we’d love your thoughts on the Aardvark release



