Margin of Safety #20: Hedging Against AGI
There is a pocket of opportunity on not betting on AGI, but on what it can’t do
One under-discussed trend we’re tracking is how ambiguity around the medium- and long-term capabilities of LLMs and generative AI is creating space for high-risk, high-reward AI bets. Many of these bets implicitly assume that the dominant foundation model providers—OpenAI, Anthropic, Google DeepMind—will not reach well-rounded AGI, at least not soon, and not across all domains.
You can think of this as a more extreme version of the AI wrapper strategic gotcha we wrote about a few weeks ago [link to be added]. This week, we’re diving into a few categories of investments that act either as hedges against AGI, or building blocks toward it, depending on your stance.
Specifically, we’ll look at formal reasoning, retrieval-augmented generation (RAG), and human evaluation—and how these areas intersect (or don’t) with the long-term roadmaps of foundation model providers.
Formal Reasoning
A growing body of research is exploring how to combine formal reasoning systems or other highly deterministic techniques with modern AI, including the transformer. In our view, this represents a thesis bet that LLMs will hit a hard ceiling on problems that demand highly structured, deterministic thinking.
That ceiling might show up as a feasibility barrier (i.e., LLMs just can’t reason that precisely), or an economic one (they can’t do it cost-effectively at scale). Either way, the more you believe LLMs will be able to reason arbitrarily well via heuristic loops, the less compelling these hybrid approaches become. Conversely, if you believe in the marketing of harmonic.fun or symbolica.ai, you implicitly believe that novel techniques are needed for at least some classes of reasoning.
We tend to fall closer to the later camp than the former and believe that LLMs are fundamentally heuristic engines—which makes them powerful for fast pattern matching, but ill-suited for tasks that require extreme precision in longer tail contexts. That opens up interesting ground for hybrid approaches: using generative models as creative engines, and formal logic systems as correctness filters or proof checkers is one architecture. But it also represents an implicit hedge against the most optimistic forecasts for end-to-end reasoning via transformers alone.
RAG Systems
RAG systems remain compelling because they bridge the gap between static knowledge embedded in weights and dynamic, up-to-date information retrieval. And realistically, there’s likely no future where some form of RAG isn’t useful. But there’s also a growing tension worth watching: the better LLMs get at “needle-in-the-haystack” tasks, the less critical it is to optimize retrieval pipelines.
Needle-in-the-haystack tasks refer to the challenge of locating a single relevant fact in a large, noisy context window. As you can likely imagine, it’s highly relevant for some tasks (for example, if you want to detect a precise type of gotcha 1000 page contact) and barely needed for others (for example, summarizing a 500 word email). But the more a task type needs the model to be given a large quantity of dynamic information versus its context window, the more a carefully tuned RAG system becomes necessary – the dynamic nature of the information means you can’t fine tune it in, and the fact that you need a lot of it means you need to carefully select the right information to fit into the context window.
So what happens when a model’s context window is big enough and its needle-finding capabilities are sharp enough to negotiate a context window that’s one giant, poorly curated blob of data? Now many your engineers can just throw everything that’s even marginally relevant into the context window and cross their fingers. Your RAG system needs to be able to retrieve “everything,” but you don’t need to worry as much about whether you’ve found the perfect stuff.
For startups, this means that some classes of bet on complex knowledge graphs or other ways to get just the right RAG are implicitly bets against models ever being so good at managing context (and so cheap to do it) that they can easily take everything and the kitchen sink in their window.
Human Evaluation
Another area of huge AI investment has been human eval; look no further than Meta’s acquisition of Scale for validation. Many people have already questioned how the acquisition price can be sustained with large providers canceling Scale contracts, but what about the fundamentals of the business?
Much of Scale’s growth over the last several years can be attributed to an ever-growing need for two types of data in AI training: (1) human-authored synthetic data for model training and (2) human scoring as a quality failsafe in model tuning.
As an example of (1), many types of reasoning don’t have great examples in default training data. For example, people don’t tend to write out things that we think are common sense. But if you’re trying to train a model to perform common sense reasoning and don’t want it to say that a pound of lead is heavier than a pound of feathers, you might need to supplement its training with specific examples of that kind of written reasoning. In some cases, you can use another model to do this (this would be the distillation process that DeepSeek is alleged to have use). But when no model has cracked the problem, you turn to humans and vendors like Scale.
(2) is similar, but instead of generating data for training, you evaluate a model’s response. Just like the previous example, you can use another model if one can do a great job with the evaluation. But when you’re pushing the boundaries of what a model can do, there definitionally is no such model available. This forces you to turn to a mix of other heuristics and, potentially humans.
But if you’re investing in a vendor like Scale (let’s say Surge has their first raise and every VC jumps on them), you’re implicitly betting that these use cases will continue to exist, or that new ones will emerge for model training. How does that square with AGI? We’d argue that it partially does not, at least not under a broad definition – the core bet is that models themselves will be unable to generate or score some categories of data, and as a result armies of humans will continue to be needed.
Conclusion
These are just a couple examples of areas where valuations and excitement mostly make sense if you believe that foundational model providers will not achieve the broadest definition of AGI in the near future. Given both our own biases and the increasingly hedged language from some of those providers, we think those bets are likely reasonable. But as investors, we always prefer to be clear eyed about the underlying bet we’re making
.