2026 is the year we stop using the wrong denominator

Everyone keeps asking: "Can AI do X yet?"

That's the wrong question, in the same way "How many alumni does this university have?" is the wrong question. The question is always: out of what total?

In 2024–2025, AI was graded on the easiest denominator available: best-case prompts, controlled conditions, with a human babysitter. In 2026, the denominator changes to: all the messy, real tasks done by normal people, under time pressure, with reputational and legal consequences.

This shift isn't coming from research labs. It's coming from the fact that AI is moving out of demos and into production systems where failure is expensive.

The "90% accurate" trap (toy example)

Founders love hearing "90% accuracy." Buyers do not.

Imagine an AI agent that helps a sales team by drafting and sending follow-up emails. It takes 10,000 actions/month (send, update CRM, schedule, etc.). A "pretty good" 99% success rate sounds elite—until you do the denominator math.

  • 99% success on 10,000 actions = 100 failures/month.

  • If even 10 of those failures are "high-severity" (wrong recipient, wrong pricing, wrong attachment, embarrassing hallucination), that's not a product. That's a recurring incident program.

Now flip the requirement: if the business can tolerate, say, 1 serious incident/month, then the real bar isn't 99%. It might be 99.99% on the subset of actions that can cause damage (and a forced escalation path on everything uncertain). This is why "accuracy" is the wrong headline metric; the real metric is incidents per 1,000 actions, segmented by severity, plus time-to-detect and time-to-recover.

Most founders still pitch on accuracy. Smart buyers ask for the incident dashboard first.

A founder vignette (postmortem-style)

A founder ships an "autonomous support agent" into production for a mid-market SaaS. The demo crushes: it resolves tickets, updates the CRM, and drafts refunds. Two weeks later, the customer pauses rollout—not because the agent is dumb, but because it's unmeasured. No one can answer: "How often does it silently do the wrong thing?" The agent handled 3,000 tickets, but three edge cases triggered a nasty pattern: it refunded the wrong plan tier twice and sent one confidently wrong policy explanation that got forwarded to legal. The customer doesn't ask for a bigger model. They ask for logging, evals, and hard controls: "Show me the error distribution, add an approval queue for refunds, and give me an incident dashboard." The founder realizes the real product isn't "an agent." It's a managed system with guardrails and proof. Everything that came before was a science fair project.

The real metric: evals become the business model

The most valuable AI startups in 2026 won't win by shouting "state of the art." They'll win by making buying safe.

That means being able to say, quickly and credibly:

  • "Here's performance on your distribution (not our demo)."

  • "Here's what it does when uncertain: abstain, ask, escalate."

  • "Here's the weekly report: incident rate, severity mix, and top failure modes."

In other words, evaluation becomes the business model: trust, control, and accountability are what unlock budget.

Vendors who can't report these metrics weekly aren't ready for revenue. They're still playing.

Agents will grow up: boring, instrumented operations

"Agents" will keep getting marketed as autonomous employees. But founders who actually want revenue will build something more boring and more real:

  • Narrow scope (fewer actions, done reliably).

  • Hard permissions and budgets (prevent expensive mistakes).

  • Full observability (every action logged, queryable, auditable).

  • Explicit escalation paths (humans handle the tail risk).

When the denominator becomes "all actions in production," reliability and containment beat cleverness—every time. The vanity metric is "tickets touched." The real metric is "severity-weighted incident rate per 1,000 actions." Most founders optimize for the first. Smart ones optimize for the second.

One founder test for 2026

If someone claims "AI is transforming our customer's business," ask for one number:

"What percentage of their core workflows run with logged evals, measured incident rates, and defined escalation policies?"

If the answer is fuzzy, it's still a prototype. If it's precise and improving week-over-week, it's a product. If you can't report it, you can't scale it.

The Real Unicorn Founder Ranking (Adjusted for Alumni Cohort)

Most unicorn-founder university rankings are really school-size rankings. A more useful view is “conversion efficiency”: unicorn founders per plausible founder cohort, not per total living alumni.​

The denominator problem

Ilya Strebulaev’s published unicorn-founder-by-university counts are a strong numerator, but most people may implicitly pair them with the wrong denominator (“living alumni”). “Living alumni” mixes retirees (no longer founding) with very recent grads (not enough time to found and scale), which blurs the signal you actually care about.​

Founder timelines make this mismatch obvious: unicorn founders skew toward founding in their 30s (average ~35; median ~33), and reaching unicorn status typically takes years after founding. So if the question is “which universities produce unicorn founders,” the denominator should reflect alumni who realistically had time to do it.​

The cohort adjustment

The adjustment is deliberately simple: keep the published founder counts, but replace “living alumni” with a working-age cohort proxy. Practically, that means estimating working-age alumni as roughly graduates from 1980–2015 (today’s ~30–65 year-olds), which aligns with the observed founder life cycle.​

This doesn’t claim causality or “best university” status. It just separates ecosystem gravity (absolute founder counts) from conversion efficiency (founders per plausible founding cohort).​

Cohort-adjusted ranking

Metric: unicorn founders per 100,000 working-age alumni (estimated).

Rank University    Working-age alumni (est.)    Unicorn founders per 100k
1    Stanford    ~115,000    106
2    MIT    ~85,000    102
3    Harvard    ~200,000    36
4    Yale    ~140,000    32
5    Cornell    ~150,000    30
6    Princeton    ~120,000    25
7    UC Berkeley    ~270,000    22
8    Tel Aviv University    ~110,000    15
9    Columbia    ~170,000    14
10    University of Pennsylvania    ~180,000    13
11    University of Waterloo    ~130,000      8

What the cohort lens reveals

Stanford and MIT converge at the top on efficiency (106 vs 102 per 100k), even though Stanford leads on absolute count. Harvard and Berkeley “drop” mainly because they are huge; normalization is doing its job by showing that volume and efficiency are different signals. International technical schools (e.g., Tel Aviv University, Waterloo) remain visible on a per-capita basis even without Silicon Valley’s capital density, which suggests institution-level culture and networks can matter even when geography doesn’t help.

For investors, this is actionable because it cleanly splits two sourcing heuristics: go where the gravity is (absolute counts), and also track where the conversion rate is high (cohort-adjusted efficiency). The dropout myth persists because anecdotes are easier to remember than denominators; the cohort denominator forces the analysis to match how unicorns are actually built over time.


Machine Learning Is Having a Midlife Crisis?

The most successful field in computer science right now is also the most anxious. You can feel it in Reddit threads, conference hallways, and DMs: something about how we do ML research is off. The pace is intoxicating, the progress is real—and yet the people building it are quietly asking, “Is this sustainable? Is this still science?”

That tension is the story: a field that went from scrappy outsider to global infrastructure so fast it never upgraded its operating system. Now the bugs are showing.

When “More Papers” Stops Feeling Like Progress

In theory, more research means more discovery. In practice, we’ve hit the point where conference submission graphs look like someone mis-set the y-axis. Flagship venues are drowning in tens of thousands of papers a year, forcing brutal early rejections and weird hacks to keep the system from collapsing.​

From the outside, it looks like abundance. From the inside, it feels like spam. Authors optimize for “accepted somewhere, anywhere” instead of “is this result robust and useful?” Reviewers are buried. Organizers are pushed into warehouse logistics instead of deep curation. The whole thing starts to feel like a metrics game, not a knowledge engine.​

When accepted papers with solid scores get dropped because there isn’t enough physical space at the venue, that’s not a nice problem to have. That’s a signal the model is mis-specified.​

Quality Debt and the Reproducibility Hangover

Meanwhile, a quieter crisis has been compounding: reproducibility. Code not released. Data not shared. Baselines mis-implemented. Benchmarks overfit. Half the field has a story about trying to re-run a “state of the art” paper and giving up after a week.​

This isn’t just a paperwork problem. If others can’t reproduce your result:

  • No one knows if your idea generalizes.

  • Downstream work might be building on a mirage.

  • Real-world teams burn time and budget chasing ghosts.

As models move into medicine, finance, and public policy, “it sort of worked on this dataset in our lab” is not a pass. Trust in the science behind ML becomes a hard constraint, not a nice-to-have.​

Incentives: Optimizing the Wrong Objective

Zoom out, and a pattern appears: the system is rewarding the wrong things.

  • Novelty over reliability.

  • Benchmarks over messy, real problems.

  • Velocity over understanding.

The fastest way to survive in this game is to slice your work into as many publishable units as possible, push to every major conference, and pray the review lottery hits at least once. Deep, slow, high-risk ideas don’t fit neatly into that cadence.

And then there’s the talent flow. The best people are heavily pulled into industry labs with bigger checks and bigger GPUs. Academia becomes more about paper throughput on limited resources. The result: the people with the most time to think have the least compute, and the people with the most compute are often on product timelines. Misalignment everywhere.​

The Field’s Growing Self‑Doubt (That’s Actually Healthy)

Here’s the twist: this wave of self-critique is not a sign ML is dying. It’s a sign the immune system is finally kicking in.

Researchers are openly asking:

  • Are we publishing too much, learning too little?

  • Are our benchmarks telling us anything real?

  • Are we building tools that transfer beyond leaderboards into the world?

When people who benefit from the current system start calling it broken, pay attention. That’s not nihilism; that’s care. It’s a field realizing it grew up faster than its institutions did—and deciding to fix that before an AI winter or an external backlash does it for them.​

What a Healthier ML Research Culture Could Look Like

If you strip away the institutional inertia, the fixes aren’t mysterious. They’re the research equivalent of “stop pretending the plan is working; start iterating on the process.”

Some levers worth pulling:

  • Less worship of novelty, more respect for rigor. Make “solid, careful, negative-result-rich” a first-class contribution, not a consolation prize.

  • Mandatory openness. If it can be open-sourced, it should be. Code, data, evaluation scripts. No artifacts, no big claims.

  • Different tracks, different values. Separate venues or tracks for (a) theory, (b) benchmarks, (c) applications. Judge each by the right metric instead of forcing everything through the same novelty filter.

  • Incentives that outlast a deadline. Promotion, funding, and prestige that factor in impact over time, not just conference logos on a CV.

None of this is romantic. It’s plumbing. But if you get the plumbing right, the next decade of ML feels very different: fewer hype cycles, fewer brittle “breakthroughs,” more compounding, reliable progress.

If You’re an ML Researcher, Here’s the Move

You can’t fix the whole ecosystem alone—but you can run a different local policy.

  • Treat your own beliefs like models: version them, stress-test them, deprecate them.

  • Aim for “someone else can reproduce this without emailing me” as a hard requirement, not an aspiration.

  • Choose questions that would matter even if they never hit a top-tier conference.

  • Remember that “I don’t know yet” and “we couldn’t replicate it” are signs of seriousness, not weakness.

Machine learning isn’t in crisis because it’s failing. It’s in crisis because it’s succeeding faster than its institutions can adapt. The people who will matter most in the next decade aren’t the ones who ride this wave blindly—they’re the ones who help the field course-correct in public, with less ego and more evidence.


World Models: The $100T AI Bet Founders Must Make Now

World models are quietly transforming AI from text predictors into systems that understand and simulate the real world. Unlike large language models (LLMs) that predict the next word, world models build internal representations of how environments evolve over time and how actions change states. This leap from language to spatial intelligence promises to unlock AI capable of perceiving, reasoning, and interacting with complex 3D spaces.

Fei-Fei Li calls world models "the next frontier of AI," emphasizing spatial intelligence as essential for machines to see and act in the world. Yann LeCun echoes this urgency, arguing that learning accurate world models is key to human-level AI. His approach highlights the need for self-supervised learning architectures that predict world states in compressed representations rather than raw pixels, optimizing efficiency and generalization.

Leading efforts diverge into three camps. OpenAI’s Sora uses video generation transformers to simulate physical environments, showing emergent long-range coherence and object permanence, crucial for world simulation. Meta’s Joint Embedding Predictive Architecture (V-JEPA) models latent representations of videos and robotic interactions to reduce computational waste and improve reasoning. Fei-Fei Li’s World Labs blends multimodal inputs into spatially consistent, editable 3D worlds via Marble, targeting interactive virtual environment generation.

The commercial potential is looking to be enormous. Over $2 billion was invested across 15+ world model startups in 2024, with estimates valuing the full market north of $100 trillion if AI masters physical intelligence. Robotics leads near-term value: enabling robots to safely navigate unstructured environments requires world models to predict object interactions and plan multi-step tasks. NVIDIA’s Cosmos infrastructure accelerates physical AI training with synthetic photorealistic data, while companies like Skild AI have raised billions by building massive robotic interaction datasets.​

Autonomous vehicles also tap world models to simulate traffic and rare scenarios at scale, cutting down expensive on-road tests and improving safety. Companies like Wayve and Waabi leverage virtual worlds for pre-labeling and scenario generation, critical in achieving full autonomy. Meanwhile, the gaming and entertainment sector is the most mature commercial playground, with startups using world models to generate dynamic game worlds and personalized content that attract millions of users almost overnight

Specialized industrial applications—engineering simulations, healthcare, city planning—show clear revenue pathways with fewer competitors. PhysicsX’s quantum leap in simulation speed exemplifies how tailored world models can revolutionize verticals where traditional methods falter. Healthcare and urban planning stand to gain precision interventions and predictive modeling unparalleled by current AI.

The funding landscape reveals the importance of founder pedigree and scale. Fei-Fei Li’s World Labs hit unicorn status swiftly with $230 million raised, Luma AI secured $900 million Series C for supercluster-scale training, and Skild AI amassed over $1.5 billion focused on robotics. NVIDIA, while a supplier, remains a kingmaker, providing hardware, software, and foundational models as a platform layer—both opportunity and competition for startups.

Crucially, despite staggering investment, gaps abound—technical, commercial, and strategic. Training world models requires vast, complex multimodal datasets rarely available openly, creating defensive moats for data-rich startups. Models still struggle with physics accuracy, generalization to novel scenarios, and real-time performance needed for robotics or autonomous vehicles. Startups innovating around efficiency, transfer learning, sim-to-real gaps, and safety validation have outsized opportunities.

On the market front, vertical-specific solutions in healthcare, logistics, and defense are underserved, offering fertile ground for founders with domain expertise. Productizing world models requires bridging the gap from lab prototypes to robust, scalable deployments, including integration tooling and certification for safety-critical applications. Startups enabling high-fidelity synthetic data generation are becoming ecosystem enablers.​

Strategically, founders must navigate open research—like Meta’s V-JEPA—and proprietary plays exemplified by World Labs. Standardization and interoperability remain open questions critical for ecosystem growth. Handling rare edge cases and ensuring reliable sim-to-real transfer are gating factors for robotic and autonomous systems.

For investors, the thesis is clear but nuanced. Robotics world models, vertical AI for high-value industries, infrastructure and tooling layers, and gaming are high-conviction bets offering a blend of risk and clear pathways to market. Foundational model companies with massive compute and data moats present risky but lucrative opportunities, demanding large capital and specialized talent. Efficiency, differentiated data, and agile product-market fit matter more than raw scale alone.

The next 24 months will crystallize market winners as world models shift from research curiosity to mission-critical AI infrastructure. Founders displaying relentless adaptability, technical depth, and deep domain insight will lead the charge. Investors who balance bets across foundation layers and vertical applications, while embracing geographic and stage diversity, stand to capture disproportionate value.

While the industry watches language models, the less flashy but more profound revolution is unfolding quietly in world models—systems that don’t just process language but build a mental map of reality itself. These systems will define the next era of AI, shaping how machines perceive, interact, and augment the physical world for decades.

That’s the state of play. The winners will be those who combine technical innovation with pragmatic business sense, and above all, a ruthlessly adaptive mindset to pivot rapidly as the frontier evolves.