Science

Why AI Diagnostics Need More Than a Validation Study

January 30, 2026 Dr. Priya Sharma
AI diagnostic algorithm output displayed on hospital workstation

We've looked at more than 60 AI diagnostics companies in the past three years. Almost all of them have validation studies. A lot of those studies show genuinely impressive numbers — AUC above 0.90, sensitivity and specificity that outperform the historical standard of care, prospective cohorts with enough size to be statistically convincing. And still, most of those companies are stuck. They can't get health systems to deploy, can't close payor contracts, can't move from pilot to paid. The validation study isn't the problem. The validation study is what they have instead of an answer to the real question.

The real question is: does using this change what happens to patients? Not does the algorithm detect the condition accurately — but does the detection lead to a clinical action, and does that action improve the outcome? Those are three different things, and AI diagnostics companies routinely have evidence on the first while having almost none on the other two.

What the algorithm metric actually measures

AUC, sensitivity, and specificity measure the algorithm's ability to classify. They're trained on datasets and tested on held-out datasets. When the held-out dataset is representative of the population you'll deploy in, these numbers are meaningful. Often they're not representative, and the metrics degrade in production. That's a well-documented problem in the literature — publication bias toward high-performing models, lack of external validation, dataset shift.

But even setting that aside, classification performance is the easiest bar to clear. A diagnostic algorithm that perfectly identifies a condition is still clinically worthless if: the condition was already being caught by other means; the finding arrives at a point in the workflow where no action can be taken; clinicians don't trust the output and override it; or the condition requires a treatment that isn't available or accessible for the identified patients.

We've seen all four of these, across real companies with strong academic papers behind them.

The clinical integration question

Health systems evaluate AI diagnostics on a fundamentally different framework than what most founders present. The clinical informaticist, the CMIO, the department chair who's being asked to adopt the tool — they're thinking about workflow. Where does the output appear? Who sees it, at what point in the care pathway? What action is the clinician supposed to take? How does that action get documented? What happens when the algorithm fires incorrectly?

These aren't objections to be managed. They're the real product questions. And the companies that are getting deployed are the ones that did this work before they walked into the room. They have implementation documentation, workflow diagrams, EHR integration specifications. They can describe exactly where their alert appears in Epic or Cerner, what the associated order set looks like, and how false positive rates were tuned for the specific clinical context.

Tuning for clinical context is important. A sepsis early warning system deployed in a 20-bed community ICU needs to be calibrated differently than the same system in a 500-bed academic medical center. The baseline population is different, the staffing ratios are different, the alert fatigue dynamics are different. A single validation study conducted at one institution doesn't cover this. The companies that know this and have done multi-site validation in diverse settings are a different category from those that haven't.

What payors want to see

Coverage decisions for AI diagnostics — particularly AI-powered SaMD (Software as a Medical Device) — increasingly require evidence of clinical utility, not just analytical performance. The AMA's CPT Editorial Panel has created Category I and III codes for AI-enabled diagnostics, but coverage under those codes by commercial payors depends on published clinical utility evidence.

Clinical utility means: does the test result change clinical management, and does the change in management improve outcomes? This requires prospective data collection, usually in a health system pilot, where you track decision changes and downstream outcomes. It takes 18-36 months to generate compelling clinical utility data. Most companies starting a Series A conversation haven't started this process. That's not a dealbreaker for us, but it becomes a question about roadmap: when does clinical utility data collection start, who's paying for it, and what's the go-to-market before it exists?

The evidentiary bar for AI diagnostics isn't moving toward less rigor. If anything, it's moving toward more — faster than most early-stage companies are prepared for.

What we look for instead of just AUC

When we diligence an AI diagnostics company, we look for several things that go beyond the model performance metrics:

  • External validation on a population not used in training, ideally at multiple sites
  • Evidence of workflow integration design — not a mockup, an actual EHR integration pilot
  • Clinician interviews that confirm the output is actionable in the target workflow
  • Prospective outcome data collection design, even if data is early
  • A realistic regulatory pathway — Software as a Medical Device requirements under FDA's Digital Health framework
  • A commercial strategy that doesn't rely entirely on reimbursement that doesn't yet exist

None of these are unreasonable asks. The companies that have them — even partially — are the ones that have taken clinical deployment seriously. They've usually had a hospital partner involved early, not as a reference but as a genuine implementation site. That partnership is often the most valuable thing on the cap table at the early stage.

The validation study tells us the algorithm works in a controlled setting. That's necessary. It's just nowhere near sufficient for what comes next.