Building a clinical validation harness for ML in healthcare

A model that hits 94% accuracy on a public benchmark is not the same as a model that hits 94% accuracy in your hospital. The benchmark probably came from one institution, one demographic, one EHR system. Your hospital is none of those things.

We've shipped ML into clinical workflows enough times to have learned the same lesson the hard way, several times: the model itself is maybe 30% of the work. The validation harness — the infrastructure that tells you whether the model is doing what you think it's doing, on your data, with your patients — is the other 70%. Here's what we wish every team had on day one.

What a validation harness actually is

A validation harness is the always-on infrastructure that answers four questions, on a continuing basis:

Is the model performing now the way it performed when we shipped it?
Is it performing equally well across the populations we serve, or only well on the dominant one?
When clinicians disagree with the model, what are they disagreeing about — and is that signal of drift, or signal of edge cases?
If something changes in the upstream data, will we know before the clinicians do?

Notice these aren't "is the model accurate?" questions. Accuracy in aggregate is a comforting number that hides almost everything that goes wrong in production.

The four layers

We structure validation harnesses in four layers. Each layer answers a different category of question, and each one is non-optional for clinical ML.

Layer 1: data validation. Before any input reaches the model, schema and distribution checks run. If you're expecting age in years but your EHR starts sending months for pediatric patients, the data validation layer catches it. If a vendor changes a lab unit from mg/dL to mmol/L, you find out at the data layer, not in the model output.

Layer 2: prediction monitoring. Every prediction is logged with: model version, input features (or a hash of them, if PHI sensitive), prediction confidence, downstream action taken (accepted, modified, rejected by clinician), and outcome where available. This is the foundation everything else builds on.

Layer 3: subgroup performance tracking. Performance across the demographics that matter to your patient population. If your model is 94% accurate overall but 76% accurate for one age group, you need to know — and the only way to know is to measure it explicitly, ongoing, with alerts when subgroups diverge.

Layer 4: drift detection. Two flavors. Input drift (the distribution of features changing — new disease prevalence, new vendor systems, new patient demographics). Output drift (the distribution of predictions changing even when inputs look stable). Both can be early signals that something is wrong before the outcomes deteriorate.

The hardest part: ground truth, lag, and confounding

Most ML platforms outside healthcare can compute model accuracy in near-real-time. In healthcare, ground truth often arrives weeks or months after the prediction — and sometimes never.

A sepsis risk model predicts at hour 0 that a patient is high-risk. Ground truth (did they develop sepsis? did the intervention prevent it?) might not be definitively answerable for days or weeks. And the intervention itself confounds the outcome: if the model predicted high risk and clinicians acted, you can't measure the unintervened outcome.

Practical implication: the validation harness has to track three time horizons separately. Immediate (data validation, prediction logging — minutes). Short-term (clinician agreement, override rates — hours/days). Long-term (clinical outcome correlation — weeks/months). Each horizon has its own dashboard and its own alerting policy.

Subgroup performance — the part teams skip

Every team we've worked with had a generic accuracy number. About half had subgroup performance tracked at all. Almost none had it tracked in production with alerts.

The minimum we recommend: track at least age cohort, sex, and race/ethnicity where available. If the model is for a specific condition, also track by relevant clinical subgroups (e.g., diabetic vs. non-diabetic for cardiovascular risk models). For each subgroup, compute the same metrics you track in aggregate — and alert when any subgroup's metric drops more than X% below the aggregate.

Why it matters: a model that's 92% accurate overall but 78% accurate for one ethnic group, on a population where that group is 15% of patients, will produce systematically worse care for those patients — and you won't see it in the aggregate number.

The clinician disagreement signal

When a clinician overrides the model, that's a data point — possibly the most valuable one. We instrument every override with: the prediction, the override reason (from a structured dropdown plus optional free text), the resulting decision, and the eventual outcome.

Useful patterns emerge:

Override rate trending up over weeks → likely drift. Investigate the data layer.
Override rate spiking for one clinician → probably a calibration or trust issue. Talk to them.
Override rate spiking for one shift/unit → workflow problem, not model problem.
Override rate spiking after a specific date → look at deploys, EHR upgrades, or data source changes around that date.

The override is not a failure of the model. It's the data point that tells you whether the model is converging with clinical judgment over time, or diverging.

What deployment looks like, end-to-end

A typical clinical ML deployment we'd recommend:

Shadow mode (2-4 weeks). Model runs on every relevant case, but predictions aren't shown to clinicians. You measure accuracy, subgroup performance, and override rate (against a small panel of expert reviewers).
Pilot deployment (4-8 weeks). Model predictions visible to a small clinical group, with structured override capture. Targets: subgroup parity, override rate stable, no novel failure modes.
Phased rollout (4-12 weeks). Expand by specialty, location, or service line. Each phase has explicit go/no-go criteria.
Continuous monitoring (ongoing). All four validation layers running, with alerting tuned to clinical leadership escalation paths.

Skip the shadow phase and you ship blind. Skip the pilot and you discover failure modes at scale instead of in a small group.

Common mistakes

Tracking aggregate accuracy only. Hides subgroup divergence, which is where harm happens.
No ground truth tracking. You think you're measuring accuracy, but you're measuring prediction-vs-prediction.
Override rate logged but not analyzed. The signal is there, but no one is watching it.
Model version not logged with predictions. When you deploy a new version, you can't tell which predictions came from which model.
Treating drift as an alarm rather than a signal. Drift detection should trigger investigation, not panic — most drift has a benign explanation.

What to build first

If you're starting from zero and can only build one thing: log every prediction with model version, input hash, confidence, clinician decision, and timestamp. That single table is the foundation for everything else. You can add subgroup tracking, drift detection, and outcome correlation on top — but you can't reconstruct the data after the fact.

If you can build two: add subgroup performance tracking, with alerts.

If you can build three: add structured clinician override capture.

Everything else can wait. Most teams ship models without these three, and then learn the lesson the expensive way.

Building a clinical validation harness for ML in healthcare

What a validation harness actually is

The four layers

The hardest part: ground truth, lag, and confounding

Subgroup performance — the part teams skip

The clinician disagreement signal

What deployment looks like, end-to-end

Common mistakes

What to build first

Build vs buy: when a custom EHR beats Epic or Cerner (and when it doesn't)

Redox vs Health Gorilla vs 1upHealth: choosing a healthcare integration platform

FHIR vs HL7v2: which standard for which integration

Most of these started as projects. Yours could too.