Machine LearningExplainabilityML EngineeringCalibrationUncertainty QuantificationResearch

Building Trustworthiness into Medical Vision Systems

I recently finished Trustworthy Medical Vision, a project that took me deeper into a question I've been thinking about for years: What does it actually mean to trust an ML model's decision in a high-stakes domain?

The honest version: This project started because I was frustrated. I'd seen papers report 96% accuracy on medical datasets and call it a success. I'd watched conference presentations celebrate AUROC improvements of 0.5 percentage points as if the problem was solved. But something felt off. A doctor I was working with once asked me a simple question—"Okay, but when should I actually use this?"—and I realized I didn't have a good answer. Not because I hadn't trained a good model. Because I'd never actually asked the question seriously.

So I decided to. Not just to improve a number, but to build a framework where trustworthiness is measurable, auditable, and real.*

Why medical imaging

Medical imaging is where theory meets reality with real stakes. That's what I needed to build something meaningful.

The gaps in standard ML become impossible to ignore: confidence isn't correctness—models can be catastrophically wrong while reporting high confidence. Failures cluster on patterns that metrics miss. Explanations can look convincing while being fundamentally brittle. Not all errors are equal; a model must distinguish between ambiguous cases and systematic blindness.

Answering these questions rigorously requires moving beyond single metrics. Calibration, uncertainty, explainability, failure audits—all together.

How I built it

I designed an 8-stage pipeline where each stage answers one question about trustworthiness:

  1. Baseline: DenseNet-121 reached 0.9576 AUROC. Good—but AUROC isn't the story.
  2. Diversity: ResNet-50 achieved 0.9713 AUROC with different calibration properties. Model disagreement became valuable.
  3. Calibration: Did 87% confidence mean 87% accuracy? No. Models are overconfident by design.
  4. Uncertainty: MC Dropout revealed which predictions were genuine uncertainty vs. confident mistakes.
  5. Explanations: Grad-CAM showed what the model attended to, but validating those explanations led to...
  6. Explanation stability: Perturbing inputs slightly caused explanations to shift dramatically—even when predictions didn't. Unstable explanations preceded errors.
  7. Failure taxonomy: Rather than counting errors, I categorized them: clinical ambiguity, annotation noise, model blindness, spurious features. Each type required different handling.
  8. Ensemble + deferral: Combined models with learned abstention. Result: at uncertainty threshold 0.3, failure cost dropped 36.63%—fewer wrong predictions reaching doctors, more ambiguous cases flagged for review.

Each stage built on the previous, producing metrics, visualizations, and case studies. By the end, I had a framework I could defend.

What surprised me

Explanation stability: Mild perturbations (±5° rotation, brightness shifts) caused explanations to shift dramatically even when predictions stayed the same. This wasn't a bug—unstable explanations often preceded failures. Robustness is about the justification, not just the prediction.

Model disagreement as information: DenseNet and ResNet disagreed on 32 of 624 cases. Rather than noise, I found they had complementary strengths—ResNet ranked better, DenseNet calibrated better. Leaning into disagreement, not averaging it out, enabled the ensemble design and 36% failure reduction.

Deferral is a design lever: I started thinking "perfect classifier." I ended thinking "system that knows when to ask for help." Abstention thresholds become tunable parameters—at uncertainty 0.3 you defer 18% of cases, at 0.5 you defer 35%. Responsible AI isn't perfection; it's honest about uncertainty.

Why this matters

Most ML projects treat reliability as optional. This inverts that: reliability is the model. Every decision—architecture, deferral threshold, metric choice—was guided by: does this improve trustworthiness?

What this demonstrates:

Disciplined evaluation: Not one metric, but many used together as design constraints. AUROC for ranking, calibration for honesty, uncertainty for doubt, explanations for reasoning, audit for error patterns. This is mature engineering practice.

Responsible AI in practice: Not statements about ethics, but concrete implementation—failure categorization, abstention policies, limitations documented, explanations validated. Auditable and understandable systems.

ML systems thinking: Models are one component. Reliability emerges from calibration + uncertainty + ensemble diversity + human judgment. This difference—optimizing a model versus architecting a system—separates one-off work from production-grade infrastructure.

Why healthcare needs this: Accuracy claims are cheap; rigorous multi-dimensional evaluation is rare. Doctors need systems they can understand and override. That requires upstream work (calibration, uncertainty, failure audit) most projects skip because it's harder than reporting AUROC.

The framework is reproducible and open-sourced—designed for others to extend, retrain, or apply to different tasks. That's the work I want to do: building infrastructure others build on.

What's next

Code is on GitHub, demo is live. Everything is reproducible and modular.

What matters most isn't the artifacts—it's what they represent: a systematic, measurable framework for trustworthiness grounded in real constraints. I hope it becomes something others extend and critique.

This project changed how I think about ML. The hard problems aren't technical; they're conceptual. How do you define trustworthiness? What trade-offs matter? When does uncertainty mean abstain? These can't be solved by optimizing a loss function. They require deep thinking, domain expertise, and systems honest about their limits.

That's the kind of work I want to do.