Deep LearningCNNMedical AIExplainabilityPyTorchPythonCalibrationUncertainty Quantification

Trustworthy Medical Vision

Rigorous ML pipeline for binary chest X-ray classification emphasizing model calibration, uncertainty quantification, and trustworthy decision-making. The system evaluates when classifier outputs should be trusted, when predictions warrant caution, and when decisions should be deferred.

TL;DR

Production baseline: DenseNet-121 achieving 0.9576 AUROC, 0.8990 accuracy, 0.9240 F1 on 624 test cases
Comparator model: ResNet-50 with 0.9713 AUROC, 0.8798 accuracy for complementary evaluation
Calibration pipeline: Adaptive temperature scaling and reliability diagram analysis (ECE, Brier scores)
Uncertainty quantification: MC Dropout inference with selective prediction and abstention policies
Explainability validation: Grad-CAM generation with untrained model sanity checks and stability metrics (Spearman correlation, spatial robustness)
Failure audit: 4-mode taxonomy across 63 analyzed error cases; ensemble + deferral reduces estimated cost by 36.63%

What I Built

Binary pneumonia classification on the Kaggle Chest X-Ray Images dataset with a rigorous reliability pipeline answering: when should predictions be trusted?

I trained DenseNet-121 (0.9576 AUROC, 0.8990 accuracy) and ResNet-50 (0.9713 AUROC, 0.8798 accuracy) on deterministic splits. Strong numbers were just the start. The reliability stack—temperature scaling for calibration, MC Dropout for uncertainty, Grad-CAM for explanations validated against untrained baselines—audits trustworthiness systematically. I measured explanation stability under perturbations and found explanations shifted dramatically while predictions stayed the same, often preceding errors.

I audited all 63 misclassifications into 4 modes: clinical ambiguity, annotation noise, model blindness, and spurious features. Then combined both models with abstention policies: the ensemble + deferral reduced estimated failure cost by 36.63%. Cross-model disagreement (32 of 624 cases) revealed complementary strengths, not noise.

The key insight: high AUROC doesn't mean trustworthy. Calibration, uncertainty, and explanation stability distinguish robust predictions from confident-but-fragile ones. Model disagreement becomes a design lever.

Why It Matters

ML Systems Thinking: Reliability isn't bolted on—it's architected. This project shows disciplined multi-dimensional evaluation (accuracy + calibration + uncertainty + explanation + failure modes) as standard practice.

Production ML Engineering: Temperature scaling, MC Dropout placement, and selective prediction are practical mechanisms. Failure audits convert error patterns into actionable insights. Deferral policies quantify the cost-benefit of abstention.

Responsible AI in Practice: Not just statements but concrete limitations, disclaimers, and failure taxonomies. Medical AI demands this rigor.

Architecture & Implementation

9-Stage Pipeline

Data preparation (deterministic splits, stratification)
Baseline model training (DenseNet-121, ResNet-50)
Model comparison & cross-validation
Calibration analysis & temperature scaling
Uncertainty quantification (MC Dropout)
Explainability validation (Grad-CAM + sanity checks)
Explanation stability (perturbation robustness)
Failure-case audit & trust synthesis
Ensemble & deferral policies

Config-Driven & Reproducible

All artifacts versioned and archived (checkpoints, metrics, explanations, case studies)
Deterministic reproducibility via seeding
Full audit trail from data → models → insights

Key Results

Model Performance

Metric	DenseNet-121	ResNet-50
AUROC	0.9576	0.9713
Accuracy	0.8990	0.8798
F1	0.9240	0.9117

Reliability Signals

Temperature scaling improved calibration; MC Dropout sensitivity to model uncertainty varied by architecture
Selective prediction showed 2-3% accuracy gains when abstaining on top 10% uncertainty cases
Explanation stability (Spearman > 0.85) correlated with correct predictions; unstable explanations often preceded errors

Failure Analysis

Cross-model disagreement: 32 of 624 cases (5.1%)
Oracle ensemble accuracy: 0.9151 (removing all disagreements)
Deferral policy reduced estimated failure cost by 36.63%
Failure modes: clinical ambiguity (40%), annotation errors (25%), model blindness (20%), spurious features (15%)

System-Level Insights

The project demonstrates that asking when the model should be trusted yields more value than asking how accurate is the model. Calibration adjustments improve reliability signals without changing raw accuracy. Uncertainty and explanation stability distinguish cases where high confidence is justified from cases where it masks fragility. Ensemble disagreement reveals complementary model strengths. Deferral policies quantify the cost-benefit of abstention.

Responsible AI & Safety

Important Disclaimers

Research/educational project for methodological validation only
Not a medical diagnostic tool or approved for clinical use
Requires medical professional review before any healthcare application

Ethical Considerations

Dataset reflects demographic patterns in training data; external validation on diverse populations required
Explanations are model-derived signals, not ground truth
Calibration and stability metrics are dataset-dependent and may not generalize

Limitations

Single-source dataset; external validation on independent cohorts needed before clinical claims
Failure due to dataset bias, labeling noise, or domain shift cannot be fully ruled out
Model decisions should never override radiologist judgment

Workflow

8-stage modular pipeline with config-driven training, evaluation, and deployment. Clean separation of data, modeling, calibration, uncertainty, explainability, failure audit, and ensemble analysis. Full artifact versioning and reproducibility.

View on GitHub Live Demo Read Blog Post