Trustworthy Medical Vision
Rigorous ML pipeline for binary chest X-ray classification emphasizing model calibration, uncertainty quantification, and trustworthy decision-making. The system evaluates when classifier outputs should be trusted, when predictions warrant caution, and when decisions should be deferred.
TL;DR
- Production baseline: DenseNet-121 achieving 0.9576 AUROC, 0.8990 accuracy, 0.9240 F1 on 624 test cases
- Comparator model: ResNet-50 with 0.9713 AUROC, 0.8798 accuracy for complementary evaluation
- Calibration pipeline: Adaptive temperature scaling and reliability diagram analysis (ECE, Brier scores)
- Uncertainty quantification: MC Dropout inference with selective prediction and abstention policies
- Explainability validation: Grad-CAM generation with untrained model sanity checks and stability metrics (Spearman correlation, spatial robustness)
- Failure audit: 4-mode taxonomy across 63 analyzed error cases; ensemble + deferral reduces estimated cost by 36.63%
What I Built
Binary pneumonia classification on the Kaggle Chest X-Ray Images dataset with a rigorous reliability pipeline answering: when should predictions be trusted?
I trained DenseNet-121 (0.9576 AUROC, 0.8990 accuracy) and ResNet-50 (0.9713 AUROC, 0.8798 accuracy) on deterministic splits. Strong numbers were just the start. The reliability stack—temperature scaling for calibration, MC Dropout for uncertainty, Grad-CAM for explanations validated against untrained baselines—audits trustworthiness systematically. I measured explanation stability under perturbations and found explanations shifted dramatically while predictions stayed the same, often preceding errors.
I audited all 63 misclassifications into 4 modes: clinical ambiguity, annotation noise, model blindness, and spurious features. Then combined both models with abstention policies: the ensemble + deferral reduced estimated failure cost by 36.63%. Cross-model disagreement (32 of 624 cases) revealed complementary strengths, not noise.
The key insight: high AUROC doesn't mean trustworthy. Calibration, uncertainty, and explanation stability distinguish robust predictions from confident-but-fragile ones. Model disagreement becomes a design lever.
Why It Matters
ML Systems Thinking: Reliability isn't bolted on—it's architected. This project shows disciplined multi-dimensional evaluation (accuracy + calibration + uncertainty + explanation + failure modes) as standard practice.
Production ML Engineering: Temperature scaling, MC Dropout placement, and selective prediction are practical mechanisms. Failure audits convert error patterns into actionable insights. Deferral policies quantify the cost-benefit of abstention.
Responsible AI in Practice: Not just statements but concrete limitations, disclaimers, and failure taxonomies. Medical AI demands this rigor.
Architecture & Implementation
9-Stage Pipeline
- Data preparation (deterministic splits, stratification)
- Baseline model training (DenseNet-121, ResNet-50)
- Model comparison & cross-validation
- Calibration analysis & temperature scaling
- Uncertainty quantification (MC Dropout)
- Explainability validation (Grad-CAM + sanity checks)
- Explanation stability (perturbation robustness)
- Failure-case audit & trust synthesis
- Ensemble & deferral policies
Config-Driven & Reproducible
- All artifacts versioned and archived (checkpoints, metrics, explanations, case studies)
- Deterministic reproducibility via seeding
- Full audit trail from data → models → insights
Key Results
Model Performance
| Metric | DenseNet-121 | ResNet-50 |
|---|---|---|
| AUROC | 0.9576 | 0.9713 |
| Accuracy | 0.8990 | 0.8798 |
| F1 | 0.9240 | 0.9117 |
Reliability Signals
- Temperature scaling improved calibration; MC Dropout sensitivity to model uncertainty varied by architecture
- Selective prediction showed 2-3% accuracy gains when abstaining on top 10% uncertainty cases
- Explanation stability (Spearman > 0.85) correlated with correct predictions; unstable explanations often preceded errors
Failure Analysis
- Cross-model disagreement: 32 of 624 cases (5.1%)
- Oracle ensemble accuracy: 0.9151 (removing all disagreements)
- Deferral policy reduced estimated failure cost by 36.63%
- Failure modes: clinical ambiguity (40%), annotation errors (25%), model blindness (20%), spurious features (15%)
System-Level Insights
The project demonstrates that asking when the model should be trusted yields more value than asking how accurate is the model. Calibration adjustments improve reliability signals without changing raw accuracy. Uncertainty and explanation stability distinguish cases where high confidence is justified from cases where it masks fragility. Ensemble disagreement reveals complementary model strengths. Deferral policies quantify the cost-benefit of abstention.
Responsible AI & Safety
Important Disclaimers
- Research/educational project for methodological validation only
- Not a medical diagnostic tool or approved for clinical use
- Requires medical professional review before any healthcare application
Ethical Considerations
- Dataset reflects demographic patterns in training data; external validation on diverse populations required
- Explanations are model-derived signals, not ground truth
- Calibration and stability metrics are dataset-dependent and may not generalize
Limitations
- Single-source dataset; external validation on independent cohorts needed before clinical claims
- Failure due to dataset bias, labeling noise, or domain shift cannot be fully ruled out
- Model decisions should never override radiologist judgment
Workflow
8-stage modular pipeline with config-driven training, evaluation, and deployment. Clean separation of data, modeling, calibration, uncertainty, explainability, failure audit, and ensemble analysis. Full artifact versioning and reproducibility.