Validation Case Study

What Your AI Misses After Deployment

A post-deployment validation analysis of a commercial CT lung nodule detection algorithm under six real-world imaging perturbations — the conditions your vendor never tested for.

Dataset LIDC-IDRI (154 cases)
Algorithm MONAI RetinaNet / LUNA16
Conditions Tested 6 imaging perturbations
Preprint arXiv:2603.26785
This applies to your site if:
You've deployed a CT detection AI
Lung nodule, pulmonary embolism, incidental finding — any model running on live scanner data that came with a vendor sensitivity claim.
Your protocols aren't standardized across scanners
Mixed kernel selections, variable slice thickness, or protocols that have drifted since the scanner was installed — common at multi-scanner sites.
You're evaluating a vendor update or new algorithm
Model updates reset the performance baseline. Without independent measurement, you have no way to know whether performance improved, degraded, or shifted by subgroup.
01 Summary Statistics

The Performance Gap Hidden in Plain Sight

Vendor benchmarks are measured under controlled acquisition protocols. Your scanners don't run under controlled acquisition protocols. This is what the gap looks like.

84.8%
Baseline sensitivity under reference protocol
Standard thin-slice, standard kernel
−13.2pp
Sensitivity drop at 5mm slice thickness
vs. thin-slice baseline
−10.5pp
Sensitivity drop with soft reconstruction kernel
Single parameter change
~−4pp
Sensitivity impact from dose reduction alone
Lowest-risk perturbation tested
Critical Finding

Effects were most pronounced in the 3–6mm nodule size range — the exact range where clinical detection decisions carry the most ambiguity and where missed findings have the most downstream consequence.

02 Performance Profiles

Sensitivity by Condition & Nodule Size

Sensitivity by Imaging Condition
Absolute sensitivity (%) per perturbation relative to reference baseline. Y-axis begins at 55% — all conditions retain meaningful sensitivity; the chart shows relative magnitude of degradation, not proximity to zero.
90% 80% 70% 60% 84.8% 80.8% 74.3% 71.6% ~66.5% Baseline Dose Reduction Soft Kernel 5mm Slice Combined (est.)
Sensitivity by Nodule Size Group
Baseline vs. 5mm slice thickness, across all size cohorts. Values are directional, derived from published analysis. Effect is most pronounced in the 3–6mm range.
100% 80% 60% 40% Baseline 5mm Slice 58% 41% <3mm 79% 61% 3–6mm 91% 82% 6–10mm 97% 93% >10mm
Imaging Condition Sensitivity (%) Δ from Baseline Clinical Risk Flag
Baseline (thin-slice, standard kernel) 84.8% REFERENCE
Dose Reduction (~25% mAs) ~80.8% −4.0pp LOW
Soft Reconstruction Kernel 74.3% −10.5pp HIGH
5mm Slice Thickness 71.6% −13.2pp HIGH
Combined: 5mm + Soft Kernel ~65–68% −17–20pp est. CRITICAL
Clinical Consequence

A 13-point sensitivity drop is not an abstract metric. In the 3–6mm nodule range — where these effects are largest — the AI flag rate falls, meaning cases the algorithm would have surfaced are not surfaced. In a workflow where radiologists rely on AI triage, that gap translates directly into increased miss risk for indeterminate nodules. Missed findings in that size range drive delayed diagnosis, upstaged disease at follow-up, and the defensibility exposure that follows. The physics finding and the clinical consequence are the same event.

03 Client Deliverables

What a GammaMetric Engagement Produces

Every engagement is structured around reproducible, documentable outputs — not a vendor sales conversation. These are the artifacts your team receives.

Protocol Sensitivity Mapping
Systematic performance profiling of your deployed AI across the acquisition conditions present in your actual scanner fleet — not a vendor reference configuration.
  • Sensitivity & FP rate per protocol variant
  • Slice thickness sweep
  • Kernel characterization
  • Dose condition testing
  • Scanner-specific baseline establishment
Statistical Validation Report
A peer-review–grade analysis document using established statistical methodology. Suitable for regulatory response, accreditation, or QA documentation.
  • McNemar tests for paired comparisons
  • Confidence intervals per condition
  • Size-stratified subgroup analysis
  • Effect size and clinical significance framing
  • Pre-/post-update change detection
Risk Stratification Summary
A concise, radiologist-readable summary flagging which protocol conditions in your workflow carry the highest detection risk — and what to do about them.
  • High/medium/low risk protocol flags
  • Recommended protocol guard rails
  • Vendor comparison if multiple systems deployed
  • Patient population impact estimate
Performance Monitoring Dashboard
Ongoing drift detection for longitudinal AI monitoring — identifying when algorithm updates or scanner changes have altered performance from your established baseline.
  • Baseline sensitivity anchor point
  • Update-triggered re-validation protocol
  • Scanner fleet change tracking
  • Monthly/quarterly report cadence
CT Dose Protocol Review
Independent review of your CT acquisition protocols with respect to both dose optimization and AI detection performance — identifying where you're leaving sensitivity on the table unnecessarily.
  • DLP / CTDIvol benchmarking by scanner
  • SSDE-based dose assessment
  • Protocol-AI interaction mapping
  • Leapfrog compliance check (optional)
Vendor Accountability Package
A structured data package for vendor contract negotiations, update accountability, and SLA documentation — grounded in your deployment's measured performance, not their marketing claims.
  • Vendor benchmark vs. site-measured delta
  • Update impact quantification
  • Performance regression documentation
  • Independent physicist attestation letter
04 Engagement Process

How a Validation Engagement Works

01
Site Assessment & Protocol Audit
Review of current scanner protocols, AI vendor documentation, deployed model version, and acquisition parameter ranges actually in use. No PHI required at this stage.
1–2 weeks
02
Phantom & Retrospective Data Collection
Structured data collection plan using site-appropriate phantom sets and de-identified retrospective cases. Collection protocol designed to avoid workflow disruption.
2–4 weeks
03
Statistical Analysis & Findings Generation
Systematic perturbation analysis, condition-by-condition sensitivity profiling, and statistical comparison across protocol variants. All analysis is independent of vendor input.
2–3 weeks
04
Report Delivery & Stakeholder Briefing
Delivery of full validation report, risk stratification summary, and optional presentation to radiology leadership, QA committee, or AI vendor representatives.
1 week
05
Ongoing Monitoring (Optional)
Retainer-based longitudinal monitoring for algorithm update tracking, protocol change impact assessment, and annual re-validation. Structured to meet emerging ACR and CMS AI oversight guidance.
Ongoing / Quarterly
05 Independence

Why Independent Validation Matters

Every AI vendor publishes performance numbers. None of them were generated on your scanners, with your protocols, on your patient population. GammaMetric has no equity stake in any AI vendor, no exclusivity agreements, and no financial incentive to certify anything.

On Vendor Benchmarks

A 13-point sensitivity drop from a single protocol parameter change is not an edge case — it is the predictable consequence of deploying a model trained on curated data into an uncurated environment. The benchmark is not a lie; it is simply not a measurement of your site. And no vendor is incentivized to measure that for you.

GammaMetric
  • No vendor relationships
  • ABR board-eligible physicist
  • Published validation methodology
  • Dozens of scanner installations across CT, MRI, and fluoroscopy
  • Site-specific, reproducible analysis
Vendor Self-Reporting
  • Financial interest in outcome
  • Curated benchmark datasets
  • Protocol conditions not disclosed
  • Update impacts not measured
  • Site-specific validation absent
Find out what your AI is actually doing on your scanners.
gammametric.com