Automated Clinical Remediation: A Modular Python Pipeline

Enhancing Fidelity in Longitudinal Diabetes Records

This project implements an automated Python-based pipeline designed to remediate systemic defects in the Diabetes 130-US Hospitals dataset. By utilizing a modular technical stack, the team achieved a 25% increase in the Data Quality Index (DQI), transforming raw clinical “noise” into high-fidelity data assets.

📊 Key Performance Indicators (KPIs)

25% Aggregate Improvement in data health metrics.
101,766 Patient Encounters preserved through statistical power maintenance.
11.2% Readmission Rate identified as the target clinical variable.
97% Weight Missingness successfully addressed via MICE to prevent data loss.

🛠 Technical Stack & Methodology

The project follows a structured Software Development Lifecycle (SDLC):

Phase 1: Clinical Audit: Quantitative profiling using Pandas and NumPy to identify “sentinel” null values (e.g., “?”).
Phase 2: Advanced Remediation: Multivariate Imputation by Chained Equations (MICE) via Scikit-Learn and Regular Expressions (RE) for ICD-9 code normalization.
Phase 3: Visual Validation: Fidelity checks using Kernel Density Estimate (KDE) plots to ensure statistical stability post-remediation.

📂 Quick Links

👥 Research Team (Milestone 3 Roles)

Ashley Love: Project Manager (SDLC Management & Documentation)
Christian Shannon: Data Wrangler (Clinical Data Profiling & Audit)
Kirsten Livingston: Data Scientist (MICE Imputation & RE Logic)
Mugtaba Awad: Data Visualizer (Distribution Analysis & Presentation)