Automated Clinical Remediation: A Modular Python Pipeline
Enhancing Fidelity in Longitudinal Diabetes Records
This project implements an automated Python-based pipeline designed to remediate systemic defects in the Diabetes 130-US Hospitals dataset. By utilizing a modular technical stack, the team achieved a 25% increase in the Data Quality Index (DQI), transforming raw clinical โnoiseโ into high-fidelity data assets.
๐ Key Performance Indicators (KPIs)
- 25% Aggregate Improvement in data health metrics.
- 101,766 Patient Encounters preserved through statistical power maintenance.
- 11.2% Readmission Rate identified as the target clinical variable.
- 97% Weight Missingness successfully addressed via MICE to prevent data loss.
๐ Technical Stack & Methodology
The project follows a structured Software Development Lifecycle (SDLC):
- Phase 1: Clinical Audit: Quantitative profiling using Pandas and NumPy to identify โsentinelโ null values (e.g., โ?โ).
- Phase 2: Advanced Remediation: Multivariate Imputation by Chained Equations (MICE) via Scikit-Learn and Regular Expressions (RE) for ICD-9 code normalization.
- Phase 3: Visual Validation: Fidelity checks using Kernel Density Estimate (KDE) plots to ensure statistical stability post-remediation.
๐ Quick Links
๐ฅ Research Team (Milestone 3 Roles)
- Ashley Love: Project Manager (SDLC Management & Documentation)
- Christian Shannon: Data Wrangler (Clinical Data Profiling & Audit)
- Kirsten Livingston: Data Scientist (MICE Imputation & RE Logic)
- Mugtaba Awad: Data Visualizer (Distribution Analysis & Presentation)