Data Quality Index (DQI) Framework
The DQI serves as the primary quantitative metric for evaluating the efficacy of the automated remediation pipeline.
📐 The DQI Formula
The aggregate health of the dataset is expressed through the following composite KPI:
\[DQI = \frac{Completeness + Validity + Consistency}{3}\]🔍 The Three Pillars of Fidelity
1. Completeness
- Action: Replaced listwise deletion with MICE, preserving the total sample of 101,766 encounters.
- Impact: Resolved the 97% missingness found in patient weight data.
2. Validity
- Action: Programmatically enforced clinical constraints via
DataAuditor.py. - Impact: Ensured lab results (e.g., HbA1c) fall within medically plausible ranges.
3. Consistency
- Action: Utilized Regular Expressions (RE) to standardize fragmented ICD-9 diagnosis codes.
- Impact: Prevented dataset fragmentation across longitudinal hospital records.