Skip to the content.

Technical Stakeholder Q&A (Milestone 3)

Q1: Who are the primary stakeholders for this pipeline? A: Clinical administrators, hospital data scientists, and preventative care teams focused on reducing readmission rates.

Q2: What was the primary “remedy” applied to the data? A: A combination of Multivariate Imputation by Chained Equations (MICE) for missingness and Regular Expression (RE) patterns for clinical code normalization.

Q3: Why was MICE chosen over simple mean/median imputation? A: Clinical markers are interdependent. MICE preserves these relationships (e.g., correlation between weight and lab results), whereas mean imputation flattens variance and introduces bias.

Q4: How did the team handle the 97% missingness in patient weight? A: Rather than deleting the records, we used auxiliary variables (gender, age, and diagnoses) to impute weight values, maintaining the full sample size of 101,766 encounters.

Q5: How does the pipeline handle the high concentration of elderly patients? A: The logic accounts for age-related skews by using “medication counts” and “time in hospital” as key predictors to ensure imputed data reflects geriatric clinical patterns.

Q6: How do DQI dimensions (Completeness, Validity, Consistency) correlate with patient safety? A: High-fidelity data ensures clinical decision support systems are not making recommendations based on “hallucinated” or fragmented data, reducing medical error risks.

Q7: How did the team ensure remediation didn’t introduce “data drift”? A: We used Kernel Density Estimate (KDE) plots to overlay pre- and post-remediation distributions. The identical “shape” of the data proved statistical fidelity was maintained.

Q8: What was the primary obstacle in the Data Acquisition phase? A: “Sentinel Unmasking”—identifying that characters like “?” were not actual data but placeholders for null values, requiring a manual clinical audit before automation.

Q9: Is the remediation pipeline department-specific? A: No. The architecture is department-agnostic and can be scaled to Cardiology or Oncology by updating the feature constraints in the DataAuditor class.

Q10: How does this project demonstrate “Role Playing Variety”? A: Per the DSC450 handout, our team rotated roles for this milestone (Project Manager, Wrangler, Scientist, Visualizer) to demonstrate cross-functional mastery of the data science process.