BACKGROUND: Housing and income are important social determinants of health (SDoH). Primary care providers often do not have information about these determinants, which could be used to support equitable health system planning and care delivery. The aim of this study was to use primary care electronic medical record (EMR) data to test two approaches (machine learning and regular expression searches) to obtain information about patients' housing instability and low income status.
METHODS: We used de-identified EMR data from the St. Michael's Hospital Academic Family Health Team (Toronto, Ontario, Canada). A Health Equity Questionnaire is also routinely distributed to patients and includes questions about income and housing status; this formed the reference standard. First, a regular expression (REGEX) classifier was created using key text terms and codes; the second approach used supervised machine learning models (XGBoost). Discrimination and calibration metrics were calculated as compared to the patient-reported responses.
RESULTS: 11,794 eligible patients were included in the housing cohort and 10,454 were in the income cohort. Overall, both approaches had poor sensitivity for determining both housing instability (XGBoost: 3.1%, REGEX: 29.0%) and low income status (XGBoost: 41.7%, REGEX: 17.6%). Positive predictive value (PPV) was satisfactory for the machine learning approach (83.3% for housing, 72.9% for income).
CONCLUSION: While the machine learning approach demonstrated reasonable PPV, the overall metrics were poor and unlikely to be useful in a clinical setting for identifying patients with housing or economic needs. More robust analysis could be explored, but continued patient-captured SDoH information is necessary.