OBJECTIVE: This scoping review aimed to (1) map current applications of transformers and large language models (LLMs) for extracting social drivers of health (SDOH) from clinical text, (2) benchmark model performance across SDOH domains, and (3) evaluate methodological rigor to identify research gaps and inform clinical deployment.
MATERIALS AND METHODS: We searched PubMed, Web of Science, Embase, Scopus, and IEEE Xplore for studies applying transformers or LLMs to detect SDOH in clinical narratives. We developed a novel methodological framework integrating (1) hierarchical classification of SDOH domains and transformer/LLM architectures, (2) systematic synthesis of performance metrics, and (3) a 7-domain instrument assessing internal validity, external validity, and reporting transparency.
RESULTS: Forty-two studies met inclusion criteria. Performance varied substantially across SDOH domains. Behavioral Factors achieved the highest median F1-score (0.87), while Health Care Access and Quality showed the lowest performance and greatest variability (median F1 = 0.59). Research concentrated in the United States (85.7%), relied predominantly on private institutional datasets (69%), and focused primarily on critical care populations (45.2%). Methodological assessment revealed critical gaps; only 29% of studies provided annotation guidelines, 24% assessed fairness across demographic groups, and 21% performed external validation.
DISCUSSION: Smaller open-source transformer models show promise for democratizing SDOH detection by achieving competitive performance at lower costs while enabling secure local deployment in resource-limited settings. Advancing clinical readiness requires standardized reporting practices, diverse benchmark datasets across care settings, and systematic equity evaluation to prevent perpetuating health disparities.
CONCLUSION: Transformer and LLM performance for SDOH detection varied substantially across domains, with encoder-based models excelling at structured tasks and decoder-only models at linguistically complex tasks. Critical gaps in fairness assessment, external validation, and dataset diversity restrict generalizability and readiness for widespread clinical deployment.