Automated Extraction of Mortality Information From Publicly Available Sources Using Large Language Models: Development and Evaluation Study

Abstract BackgroundMortality is a critical variable in health care research, especially for evaluating medical product safety and effectiveness. However, inconsistencies in the availability and timeliness of death date and cause of death (CoD) information present significant c...

Full description

Saved in:
Bibliographic Details
Main Authors: Mohammed Al-Garadi, Michele LeNoue-Newton, Michael E Matheny, Melissa McPheeters, Jill M Whitaker, Jessica A Deere, Michael F McLemore, Dax Westerman, Mirza S Khan, José J Hernández-Muñoz, Xi Wang, Aida Kuzucan, Rishi J Desai, Ruth Reeves
Format: Article
Language:English
Published: JMIR Publications 2025-08-01
Series:Journal of Medical Internet Research
Online Access:https://www.jmir.org/2025/1/e71113
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract BackgroundMortality is a critical variable in health care research, especially for evaluating medical product safety and effectiveness. However, inconsistencies in the availability and timeliness of death date and cause of death (CoD) information present significant challenges. Conventional sources such as the National Death Index and electronic health records often experience data lags, missing fields, or incomplete coverage, limiting their utility in time-sensitive or large-scale studies. With the growing use of social media, crowdfunding platforms, and web-based memorials, publicly available digital content has emerged as a potential supplementary source for mortality surveillance. Despite this potential, accurate tools for extracting mortality information from such unstructured data sources remain underdeveloped. ObjectiveThe aim of the study is to develop scalable approaches using natural language processing (NLP) and large language models (LLMs) for the extraction of mortality information from publicly available web-based data sources, including social media platforms, crowdfunding websites, and web-based obituaries, and to evaluate their performance across various sources. MethodsData were collected from public posts on X (formerly known as Twitter), GoFundMe campaigns, memorial websites (EverLoved and TributeArchive), and web-based obituaries from 2015 to 2022, focusing on US-based content relevant to mortality. We developed an NLP pipeline using transformer-based models to extract key mortality information such as decedent names, dates of birth, and dates of death. We then used a few-shot learning (FSL) approach with LLMs to identify primary and secondary CoDs. Model performance was assessed using precision, recall, F1 ResultsThe best-performing model obtained a microaveraged F1 ConclusionsThis study demonstrates the feasibility of using advanced NLP and LLM techniques to extract mortality data from publicly available web-based sources. These methods can significantly enhance the timeliness, completeness, and granularity of mortality surveillance, offering a valuable complement to traditional data systems. By enabling earlier detection of mortality signals and improving CoD classification across large populations, this approach may support more responsive public health monitoring and medical product safety assessments. Further work is needed to validate these findings in real-world health care settings and facilitate the integration of digital data sources into national public health surveillance systems.
ISSN:1438-8871