Automated Extraction of Mortality Information From Publicly Available Sources Using Large Language Models: Development and Evaluation Study

Abstract BackgroundMortality is a critical variable in health care research, especially for evaluating medical product safety and effectiveness. However, inconsistencies in the availability and timeliness of death date and cause of death (CoD) information present significant c...

Full description

Saved in:
Bibliographic Details
Main Authors: Mohammed Al-Garadi, Michele LeNoue-Newton, Michael E Matheny, Melissa McPheeters, Jill M Whitaker, Jessica A Deere, Michael F McLemore, Dax Westerman, Mirza S Khan, José J Hernández-Muñoz, Xi Wang, Aida Kuzucan, Rishi J Desai, Ruth Reeves
Format: Article
Language:English
Published: JMIR Publications 2025-08-01
Series:Journal of Medical Internet Research
Online Access:https://www.jmir.org/2025/1/e71113
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849224242085756928
author Mohammed Al-Garadi
Michele LeNoue-Newton
Michael E Matheny
Melissa McPheeters
Jill M Whitaker
Jessica A Deere
Michael F McLemore
Dax Westerman
Mirza S Khan
José J Hernández-Muñoz
Xi Wang
Aida Kuzucan
Rishi J Desai
Ruth Reeves
author_facet Mohammed Al-Garadi
Michele LeNoue-Newton
Michael E Matheny
Melissa McPheeters
Jill M Whitaker
Jessica A Deere
Michael F McLemore
Dax Westerman
Mirza S Khan
José J Hernández-Muñoz
Xi Wang
Aida Kuzucan
Rishi J Desai
Ruth Reeves
author_sort Mohammed Al-Garadi
collection DOAJ
description Abstract BackgroundMortality is a critical variable in health care research, especially for evaluating medical product safety and effectiveness. However, inconsistencies in the availability and timeliness of death date and cause of death (CoD) information present significant challenges. Conventional sources such as the National Death Index and electronic health records often experience data lags, missing fields, or incomplete coverage, limiting their utility in time-sensitive or large-scale studies. With the growing use of social media, crowdfunding platforms, and web-based memorials, publicly available digital content has emerged as a potential supplementary source for mortality surveillance. Despite this potential, accurate tools for extracting mortality information from such unstructured data sources remain underdeveloped. ObjectiveThe aim of the study is to develop scalable approaches using natural language processing (NLP) and large language models (LLMs) for the extraction of mortality information from publicly available web-based data sources, including social media platforms, crowdfunding websites, and web-based obituaries, and to evaluate their performance across various sources. MethodsData were collected from public posts on X (formerly known as Twitter), GoFundMe campaigns, memorial websites (EverLoved and TributeArchive), and web-based obituaries from 2015 to 2022, focusing on US-based content relevant to mortality. We developed an NLP pipeline using transformer-based models to extract key mortality information such as decedent names, dates of birth, and dates of death. We then used a few-shot learning (FSL) approach with LLMs to identify primary and secondary CoDs. Model performance was assessed using precision, recall, F1 ResultsThe best-performing model obtained a microaveraged F1 ConclusionsThis study demonstrates the feasibility of using advanced NLP and LLM techniques to extract mortality data from publicly available web-based sources. These methods can significantly enhance the timeliness, completeness, and granularity of mortality surveillance, offering a valuable complement to traditional data systems. By enabling earlier detection of mortality signals and improving CoD classification across large populations, this approach may support more responsive public health monitoring and medical product safety assessments. Further work is needed to validate these findings in real-world health care settings and facilitate the integration of digital data sources into national public health surveillance systems.
format Article
id doaj-art-a67d35ffb6864075bded53749eaf666b
institution Kabale University
issn 1438-8871
language English
publishDate 2025-08-01
publisher JMIR Publications
record_format Article
series Journal of Medical Internet Research
spelling doaj-art-a67d35ffb6864075bded53749eaf666b2025-08-25T13:20:53ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-08-0127e71113e7111310.2196/71113Automated Extraction of Mortality Information From Publicly Available Sources Using Large Language Models: Development and Evaluation StudyMohammed Al-Garadihttp://orcid.org/0000-0002-6991-2687Michele LeNoue-Newtonhttp://orcid.org/0000-0003-3469-3784Michael E Mathenyhttp://orcid.org/0000-0003-3217-4147Melissa McPheetershttp://orcid.org/0000-0002-4423-797XJill M Whitakerhttp://orcid.org/0009-0000-5451-2667Jessica A Deerehttp://orcid.org/0009-0003-0677-0364Michael F McLemorehttp://orcid.org/0009-0001-4772-4810Dax Westermanhttp://orcid.org/0000-0002-9547-7789Mirza S Khanhttp://orcid.org/0000-0001-7007-9437José J Hernández-Muñozhttp://orcid.org/0000-0002-2553-3159Xi Wanghttp://orcid.org/0000-0001-9478-0199Aida Kuzucanhttp://orcid.org/0000-0003-0893-7028Rishi J Desaihttp://orcid.org/0000-0003-0299-7273Ruth Reeveshttp://orcid.org/0000-0003-4260-2707 Abstract BackgroundMortality is a critical variable in health care research, especially for evaluating medical product safety and effectiveness. However, inconsistencies in the availability and timeliness of death date and cause of death (CoD) information present significant challenges. Conventional sources such as the National Death Index and electronic health records often experience data lags, missing fields, or incomplete coverage, limiting their utility in time-sensitive or large-scale studies. With the growing use of social media, crowdfunding platforms, and web-based memorials, publicly available digital content has emerged as a potential supplementary source for mortality surveillance. Despite this potential, accurate tools for extracting mortality information from such unstructured data sources remain underdeveloped. ObjectiveThe aim of the study is to develop scalable approaches using natural language processing (NLP) and large language models (LLMs) for the extraction of mortality information from publicly available web-based data sources, including social media platforms, crowdfunding websites, and web-based obituaries, and to evaluate their performance across various sources. MethodsData were collected from public posts on X (formerly known as Twitter), GoFundMe campaigns, memorial websites (EverLoved and TributeArchive), and web-based obituaries from 2015 to 2022, focusing on US-based content relevant to mortality. We developed an NLP pipeline using transformer-based models to extract key mortality information such as decedent names, dates of birth, and dates of death. We then used a few-shot learning (FSL) approach with LLMs to identify primary and secondary CoDs. Model performance was assessed using precision, recall, F1 ResultsThe best-performing model obtained a microaveraged F1 ConclusionsThis study demonstrates the feasibility of using advanced NLP and LLM techniques to extract mortality data from publicly available web-based sources. These methods can significantly enhance the timeliness, completeness, and granularity of mortality surveillance, offering a valuable complement to traditional data systems. By enabling earlier detection of mortality signals and improving CoD classification across large populations, this approach may support more responsive public health monitoring and medical product safety assessments. Further work is needed to validate these findings in real-world health care settings and facilitate the integration of digital data sources into national public health surveillance systems.https://www.jmir.org/2025/1/e71113
spellingShingle Mohammed Al-Garadi
Michele LeNoue-Newton
Michael E Matheny
Melissa McPheeters
Jill M Whitaker
Jessica A Deere
Michael F McLemore
Dax Westerman
Mirza S Khan
José J Hernández-Muñoz
Xi Wang
Aida Kuzucan
Rishi J Desai
Ruth Reeves
Automated Extraction of Mortality Information From Publicly Available Sources Using Large Language Models: Development and Evaluation Study
Journal of Medical Internet Research
title Automated Extraction of Mortality Information From Publicly Available Sources Using Large Language Models: Development and Evaluation Study
title_full Automated Extraction of Mortality Information From Publicly Available Sources Using Large Language Models: Development and Evaluation Study
title_fullStr Automated Extraction of Mortality Information From Publicly Available Sources Using Large Language Models: Development and Evaluation Study
title_full_unstemmed Automated Extraction of Mortality Information From Publicly Available Sources Using Large Language Models: Development and Evaluation Study
title_short Automated Extraction of Mortality Information From Publicly Available Sources Using Large Language Models: Development and Evaluation Study
title_sort automated extraction of mortality information from publicly available sources using large language models development and evaluation study
url https://www.jmir.org/2025/1/e71113
work_keys_str_mv AT mohammedalgaradi automatedextractionofmortalityinformationfrompubliclyavailablesourcesusinglargelanguagemodelsdevelopmentandevaluationstudy
AT michelelenouenewton automatedextractionofmortalityinformationfrompubliclyavailablesourcesusinglargelanguagemodelsdevelopmentandevaluationstudy
AT michaelematheny automatedextractionofmortalityinformationfrompubliclyavailablesourcesusinglargelanguagemodelsdevelopmentandevaluationstudy
AT melissamcpheeters automatedextractionofmortalityinformationfrompubliclyavailablesourcesusinglargelanguagemodelsdevelopmentandevaluationstudy
AT jillmwhitaker automatedextractionofmortalityinformationfrompubliclyavailablesourcesusinglargelanguagemodelsdevelopmentandevaluationstudy
AT jessicaadeere automatedextractionofmortalityinformationfrompubliclyavailablesourcesusinglargelanguagemodelsdevelopmentandevaluationstudy
AT michaelfmclemore automatedextractionofmortalityinformationfrompubliclyavailablesourcesusinglargelanguagemodelsdevelopmentandevaluationstudy
AT daxwesterman automatedextractionofmortalityinformationfrompubliclyavailablesourcesusinglargelanguagemodelsdevelopmentandevaluationstudy
AT mirzaskhan automatedextractionofmortalityinformationfrompubliclyavailablesourcesusinglargelanguagemodelsdevelopmentandevaluationstudy
AT josejhernandezmunoz automatedextractionofmortalityinformationfrompubliclyavailablesourcesusinglargelanguagemodelsdevelopmentandevaluationstudy
AT xiwang automatedextractionofmortalityinformationfrompubliclyavailablesourcesusinglargelanguagemodelsdevelopmentandevaluationstudy
AT aidakuzucan automatedextractionofmortalityinformationfrompubliclyavailablesourcesusinglargelanguagemodelsdevelopmentandevaluationstudy
AT rishijdesai automatedextractionofmortalityinformationfrompubliclyavailablesourcesusinglargelanguagemodelsdevelopmentandevaluationstudy
AT ruthreeves automatedextractionofmortalityinformationfrompubliclyavailablesourcesusinglargelanguagemodelsdevelopmentandevaluationstudy