AI in Home Care—Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study

BackgroundThe aging population presents an accomplishment for society but also poses significant challenges for governments, health care systems, and caregivers. Elevated rates of functional limitations among older adults, primarily caused by chronic conditions, necessitate a...

Full description

Saved in:
Bibliographic Details
Main Authors: Clara Pérez-Esteve, Mercedes Guilabert, Valerie Matarredona, Einav Srulovici, Susanna Tella, Reinhard Strametz, José Joaquín Mira
Format: Article
Language:English
Published: JMIR Publications 2025-04-01
Series:Journal of Medical Internet Research
Online Access:https://www.jmir.org/2025/1/e70703
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850177172540489728
author Clara Pérez-Esteve
Mercedes Guilabert
Valerie Matarredona
Einav Srulovici
Susanna Tella
Reinhard Strametz
José Joaquín Mira
author_facet Clara Pérez-Esteve
Mercedes Guilabert
Valerie Matarredona
Einav Srulovici
Susanna Tella
Reinhard Strametz
José Joaquín Mira
author_sort Clara Pérez-Esteve
collection DOAJ
description BackgroundThe aging population presents an accomplishment for society but also poses significant challenges for governments, health care systems, and caregivers. Elevated rates of functional limitations among older adults, primarily caused by chronic conditions, necessitate adequate and safe care, including in-home settings. Traditionally, informal caregiver training has relied on verbal and written instructions. However, the advent of digital resources has introduced videos and interactive platforms, offering more accessible and effective training. Large language models (LLMs) have emerged as potential tools for personalized information delivery. While LLMs exhibit the capacity to mimic clinical reasoning and support decision-making, their potential to serve as alternatives to evidence-based professional instruction remains unexplored. ObjectiveWe aimed to evaluate the appropriateness of home care instructions generated by LLMs (including GPTs) in comparison to a professional gold standard. Furthermore, it seeks to identify specific domains where LLMs show the most promise and where improvements are necessary to optimize their reliability for caregiver training. MethodsAn observational, comparative case study evaluated 3 LLMs—GPT-3.5, GPT-4o, and Microsoft Copilot—in 10 home care scenarios. A rubric assessed the models against a reference standard (gold standard) created by health care professionals. Independent reviewers evaluated variables including specificity, clarity, and self-efficacy. In addition to comparing each LLM to the gold standard, the models were also compared against each other across all study domains to identify relative strengths and weaknesses. Statistical analyses compared LLMs performance to the gold standard to ensure consistency and validity, as well as to analyze differences between LLMs across all evaluated domains. ResultsThe study revealed that while no LLM achieved the precision of the professional gold standard, GPT-4o outperformed GPT-3.5, and Copilot in specificity (4.6 vs 3.7 and 3.6), clarity (4.8 vs 4.1 and 3.9), and self-efficacy (4.6 vs 3.8 and 3.4). However, the models exhibited significant limitations, with GPT-4o and Copilot omitting relevant details in 60% (6/10) of the cases, and GPT-3.5 doing so in 80% (8/10). When compared to the gold standard, only 10% (2/20) of GPT-4o responses were rated as equally specific, 20% (4/20) included comparable practical advice, and just 5% (1/20) provided a justification as detailed as professional guidance. Furthermore, error frequency did not differ significantly across models (P=.65), though Copilot had the highest rate of incorrect information (20%, 2/10 vs 10%, 1/10 for GPT-4o and 0%, 0/0 for GPT-3.5). ConclusionsLLMs, particularly GPT-4o subscription-based, show potential as tools for training informal caregivers by providing tailored guidance and reducing errors. Although not yet surpassing professional instruction quality, these models offer a flexible and accessible alternative that could enhance home safety and care quality. Further research is necessary to address limitations and optimize their performance. Future implementation of LLMs may alleviate health care system burdens by reducing common caregiver errors.
format Article
id doaj-art-fa386a82371249d9bd1afe6ca047c429
institution OA Journals
issn 1438-8871
language English
publishDate 2025-04-01
publisher JMIR Publications
record_format Article
series Journal of Medical Internet Research
spelling doaj-art-fa386a82371249d9bd1afe6ca047c4292025-08-20T02:19:03ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-04-0127e7070310.2196/70703AI in Home Care—Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case StudyClara Pérez-Estevehttps://orcid.org/0009-0008-8009-0507Mercedes Guilaberthttps://orcid.org/0009-0008-8009-0507Valerie Matarredonahttps://orcid.org/0009-0008-0419-3600Einav Srulovicihttps://orcid.org/0000-0003-1291-8284Susanna Tellahttps://orcid.org/0000-0003-1291-8284Reinhard Strametzhttps://orcid.org/0000-0002-9920-8674José Joaquín Mirahttps://orcid.org/0000-0001-6497-083X BackgroundThe aging population presents an accomplishment for society but also poses significant challenges for governments, health care systems, and caregivers. Elevated rates of functional limitations among older adults, primarily caused by chronic conditions, necessitate adequate and safe care, including in-home settings. Traditionally, informal caregiver training has relied on verbal and written instructions. However, the advent of digital resources has introduced videos and interactive platforms, offering more accessible and effective training. Large language models (LLMs) have emerged as potential tools for personalized information delivery. While LLMs exhibit the capacity to mimic clinical reasoning and support decision-making, their potential to serve as alternatives to evidence-based professional instruction remains unexplored. ObjectiveWe aimed to evaluate the appropriateness of home care instructions generated by LLMs (including GPTs) in comparison to a professional gold standard. Furthermore, it seeks to identify specific domains where LLMs show the most promise and where improvements are necessary to optimize their reliability for caregiver training. MethodsAn observational, comparative case study evaluated 3 LLMs—GPT-3.5, GPT-4o, and Microsoft Copilot—in 10 home care scenarios. A rubric assessed the models against a reference standard (gold standard) created by health care professionals. Independent reviewers evaluated variables including specificity, clarity, and self-efficacy. In addition to comparing each LLM to the gold standard, the models were also compared against each other across all study domains to identify relative strengths and weaknesses. Statistical analyses compared LLMs performance to the gold standard to ensure consistency and validity, as well as to analyze differences between LLMs across all evaluated domains. ResultsThe study revealed that while no LLM achieved the precision of the professional gold standard, GPT-4o outperformed GPT-3.5, and Copilot in specificity (4.6 vs 3.7 and 3.6), clarity (4.8 vs 4.1 and 3.9), and self-efficacy (4.6 vs 3.8 and 3.4). However, the models exhibited significant limitations, with GPT-4o and Copilot omitting relevant details in 60% (6/10) of the cases, and GPT-3.5 doing so in 80% (8/10). When compared to the gold standard, only 10% (2/20) of GPT-4o responses were rated as equally specific, 20% (4/20) included comparable practical advice, and just 5% (1/20) provided a justification as detailed as professional guidance. Furthermore, error frequency did not differ significantly across models (P=.65), though Copilot had the highest rate of incorrect information (20%, 2/10 vs 10%, 1/10 for GPT-4o and 0%, 0/0 for GPT-3.5). ConclusionsLLMs, particularly GPT-4o subscription-based, show potential as tools for training informal caregivers by providing tailored guidance and reducing errors. Although not yet surpassing professional instruction quality, these models offer a flexible and accessible alternative that could enhance home safety and care quality. Further research is necessary to address limitations and optimize their performance. Future implementation of LLMs may alleviate health care system burdens by reducing common caregiver errors.https://www.jmir.org/2025/1/e70703
spellingShingle Clara Pérez-Esteve
Mercedes Guilabert
Valerie Matarredona
Einav Srulovici
Susanna Tella
Reinhard Strametz
José Joaquín Mira
AI in Home Care—Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study
Journal of Medical Internet Research
title AI in Home Care—Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study
title_full AI in Home Care—Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study
title_fullStr AI in Home Care—Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study
title_full_unstemmed AI in Home Care—Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study
title_short AI in Home Care—Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study
title_sort ai in home care evaluation of large language models for future training of informal caregivers observational comparative case study
url https://www.jmir.org/2025/1/e70703
work_keys_str_mv AT claraperezesteve aiinhomecareevaluationoflargelanguagemodelsforfuturetrainingofinformalcaregiversobservationalcomparativecasestudy
AT mercedesguilabert aiinhomecareevaluationoflargelanguagemodelsforfuturetrainingofinformalcaregiversobservationalcomparativecasestudy
AT valeriematarredona aiinhomecareevaluationoflargelanguagemodelsforfuturetrainingofinformalcaregiversobservationalcomparativecasestudy
AT einavsrulovici aiinhomecareevaluationoflargelanguagemodelsforfuturetrainingofinformalcaregiversobservationalcomparativecasestudy
AT susannatella aiinhomecareevaluationoflargelanguagemodelsforfuturetrainingofinformalcaregiversobservationalcomparativecasestudy
AT reinhardstrametz aiinhomecareevaluationoflargelanguagemodelsforfuturetrainingofinformalcaregiversobservationalcomparativecasestudy
AT josejoaquinmira aiinhomecareevaluationoflargelanguagemodelsforfuturetrainingofinformalcaregiversobservationalcomparativecasestudy