AI in Home Care—Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study

BackgroundThe aging population presents an accomplishment for society but also poses significant challenges for governments, health care systems, and caregivers. Elevated rates of functional limitations among older adults, primarily caused by chronic conditions, necessitate a...

Full description

Saved in:

Bibliographic Details
Main Authors:	Clara Pérez-Esteve, Mercedes Guilabert, Valerie Matarredona, Einav Srulovici, Susanna Tella, Reinhard Strametz, José Joaquín Mira
Format:	Article
Language:	English
Published:	JMIR Publications 2025-04-01
Series:	Journal of Medical Internet Research
Online Access:	https://www.jmir.org/2025/1/e70703
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850177172540489728
author	Clara Pérez-Esteve Mercedes Guilabert Valerie Matarredona Einav Srulovici Susanna Tella Reinhard Strametz José Joaquín Mira
author_facet	Clara Pérez-Esteve Mercedes Guilabert Valerie Matarredona Einav Srulovici Susanna Tella Reinhard Strametz José Joaquín Mira
author_sort	Clara Pérez-Esteve
collection	DOAJ
description	BackgroundThe aging population presents an accomplishment for society but also poses significant challenges for governments, health care systems, and caregivers. Elevated rates of functional limitations among older adults, primarily caused by chronic conditions, necessitate adequate and safe care, including in-home settings. Traditionally, informal caregiver training has relied on verbal and written instructions. However, the advent of digital resources has introduced videos and interactive platforms, offering more accessible and effective training. Large language models (LLMs) have emerged as potential tools for personalized information delivery. While LLMs exhibit the capacity to mimic clinical reasoning and support decision-making, their potential to serve as alternatives to evidence-based professional instruction remains unexplored. ObjectiveWe aimed to evaluate the appropriateness of home care instructions generated by LLMs (including GPTs) in comparison to a professional gold standard. Furthermore, it seeks to identify specific domains where LLMs show the most promise and where improvements are necessary to optimize their reliability for caregiver training. MethodsAn observational, comparative case study evaluated 3 LLMs—GPT-3.5, GPT-4o, and Microsoft Copilot—in 10 home care scenarios. A rubric assessed the models against a reference standard (gold standard) created by health care professionals. Independent reviewers evaluated variables including specificity, clarity, and self-efficacy. In addition to comparing each LLM to the gold standard, the models were also compared against each other across all study domains to identify relative strengths and weaknesses. Statistical analyses compared LLMs performance to the gold standard to ensure consistency and validity, as well as to analyze differences between LLMs across all evaluated domains. ResultsThe study revealed that while no LLM achieved the precision of the professional gold standard, GPT-4o outperformed GPT-3.5, and Copilot in specificity (4.6 vs 3.7 and 3.6), clarity (4.8 vs 4.1 and 3.9), and self-efficacy (4.6 vs 3.8 and 3.4). However, the models exhibited significant limitations, with GPT-4o and Copilot omitting relevant details in 60% (6/10) of the cases, and GPT-3.5 doing so in 80% (8/10). When compared to the gold standard, only 10% (2/20) of GPT-4o responses were rated as equally specific, 20% (4/20) included comparable practical advice, and just 5% (1/20) provided a justification as detailed as professional guidance. Furthermore, error frequency did not differ significantly across models (P=.65), though Copilot had the highest rate of incorrect information (20%, 2/10 vs 10%, 1/10 for GPT-4o and 0%, 0/0 for GPT-3.5). ConclusionsLLMs, particularly GPT-4o subscription-based, show potential as tools for training informal caregivers by providing tailored guidance and reducing errors. Although not yet surpassing professional instruction quality, these models offer a flexible and accessible alternative that could enhance home safety and care quality. Further research is necessary to address limitations and optimize their performance. Future implementation of LLMs may alleviate health care system burdens by reducing common caregiver errors.
format	Article
id	doaj-art-fa386a82371249d9bd1afe6ca047c429
institution	OA Journals
issn	1438-8871
language	English
publishDate	2025-04-01
publisher	JMIR Publications
record_format	Article
series	Journal of Medical Internet Research
spelling	doaj-art-fa386a82371249d9bd1afe6ca047c4292025-08-20T02:19:03ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-04-0127e7070310.2196/70703AI in Home Care—Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case StudyClara Pérez-Estevehttps://orcid.org/0009-0008-8009-0507Mercedes Guilaberthttps://orcid.org/0009-0008-8009-0507Valerie Matarredonahttps://orcid.org/0009-0008-0419-3600Einav Srulovicihttps://orcid.org/0000-0003-1291-8284Susanna Tellahttps://orcid.org/0000-0003-1291-8284Reinhard Strametzhttps://orcid.org/0000-0002-9920-8674José Joaquín Mirahttps://orcid.org/0000-0001-6497-083X BackgroundThe aging population presents an accomplishment for society but also poses significant challenges for governments, health care systems, and caregivers. Elevated rates of functional limitations among older adults, primarily caused by chronic conditions, necessitate adequate and safe care, including in-home settings. Traditionally, informal caregiver training has relied on verbal and written instructions. However, the advent of digital resources has introduced videos and interactive platforms, offering more accessible and effective training. Large language models (LLMs) have emerged as potential tools for personalized information delivery. While LLMs exhibit the capacity to mimic clinical reasoning and support decision-making, their potential to serve as alternatives to evidence-based professional instruction remains unexplored. ObjectiveWe aimed to evaluate the appropriateness of home care instructions generated by LLMs (including GPTs) in comparison to a professional gold standard. Furthermore, it seeks to identify specific domains where LLMs show the most promise and where improvements are necessary to optimize their reliability for caregiver training. MethodsAn observational, comparative case study evaluated 3 LLMs—GPT-3.5, GPT-4o, and Microsoft Copilot—in 10 home care scenarios. A rubric assessed the models against a reference standard (gold standard) created by health care professionals. Independent reviewers evaluated variables including specificity, clarity, and self-efficacy. In addition to comparing each LLM to the gold standard, the models were also compared against each other across all study domains to identify relative strengths and weaknesses. Statistical analyses compared LLMs performance to the gold standard to ensure consistency and validity, as well as to analyze differences between LLMs across all evaluated domains. ResultsThe study revealed that while no LLM achieved the precision of the professional gold standard, GPT-4o outperformed GPT-3.5, and Copilot in specificity (4.6 vs 3.7 and 3.6), clarity (4.8 vs 4.1 and 3.9), and self-efficacy (4.6 vs 3.8 and 3.4). However, the models exhibited significant limitations, with GPT-4o and Copilot omitting relevant details in 60% (6/10) of the cases, and GPT-3.5 doing so in 80% (8/10). When compared to the gold standard, only 10% (2/20) of GPT-4o responses were rated as equally specific, 20% (4/20) included comparable practical advice, and just 5% (1/20) provided a justification as detailed as professional guidance. Furthermore, error frequency did not differ significantly across models (P=.65), though Copilot had the highest rate of incorrect information (20%, 2/10 vs 10%, 1/10 for GPT-4o and 0%, 0/0 for GPT-3.5). ConclusionsLLMs, particularly GPT-4o subscription-based, show potential as tools for training informal caregivers by providing tailored guidance and reducing errors. Although not yet surpassing professional instruction quality, these models offer a flexible and accessible alternative that could enhance home safety and care quality. Further research is necessary to address limitations and optimize their performance. Future implementation of LLMs may alleviate health care system burdens by reducing common caregiver errors.https://www.jmir.org/2025/1/e70703
spellingShingle	Clara Pérez-Esteve Mercedes Guilabert Valerie Matarredona Einav Srulovici Susanna Tella Reinhard Strametz José Joaquín Mira AI in Home Care—Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study Journal of Medical Internet Research
title	AI in Home Care—Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study
title_full	AI in Home Care—Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study
title_fullStr	AI in Home Care—Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study
title_full_unstemmed	AI in Home Care—Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study
title_short	AI in Home Care—Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study
title_sort	ai in home care evaluation of large language models for future training of informal caregivers observational comparative case study
url	https://www.jmir.org/2025/1/e70703
work_keys_str_mv	AT claraperezesteve aiinhomecareevaluationoflargelanguagemodelsforfuturetrainingofinformalcaregiversobservationalcomparativecasestudy AT mercedesguilabert aiinhomecareevaluationoflargelanguagemodelsforfuturetrainingofinformalcaregiversobservationalcomparativecasestudy AT valeriematarredona aiinhomecareevaluationoflargelanguagemodelsforfuturetrainingofinformalcaregiversobservationalcomparativecasestudy AT einavsrulovici aiinhomecareevaluationoflargelanguagemodelsforfuturetrainingofinformalcaregiversobservationalcomparativecasestudy AT susannatella aiinhomecareevaluationoflargelanguagemodelsforfuturetrainingofinformalcaregiversobservationalcomparativecasestudy AT reinhardstrametz aiinhomecareevaluationoflargelanguagemodelsforfuturetrainingofinformalcaregiversobservationalcomparativecasestudy AT josejoaquinmira aiinhomecareevaluationoflargelanguagemodelsforfuturetrainingofinformalcaregiversobservationalcomparativecasestudy

AI in Home Care—Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study

Similar Items