Comparing the performance of a large language model and naive human interviewers in interviewing children about a witnessed mock-event.

<h4>Purpose</h4>The present study compared the performance of a Large Language Model (LLM; ChatGPT) and human interviewers in interviewing children about a mock-event they witnessed.<h4>Methods</h4>Children aged 6-8 (N =  78) were randomly assigned to the LLM (n =  40) or the...

Full description

Saved in:
Bibliographic Details
Main Authors: Yongjie Sun, Haohai Pang, Liisa Järvilehto, Ophelia Zhang, David Shapiro, Julia Korkman, Shumpei Haginoya, Pekka Santtila
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2025-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0316317
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849730276892082176
author Yongjie Sun
Haohai Pang
Liisa Järvilehto
Ophelia Zhang
David Shapiro
Julia Korkman
Shumpei Haginoya
Pekka Santtila
author_facet Yongjie Sun
Haohai Pang
Liisa Järvilehto
Ophelia Zhang
David Shapiro
Julia Korkman
Shumpei Haginoya
Pekka Santtila
author_sort Yongjie Sun
collection DOAJ
description <h4>Purpose</h4>The present study compared the performance of a Large Language Model (LLM; ChatGPT) and human interviewers in interviewing children about a mock-event they witnessed.<h4>Methods</h4>Children aged 6-8 (N =  78) were randomly assigned to the LLM (n =  40) or the human interviewer condition (n =  38). In the experiment, the children were asked to watch a video filmed by the researchers that depicted behavior including elements that could be misinterpreted as abusive in other contexts, and then answer questions posed by either an LLM (presented by a human researcher) or a human interviewer.<h4>Results</h4>Irrespective of condition, recommended (vs. not recommended) questions elicited more correct information. The LLM posed fewer questions overall, but no difference in the proportion of the questions recommended by the literature. There were no differences between the LLM and human interviewers in unique correct information elicited but questions posed by LLM (vs. humans) elicited more unique correct information per question. LLM (vs. humans) also elicited less false information overall, but there was no difference in false information elicited per question.<h4>Conclusions</h4>The findings show that the LLM was competent in formulating questions that adhere to best practice guidelines while human interviewers asked more questions following up on the child responses in trying to find out what the children had witnessed. The results indicate LLMs could possibly be used to support child investigative interviewers. However, substantial further investigation is warranted to ascertain the utility of LLMs in more realistic investigative interview settings.
format Article
id doaj-art-a384f51b2ea6423fac4ca77680ab6e2a
institution DOAJ
issn 1932-6203
language English
publishDate 2025-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-a384f51b2ea6423fac4ca77680ab6e2a2025-08-20T03:08:56ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01202e031631710.1371/journal.pone.0316317Comparing the performance of a large language model and naive human interviewers in interviewing children about a witnessed mock-event.Yongjie SunHaohai PangLiisa JärvilehtoOphelia ZhangDavid ShapiroJulia KorkmanShumpei HaginoyaPekka Santtila<h4>Purpose</h4>The present study compared the performance of a Large Language Model (LLM; ChatGPT) and human interviewers in interviewing children about a mock-event they witnessed.<h4>Methods</h4>Children aged 6-8 (N =  78) were randomly assigned to the LLM (n =  40) or the human interviewer condition (n =  38). In the experiment, the children were asked to watch a video filmed by the researchers that depicted behavior including elements that could be misinterpreted as abusive in other contexts, and then answer questions posed by either an LLM (presented by a human researcher) or a human interviewer.<h4>Results</h4>Irrespective of condition, recommended (vs. not recommended) questions elicited more correct information. The LLM posed fewer questions overall, but no difference in the proportion of the questions recommended by the literature. There were no differences between the LLM and human interviewers in unique correct information elicited but questions posed by LLM (vs. humans) elicited more unique correct information per question. LLM (vs. humans) also elicited less false information overall, but there was no difference in false information elicited per question.<h4>Conclusions</h4>The findings show that the LLM was competent in formulating questions that adhere to best practice guidelines while human interviewers asked more questions following up on the child responses in trying to find out what the children had witnessed. The results indicate LLMs could possibly be used to support child investigative interviewers. However, substantial further investigation is warranted to ascertain the utility of LLMs in more realistic investigative interview settings.https://doi.org/10.1371/journal.pone.0316317
spellingShingle Yongjie Sun
Haohai Pang
Liisa Järvilehto
Ophelia Zhang
David Shapiro
Julia Korkman
Shumpei Haginoya
Pekka Santtila
Comparing the performance of a large language model and naive human interviewers in interviewing children about a witnessed mock-event.
PLoS ONE
title Comparing the performance of a large language model and naive human interviewers in interviewing children about a witnessed mock-event.
title_full Comparing the performance of a large language model and naive human interviewers in interviewing children about a witnessed mock-event.
title_fullStr Comparing the performance of a large language model and naive human interviewers in interviewing children about a witnessed mock-event.
title_full_unstemmed Comparing the performance of a large language model and naive human interviewers in interviewing children about a witnessed mock-event.
title_short Comparing the performance of a large language model and naive human interviewers in interviewing children about a witnessed mock-event.
title_sort comparing the performance of a large language model and naive human interviewers in interviewing children about a witnessed mock event
url https://doi.org/10.1371/journal.pone.0316317
work_keys_str_mv AT yongjiesun comparingtheperformanceofalargelanguagemodelandnaivehumaninterviewersininterviewingchildrenaboutawitnessedmockevent
AT haohaipang comparingtheperformanceofalargelanguagemodelandnaivehumaninterviewersininterviewingchildrenaboutawitnessedmockevent
AT liisajarvilehto comparingtheperformanceofalargelanguagemodelandnaivehumaninterviewersininterviewingchildrenaboutawitnessedmockevent
AT opheliazhang comparingtheperformanceofalargelanguagemodelandnaivehumaninterviewersininterviewingchildrenaboutawitnessedmockevent
AT davidshapiro comparingtheperformanceofalargelanguagemodelandnaivehumaninterviewersininterviewingchildrenaboutawitnessedmockevent
AT juliakorkman comparingtheperformanceofalargelanguagemodelandnaivehumaninterviewersininterviewingchildrenaboutawitnessedmockevent
AT shumpeihaginoya comparingtheperformanceofalargelanguagemodelandnaivehumaninterviewersininterviewingchildrenaboutawitnessedmockevent
AT pekkasanttila comparingtheperformanceofalargelanguagemodelandnaivehumaninterviewersininterviewingchildrenaboutawitnessedmockevent