Estimation of minimal data sets sizes for machine learning predictions in digital mental health interventions

Abstract Artificial intelligence promises to revolutionize mental health care, but small dataset sizes and lack of robust methods raise concerns about result generalizability. To provide insights on minimal necessary data set sizes, we explore domain-specific learning curves for digital intervention...

Full description

Saved in:
Bibliographic Details
Main Authors: Kirsten Zantvoort, Barbara Nacke, Dennis Görlich, Silvan Hornstein, Corinna Jacobi, Burkhardt Funk
Format: Article
Language:English
Published: Nature Portfolio 2024-12-01
Series:npj Digital Medicine
Online Access:https://doi.org/10.1038/s41746-024-01360-w
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850133017445531648
author Kirsten Zantvoort
Barbara Nacke
Dennis Görlich
Silvan Hornstein
Corinna Jacobi
Burkhardt Funk
author_facet Kirsten Zantvoort
Barbara Nacke
Dennis Görlich
Silvan Hornstein
Corinna Jacobi
Burkhardt Funk
author_sort Kirsten Zantvoort
collection DOAJ
description Abstract Artificial intelligence promises to revolutionize mental health care, but small dataset sizes and lack of robust methods raise concerns about result generalizability. To provide insights on minimal necessary data set sizes, we explore domain-specific learning curves for digital intervention dropout predictions based on 3654 users from a single study (ISRCTN13716228, 26/02/2016). Prediction performance is analyzed based on dataset size (N = 100–3654), feature groups (F = 2–129), and algorithm choice (from Naive Bayes to Neural Networks). The results substantiate the concern that small datasets (N ≤ 300) overestimate predictive power. For uninformative feature groups, in-sample prediction performance was negatively correlated with dataset size. Sophisticated models overfitted in small datasets but maximized holdout test results in larger datasets. While N = 500 mitigated overfitting, performance did not converge until N = 750–1500. Consequently, we propose minimum dataset sizes of N = 500–1000. As such, this study offers an empirical reference for researchers designing or interpreting AI studies on Digital Mental Health Intervention data.
format Article
id doaj-art-cc6529e5e0a044289dc5d4fae2b01b36
institution OA Journals
issn 2398-6352
language English
publishDate 2024-12-01
publisher Nature Portfolio
record_format Article
series npj Digital Medicine
spelling doaj-art-cc6529e5e0a044289dc5d4fae2b01b362025-08-20T02:32:04ZengNature Portfolionpj Digital Medicine2398-63522024-12-017111010.1038/s41746-024-01360-wEstimation of minimal data sets sizes for machine learning predictions in digital mental health interventionsKirsten Zantvoort0Barbara Nacke1Dennis Görlich2Silvan Hornstein3Corinna Jacobi4Burkhardt Funk5Institute of Information Systems, Leuphana UniversityDepartment of Clinical Psychology and Psychotherapy, Faculty of Psychology, Technische Universität DresdenInstitute of Biostatistics and Clinical Research, University MünsterDepartment of Psychology, Humboldt-Universität zu BerlinDepartment of Clinical Psychology and Psychotherapy, Faculty of Psychology, Technische Universität DresdenInstitute of Information Systems, Leuphana UniversityAbstract Artificial intelligence promises to revolutionize mental health care, but small dataset sizes and lack of robust methods raise concerns about result generalizability. To provide insights on minimal necessary data set sizes, we explore domain-specific learning curves for digital intervention dropout predictions based on 3654 users from a single study (ISRCTN13716228, 26/02/2016). Prediction performance is analyzed based on dataset size (N = 100–3654), feature groups (F = 2–129), and algorithm choice (from Naive Bayes to Neural Networks). The results substantiate the concern that small datasets (N ≤ 300) overestimate predictive power. For uninformative feature groups, in-sample prediction performance was negatively correlated with dataset size. Sophisticated models overfitted in small datasets but maximized holdout test results in larger datasets. While N = 500 mitigated overfitting, performance did not converge until N = 750–1500. Consequently, we propose minimum dataset sizes of N = 500–1000. As such, this study offers an empirical reference for researchers designing or interpreting AI studies on Digital Mental Health Intervention data.https://doi.org/10.1038/s41746-024-01360-w
spellingShingle Kirsten Zantvoort
Barbara Nacke
Dennis Görlich
Silvan Hornstein
Corinna Jacobi
Burkhardt Funk
Estimation of minimal data sets sizes for machine learning predictions in digital mental health interventions
npj Digital Medicine
title Estimation of minimal data sets sizes for machine learning predictions in digital mental health interventions
title_full Estimation of minimal data sets sizes for machine learning predictions in digital mental health interventions
title_fullStr Estimation of minimal data sets sizes for machine learning predictions in digital mental health interventions
title_full_unstemmed Estimation of minimal data sets sizes for machine learning predictions in digital mental health interventions
title_short Estimation of minimal data sets sizes for machine learning predictions in digital mental health interventions
title_sort estimation of minimal data sets sizes for machine learning predictions in digital mental health interventions
url https://doi.org/10.1038/s41746-024-01360-w
work_keys_str_mv AT kirstenzantvoort estimationofminimaldatasetssizesformachinelearningpredictionsindigitalmentalhealthinterventions
AT barbaranacke estimationofminimaldatasetssizesformachinelearningpredictionsindigitalmentalhealthinterventions
AT dennisgorlich estimationofminimaldatasetssizesformachinelearningpredictionsindigitalmentalhealthinterventions
AT silvanhornstein estimationofminimaldatasetssizesformachinelearningpredictionsindigitalmentalhealthinterventions
AT corinnajacobi estimationofminimaldatasetssizesformachinelearningpredictionsindigitalmentalhealthinterventions
AT burkhardtfunk estimationofminimaldatasetssizesformachinelearningpredictionsindigitalmentalhealthinterventions