Semantic search helper: A tool based on the use of embeddings in multi-item questionnaires as a harmonization opportunity for merging large datasets – A feasibility study

Abstract Background Recent advances in natural language processing (NLP), particularly in language processing methods, have opened new avenues in semantic data analysis. A promising application of NLP is data harmonization in questionnaire-based cohort studies, where it can be used as an additional...

Full description

Saved in:
Bibliographic Details
Main Authors: Karl Gottfried, Karina Janson, Nathalie E. Holz, Olaf Reis, Johannes Kornhuber, Anna Eichler, Tobias Banaschewski, Frauke Nees, IMAC-Mind Consortium
Format: Article
Language:English
Published: Cambridge University Press 2025-01-01
Series:European Psychiatry
Subjects:
Online Access:https://www.cambridge.org/core/product/identifier/S092493382401808X/type/journal_article
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832593709638615040
author Karl Gottfried
Karina Janson
Nathalie E. Holz
Olaf Reis
Johannes Kornhuber
Anna Eichler
Tobias Banaschewski
Frauke Nees
IMAC-Mind Consortium
author_facet Karl Gottfried
Karina Janson
Nathalie E. Holz
Olaf Reis
Johannes Kornhuber
Anna Eichler
Tobias Banaschewski
Frauke Nees
IMAC-Mind Consortium
author_sort Karl Gottfried
collection DOAJ
description Abstract Background Recent advances in natural language processing (NLP), particularly in language processing methods, have opened new avenues in semantic data analysis. A promising application of NLP is data harmonization in questionnaire-based cohort studies, where it can be used as an additional method, specifically when only different instruments are available for one construct as well as for the evaluation of potentially new construct-constellations. The present article therefore explores embedding models’ potential to detect opportunities for semantic harmonization. Methods Using models like SBERT and OpenAI’s ADA, we developed a prototype application (“Semantic Search Helper”) to facilitate the harmonization process of detecting semantically similar items within extensive health-related datasets. The approach’s feasibility and applicability were evaluated through a use case analysis involving data from four large cohort studies with heterogeneous data obtained with a different set of instruments for common constructs. Results With the prototype, we effectively identified potential harmonization pairs, which significantly reduced manual evaluation efforts. Expert ratings of semantic similarity candidates showed high agreement with model-generated pairs, confirming the validity of our approach. Conclusions This study demonstrates the potential of embeddings in matching semantic similarity as a promising add-on tool to assist harmonization processes of multiplex data sets and instruments but with similar content, within and across studies.
format Article
id doaj-art-f5fc7e1f2ad24fc7bfc9a9be77ea4be8
institution Kabale University
issn 0924-9338
1778-3585
language English
publishDate 2025-01-01
publisher Cambridge University Press
record_format Article
series European Psychiatry
spelling doaj-art-f5fc7e1f2ad24fc7bfc9a9be77ea4be82025-01-20T10:29:12ZengCambridge University PressEuropean Psychiatry0924-93381778-35852025-01-016810.1192/j.eurpsy.2024.1808Semantic search helper: A tool based on the use of embeddings in multi-item questionnaires as a harmonization opportunity for merging large datasets – A feasibility studyKarl Gottfried0https://orcid.org/0000-0002-2100-3409Karina Janson1https://orcid.org/0000-0002-0902-0628Nathalie E. Holz2Olaf Reis3https://orcid.org/0000-0001-6480-6431Johannes Kornhuber4https://orcid.org/0000-0002-8096-3987Anna Eichler5https://orcid.org/0000-0001-5584-0961Tobias Banaschewski6https://orcid.org/0000-0003-4595-1144Frauke Nees7https://orcid.org/0000-0002-7796-8234IMAC-Mind ConsortiumInstitute of Applied Medical Informatics, University Hospital Center Hamburg-Eppendorf, Hamburg, GermanyDepartment of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Baden-Württemberg, Germany Institute of Medical Psychology and Medical Sociology, University Medical Center Schleswig-Holstein, Kiel University, Preußerstraße 1-9, Kiel, Schleswig-Holstein, GermanyDepartment of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Baden-Württemberg, Germany German Center for Mental Health (DZPG), Partnersite Mannheim-Heidelberg-Ulm, GermanyDepartment of Child and Adolescent Psychiatry, Neurology, Psychosomatics and Psychotherapy, Rostock University Medical Centre, Rostock, GermanyDepartment of Psychiatry and Psychotherapy, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, GermanyDepartment of Child and Adolescent Mental Health, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, GermanyDepartment of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Baden-Württemberg, Germany German Center for Mental Health (DZPG), Partnersite Mannheim-Heidelberg-Ulm, GermanyDepartment of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Baden-Württemberg, Germany Institute of Medical Psychology and Medical Sociology, University Medical Center Schleswig-Holstein, Kiel University, Preußerstraße 1-9, Kiel, Schleswig-Holstein, GermanyAbstract Background Recent advances in natural language processing (NLP), particularly in language processing methods, have opened new avenues in semantic data analysis. A promising application of NLP is data harmonization in questionnaire-based cohort studies, where it can be used as an additional method, specifically when only different instruments are available for one construct as well as for the evaluation of potentially new construct-constellations. The present article therefore explores embedding models’ potential to detect opportunities for semantic harmonization. Methods Using models like SBERT and OpenAI’s ADA, we developed a prototype application (“Semantic Search Helper”) to facilitate the harmonization process of detecting semantically similar items within extensive health-related datasets. The approach’s feasibility and applicability were evaluated through a use case analysis involving data from four large cohort studies with heterogeneous data obtained with a different set of instruments for common constructs. Results With the prototype, we effectively identified potential harmonization pairs, which significantly reduced manual evaluation efforts. Expert ratings of semantic similarity candidates showed high agreement with model-generated pairs, confirming the validity of our approach. Conclusions This study demonstrates the potential of embeddings in matching semantic similarity as a promising add-on tool to assist harmonization processes of multiplex data sets and instruments but with similar content, within and across studies. https://www.cambridge.org/core/product/identifier/S092493382401808X/type/journal_articlenatural language processingharmonizationsemanticquestionnairesbig data
spellingShingle Karl Gottfried
Karina Janson
Nathalie E. Holz
Olaf Reis
Johannes Kornhuber
Anna Eichler
Tobias Banaschewski
Frauke Nees
IMAC-Mind Consortium
Semantic search helper: A tool based on the use of embeddings in multi-item questionnaires as a harmonization opportunity for merging large datasets – A feasibility study
European Psychiatry
natural language processing
harmonization
semantic
questionnaires
big data
title Semantic search helper: A tool based on the use of embeddings in multi-item questionnaires as a harmonization opportunity for merging large datasets – A feasibility study
title_full Semantic search helper: A tool based on the use of embeddings in multi-item questionnaires as a harmonization opportunity for merging large datasets – A feasibility study
title_fullStr Semantic search helper: A tool based on the use of embeddings in multi-item questionnaires as a harmonization opportunity for merging large datasets – A feasibility study
title_full_unstemmed Semantic search helper: A tool based on the use of embeddings in multi-item questionnaires as a harmonization opportunity for merging large datasets – A feasibility study
title_short Semantic search helper: A tool based on the use of embeddings in multi-item questionnaires as a harmonization opportunity for merging large datasets – A feasibility study
title_sort semantic search helper a tool based on the use of embeddings in multi item questionnaires as a harmonization opportunity for merging large datasets a feasibility study
topic natural language processing
harmonization
semantic
questionnaires
big data
url https://www.cambridge.org/core/product/identifier/S092493382401808X/type/journal_article
work_keys_str_mv AT karlgottfried semanticsearchhelperatoolbasedontheuseofembeddingsinmultiitemquestionnairesasaharmonizationopportunityformerginglargedatasetsafeasibilitystudy
AT karinajanson semanticsearchhelperatoolbasedontheuseofembeddingsinmultiitemquestionnairesasaharmonizationopportunityformerginglargedatasetsafeasibilitystudy
AT nathalieeholz semanticsearchhelperatoolbasedontheuseofembeddingsinmultiitemquestionnairesasaharmonizationopportunityformerginglargedatasetsafeasibilitystudy
AT olafreis semanticsearchhelperatoolbasedontheuseofembeddingsinmultiitemquestionnairesasaharmonizationopportunityformerginglargedatasetsafeasibilitystudy
AT johanneskornhuber semanticsearchhelperatoolbasedontheuseofembeddingsinmultiitemquestionnairesasaharmonizationopportunityformerginglargedatasetsafeasibilitystudy
AT annaeichler semanticsearchhelperatoolbasedontheuseofembeddingsinmultiitemquestionnairesasaharmonizationopportunityformerginglargedatasetsafeasibilitystudy
AT tobiasbanaschewski semanticsearchhelperatoolbasedontheuseofembeddingsinmultiitemquestionnairesasaharmonizationopportunityformerginglargedatasetsafeasibilitystudy
AT fraukenees semanticsearchhelperatoolbasedontheuseofembeddingsinmultiitemquestionnairesasaharmonizationopportunityformerginglargedatasetsafeasibilitystudy
AT imacmindconsortium semanticsearchhelperatoolbasedontheuseofembeddingsinmultiitemquestionnairesasaharmonizationopportunityformerginglargedatasetsafeasibilitystudy