Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths

Abstract Mental illnesses often manifest through behavioral changes, with speech serving as a key medium for expressing thoughts and emotions. The use of computational linguistics on speech data in mental illnesses is a promising approach to uncover objective biomarkers for the early detection of me...

Full description

Saved in:
Bibliographic Details
Main Authors: Jordon Junyang Kho, Shangzheng Song, Samuel Ming Xuan Tan, Nur Hikmah Fitriyah, Matheus Calvin Lokadjaja, Jie Yin Yee, Zixu Yang, Eric Yu Hai Chen, Jimmy Lee, Wilson Wen Bin Goh
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Schizophrenia
Online Access:https://doi.org/10.1038/s41537-025-00649-3
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849342676577550336
author Jordon Junyang Kho
Shangzheng Song
Samuel Ming Xuan Tan
Nur Hikmah Fitriyah
Matheus Calvin Lokadjaja
Jie Yin Yee
Zixu Yang
Eric Yu Hai Chen
Jimmy Lee
Wilson Wen Bin Goh
author_facet Jordon Junyang Kho
Shangzheng Song
Samuel Ming Xuan Tan
Nur Hikmah Fitriyah
Matheus Calvin Lokadjaja
Jie Yin Yee
Zixu Yang
Eric Yu Hai Chen
Jimmy Lee
Wilson Wen Bin Goh
author_sort Jordon Junyang Kho
collection DOAJ
description Abstract Mental illnesses often manifest through behavioral changes, with speech serving as a key medium for expressing thoughts and emotions. The use of computational linguistics on speech data in mental illnesses is a promising approach to uncover objective biomarkers for the early detection of mental illnesses. This study analyzed speech transcripts from 80 youths at ultra-high risk of psychosis (UHR) and 329 healthy controls, examining text features such as sentiment variability, cohesion, lexical sophistication, morphology, syntactic sophistication, and lexical diversity. Factor analysis revealed five key linguistic themes: Sentiment Intensity and Variability, Linguistic Register Alignment, Phonographic Uniqueness and Recognizability, Morphological Complexity and Imageability, and Lexical Richness and Typicalness. Regression analysis indicated UHR speech is characterized by diminished sentiment variability (β = –0.07), deviation from linguistic registers (β = –0.16), fewer phonographic neighbors (β = –0.11), lower morphological complexity (β = –0.36), and more predictable lexical structures (β = 0.05). Optimized machine learning (ML) models trained on Boruta-selected features achieved a mean AUC of 0.70. Our findings highlight the potential of sentiment and linguistic analyses in speech for training ML models to aid in early detection and monitoring of mental health conditions.
format Article
id doaj-art-3fc8a8861ee84107a2d570c66f34d3ce
institution Kabale University
issn 2754-6993
language English
publishDate 2025-07-01
publisher Nature Portfolio
record_format Article
series Schizophrenia
spelling doaj-art-3fc8a8861ee84107a2d570c66f34d3ce2025-08-20T03:43:16ZengNature PortfolioSchizophrenia2754-69932025-07-011111910.1038/s41537-025-00649-3Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youthsJordon Junyang Kho0Shangzheng Song1Samuel Ming Xuan Tan2Nur Hikmah Fitriyah3Matheus Calvin Lokadjaja4Jie Yin Yee5Zixu Yang6Eric Yu Hai Chen7Jimmy Lee8Wilson Wen Bin Goh9Lee Kong Chian School of Medicine, Nanyang Technological UniversityLee Kong Chian School of Medicine, Nanyang Technological UniversityLee Kong Chian School of Medicine, Nanyang Technological UniversitySchool of Biological Sciences, Nanyang Technological UniversityLee Kong Chian School of Medicine, Nanyang Technological UniversityInstitute of Mental HealthInstitute of Mental HealthLee Kong Chian School of Medicine, Nanyang Technological UniversityLee Kong Chian School of Medicine, Nanyang Technological UniversityLee Kong Chian School of Medicine, Nanyang Technological UniversityAbstract Mental illnesses often manifest through behavioral changes, with speech serving as a key medium for expressing thoughts and emotions. The use of computational linguistics on speech data in mental illnesses is a promising approach to uncover objective biomarkers for the early detection of mental illnesses. This study analyzed speech transcripts from 80 youths at ultra-high risk of psychosis (UHR) and 329 healthy controls, examining text features such as sentiment variability, cohesion, lexical sophistication, morphology, syntactic sophistication, and lexical diversity. Factor analysis revealed five key linguistic themes: Sentiment Intensity and Variability, Linguistic Register Alignment, Phonographic Uniqueness and Recognizability, Morphological Complexity and Imageability, and Lexical Richness and Typicalness. Regression analysis indicated UHR speech is characterized by diminished sentiment variability (β = –0.07), deviation from linguistic registers (β = –0.16), fewer phonographic neighbors (β = –0.11), lower morphological complexity (β = –0.36), and more predictable lexical structures (β = 0.05). Optimized machine learning (ML) models trained on Boruta-selected features achieved a mean AUC of 0.70. Our findings highlight the potential of sentiment and linguistic analyses in speech for training ML models to aid in early detection and monitoring of mental health conditions.https://doi.org/10.1038/s41537-025-00649-3
spellingShingle Jordon Junyang Kho
Shangzheng Song
Samuel Ming Xuan Tan
Nur Hikmah Fitriyah
Matheus Calvin Lokadjaja
Jie Yin Yee
Zixu Yang
Eric Yu Hai Chen
Jimmy Lee
Wilson Wen Bin Goh
Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths
Schizophrenia
title Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths
title_full Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths
title_fullStr Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths
title_full_unstemmed Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths
title_short Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths
title_sort leveraging computational linguistics and machine learning for detection of ultra high risk of mental health disorders in youths
url https://doi.org/10.1038/s41537-025-00649-3
work_keys_str_mv AT jordonjunyangkho leveragingcomputationallinguisticsandmachinelearningfordetectionofultrahighriskofmentalhealthdisordersinyouths
AT shangzhengsong leveragingcomputationallinguisticsandmachinelearningfordetectionofultrahighriskofmentalhealthdisordersinyouths
AT samuelmingxuantan leveragingcomputationallinguisticsandmachinelearningfordetectionofultrahighriskofmentalhealthdisordersinyouths
AT nurhikmahfitriyah leveragingcomputationallinguisticsandmachinelearningfordetectionofultrahighriskofmentalhealthdisordersinyouths
AT matheuscalvinlokadjaja leveragingcomputationallinguisticsandmachinelearningfordetectionofultrahighriskofmentalhealthdisordersinyouths
AT jieyinyee leveragingcomputationallinguisticsandmachinelearningfordetectionofultrahighriskofmentalhealthdisordersinyouths
AT zixuyang leveragingcomputationallinguisticsandmachinelearningfordetectionofultrahighriskofmentalhealthdisordersinyouths
AT ericyuhaichen leveragingcomputationallinguisticsandmachinelearningfordetectionofultrahighriskofmentalhealthdisordersinyouths
AT jimmylee leveragingcomputationallinguisticsandmachinelearningfordetectionofultrahighriskofmentalhealthdisordersinyouths
AT wilsonwenbingoh leveragingcomputationallinguisticsandmachinelearningfordetectionofultrahighriskofmentalhealthdisordersinyouths