Taking a look at your speech: identifying diagnostic status and negative symptoms of psychosis using convolutional neural networks
Abstract Speech-based indices are promising objective biomarkers for identifying schizophrenia and monitoring symptom burden. Static acoustic features show potential but often overlook time-varying acoustic cues that clinicians naturally evaluate—such as negative symptoms—during clinical interviews....
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-07-01
|
| Series: | NPP-Digital Psychiatry and Neuroscience |
| Online Access: | https://doi.org/10.1038/s44277-025-00040-1 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Speech-based indices are promising objective biomarkers for identifying schizophrenia and monitoring symptom burden. Static acoustic features show potential but often overlook time-varying acoustic cues that clinicians naturally evaluate—such as negative symptoms—during clinical interviews. A similar dynamic, unfiltered approach can be applied using speech spectrograms, preserving acoustic-temporal nuances. Here, we investigate if this method has the potential to assist in the determination of diagnostic and symptom severity status. Speech recordings from 319 participants (227 with schizophrenia spectrum disorders, 92 healthy controls) were segmented into 10 s fragments of uninterrupted audio (n = 110,246) and transformed into log-Mel spectrograms to preserve both acoustic and temporal features. Participants were partitioned into training (70%), validation (15%), and test (15%) datasets without overlap. Modified ResNet-18 convolutional neural networks (CNNs) performed three classification tasks; (1) schizophrenia-spectrum vs healthy controls, within 179 clinically-rated patients, (2) individuals with more severe vs less severe negative symptom burden, and (3) clinically obvious vs subtle blunted affect. Grad-CAM was used to visualize salient regions of the spectrograms that contributed to classification. CNNs distinguished schizophrenia-spectrum participants from healthy controls with 87.8% accuracy (AUC = 0.86). The classifier trained on negative symptom burden performed with somewhat less accuracy (80.5%; AUC = 0.73) but the model detecting blunted affect above a predefined clinical threshold achieved 87.8% accuracy (AUC = 0.79). Importantly, acoustic information contributing to diagnostic classification was distinct from those identifying blunted affect. Grad-CAM visualization indicated that the CNN targeted regions consistent with human speech signals at the utterance level, highlighting clinically relevant vocal patterns. Our results suggest that spectrogram-based CNN analyses of short conversational segments can robustly detect both schizophrenia-spectrum disorders and ascertain burden of negative symptoms. This interpretable framework underscores how time–frequency feature maps of natural speech may facilitate more nuanced tracking and detection of negative symptoms in schizophrenia. |
|---|---|
| ISSN: | 2948-1570 |