A comparative study of deep End-to-End Automatic Speech Recognition models for doctor-patient conversations in Polish in a real-life acoustic environment

The following paper presents research on the Automatic Speech Recognition (ASR) methods for the construction of a system to automatically transcribe the medical interview in Polish language during a visit in the clinic. Performance of four ASR models based on Deep Neural Networks (DNN) was evaluated...

Full description

Saved in:
Bibliographic Details
Main Authors: Karolina Pondel-Sycz, Piotr Bilski, Piotr Bobiński, Leszek Morzyński, Marcin Lewandowski, Emil Kozłowski, Grzegorz Szczepański, Maciej Jasiński, Grzegorz Makarewicz, Agnieszka Paula Pietrzak, Andrzej Buchowicz, Paweł Mazurek, Adrian Bilski, Jacek Olejnik, Iwona Olejnik
Format: Article
Language:English
Published: Polish Academy of Sciences 2025-07-01
Series:International Journal of Electronics and Telecommunications
Subjects:
Online Access:https://journals.pan.pl/Content/135730/2-5157-Pondel-Sycz_sk.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The following paper presents research on the Automatic Speech Recognition (ASR) methods for the construction of a system to automatically transcribe the medical interview in Polish language during a visit in the clinic. Performance of four ASR models based on Deep Neural Networks (DNN) was evaluated. The applied structures included XLSR-53 large, Quartznet15x5, FastConformer Hybrid Transducer-CTC and Whisper large. The study was conducted on a self-developed speech dataset. Models were evaluated using Word Error Rate (WER), Character Error Rate (CER), Match Error Rate (MER), Word Accuracy (WAcc), Word Information Preserved (WIP), Word Information Lost (WIL), Levenshtein distance, Jaro - Winkler similarity and Jaccard index. The results show that the Whisper model outperformed other tested solutions in the vast majority of the conducted tests. Whisper achieved a WER = 20.84%, where XLSR-53 WER = 67.96%, Quartznet15x5 WER = 76.25%, FastConformer WER = 46.30%. These results show that Whisper needs further adaptation for medical conversations, as current volume of transcription errors is not practically acceptable (too many mistakes in the description of the patient's health description).
ISSN:2081-8491
2300-1933