Software complex for simulation modelling of single nucleotide genetic polymorphism sites

Objectives. High-throughput sequencing methods have recently become widely used in the fundamental and applied research of various human diseases. Sequencing of functionally significant regions of the human genome enables the simultaneous identification of multiple genetic polymorphism sites that ha...

Full description

Saved in:
Bibliographic Details
Main Authors: M. M. Yatskou, D. D. Sarnatski, V. V. Skakun, V. V. Grinev
Format: Article
Language:Russian
Published: National Academy of Sciences of Belarus, the United Institute of Informatics Problems 2025-07-01
Series:Informatika
Subjects:
Online Access:https://inf.grid.by/jour/article/view/1355
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Objectives. High-throughput sequencing methods have recently become widely used in the fundamental and applied research of various human diseases. Sequencing of functionally significant regions of the human genome enables the simultaneous identification of multiple genetic polymorphism sites that have diagnostic and/or prognostic significance for human genetic diseases. One of the key goals in this area is to develop efficient software tools for processing genomic data and identifying single nucleotide polymorphism sites using computer modelling and big data analysis methods.Methods. A software complex has been developed for simulation modelling and identification of single nucleotide polymorphism sites using machine learning methods. The methods for the approach to simulation modelling and analysis of single nucleotide polymorphism sites in DNA molecules are implemented based on the beta or normal distributions, the parameters of which are determined from the available experimental data, and machine learning models trained on simulated data and used to accurately identify single nucleotide polymorphism sites. The software complex includes an R package, a web application, and auxiliary computational tools for processing experimental genomic sequencing data.Results. The performance of the developed software complex was tested on sets of simulated and experimental data from human cell genomic sequencing. A comparative analysis of the most effective algorithms for identifying single nucleotide polymorphism sites was performed. The best results were obtained for machine learning models.Conclusion. The use of the software complex increases the accuracy of identifying genetic polymorphism sites during the analysis of big genomic sequencing data. The software can be used for modelling synthetic data, based on experimental data or independently, for the purpose of comprehensive testing and selection of the best algorithms for identifying single nucleotide polymorphisms, as well as for generative data modelling used in training identification algorithms based on machine learning methods
ISSN:1816-0301