Forensic Genetics: A Machine Learning Algorithm for Mutation Modelling

Microsatellites or short tandem repeats (STRs) are the most used markers in population and forensic genetics due to their high polymorphism that is consequence of high germinal mutation rates. Mutation modeling has been a topic of intense research as its proper estimation is crucial for a wide rang...

Full description

Saved in:
Bibliographic Details
Main Authors: Sofia Antão Sousa, Leonor Gusmão, Marisa Faustino, António Amorim, Nádia Pinto
Format: Article
Language:English
Published: Rede Académica das Ciências da Saúde da Lusofonia - RACS 2025-06-01
Series:RevSALUS
Subjects:
Online Access:https://revsalus.com/index.php/RevSALUS/article/view/999
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850115763946389504
author Sofia Antão Sousa
Leonor Gusmão
Marisa Faustino
António Amorim
Nádia Pinto
author_facet Sofia Antão Sousa
Leonor Gusmão
Marisa Faustino
António Amorim
Nádia Pinto
author_sort Sofia Antão Sousa
collection DOAJ
description Microsatellites or short tandem repeats (STRs) are the most used markers in population and forensic genetics due to their high polymorphism that is consequence of high germinal mutation rates. Mutation modeling has been a topic of intense research as its proper estimation is crucial for a wide range of forensic genetics’ problems. The objective of this work is to obtain a statistical system for mutation modeling able to accommodate as predictors the parental allele length and age, known to be correlated with the biological mechanism. Due to its haploid mode of transmission the analysis of Y-chromosomal markers provides invaluable insights regarding germinal mutation modeling as it allows the inference of which parental allele originated which filial one [1]. In contrast, for diploid and haplodiploid markers not only hidden mutations can occur, as also multistep mutations can be misinterpreted as single step ones, which biases the modelling of the phenomena [2]. Mutation rates of STRs are known to be correlated with the parental sex, age, and allele size and sequence of the repetitive motif [3]. Nonetheless, corresponding estimates are generally computed simply considering the marker-specific ratio between the number of Mendelian incompatibilities and transmissions observed. This naïve approach hides the variation in germinal mutation rates within each marker, dependent on the allele, sex and age of the individual. Under the framework of a working commission of the Spanish and Portuguese Speaking Working Group of the International Society for Forensic Genetics (GHEP-ISFG), father-son segregation data for 28 Y-STRs were analyzed, and a machine-learning model was developed, where logistic regression analyses were computed to estimate marker specific mutation rates depending on paternal age and/or allele length [4]. Statistical significance was reached for both predictors for three markers out of the 25 analyzed, with allele length showing greater contribution than age (from 5 to 16 times greater). Greater subsets of data were able to be analyzed when considering only the allele length as predictor, which allowed statistical significance to be reached for 18 Y-STRs out of the 28 analyzed. For each case, algebraic expressions were provided for estimating marker specific mutation rates depending on paternal age and/or allele length. These results support that machine learning algorithms may be used to improve mutation modelling, statistical significance depending on the available data to be used as training and test sets. As for any other rare event, a huge amount of data is needed for the proper estimation of mutation parameters. Therefore, interlaboratory studies are crucial to produce and gather important amounts of data, in parallel to the establishment of publication guidelines to assure the release of data with the proper level of detail. To circumvent the limitation inherent to the scarce data available and increase its potential, in this work we evaluate the possibility of gathering data from different markers with the same structure of the repetitive motif for modelling mutation rates considering also as predictors the parental allele and/or age.
format Article
id doaj-art-893ec1835fa04a3b9611dc3f2e0e5e37
institution OA Journals
issn 2184-4860
2184-836X
language English
publishDate 2025-06-01
publisher Rede Académica das Ciências da Saúde da Lusofonia - RACS
record_format Article
series RevSALUS
spelling doaj-art-893ec1835fa04a3b9611dc3f2e0e5e372025-08-20T02:36:30ZengRede Académica das Ciências da Saúde da Lusofonia - RACSRevSALUS2184-48602184-836X2025-06-017Sup10.51126/revsalus.v7isup.999Forensic Genetics: A Machine Learning Algorithm for Mutation ModellingSofia Antão Sousa0Leonor Gusmão1Marisa Faustino2António Amorim3Nádia Pinto4Instituto de Investigação e Inovação em Saúde (i3S), Porto, Portugal; Faculty of Sciences of the University of Porto (FCUP), Porto, Portugal; Centre of Mathematics of the University of Porto, Porto, PortugalDNA Diagnostic Laboratory (LDD), State University of Rio de Janeiro (UERJ), Rio de Janeiro, BrazilInstituto de Investigação e Inovação em Saúde (i3S), Porto, Portugal; Faculty of Sciences of the University of Porto (FCUP), Porto, PortugalInstituto de Investigação e Inovação em Saúde (i3S), Porto, Portugal; Faculty of Sciences of the University of Porto (FCUP), Porto, PortugalInstituto de Investigação e Inovação em Saúde (i3S), Porto, Portugal; Faculty of Sciences of the University of Porto (FCUP), Porto, Portugal Microsatellites or short tandem repeats (STRs) are the most used markers in population and forensic genetics due to their high polymorphism that is consequence of high germinal mutation rates. Mutation modeling has been a topic of intense research as its proper estimation is crucial for a wide range of forensic genetics’ problems. The objective of this work is to obtain a statistical system for mutation modeling able to accommodate as predictors the parental allele length and age, known to be correlated with the biological mechanism. Due to its haploid mode of transmission the analysis of Y-chromosomal markers provides invaluable insights regarding germinal mutation modeling as it allows the inference of which parental allele originated which filial one [1]. In contrast, for diploid and haplodiploid markers not only hidden mutations can occur, as also multistep mutations can be misinterpreted as single step ones, which biases the modelling of the phenomena [2]. Mutation rates of STRs are known to be correlated with the parental sex, age, and allele size and sequence of the repetitive motif [3]. Nonetheless, corresponding estimates are generally computed simply considering the marker-specific ratio between the number of Mendelian incompatibilities and transmissions observed. This naïve approach hides the variation in germinal mutation rates within each marker, dependent on the allele, sex and age of the individual. Under the framework of a working commission of the Spanish and Portuguese Speaking Working Group of the International Society for Forensic Genetics (GHEP-ISFG), father-son segregation data for 28 Y-STRs were analyzed, and a machine-learning model was developed, where logistic regression analyses were computed to estimate marker specific mutation rates depending on paternal age and/or allele length [4]. Statistical significance was reached for both predictors for three markers out of the 25 analyzed, with allele length showing greater contribution than age (from 5 to 16 times greater). Greater subsets of data were able to be analyzed when considering only the allele length as predictor, which allowed statistical significance to be reached for 18 Y-STRs out of the 28 analyzed. For each case, algebraic expressions were provided for estimating marker specific mutation rates depending on paternal age and/or allele length. These results support that machine learning algorithms may be used to improve mutation modelling, statistical significance depending on the available data to be used as training and test sets. As for any other rare event, a huge amount of data is needed for the proper estimation of mutation parameters. Therefore, interlaboratory studies are crucial to produce and gather important amounts of data, in parallel to the establishment of publication guidelines to assure the release of data with the proper level of detail. To circumvent the limitation inherent to the scarce data available and increase its potential, in this work we evaluate the possibility of gathering data from different markers with the same structure of the repetitive motif for modelling mutation rates considering also as predictors the parental allele and/or age. https://revsalus.com/index.php/RevSALUS/article/view/999Y chromosome, mutation, microsatellites, Y-STRs
spellingShingle Sofia Antão Sousa
Leonor Gusmão
Marisa Faustino
António Amorim
Nádia Pinto
Forensic Genetics: A Machine Learning Algorithm for Mutation Modelling
RevSALUS
Y chromosome, mutation, microsatellites, Y-STRs
title Forensic Genetics: A Machine Learning Algorithm for Mutation Modelling
title_full Forensic Genetics: A Machine Learning Algorithm for Mutation Modelling
title_fullStr Forensic Genetics: A Machine Learning Algorithm for Mutation Modelling
title_full_unstemmed Forensic Genetics: A Machine Learning Algorithm for Mutation Modelling
title_short Forensic Genetics: A Machine Learning Algorithm for Mutation Modelling
title_sort forensic genetics a machine learning algorithm for mutation modelling
topic Y chromosome, mutation, microsatellites, Y-STRs
url https://revsalus.com/index.php/RevSALUS/article/view/999
work_keys_str_mv AT sofiaantaosousa forensicgeneticsamachinelearningalgorithmformutationmodelling
AT leonorgusmao forensicgeneticsamachinelearningalgorithmformutationmodelling
AT marisafaustino forensicgeneticsamachinelearningalgorithmformutationmodelling
AT antonioamorim forensicgeneticsamachinelearningalgorithmformutationmodelling
AT nadiapinto forensicgeneticsamachinelearningalgorithmformutationmodelling