Integrating Biological Domain Knowledge with Machine Learning for Identifying Colorectal-Cancer-Associated Microbial Enzymes in Metagenomic Data

Advances in metagenomics have revolutionized our ability to elucidate links between the microbiome and human diseases. Colorectal cancer (CRC), a leading cause of cancer-related mortality worldwide, has been associated with dysbiosis of the gut microbiome. This study aims to develop a method for ide...

Full description

Saved in:
Bibliographic Details
Main Authors: Burcu Bakir-Gungor, Nur Sebnem Ersoz, Malik Yousef
Format: Article
Language:English
Published: MDPI AG 2025-03-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/6/2940
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850090138419331072
author Burcu Bakir-Gungor
Nur Sebnem Ersoz
Malik Yousef
author_facet Burcu Bakir-Gungor
Nur Sebnem Ersoz
Malik Yousef
author_sort Burcu Bakir-Gungor
collection DOAJ
description Advances in metagenomics have revolutionized our ability to elucidate links between the microbiome and human diseases. Colorectal cancer (CRC), a leading cause of cancer-related mortality worldwide, has been associated with dysbiosis of the gut microbiome. This study aims to develop a method for identifying CRC-associated microbial enzymes by incorporating biological domain knowledge into the feature selection process. Conventional feature selection techniques often evaluate features individually and fail to leverage biological knowledge during metagenomic data analysis. To address this gap, we propose the enzyme commission (EC)-nomenclature-based Grouping-Scoring-Modeling (G-S-M) method, which integrates biological domain knowledge into feature grouping and selection. The proposed method was tested on a CRC-associated metagenomic dataset collected from eight different countries. Community-level relative abundance values of enzymes were considered as features and grouped based on their EC categories to provide biologically informed groupings. Our findings in randomized 10-fold cross-validation experiments imply that glycosidases, CoA-transferases, hydro-lyases, oligo-1,6-glucosidase, crotonobetainyl-CoA hydratase, and citrate CoA-transferase enzymes can be associated with CRC development as part of different molecular pathways. These enzymes are mostly synthesized by <i>Eschericia coli</i>, <i>Salmonella enterica</i>, <i>Klebsiella pneumoniae</i>, <i>Staphylococcus aureus</i>, <i>Streptococcus pneumoniae</i>, and <i>Clostridioides dificile</i>. Comparative evaluation experiments showed that the proposed model consistently outperforms traditional feature selection methods paired with various classifiers.
format Article
id doaj-art-45a8e643535044d68deec038ea5ece0d
institution DOAJ
issn 2076-3417
language English
publishDate 2025-03-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-45a8e643535044d68deec038ea5ece0d2025-08-20T02:42:38ZengMDPI AGApplied Sciences2076-34172025-03-01156294010.3390/app15062940Integrating Biological Domain Knowledge with Machine Learning for Identifying Colorectal-Cancer-Associated Microbial Enzymes in Metagenomic DataBurcu Bakir-Gungor0Nur Sebnem Ersoz1Malik Yousef2Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri 38080, TürkiyeDepartment of Bioengineering, Graduate School of Engineering and Science, Abdullah Gul University, Kayseri 38080, TürkiyeDepartment of Information Systems, Zefat Academic College, Zefat 1320611, IsraelAdvances in metagenomics have revolutionized our ability to elucidate links between the microbiome and human diseases. Colorectal cancer (CRC), a leading cause of cancer-related mortality worldwide, has been associated with dysbiosis of the gut microbiome. This study aims to develop a method for identifying CRC-associated microbial enzymes by incorporating biological domain knowledge into the feature selection process. Conventional feature selection techniques often evaluate features individually and fail to leverage biological knowledge during metagenomic data analysis. To address this gap, we propose the enzyme commission (EC)-nomenclature-based Grouping-Scoring-Modeling (G-S-M) method, which integrates biological domain knowledge into feature grouping and selection. The proposed method was tested on a CRC-associated metagenomic dataset collected from eight different countries. Community-level relative abundance values of enzymes were considered as features and grouped based on their EC categories to provide biologically informed groupings. Our findings in randomized 10-fold cross-validation experiments imply that glycosidases, CoA-transferases, hydro-lyases, oligo-1,6-glucosidase, crotonobetainyl-CoA hydratase, and citrate CoA-transferase enzymes can be associated with CRC development as part of different molecular pathways. These enzymes are mostly synthesized by <i>Eschericia coli</i>, <i>Salmonella enterica</i>, <i>Klebsiella pneumoniae</i>, <i>Staphylococcus aureus</i>, <i>Streptococcus pneumoniae</i>, and <i>Clostridioides dificile</i>. Comparative evaluation experiments showed that the proposed model consistently outperforms traditional feature selection methods paired with various classifiers.https://www.mdpi.com/2076-3417/15/6/2940metagenomic analysis of colorectal cancermachine learningfeature groupingfunctional profiling of metagenomescommunity-level enzyme commission (EC) abundances
spellingShingle Burcu Bakir-Gungor
Nur Sebnem Ersoz
Malik Yousef
Integrating Biological Domain Knowledge with Machine Learning for Identifying Colorectal-Cancer-Associated Microbial Enzymes in Metagenomic Data
Applied Sciences
metagenomic analysis of colorectal cancer
machine learning
feature grouping
functional profiling of metagenomes
community-level enzyme commission (EC) abundances
title Integrating Biological Domain Knowledge with Machine Learning for Identifying Colorectal-Cancer-Associated Microbial Enzymes in Metagenomic Data
title_full Integrating Biological Domain Knowledge with Machine Learning for Identifying Colorectal-Cancer-Associated Microbial Enzymes in Metagenomic Data
title_fullStr Integrating Biological Domain Knowledge with Machine Learning for Identifying Colorectal-Cancer-Associated Microbial Enzymes in Metagenomic Data
title_full_unstemmed Integrating Biological Domain Knowledge with Machine Learning for Identifying Colorectal-Cancer-Associated Microbial Enzymes in Metagenomic Data
title_short Integrating Biological Domain Knowledge with Machine Learning for Identifying Colorectal-Cancer-Associated Microbial Enzymes in Metagenomic Data
title_sort integrating biological domain knowledge with machine learning for identifying colorectal cancer associated microbial enzymes in metagenomic data
topic metagenomic analysis of colorectal cancer
machine learning
feature grouping
functional profiling of metagenomes
community-level enzyme commission (EC) abundances
url https://www.mdpi.com/2076-3417/15/6/2940
work_keys_str_mv AT burcubakirgungor integratingbiologicaldomainknowledgewithmachinelearningforidentifyingcolorectalcancerassociatedmicrobialenzymesinmetagenomicdata
AT nursebnemersoz integratingbiologicaldomainknowledgewithmachinelearningforidentifyingcolorectalcancerassociatedmicrobialenzymesinmetagenomicdata
AT malikyousef integratingbiologicaldomainknowledgewithmachinelearningforidentifyingcolorectalcancerassociatedmicrobialenzymesinmetagenomicdata