Integrating Biological Domain Knowledge with Machine Learning for Identifying Colorectal-Cancer-Associated Microbial Enzymes in Metagenomic Data
Advances in metagenomics have revolutionized our ability to elucidate links between the microbiome and human diseases. Colorectal cancer (CRC), a leading cause of cancer-related mortality worldwide, has been associated with dysbiosis of the gut microbiome. This study aims to develop a method for ide...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-03-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/6/2940 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850090138419331072 |
|---|---|
| author | Burcu Bakir-Gungor Nur Sebnem Ersoz Malik Yousef |
| author_facet | Burcu Bakir-Gungor Nur Sebnem Ersoz Malik Yousef |
| author_sort | Burcu Bakir-Gungor |
| collection | DOAJ |
| description | Advances in metagenomics have revolutionized our ability to elucidate links between the microbiome and human diseases. Colorectal cancer (CRC), a leading cause of cancer-related mortality worldwide, has been associated with dysbiosis of the gut microbiome. This study aims to develop a method for identifying CRC-associated microbial enzymes by incorporating biological domain knowledge into the feature selection process. Conventional feature selection techniques often evaluate features individually and fail to leverage biological knowledge during metagenomic data analysis. To address this gap, we propose the enzyme commission (EC)-nomenclature-based Grouping-Scoring-Modeling (G-S-M) method, which integrates biological domain knowledge into feature grouping and selection. The proposed method was tested on a CRC-associated metagenomic dataset collected from eight different countries. Community-level relative abundance values of enzymes were considered as features and grouped based on their EC categories to provide biologically informed groupings. Our findings in randomized 10-fold cross-validation experiments imply that glycosidases, CoA-transferases, hydro-lyases, oligo-1,6-glucosidase, crotonobetainyl-CoA hydratase, and citrate CoA-transferase enzymes can be associated with CRC development as part of different molecular pathways. These enzymes are mostly synthesized by <i>Eschericia coli</i>, <i>Salmonella enterica</i>, <i>Klebsiella pneumoniae</i>, <i>Staphylococcus aureus</i>, <i>Streptococcus pneumoniae</i>, and <i>Clostridioides dificile</i>. Comparative evaluation experiments showed that the proposed model consistently outperforms traditional feature selection methods paired with various classifiers. |
| format | Article |
| id | doaj-art-45a8e643535044d68deec038ea5ece0d |
| institution | DOAJ |
| issn | 2076-3417 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Applied Sciences |
| spelling | doaj-art-45a8e643535044d68deec038ea5ece0d2025-08-20T02:42:38ZengMDPI AGApplied Sciences2076-34172025-03-01156294010.3390/app15062940Integrating Biological Domain Knowledge with Machine Learning for Identifying Colorectal-Cancer-Associated Microbial Enzymes in Metagenomic DataBurcu Bakir-Gungor0Nur Sebnem Ersoz1Malik Yousef2Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri 38080, TürkiyeDepartment of Bioengineering, Graduate School of Engineering and Science, Abdullah Gul University, Kayseri 38080, TürkiyeDepartment of Information Systems, Zefat Academic College, Zefat 1320611, IsraelAdvances in metagenomics have revolutionized our ability to elucidate links between the microbiome and human diseases. Colorectal cancer (CRC), a leading cause of cancer-related mortality worldwide, has been associated with dysbiosis of the gut microbiome. This study aims to develop a method for identifying CRC-associated microbial enzymes by incorporating biological domain knowledge into the feature selection process. Conventional feature selection techniques often evaluate features individually and fail to leverage biological knowledge during metagenomic data analysis. To address this gap, we propose the enzyme commission (EC)-nomenclature-based Grouping-Scoring-Modeling (G-S-M) method, which integrates biological domain knowledge into feature grouping and selection. The proposed method was tested on a CRC-associated metagenomic dataset collected from eight different countries. Community-level relative abundance values of enzymes were considered as features and grouped based on their EC categories to provide biologically informed groupings. Our findings in randomized 10-fold cross-validation experiments imply that glycosidases, CoA-transferases, hydro-lyases, oligo-1,6-glucosidase, crotonobetainyl-CoA hydratase, and citrate CoA-transferase enzymes can be associated with CRC development as part of different molecular pathways. These enzymes are mostly synthesized by <i>Eschericia coli</i>, <i>Salmonella enterica</i>, <i>Klebsiella pneumoniae</i>, <i>Staphylococcus aureus</i>, <i>Streptococcus pneumoniae</i>, and <i>Clostridioides dificile</i>. Comparative evaluation experiments showed that the proposed model consistently outperforms traditional feature selection methods paired with various classifiers.https://www.mdpi.com/2076-3417/15/6/2940metagenomic analysis of colorectal cancermachine learningfeature groupingfunctional profiling of metagenomescommunity-level enzyme commission (EC) abundances |
| spellingShingle | Burcu Bakir-Gungor Nur Sebnem Ersoz Malik Yousef Integrating Biological Domain Knowledge with Machine Learning for Identifying Colorectal-Cancer-Associated Microbial Enzymes in Metagenomic Data Applied Sciences metagenomic analysis of colorectal cancer machine learning feature grouping functional profiling of metagenomes community-level enzyme commission (EC) abundances |
| title | Integrating Biological Domain Knowledge with Machine Learning for Identifying Colorectal-Cancer-Associated Microbial Enzymes in Metagenomic Data |
| title_full | Integrating Biological Domain Knowledge with Machine Learning for Identifying Colorectal-Cancer-Associated Microbial Enzymes in Metagenomic Data |
| title_fullStr | Integrating Biological Domain Knowledge with Machine Learning for Identifying Colorectal-Cancer-Associated Microbial Enzymes in Metagenomic Data |
| title_full_unstemmed | Integrating Biological Domain Knowledge with Machine Learning for Identifying Colorectal-Cancer-Associated Microbial Enzymes in Metagenomic Data |
| title_short | Integrating Biological Domain Knowledge with Machine Learning for Identifying Colorectal-Cancer-Associated Microbial Enzymes in Metagenomic Data |
| title_sort | integrating biological domain knowledge with machine learning for identifying colorectal cancer associated microbial enzymes in metagenomic data |
| topic | metagenomic analysis of colorectal cancer machine learning feature grouping functional profiling of metagenomes community-level enzyme commission (EC) abundances |
| url | https://www.mdpi.com/2076-3417/15/6/2940 |
| work_keys_str_mv | AT burcubakirgungor integratingbiologicaldomainknowledgewithmachinelearningforidentifyingcolorectalcancerassociatedmicrobialenzymesinmetagenomicdata AT nursebnemersoz integratingbiologicaldomainknowledgewithmachinelearningforidentifyingcolorectalcancerassociatedmicrobialenzymesinmetagenomicdata AT malikyousef integratingbiologicaldomainknowledgewithmachinelearningforidentifyingcolorectalcancerassociatedmicrobialenzymesinmetagenomicdata |