Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data.

Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analys...

Full description

Saved in:
Bibliographic Details
Main Authors: Enrico Glaab, Jaume Bacardit, Jonathan M Garibaldi, Natalio Krasnogor
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2012-01-01
Series:PLoS ONE
Online Access:https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0039932&type=printable
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849727598768160768
author Enrico Glaab
Jaume Bacardit
Jonathan M Garibaldi
Natalio Krasnogor
author_facet Enrico Glaab
Jaume Bacardit
Jonathan M Garibaldi
Natalio Krasnogor
author_sort Enrico Glaab
collection DOAJ
description Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL's classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes.
format Article
id doaj-art-2b4cc20643824aa79d5f669a5bc8d47b
institution DOAJ
issn 1932-6203
language English
publishDate 2012-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-2b4cc20643824aa79d5f669a5bc8d47b2025-08-20T03:09:48ZengPublic Library of Science (PLoS)PLoS ONE1932-62032012-01-0177e3993210.1371/journal.pone.0039932Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data.Enrico GlaabJaume BacarditJonathan M GaribaldiNatalio KrasnogorMicroarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL's classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes.https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0039932&type=printable
spellingShingle Enrico Glaab
Jaume Bacardit
Jonathan M Garibaldi
Natalio Krasnogor
Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data.
PLoS ONE
title Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data.
title_full Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data.
title_fullStr Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data.
title_full_unstemmed Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data.
title_short Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data.
title_sort using rule based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data
url https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0039932&type=printable
work_keys_str_mv AT enricoglaab usingrulebasedmachinelearningforcandidatediseasegeneprioritizationandsampleclassificationofcancergeneexpressiondata
AT jaumebacardit usingrulebasedmachinelearningforcandidatediseasegeneprioritizationandsampleclassificationofcancergeneexpressiondata
AT jonathanmgaribaldi usingrulebasedmachinelearningforcandidatediseasegeneprioritizationandsampleclassificationofcancergeneexpressiondata
AT nataliokrasnogor usingrulebasedmachinelearningforcandidatediseasegeneprioritizationandsampleclassificationofcancergeneexpressiondata