Revisiting pangenome openness with k-mers

Pangenomics is the study of related genomes collectively, usually from the same species or closely related taxa. Originally, pangenomes were defined for bacterial species. After the concept was extended to eukaryotic genomes, two definitions of pangenome evolved in parallel: the gene-based approach,...

Full description

Saved in:
Bibliographic Details
Main Authors: Parmigiani, Luca, Wittler, Roland, Stoye, Jens
Format: Article
Language:English
Published: Peer Community In 2024-04-01
Series:Peer Community Journal
Online Access:https://peercommunityjournal.org/articles/10.24072/pcjournal.415/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1825206400085655552
author Parmigiani, Luca
Wittler, Roland
Stoye, Jens
author_facet Parmigiani, Luca
Wittler, Roland
Stoye, Jens
author_sort Parmigiani, Luca
collection DOAJ
description Pangenomics is the study of related genomes collectively, usually from the same species or closely related taxa. Originally, pangenomes were defined for bacterial species. After the concept was extended to eukaryotic genomes, two definitions of pangenome evolved in parallel: the gene-based approach, which defines the pangenome as the union of all genes, and the sequence-based approach, which defines the pangenome as the set of all nonredundant genomic sequences. Estimating the total size of the pangenome for a given species has been subject of study since the very first mention of pangenomes. Traditionally, this is performed by predicting the ratio at which new genes are discovered, referred to as the openness of the species. Here, we abstract each genome as a set of items, which is entirely agnostic of the two approaches (gene-based, sequence-based). Genes are a viable option for items, but also other possibilities are feasible, e.g., genome sequence substrings of fixed length k (k-mers). In the present study, we investigate the use of k-mers to estimate the openness as an alternative to genes, and compare the results. An efficient implementation is also provided.
format Article
id doaj-art-68df90d3a0f645e78c7da3d68f0f288a
institution Kabale University
issn 2804-3871
language English
publishDate 2024-04-01
publisher Peer Community In
record_format Article
series Peer Community Journal
spelling doaj-art-68df90d3a0f645e78c7da3d68f0f288a2025-02-07T10:17:18ZengPeer Community InPeer Community Journal2804-38712024-04-01410.24072/pcjournal.41510.24072/pcjournal.415Revisiting pangenome openness with k-mers Parmigiani, Luca0https://orcid.org/0000-0002-2139-3259Wittler, Roland1https://orcid.org/0000-0002-2249-9880Stoye, Jens2https://orcid.org/0000-0002-4656-7155Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University – Bielefeld, Germany; Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Bielefeld University – Bielefeld, Germany; Graduate School “Digital Infrastructure for the Life Sciences” (DILS), Bielefeld University – Bielefeld, GermanyFaculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University – Bielefeld, Germany; Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Bielefeld University – Bielefeld, GermanyFaculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University – Bielefeld, Germany; Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Bielefeld University – Bielefeld, GermanyPangenomics is the study of related genomes collectively, usually from the same species or closely related taxa. Originally, pangenomes were defined for bacterial species. After the concept was extended to eukaryotic genomes, two definitions of pangenome evolved in parallel: the gene-based approach, which defines the pangenome as the union of all genes, and the sequence-based approach, which defines the pangenome as the set of all nonredundant genomic sequences. Estimating the total size of the pangenome for a given species has been subject of study since the very first mention of pangenomes. Traditionally, this is performed by predicting the ratio at which new genes are discovered, referred to as the openness of the species. Here, we abstract each genome as a set of items, which is entirely agnostic of the two approaches (gene-based, sequence-based). Genes are a viable option for items, but also other possibilities are feasible, e.g., genome sequence substrings of fixed length k (k-mers). In the present study, we investigate the use of k-mers to estimate the openness as an alternative to genes, and compare the results. An efficient implementation is also provided.https://peercommunityjournal.org/articles/10.24072/pcjournal.415/
spellingShingle Parmigiani, Luca
Wittler, Roland
Stoye, Jens
Revisiting pangenome openness with k-mers
Peer Community Journal
title Revisiting pangenome openness with k-mers
title_full Revisiting pangenome openness with k-mers
title_fullStr Revisiting pangenome openness with k-mers
title_full_unstemmed Revisiting pangenome openness with k-mers
title_short Revisiting pangenome openness with k-mers
title_sort revisiting pangenome openness with k mers
url https://peercommunityjournal.org/articles/10.24072/pcjournal.415/
work_keys_str_mv AT parmigianiluca revisitingpangenomeopennesswithkmers
AT wittlerroland revisitingpangenomeopennesswithkmers
AT stoyejens revisitingpangenomeopennesswithkmers