KPop: accurate and scalable comparative analysis of microbial genomes by sequence embeddings

Abstract Here we introduce KPop, a novel versatile method based on full k-mer spectra and dataset-specific transformations, through which thousands of assembled or unassembled microbial genomes can be quickly compared. Unlike MinHash-based methods that produce distances and have lower resolution, KP...

Full description

Saved in:
Bibliographic Details
Main Authors: Xavier Didelot, Paolo Ribeca
Format: Article
Language:English
Published: BMC 2025-06-01
Series:Genome Biology
Subjects:
Online Access:https://doi.org/10.1186/s13059-025-03585-8
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Here we introduce KPop, a novel versatile method based on full k-mer spectra and dataset-specific transformations, through which thousands of assembled or unassembled microbial genomes can be quickly compared. Unlike MinHash-based methods that produce distances and have lower resolution, KPop is able to accurately map sequences onto a low-dimensional space. Extensive validation on simulated and real-life viral and bacterial datasets shows that KPop can correctly separate sequences at both species and sub-species levels even when the overall genomic diversity is low. KPop also rapidly identifies related sequences and systematically outperforms MinHash-based methods.
ISSN:1474-760X