An integrated strain-level analytic pipeline utilizing longitudinal metagenomic data
ABSTRACT With the development of sequencing technology and analytic tools, studying within-species variations enhances the understanding of microbial biological processes. Nevertheless, most existing methods designed for strain-level analysis lack the capability to concurrently assess both strain pr...
Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
American Society for Microbiology
2024-11-01
|
| Series: | Microbiology Spectrum |
| Subjects: | |
| Online Access: | https://journals.asm.org/doi/10.1128/spectrum.01431-24 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850151230373888000 |
|---|---|
| author | Boyan Zhou Chan Wang Gregory Putzel Jiyuan Hu Menghan Liu Fen Wu Yu Chen Alejandro Pironti Huilin Li |
| author_facet | Boyan Zhou Chan Wang Gregory Putzel Jiyuan Hu Menghan Liu Fen Wu Yu Chen Alejandro Pironti Huilin Li |
| author_sort | Boyan Zhou |
| collection | DOAJ |
| description | ABSTRACT With the development of sequencing technology and analytic tools, studying within-species variations enhances the understanding of microbial biological processes. Nevertheless, most existing methods designed for strain-level analysis lack the capability to concurrently assess both strain proportions and genome-wide single nucleotide variants (SNVs) across longitudinal metagenomic samples. In this study, we introduce LongStrain, an integrated pipeline for the analysis of large-scale metagenomic data from individuals with longitudinal or repeated samples. In LongStrain, we first utilize two efficient tools, Kraken2 and Bowtie2, for the taxonomic classification and alignment of sequencing reads, respectively. Subsequently, we propose to jointly model strain proportions and shared haplotypes across samples within individuals. This approach specifically targets tracking a primary strain and a secondary strain for each subject, providing their respective proportions and SNVs as output. With extensive simulation studies of a microbial community and single species, our results demonstrate that LongStrain is superior to two genotyping methods and two deconvolution methods across a majority of scenarios. Furthermore, we illustrate the potential applications of LongStrain in the real data analysis of The Environmental Determinants of Diabetes in the Young study and a gastric intestinal metaplasia microbiome study. In summary, the proposed analytic pipeline demonstrates marked statistical efficiency over the same type of methods and has great potential in understanding the genomic variants and dynamic changes at strain level. LongStrain and its tutorial are freely available online at https://github.com/BoyanZhou/LongStrain.IMPORTANCEThe advancement in DNA-sequencing technology has enabled the high-resolution identification of microorganisms in microbial communities. Since different microbial strains within species may contain extreme phenotypic variability (e.g., nutrition metabolism, antibiotic resistance, and pathogen virulence), investigating within-species variations holds great scientific promise in understanding the underlying mechanism of microbial biological processes. To fully utilize the shared genomic variants across longitudinal metagenomics samples collected in microbiome studies, we develop an integrated analytic pipeline (LongStrain) for longitudinal metagenomics data. It concurrently leverages the information on proportions of mapped reads for individual strains and genome-wide SNVs to enhance the efficiency and accuracy of strain identification. Our method helps to understand strains’ dynamic changes and their association with genome-wide variants. Given the fast-growing longitudinal studies of microbial communities, LongStrain which streamlines analyses of large-scale raw sequencing data should be of great value in microbiome research communities. |
| format | Article |
| id | doaj-art-291455979dc0431b8efb2e35e9fc6709 |
| institution | OA Journals |
| issn | 2165-0497 |
| language | English |
| publishDate | 2024-11-01 |
| publisher | American Society for Microbiology |
| record_format | Article |
| series | Microbiology Spectrum |
| spelling | doaj-art-291455979dc0431b8efb2e35e9fc67092025-08-20T02:26:20ZengAmerican Society for MicrobiologyMicrobiology Spectrum2165-04972024-11-01121110.1128/spectrum.01431-24An integrated strain-level analytic pipeline utilizing longitudinal metagenomic dataBoyan Zhou0Chan Wang1Gregory Putzel2Jiyuan Hu3Menghan Liu4Fen Wu5Yu Chen6Alejandro Pironti7Huilin Li8Division of Biostatistics, Department of Population Health, New York University School of Medicine, New York, New York, USADivision of Biostatistics, Department of Population Health, New York University School of Medicine, New York, New York, USADepartment of Microbiology, New York University School of Medicine, New York, New York, USADivision of Biostatistics, Department of Population Health, New York University School of Medicine, New York, New York, USADepartment of Biological Sciences, Columbia University in the City of New York, New York, New York, USADivision of Epidemiology, Department of Population Health, New York University School of Medicine, New York, New York, USADivision of Epidemiology, Department of Population Health, New York University School of Medicine, New York, New York, USADepartment of Microbiology, New York University School of Medicine, New York, New York, USADivision of Biostatistics, Department of Population Health, New York University School of Medicine, New York, New York, USAABSTRACT With the development of sequencing technology and analytic tools, studying within-species variations enhances the understanding of microbial biological processes. Nevertheless, most existing methods designed for strain-level analysis lack the capability to concurrently assess both strain proportions and genome-wide single nucleotide variants (SNVs) across longitudinal metagenomic samples. In this study, we introduce LongStrain, an integrated pipeline for the analysis of large-scale metagenomic data from individuals with longitudinal or repeated samples. In LongStrain, we first utilize two efficient tools, Kraken2 and Bowtie2, for the taxonomic classification and alignment of sequencing reads, respectively. Subsequently, we propose to jointly model strain proportions and shared haplotypes across samples within individuals. This approach specifically targets tracking a primary strain and a secondary strain for each subject, providing their respective proportions and SNVs as output. With extensive simulation studies of a microbial community and single species, our results demonstrate that LongStrain is superior to two genotyping methods and two deconvolution methods across a majority of scenarios. Furthermore, we illustrate the potential applications of LongStrain in the real data analysis of The Environmental Determinants of Diabetes in the Young study and a gastric intestinal metaplasia microbiome study. In summary, the proposed analytic pipeline demonstrates marked statistical efficiency over the same type of methods and has great potential in understanding the genomic variants and dynamic changes at strain level. LongStrain and its tutorial are freely available online at https://github.com/BoyanZhou/LongStrain.IMPORTANCEThe advancement in DNA-sequencing technology has enabled the high-resolution identification of microorganisms in microbial communities. Since different microbial strains within species may contain extreme phenotypic variability (e.g., nutrition metabolism, antibiotic resistance, and pathogen virulence), investigating within-species variations holds great scientific promise in understanding the underlying mechanism of microbial biological processes. To fully utilize the shared genomic variants across longitudinal metagenomics samples collected in microbiome studies, we develop an integrated analytic pipeline (LongStrain) for longitudinal metagenomics data. It concurrently leverages the information on proportions of mapped reads for individual strains and genome-wide SNVs to enhance the efficiency and accuracy of strain identification. Our method helps to understand strains’ dynamic changes and their association with genome-wide variants. Given the fast-growing longitudinal studies of microbial communities, LongStrain which streamlines analyses of large-scale raw sequencing data should be of great value in microbiome research communities.https://journals.asm.org/doi/10.1128/spectrum.01431-24microbiomelongitudinal metagenomic datastrain-level analysisgenomic variantsstrain dynamics |
| spellingShingle | Boyan Zhou Chan Wang Gregory Putzel Jiyuan Hu Menghan Liu Fen Wu Yu Chen Alejandro Pironti Huilin Li An integrated strain-level analytic pipeline utilizing longitudinal metagenomic data Microbiology Spectrum microbiome longitudinal metagenomic data strain-level analysis genomic variants strain dynamics |
| title | An integrated strain-level analytic pipeline utilizing longitudinal metagenomic data |
| title_full | An integrated strain-level analytic pipeline utilizing longitudinal metagenomic data |
| title_fullStr | An integrated strain-level analytic pipeline utilizing longitudinal metagenomic data |
| title_full_unstemmed | An integrated strain-level analytic pipeline utilizing longitudinal metagenomic data |
| title_short | An integrated strain-level analytic pipeline utilizing longitudinal metagenomic data |
| title_sort | integrated strain level analytic pipeline utilizing longitudinal metagenomic data |
| topic | microbiome longitudinal metagenomic data strain-level analysis genomic variants strain dynamics |
| url | https://journals.asm.org/doi/10.1128/spectrum.01431-24 |
| work_keys_str_mv | AT boyanzhou anintegratedstrainlevelanalyticpipelineutilizinglongitudinalmetagenomicdata AT chanwang anintegratedstrainlevelanalyticpipelineutilizinglongitudinalmetagenomicdata AT gregoryputzel anintegratedstrainlevelanalyticpipelineutilizinglongitudinalmetagenomicdata AT jiyuanhu anintegratedstrainlevelanalyticpipelineutilizinglongitudinalmetagenomicdata AT menghanliu anintegratedstrainlevelanalyticpipelineutilizinglongitudinalmetagenomicdata AT fenwu anintegratedstrainlevelanalyticpipelineutilizinglongitudinalmetagenomicdata AT yuchen anintegratedstrainlevelanalyticpipelineutilizinglongitudinalmetagenomicdata AT alejandropironti anintegratedstrainlevelanalyticpipelineutilizinglongitudinalmetagenomicdata AT huilinli anintegratedstrainlevelanalyticpipelineutilizinglongitudinalmetagenomicdata AT boyanzhou integratedstrainlevelanalyticpipelineutilizinglongitudinalmetagenomicdata AT chanwang integratedstrainlevelanalyticpipelineutilizinglongitudinalmetagenomicdata AT gregoryputzel integratedstrainlevelanalyticpipelineutilizinglongitudinalmetagenomicdata AT jiyuanhu integratedstrainlevelanalyticpipelineutilizinglongitudinalmetagenomicdata AT menghanliu integratedstrainlevelanalyticpipelineutilizinglongitudinalmetagenomicdata AT fenwu integratedstrainlevelanalyticpipelineutilizinglongitudinalmetagenomicdata AT yuchen integratedstrainlevelanalyticpipelineutilizinglongitudinalmetagenomicdata AT alejandropironti integratedstrainlevelanalyticpipelineutilizinglongitudinalmetagenomicdata AT huilinli integratedstrainlevelanalyticpipelineutilizinglongitudinalmetagenomicdata |