Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm.

<h4>Objective</h4>Diverticular disease (DD) is one of the most prevalent conditions encountered by gastroenterologists, affecting ~50% of Americans before the age of 60. Our aim was to identify genetic risk variants and clinical phenotypes associated with DD, leveraging multiple electron...

Full description

Saved in:
Bibliographic Details
Main Authors: Yoonjung Yoonie Joo, Jennifer A Pacheco, William K Thompson, Laura J Rasmussen-Torvik, Luke V Rasmussen, Frederick T J Lin, Mariza de Andrade, Kenneth M Borthwick, Erwin Bottinger, Andrew Cagan, David S Carrell, Joshua C Denny, Stephen B Ellis, Omri Gottesman, James G Linneman, Jyotishman Pathak, Peggy L Peissig, Ning Shang, Gerard Tromp, Annapoorani Veerappan, Maureen E Smith, Rex L Chisholm, Andrew J Gawron, M Geoffrey Hayes, Abel N Kho
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2023-01-01
Series:PLoS ONE
Online Access:https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0283553&type=printable
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850116550207471616
author Yoonjung Yoonie Joo
Jennifer A Pacheco
William K Thompson
Laura J Rasmussen-Torvik
Luke V Rasmussen
Frederick T J Lin
Mariza de Andrade
Kenneth M Borthwick
Erwin Bottinger
Andrew Cagan
David S Carrell
Joshua C Denny
Stephen B Ellis
Omri Gottesman
James G Linneman
Jyotishman Pathak
Peggy L Peissig
Ning Shang
Gerard Tromp
Annapoorani Veerappan
Maureen E Smith
Rex L Chisholm
Andrew J Gawron
M Geoffrey Hayes
Abel N Kho
author_facet Yoonjung Yoonie Joo
Jennifer A Pacheco
William K Thompson
Laura J Rasmussen-Torvik
Luke V Rasmussen
Frederick T J Lin
Mariza de Andrade
Kenneth M Borthwick
Erwin Bottinger
Andrew Cagan
David S Carrell
Joshua C Denny
Stephen B Ellis
Omri Gottesman
James G Linneman
Jyotishman Pathak
Peggy L Peissig
Ning Shang
Gerard Tromp
Annapoorani Veerappan
Maureen E Smith
Rex L Chisholm
Andrew J Gawron
M Geoffrey Hayes
Abel N Kho
author_sort Yoonjung Yoonie Joo
collection DOAJ
description <h4>Objective</h4>Diverticular disease (DD) is one of the most prevalent conditions encountered by gastroenterologists, affecting ~50% of Americans before the age of 60. Our aim was to identify genetic risk variants and clinical phenotypes associated with DD, leveraging multiple electronic health record (EHR) data sources of 91,166 multi-ancestry participants with a Natural Language Processing (NLP) technique.<h4>Materials and methods</h4>We developed a NLP-enriched phenotyping algorithm that incorporated colonoscopy or abdominal imaging reports to identify patients with diverticulosis and diverticulitis from multicenter EHRs. We performed genome-wide association studies (GWAS) of DD in European, African and multi-ancestry participants, followed by phenome-wide association studies (PheWAS) of the risk variants to identify their potential comorbid/pleiotropic effects in clinical phenotypes.<h4>Results</h4>Our developed algorithm showed a significant improvement in patient classification performance for DD analysis (algorithm PPVs ≥ 0.94), with up to a 3.5 fold increase in terms of the number of identified patients than the traditional method. Ancestry-stratified analyses of diverticulosis and diverticulitis of the identified subjects replicated the well-established associations between ARHGAP15 loci with DD, showing overall intensified GWAS signals in diverticulitis patients compared to diverticulosis patients. Our PheWAS analyses identified significant associations between the DD GWAS variants and circulatory system, genitourinary, and neoplastic EHR phenotypes.<h4>Discussion</h4>As the first multi-ancestry GWAS-PheWAS study, we showcased that heterogenous EHR data can be mapped through an integrative analytical pipeline and reveal significant genotype-phenotype associations with clinical interpretation.<h4>Conclusion</h4>A systematic framework to process unstructured EHR data with NLP could advance a deep and scalable phenotyping for better patient identification and facilitate etiological investigation of a disease with multilayered data.
format Article
id doaj-art-78f28682110648cfbe41e76d4bf5e28c
institution OA Journals
issn 1932-6203
language English
publishDate 2023-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-78f28682110648cfbe41e76d4bf5e28c2025-08-20T02:36:18ZengPublic Library of Science (PLoS)PLoS ONE1932-62032023-01-01185e028355310.1371/journal.pone.0283553Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm.Yoonjung Yoonie JooJennifer A PachecoWilliam K ThompsonLaura J Rasmussen-TorvikLuke V RasmussenFrederick T J LinMariza de AndradeKenneth M BorthwickErwin BottingerAndrew CaganDavid S CarrellJoshua C DennyStephen B EllisOmri GottesmanJames G LinnemanJyotishman PathakPeggy L PeissigNing ShangGerard TrompAnnapoorani VeerappanMaureen E SmithRex L ChisholmAndrew J GawronM Geoffrey HayesAbel N Kho<h4>Objective</h4>Diverticular disease (DD) is one of the most prevalent conditions encountered by gastroenterologists, affecting ~50% of Americans before the age of 60. Our aim was to identify genetic risk variants and clinical phenotypes associated with DD, leveraging multiple electronic health record (EHR) data sources of 91,166 multi-ancestry participants with a Natural Language Processing (NLP) technique.<h4>Materials and methods</h4>We developed a NLP-enriched phenotyping algorithm that incorporated colonoscopy or abdominal imaging reports to identify patients with diverticulosis and diverticulitis from multicenter EHRs. We performed genome-wide association studies (GWAS) of DD in European, African and multi-ancestry participants, followed by phenome-wide association studies (PheWAS) of the risk variants to identify their potential comorbid/pleiotropic effects in clinical phenotypes.<h4>Results</h4>Our developed algorithm showed a significant improvement in patient classification performance for DD analysis (algorithm PPVs ≥ 0.94), with up to a 3.5 fold increase in terms of the number of identified patients than the traditional method. Ancestry-stratified analyses of diverticulosis and diverticulitis of the identified subjects replicated the well-established associations between ARHGAP15 loci with DD, showing overall intensified GWAS signals in diverticulitis patients compared to diverticulosis patients. Our PheWAS analyses identified significant associations between the DD GWAS variants and circulatory system, genitourinary, and neoplastic EHR phenotypes.<h4>Discussion</h4>As the first multi-ancestry GWAS-PheWAS study, we showcased that heterogenous EHR data can be mapped through an integrative analytical pipeline and reveal significant genotype-phenotype associations with clinical interpretation.<h4>Conclusion</h4>A systematic framework to process unstructured EHR data with NLP could advance a deep and scalable phenotyping for better patient identification and facilitate etiological investigation of a disease with multilayered data.https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0283553&type=printable
spellingShingle Yoonjung Yoonie Joo
Jennifer A Pacheco
William K Thompson
Laura J Rasmussen-Torvik
Luke V Rasmussen
Frederick T J Lin
Mariza de Andrade
Kenneth M Borthwick
Erwin Bottinger
Andrew Cagan
David S Carrell
Joshua C Denny
Stephen B Ellis
Omri Gottesman
James G Linneman
Jyotishman Pathak
Peggy L Peissig
Ning Shang
Gerard Tromp
Annapoorani Veerappan
Maureen E Smith
Rex L Chisholm
Andrew J Gawron
M Geoffrey Hayes
Abel N Kho
Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm.
PLoS ONE
title Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm.
title_full Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm.
title_fullStr Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm.
title_full_unstemmed Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm.
title_short Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm.
title_sort multi ancestry genome and phenome wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm
url https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0283553&type=printable
work_keys_str_mv AT yoonjungyooniejoo multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT jenniferapacheco multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT williamkthompson multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT laurajrasmussentorvik multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT lukevrasmussen multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT fredericktjlin multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT marizadeandrade multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT kennethmborthwick multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT erwinbottinger multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT andrewcagan multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT davidscarrell multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT joshuacdenny multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT stephenbellis multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT omrigottesman multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT jamesglinneman multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT jyotishmanpathak multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT peggylpeissig multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT ningshang multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT gerardtromp multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT annapooraniveerappan multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT maureenesmith multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT rexlchisholm multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT andrewjgawron multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT mgeoffreyhayes multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm
AT abelnkho multiancestrygenomeandphenomewideassociationstudiesofdiverticulardiseaseinelectronichealthrecordswithnaturallanguageprocessingenrichedphenotypingalgorithm