High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions.

Accurately modeling the DNA sequence preferences of transcription factors (TFs), and using these models to predict in vivo genomic binding sites for TFs, are key pieces in deciphering the regulatory code. These efforts have been frustrated by the limited availability and accuracy of TF binding site...

Full description

Saved in:
Bibliographic Details
Main Authors: Phaedra Agius, Aaron Arvey, William Chang, William Stafford Noble, Christina Leslie
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2010-09-01
Series:PLoS Computational Biology
Online Access:https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1000916&type=printable
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850192311796891648
author Phaedra Agius
Aaron Arvey
William Chang
William Stafford Noble
Christina Leslie
author_facet Phaedra Agius
Aaron Arvey
William Chang
William Stafford Noble
Christina Leslie
author_sort Phaedra Agius
collection DOAJ
description Accurately modeling the DNA sequence preferences of transcription factors (TFs), and using these models to predict in vivo genomic binding sites for TFs, are key pieces in deciphering the regulatory code. These efforts have been frustrated by the limited availability and accuracy of TF binding site motifs, usually represented as position-specific scoring matrices (PSSMs), which may match large numbers of sites and produce an unreliable list of target genes. Recently, protein binding microarray (PBM) experiments have emerged as a new source of high resolution data on in vitro TF binding specificities. PBM data has been analyzed either by estimating PSSMs or via rank statistics on probe intensities, so that individual sequence patterns are assigned enrichment scores (E-scores). This representation is informative but unwieldy because every TF is assigned a list of thousands of scored sequence patterns. Meanwhile, high-resolution in vivo TF occupancy data from ChIP-seq experiments is also increasingly available. We have developed a flexible discriminative framework for learning TF binding preferences from high resolution in vitro and in vivo data. We first trained support vector regression (SVR) models on PBM data to learn the mapping from probe sequences to binding intensities. We used a novel -mer based string kernel called the di-mismatch kernel to represent probe sequence similarities. The SVR models are more compact than E-scores, more expressive than PSSMs, and can be readily used to scan genomics regions to predict in vivo occupancy. Using a large data set of yeast and mouse TFs, we found that our SVR models can better predict probe intensity than the E-score method or PBM-derived PSSMs. Moreover, by using SVRs to score yeast, mouse, and human genomic regions, we were better able to predict genomic occupancy as measured by ChIP-chip and ChIP-seq experiments. Finally, we found that by training kernel-based models directly on ChIP-seq data, we greatly improved in vivo occupancy prediction, and by comparing a TF's in vitro and in vivo models, we could identify cofactors and disambiguate direct and indirect binding.
format Article
id doaj-art-ca3cccfba20f4d749a137a9a04a5c70f
institution OA Journals
issn 1553-734X
1553-7358
language English
publishDate 2010-09-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS Computational Biology
spelling doaj-art-ca3cccfba20f4d749a137a9a04a5c70f2025-08-20T02:14:37ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582010-09-0169e100091610.1371/journal.pcbi.1000916High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions.Phaedra AgiusAaron ArveyWilliam ChangWilliam Stafford NobleChristina LeslieAccurately modeling the DNA sequence preferences of transcription factors (TFs), and using these models to predict in vivo genomic binding sites for TFs, are key pieces in deciphering the regulatory code. These efforts have been frustrated by the limited availability and accuracy of TF binding site motifs, usually represented as position-specific scoring matrices (PSSMs), which may match large numbers of sites and produce an unreliable list of target genes. Recently, protein binding microarray (PBM) experiments have emerged as a new source of high resolution data on in vitro TF binding specificities. PBM data has been analyzed either by estimating PSSMs or via rank statistics on probe intensities, so that individual sequence patterns are assigned enrichment scores (E-scores). This representation is informative but unwieldy because every TF is assigned a list of thousands of scored sequence patterns. Meanwhile, high-resolution in vivo TF occupancy data from ChIP-seq experiments is also increasingly available. We have developed a flexible discriminative framework for learning TF binding preferences from high resolution in vitro and in vivo data. We first trained support vector regression (SVR) models on PBM data to learn the mapping from probe sequences to binding intensities. We used a novel -mer based string kernel called the di-mismatch kernel to represent probe sequence similarities. The SVR models are more compact than E-scores, more expressive than PSSMs, and can be readily used to scan genomics regions to predict in vivo occupancy. Using a large data set of yeast and mouse TFs, we found that our SVR models can better predict probe intensity than the E-score method or PBM-derived PSSMs. Moreover, by using SVRs to score yeast, mouse, and human genomic regions, we were better able to predict genomic occupancy as measured by ChIP-chip and ChIP-seq experiments. Finally, we found that by training kernel-based models directly on ChIP-seq data, we greatly improved in vivo occupancy prediction, and by comparing a TF's in vitro and in vivo models, we could identify cofactors and disambiguate direct and indirect binding.https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1000916&type=printable
spellingShingle Phaedra Agius
Aaron Arvey
William Chang
William Stafford Noble
Christina Leslie
High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions.
PLoS Computational Biology
title High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions.
title_full High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions.
title_fullStr High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions.
title_full_unstemmed High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions.
title_short High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions.
title_sort high resolution models of transcription factor dna affinities improve in vitro and in vivo binding predictions
url https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1000916&type=printable
work_keys_str_mv AT phaedraagius highresolutionmodelsoftranscriptionfactordnaaffinitiesimproveinvitroandinvivobindingpredictions
AT aaronarvey highresolutionmodelsoftranscriptionfactordnaaffinitiesimproveinvitroandinvivobindingpredictions
AT williamchang highresolutionmodelsoftranscriptionfactordnaaffinitiesimproveinvitroandinvivobindingpredictions
AT williamstaffordnoble highresolutionmodelsoftranscriptionfactordnaaffinitiesimproveinvitroandinvivobindingpredictions
AT christinaleslie highresolutionmodelsoftranscriptionfactordnaaffinitiesimproveinvitroandinvivobindingpredictions