Ph. D. Defense

Predicting Transcript Production Rates in Yeast With Sparse Linear Models

Speaker:Yezhou Huang
yhuang at
Date: Monday, June 6, 2016
Time: 2:00pm - 5:00pm
Location: D344 LSRC, Duke


To provide biological insights into transcriptional regulation, a couple of groups have recently presented models relating the promoter DNA-bound transcription factors (TFs) to downstream genes' mean transcript level or transcript production rates over time. However, transcript production is dynamic in response to changes of TF concentrations over time. Also, TFs are not the only factors binding to promoters; other DNA binding factors (DBFs) bind as well, especially nucleosomes, resulting in competition between DBFs for binding at same genomic location. Additionally, not only TFs, but also some other elements regulate transcription. Within core promoter, various regulatory elements influence RNAPII recruitment, PIC formation, RNAPII searching for TSS, and RNAPII initiating transcription. Moreover, it is proposed that downstream from TSS, nucleosomes resist RNAPII elongation.

Here, we provide a machine learning framework to predict transcript production rates from promoter DNA sequences. We applied this framework in the S. cerevisiae yeast for two scenarios: a) to predict the dynamic transcript production rate during the cell cycle for native promoters; b) to predict the mean transcript production rate over time for synthetic promoters. As far as we know, our framework is the first successful attempt to have a model of dynamic transcript production and transcriptional regulation: with cell cycle data set, we got Pearson correlation coefficient Cp = 0.751 and coefficient of determination r2 = 0.564 on test set for predicting dynamic transcript production rate over time; with DREAM6 Gene Promoter Expression Prediction challenge, our fitted model outperformed all participant teams, best of all teams, and a model combining best team's k-mer based sequence features and another paper's biologically mechanistic features, in terms of almost all scoring metrics.

Moreover, our model shows its capability of identifying generalizable features by interpreting the highly predictive models, and thereby provide support for associated hypothesized mechanisms. With the learned sparse linear models, we got results supporting the following biological insights: a) TFs govern the probability of RNAPII recruitment and initiation possibly through interactions with PIC components and transcription cofactors; b) the core promoter amplifies the transcript production probably by influencing PIC formation, RNAPII recruitment, DNA melting, RNAPII searching for and selecting TSS, releasing RNAPII from general transcription factors, and thereby initiation; c) there is strong transcriptional synergy between TFs and core promoter, regulatory elements, which very likely represent respective DNA sequence signals for recruiting general transcription factors and transcription cofactors, and for TSS scanning and selection; d) the regulatory elements within core promoter region are more than TATA box and nucleosome free region, suggesting the existence of still unidentified TAF-dependent and cofactor-dependent core promoter elements in yeast S. cerevisiae; e) nucleosome occupancy profile is helpful for representing -1 and +1 nucleosomes' regulatory roles on transcription.

Advisor(s): Alexander Hartemink
Committee: Sayan Mukherjee, Raluca Gordan, Ryan Baugh