|Search Duke CS||
Here, we provide a machine learning framework to predict transcript production rates from promoter DNA sequences. We applied this framework in the S. cerevisiae yeast for two scenarios: a) to predict the dynamic transcript production rate during the cell cycle for native promoters; b) to predict the mean transcript production rate over time for synthetic promoters. As far as we know, our framework is the first successful attempt to have a model of dynamic transcript production and transcriptional regulation: with cell cycle data set, we got Pearson correlation coefficient Cp = 0.751 and coefficient of determination r2 = 0.564 on test set for predicting dynamic transcript production rate over time; with DREAM6 Gene Promoter Expression Prediction challenge, our fitted model outperformed all participant teams, best of all teams, and a model combining best team's k-mer based sequence features and another paper's biologically mechanistic features, in terms of almost all scoring metrics.
Moreover, our model shows its capability of identifying generalizable features by interpreting the highly predictive models, and thereby provide support for associated hypothesized mechanisms. With the learned sparse linear models, we got results supporting the following biological insights: a) TFs govern the probability of RNAPII recruitment and initiation possibly through interactions with PIC components and transcription cofactors; b) the core promoter amplifies the transcript production probably by influencing PIC formation, RNAPII recruitment, DNA melting, RNAPII searching for and selecting TSS, releasing RNAPII from general transcription factors, and thereby initiation; c) there is strong transcriptional synergy between TFs and core promoter, regulatory elements, which very likely represent respective DNA sequence signals for recruiting general transcription factors and transcription cofactors, and for TSS scanning and selection; d) the regulatory elements within core promoter region are more than TATA box and nucleosome free region, suggesting the existence of still unidentified TAF-dependent and cofactor-dependent core promoter elements in yeast S. cerevisiae; e) nucleosome occupancy profile is helpful for representing -1 and +1 nucleosomes' regulatory roles on transcription.