Duke Computer Science Colloquium

Taming Big Sequencing Data for RNA Biology: From Transcript Abundance Estimation to ‘Epitranscriptomic’ Mark Detection

Speaker:Bo Li
Date: Wednesday, April 5, 2017
Time: 12:00pm - 1:00pm
Location: D106 LSRC, Duke
Pizza will be served at 11:45.


Next generation sequencing (NGS) technology is one of the most phenomenal genomics innovations in the past decades. It brings us a promising future of genomic diagnosis and personalized treatment of human disease. The power of NGS is reflected in its ability to measure almost any molecular signal of interest using the following paradigm: 1) signal embedding; 2) sequencing; 3) signal extraction. This talk introduces two of my works on extracting signal from big sequencing data.

The first work addresses the problem of accurately estimating transcript abundance from RNA-sequencing (RNA-Seq) data. Transcript abundance is the relative measure of transcript copy number in cells. It is a fundamental quantity in biology and has a huge impact on human health: studies have shown that transcript abundances are often altered in disease conditions. To extract abundance “signal” from RNA-Seq data, we developed RSEM, one of the most accurate transcript abundance estimation tools, by utilizing modern statistical learning techniques. RSEM has been extensively used around the world since its release: RSEM papers are cited over 2,300 times and RSEM is used in nationwide consortium projects such as TCGA and ENCODE.

The second work introduces PROBer, a statistical learning software for accurate epitranscriptomic mark detection. Epitranscriptomics, also known as RNA epigenetics, is a new field focusing on the study of RNA structure, RNA modification and RNA-protein interaction at the transcriptome scale. These three aspects are ladders to understanding the mechanism of alternative splicing, of which the disturbance often results in severe diseases. Epitranscriptomic sequencing data often contain background noise and ambiguous position information, which jointly influence the detection accuracy of epitranscriptomic marks. Therefore, we need to simultaneously solve the problems of signal separation and ambiguity resolving. Existing analyzing methods heavily rely on ad hoc heuristics, which could not handle these two problems well. PROBer incorporates both background noise and position information into its generative probabilistic model, and learns them from data automatically. We compare PROBer with the existing methods on detecting epitranscriptomic marks. Results on both simulated and real data show that PROBer outperforms them all.

In recent years, epitranscriptomics has become a hot research topic. Nature Methods recently selected epitranscriptome analysis as the method of the year. In addition, several newly approved grants from NIH indicate a big funding source in the future. In the last part of my talk, I will discuss how my future works would fit in this emerging trend of epitranscriptomics research.


Dr. Bo Li is a Postdoctoral researcher in the Center for RNA Systems Biology at the University of California, Berkeley. His research focuses on RNA-centric systems biology and next-generation sequencing data analysis using modern statistical learning techniques. He received his Ph.D. in computer science from University of Wisconsin-Madison under the supervision of Colin Dewey. Then he did a postdoctoral training with Lior Pachter at University of California, Berkeley. He is best known for his work on RSEM, a popular transcript quantification tool for RNA-Seq data, which is cited over 2,300 times and adopted in big consortium projects such as TCGA and ENCODE.

Hosted by:
Raluca Gordan