学位論文要旨



No 128695
著者(漢字) 桑,飞
著者(英字) SANG,FEI
著者(カナ) サング,フェイ
標題(和) RNAシーケンシングデータからのスプライシング変異産物の検出と定量に関する包括的アプローチ
標題(洋) A global approach for identification and quantification of splicing variants using RNA-Seq data
報告番号 128695
報告番号 甲28695
学位授与日 2012.09.27
学位種別 課程博士
学位種類 博士(工学)
学位記番号 博工第7869号
研究科 工学系研究科
専攻 先端学際工学専攻
論文審査委員 主査: 東京大学 教授 油谷,浩幸
 東京大学 教授 浜窪,隆雄
 東京大学 特任教授 井原,茂男
 東京大学 特任准教授 金田,篤志
 東京大学 准教授 石川,俊平
内容要旨 要旨を表示する

Abstract

Alternative Splicing (AS) plays a major role in generating the diversity of transcriptome and proteome, and affects development and disease by producing structurally and functionally distinct mRNA and protein variants. Transcripts from approximately 95% of multi-exon human genes are spliced in more than one way, and in most cases the resulting transcripts are variably expressed between different cell and tissue types. The second generation sequencing technology is now opening unprecedented routes to address the analysis of entire transcriptome. However, accurate quantification of the abundances of transcripts still remains a big challenge. We have developed a global approach that allows the identification and quantification of splicing variants derived solely from the read count and the base count in RNA-Seq data. It is based on an explicit statistical model and enables the identification of splicing variants using the known gene annotation or the assembled transcript structure information, as well as the relative quantification. To test our method, we generated and analyzed 120 million paired 75-bp artificial mouse RNA-Seq reads from about 18,000 various transcripts.

Analysis Pipeline

Our analysis can be summarized into 4 main phases: Mapping Phase, Counting Phase, Identifying Phase and Quantifying Phase. In Mapping Phase, the align tools, eg. Tophat, are applied to map the short RNA-Seq reads to a genome reference to generate ordinary reads and splice reads; In Counting Phase, two different calculations are used to count the number of mapped reads and the number of mapped bases for a given location respectively. Moreover, in order to simply all the alternative splicing events into exon skipping, a special subexon matrix was generated. If an exon appears in two or more isoforms at different lengths, we divide this exon into several different parts and treat each part as a subexon. Then according to this subexon matrix the structure of each transcript is identified and described, which means if this transcript contains a subexon, the value of this subexon in the matrix can be set to 1, otherwise to 0. The count calculation is preformed depending on the subexon matrix. During the counting phase, the base count is calculated separately for ordinary reads and splice reads in order to maintain the valuable splicing information which is of much importance to estimate isoform expression level; In Identifying Phase, two strict assumptions help us filter noisy incorrectly expressed transcripts caused by mapping errors or incomplete express transcripts due to low read coverage. One is "If one exon was not expressed, the whole transcript was not expressed." The other one is "If one junction between two exons was not expressed, the whole transcript was not expressed." Perhaps these two assumptions are too strict to succeed capturing the low expressed transcripts; In Quantifying Phase, we developed a novel global approach, a probabilistic model that uses information of both normal reads and splice reads in RNA-Seq data to perform more accurate estimations of transcript expressions.

Results

Analysis with a known gene reference

FluxSimulator totally selected 15,333 genes and 18,143 isoforms for mouse 20M dataset, 15,475 genes and 18,330 isoforms for mouse 40M dataset, 15,524 genes and 18,411 isoforms for mosue 60M dataset, 13,531 genes and 18,425 isoforms for human 60M dataset. The total number of mouse expressed transcripts selected by FluxSimulator did not rise along with the increasing sequenced reads, which indicated that the expression of each transcript grew up. Totally we identified 7,279 genes and 8,090 isoforms in mouse 20M dataset, 8,014 genes and 8,928 isoforms in mouse 40M dataset, 8,399 genes and 9,452 isoforms in mouse 60M dataset respectively, as well as 5,308 genes and 6,535 isoforms in human 60M dataset. Cufflinks identified 11,602 genes and 12,448 isoforms in mouse 20M dataset, 12,401 genes and 13,317 isoforms in mouse 40M dataset, 12,660 genes and 13,579 isoforms in mouse 60M dataset respectively, as well as 10,496 genes and 11,946 isoforms in human 60M dataset.

The comparison results of mouse 60M dataset reported that most cufflinks results were underestimated the expression level of isoforms, with only 325 accurate predictions under the error rate below 10%. In addition, 6,176 non-expressed transcripts were incorrectly predicted by Cufflinks giving estimated expression values, while 1,276 false positive results existed in Cufflinks results. Our results showed higher correlation value than cufflinks, which means the results were more similar with the true expression levels. 9,316 non-expressed transcripts were incorrectly predicted by our method. Although more false negative estimations existed in our results due to the strict assumptions in the Identifying Phase, only 406 false positive estimations are included in our results. Furthermore 4,515 out of 9,049 transcripts were properly estimated under the error rate below 10%, 49.8% accuracy, much better than Cufflinks. However Cufflinks captured more transcripts with a few mapped reads than our method. Genes with multiple isoforms were specially derived from the mouse 60M results to access the expression ratio estimations, 8,126 isoforms obtained from Cufflinks results, 4,699 isoforms from our results. Cufflinks had 4,125 (about 51%) well-estimated results with error rate below 10% while 4,150 (about 88%) by our method. Besides, our expression level estimations were also more accurate with 0.814 correlation value than Cufflinks estimations with 0.320 correlation value which contained some well-estimated expression percentage but with inaccurate expression levels. For example, Rasgrf1 had one long isoform and one short isoform. Both our method and Cufflinks gave the accurate expression percentage estimation, however the expression levels that Cufflinks produced were much lower than our estimations and true expression levels.

Analysis with a de novo assembled reference

In order to compare with Cufflinks results, we also used Trinity, a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data, to help assemble short RNA-Seq reads. First, Trinity was applied to perform assembly; Then, those assembled transcripts were mapping to the genome reference by Blat; At last, the identification and quantification were analyzed by our global approach. Here only 60M dataset were analyzed. In the first assembly step, Trinity generated more transcripts (23,781) than the simulated transcripts (18,793), indicating the results of Trinity included much more false positive data than Cufflinks which reconstructed 12,058 transcripts. After filtering the transcripts which were not included in the gene annotation reference, Cufflinks only kept 2,139 transcripts while Trinity kept 7,204. Their distributions of expression levels were almost similar with the true expressions. Still most cufflinks results were underestimated the expression level of isoforms, with only 132 accurate predictions under the error rate below 10%. 5,231 false negative estimations and 51 false positive estimations remained in the Cufflinks results. Our results showed 3,594 transcripts were properly estimated under the error rate below 10%. 294 false negative estimations and 58 false positive estimations remained in the results. Although Trinity generated much more false positive noises, it captured many isoforms that cufflinks failed to detect, about 5,000 transcripts in mouse 60M dataset. Genes with multiple isoforms were specially derived from the 60M results to access the expression ratio estimations, 485 isoforms obtained from Cufflinks results, 645 isoforms from our results. Cufflinks had 303 (about 62.4%) well-estimated results while 416 (about 64.5%) by our method. The accuracies of percentage estimations were almost the same, however, our method provided more accurate expression level estimations than Cufflinks.

Conclusion and Discussion

Through the comparison between Cufflinks estimations and true expression levels, it was clear that Cufflinks underestimated the expressions of most transcripts. Two possible factors may lead to the underestimation of Cufflinks. One is the mapping problem, which Cufflinks estimation depends on. In this study, we use Tophat as our main mapping tool that was developed by the same group of Cufflinks. Since Cufflinks requires the mapped fragments to estimate the transcript abundance, the case of only one fragment read mapped to the genome reference cannot be used in Cufflinks abundance estimation. As we discuss above, only 36% of total simulated reads in mouse 60M dataset were perfect fragment alignments, which may cause the underestimation of Cufflinks. However, when we only use these perfect fragment alignments to estimate transcript abundance by Cufflinks, the results were still underestimated although the estimations turned better. The other reason that caused the underestimation may be the Poisson Model applied by Cufflinks. However, the authors of Cufflinks did not discuss about how they calculated the fragment count, therefore, we cannot find the real reason of underestimation. In our study, we encouraged to apply the base count instead of the read count. Read Count overestimates the counts of short exon, because the read length is longer than the exon size. Because the subexon matrix was applied in our statistical model, the frequency of short exons became much higher. Base Count played an essential role in our estimation of transcript expression levels. Our results illuminate that our global approach was more accurate than Cufflinks wide-used currently. Our results provide a probabilistic model for RNA-seq analysis, offer more accurate isoform expression estimation and develop a pipeline for studies of gene and isoform expression.

Table 1. Summary of simulated results and estimated results

Figure 1. Expression level comparison between estimated values calculated by Cufflinks and our method and true values calculated by simulated expressions when using a known gene reference. Epression level raito between estimated values calculated by Cufflinks and our method and true values calculated by simulated expressions when using a known gene reference.

審査要旨 要旨を表示する

本論文は「A global approach for identification and quantification of splicing variants using RNA-Seq data(RNAシーケンシングデータからのスプライシング変異産物の検出と定量に関する包括的アプローチ)」と題し、5つの章から構成されている。

第1章「Introduction」では本論文の背景と目的及び構成について述べている。ゲノムDNAからRNAが転写されるプロセスにおいて、選択的スプライシングはトランスクリプトームやプロテオームの多様性の生成に主要な役割を担っており、構造、機能的に異なるmRNAとタンパク質変異は、発生、疾患に影響を与えている。ヒト遺伝子の転写産物の約95%は複数のスプライスパターンを持ち、細胞、組織間で異なった発現を示す。しかしながら、哺乳類の選択的転写産物のレパートリーとその制御機構に関する知見は現在では乏しい。次世代シーケンシング技術は膨大な量のトランスクリプトーム情報をもたらし、その解析に道筋を示したと言える。

膨大な量の転写産物の配列決定とそれらのゲノムへのマッピングにより、RNA-Seq 技術は新規転写産物の発見とその発現量の推定を可能とした。これまでのRNA-seqデータ解析手法はmRNAのアセンブリとアノテーション、新規エクソンや遺伝子の発見、発現レベルの推定などに焦点が当てられていた。現在RNA-Seq解析で頻繁に使用されるCufflinks,やScripture等の手法はde novo アノテーションが行えるが、転写産物の正確な定量は難しく大きな課題として未だに残っている。

第2章「Data resource」および第3章「methodology」では本研究で用いたデータおよび開発した解析手法について述べている。RNA-Seqリードおよびベースカウントによるスプライスバリアントの同定と定量が可能な網羅的アプローチを開発した。既知遺伝子アノテーション若しくはアセンブル後の転写産物構造の情報を用いた同定と相対的定量が可能な統計モデルである。本手法の検証は、転写産物の量、種類を事前に得ることができるように、FluxSimulator により生成した人工データを解析することにより行った。1億2000万の75塩基のペアエンドリードから、18,000のトランスクリプトバリアントを得た。アノテーションレファレンスの利用の有無で得られた2種類の結果をCufflinksの結果と比較した。

第4章「Result」では開発した手法を用いた解析結果について述べている。アノテーションを利用した場合には、新規手法ではリード数の異なる3つのデータセットで、8,090, 8,928, 9,452の転写産物を同定した。 一方Cufflinksはそれぞれ12,448, 13,317,13,579個を同定し、偽陽性を含むと考えられた。Cufflinksと比較して真の発現レベルと高い相関が得られた。発現比においては88%以上が適正に推定されたのに対して、Cufflinksでは51%が良好に推定された。さらに発現比の推定だけでなく、発現レベルにおいてもCufflinksより正確であった。

de novo assembly解析も新手法として適用したところ、57%正確であったのに対しCufflinksは9%であった。Poisson model がこの様な過少推定を行っていると考えられる。シーケンシング過程が単純なランダムサンプリングと仮定すれば、サンプル中の全塩基から一様且つ独立に全てのリードが得られるはずである。しかしながら、リードは真にランダムではなく一様でもない場合も多く、これはRNA-Seq実験の技術的なバイアスによるものと考えられる。他の理由として、アイソフォームの発現量推定にリードカウントを使用するのは短いエクソンのカウントにおいて過大推定になるかもしれない。

第5章「Conclusion and Discussion」では以上の成果を要約し,議論している。本研究では既存の方法より正確に速くRNAアイソフォームの同定と定量を正確におこなうために幾つかの独創的な開発を行った。第一に、リードカウントの代わりにベースカウントを適用した。第二に、Poissonモデルに変わる新規の確率的アルゴリズムを開発した。第三に、更に正確に速く推定するためのアルゴリズムを追加した。Subexon Matrix、スプライシングリード情報を用いることなど。本研究で新規アプローチを開発したが、その推定には自身の制限による不正確さが含まれている。一つ目の不利点として、仮定が含まれる同定フェイズ(Identifying Phase)が厳密過ぎると発現が低いスプライシングバリアントを検出できない。しかし、これらの2つの仮定を除けばより多くの偽陽性を含むことになる。もう一つの不利点として遺伝子構造が複雑すぎる場合にはアイソフォーム検出に失敗することが多い。しかし、複雑な構造の場合、現行のRNA-seqデータからはどんな手法を使っても難しい。

以上のことを勘案して審査員5 人の合議による最終結論は、本論文の提出者が自立して研究活動を行い、高度な専門的業務に従事するために必要な能力およびその基礎となる豊かな学識を有していることを示すものであると判断した。

よって本論文は博士(工学)の学位請求論文として合格と認められる。

UTokyo Repositoryリンク