学位論文要旨



No 125081
著者(漢字) 曲,薇
著者(英字)
著者(カナ) チュウ,ウェイ
標題(和) 次世代シーケンサのリードエラーを取り除くための、頻度情報を用いた短いリードを線形時間にデノボクラスタリングするアルゴリズム
標題(洋) Efficient frequency-based de novo short read clustering for error trimming in next-generation sequencing
報告番号 125081
報告番号 甲25081
学位授与日 2009.03.23
学位種別 課程博士
学位種類 博士(科学)
学位記番号 博創域第499号
研究科 新領域創成科学研究科
専攻 情報生命科学専攻
論文審査委員 主査: 東京大学 教授 浅井,潔
 東京大学 教授 森下,真一
 東京大学 教授 高木,利久
 東京大学 教授 伊藤,隆司
 東京大学 教授 服部,正平
 東京大学 准教授 鈴木,穣
内容要旨 要旨を表示する

Novel massively parallel sequencing technologies provide a highly detailed structure of transcriptome and genome by yielding deep coverage of short reads, though their utility is interfered due to a considerable sequencing quality problem and short length of reads.Sequencing-error trimming in short reads is therefore a vital process which could improve the successful rate of reference mapping as well as polymorphorism detection.

There are several major drawbacks in the previous computational methodologies of correcting sequencing errors in short reads. First, a common feature of these methods is that reads have to be mapped uniquely to the reference, which may fail to detect the short erroneous reads aligned to unique but false positions (Figure 1B). Thus, a better detection of sequencing error should be performed before alignment to the reference genome. Second, another popular procedure is setting a minimum threshold on frequency in an ad-hoc manner to remove erroneous sequences originated in highly expressed sequences; however, this approach ignores sequences of low abundance at once and allows erroneous sequences with high frequency. Third, quality value selections such as Neighborhood Quality Standard (NQS) windows used for capillary sequencing do not match the next-generation sequencing strategy, which outputs bases at a same position in millions of reads simultaneously by an independent sequencing cycle. We need to exploit another sequencing error model tailored to the characteristics of next-generation sequencing.

Figure 1. Illustration of the major benefit of de novo clustering. A real cDNA is shown as a brown bar and short reads originating in it are merged into non-redundant reads with unique sequences and presented by grey bars, whose contrasting densities are proportional to their frequencies. The reference genome is shown as a long arrowed line flagged with the corresponding locus of the cDNA by a brown bar. Alignments of best hits are highlighted by blue dashed lines. Red dots emphasize base positions where sequences disagree with the original cDNA sequence. A. A short cDNA is deeply sequenced with stochastically arising sequencing-error. B. Direct alignment. Besides the correct alignments, some short reads have multiple best hits, as illustrated by the leftmost read; some fail in alignment due to too many sequencing errors, as shown by the aslant read; some are aligned to a false positive position. C. De novo clustering before the direct alignment. Erroneous reads are organized into the tree in the lower portion. The root indicates the representative sequence of the cluster, which is the darkest,most abundant read labeled with an asterisk in the upper part. In the tree, parent-children relationships are depicted by dashed lines.

In this thesis, I have addressed these issues and provide an effective solution. Before the alignment to the reference genome, erroneous 'child' sequences are clustered into a group represented by its 'parent' sequence, where a child sequence is considered to be stochastically originating in its more abundant parent sequence due to sequencing errors in the same experiment (Figure 1C). Indeed, mapping results show that broad parent-child relations inherently exist among reads generated from a same experiment. Subsequently, we integrate the parent-child relations among reads into trees such that the sequences at the root nodes are the most frequent sequences in individual trees and are treated as the representatives of all erroneous sequences in the trees. As illustrated in Figure 1, erroneous short reads that may be aligned to wrong positions or failing in mapping are clustered so that we map these representative sequences to the genome to anchor their locations. This approach is effective in resolving the low quality problem in short read sequencing because it avoids mapping erroneous reads to false-positive positions in the reference genome, and it eludes using an ad-hoc frequency threshold but outputs trees with reliable representative sequences regardless of abundance.

Since massive reads have to be clustered in a reasonable amount of time, we attempted to minimize the computational time of the clustering process. The crucial part of the algorithm is illustrated in Figure 2. The following three steps are repeated until the list of non-redundant sequences sorted by their frequencies in the descending order becomes empty:

1. Eliminate the bottom sequence of the lowest abundance, S, from the list.

2. Select the most frequent sequence S" such that

Frequency(S") = max{ Frequency(S') | Hamming(S,S') = 1, Frequency(S') > Frequency(S) },

where the Hamming distance between two sequences of equal length S and S' Hamming(S,S') is the number of positions where their bases are different.

3. Set the parent of S to S" if S is proved to be derived from S" due to sequencing errors according to a non-sequencing error statistical test.

Additionally, we developed an error model adjusted for next-generation sequencing by refining the traditional random model of error rate evaluation used in PolyBayes (Marth, G. T. et al.1999) to involve substitution patterns arising due to fluorophore cross-talk noise factor. Our two experiments on small RNAs and 5'-end serial analysis of gene expression (SAGE) sequenced by Illumina/Solexa proved that our de novo clustering remarkably reduced sequencing errors. A remarkable increase (~5%) of short reads aligned to reference sequence was confirmed,especially a significant increase (relative raise of ~200%) is observed in the rate of reads with perfect match to reference sequence.

We proposed a frequency-based detection of parent-child relations accelerated with hash-based sequence searches that runs in time linear to the number of given reads. This method complements the ability of base callers and the error correction by direct alignment with the reference genome, and is able to improve the overall accuracy of short read alignment by consulting inherent relationships of the entire set of reads. As a perspective on its universal application, de novo clustering of color space reads of ABI/SOLiD short reads is considered to be viable as well.

Figure 2. Algorithm of de novo clustering. The green list coupled with four sequences is a schematic view of millions of sequences.

審査要旨 要旨を表示する

近年の技術革新により、高度に並列化された超高速DNA配列解読装置が開発され、ゲノムDNA配列の読み取り速度が100倍以上高速化された。これにより、ゲノムDNA配列、発現情報を反映したmRNA配列などが網羅的に取得できるようになり、生命科学に大きな変革がもたらされつつある。

現在の超高速DNA配列解読装置は、その並列性、高速性を利点として持つ一方、1本のリードが短いという問題点を抱えている。リード長が短く、かつ読み取り誤りを含むため、そのクラスタリングや参照ゲノム配列への貼り付けが困難で、貴重な大量配列データを十分に活用することが難しいことが大きな問題となっている。

参照ゲノム配列がある場合、従来の手法では、個々のリードを直接参照配列と比較していたため、一意に張り付かなかったり、張り付く場所がなかったりするリードが大量に捨てられる結果となり、超高速DNA配列解読装置の性能を十分に引き出すことができなかった。

本論文では、ゲノムヘアラインメントする前に配列の頻度情報を利用してクラスタリングする手法を提案した。頻度がより高いリードと類似の頻度の低いリードは、その多くが読み取り誤りである可能性があるため、配列情報と頻度情報から読み取り誤りを反映した配列の親子関係をモデル化し、クラスタリングを行っている。クラスタリングと代表配列の検出においては、ハッシュ法を活用し、高速な線形時間アルゴリズムの構成に成功した。この結果、転写開始点やsmall RNA配列中のミスを補正し、全配列の約5%を新たに正確な位置ヘアラインメントできることを示した。

本論文が提案した手法は、現在の超高速DNA配列解読装置を活用するために必要な技術であると認められる。将来のDNA配列解読装置の進歩によってリード長の問題は解決されるが、短い非コードRNAの解析においては短いリードの扱いは今後も重要な課題であるから、本論文の成果の重要性は一時的なものではないと考えることができる。

以上の点から、本論文は生命情報科学に貢献する重要な貢献であると認めることができる。

UTokyo Repositoryリンク