学位論文要旨



No 124018
著者(漢字) マルチン,レオナルド ドリベラ
著者(英字) Martins,Leonardo de Oliveira
著者(カナ) マルチン,レオナルド ドリベラ
標題(和) ウイルスゲノム組換えのベイズ推定 : DNA部分配列の間のトポロジー距離とその分布
標題(洋) Bayesian Inference of Viral Recombination : Topology distance between DNA segments and its distribution
報告番号 124018
報告番号 甲24018
学位授与日 2008.07.04
学位種別 課程博士
学位種類 博士(農学)
学位記番号 博農第3350号
研究科 農学生命科学研究科
専攻 生産・環境生物学専攻
論文審査委員 主査: 東京大学 教授 岸野,洋久
 東京大学 教授 白子,幸男
 東京大学 教授 嶋田,透
 東京大学 准教授 大森,裕浩
 東京大学 講師 高野,泰
内容要旨 要旨を表示する

The phylogenetic inference is the problem of reconstructing the ancestrality betweeen a group of DNA or protein sequences, and is classically represented by a phylogenetic tree. These sequences may represent different species, or different genes from a same species (or both), and the underlying assumption is they share a common ancestral. To achieve consistency - the certainty that we approach the true phylogeny as more data becomes available - we would like to collect and analyze large genomic sequences. The complication is that besides the natural limitation of the genome sizes, organisms can exchange material between themselves, rendering the topological interpretation innacurate.One example of such an exchange is recombination.

In HIV-1, the reverse transcriptase switches RNA templates on average 3 times per replication cycle, yielding an average of about one recombinational strand transfer event per 3000 base pairs. A similar rate is also found in HIV-2 and murine leukemia viruses. Recombination also have been found to play a role in severe acute respiratory syndrome coronaviruses, hepatitis, enteroviruses and other primate lentiviruses. Recombinations lead to emergence of the resistant mutants to multiple drugs and may increase the chance that mutant-free individuals arise among the population of individuals with deleterious mutant genes. Reassortment is a similar type of genetic exchange in RNA viruses, where whole RNA molecules constituents of the segmented viral genome are swapped between individuals, and are responsible for antigenic shift in influenza A viruses.

In the case of HIV-1, it was observed that some sequences always clustered together, and this was used to classify HIV-1 in subtypes. As more data were collected, it became evident that disagreements from this classification appeared depending on the gene used to do the subtyping (inference of the subtype). This discordance was then attributed to recombination, and sequences with similar mosaic structure (region-dependent clustering) present in unrelated patients started being classified as Circulating Recombinant Forms (CRF). These recombinants are nowadays routinely detected by phylogenetic methods based on a local sequence similarity between the putative recombinant and all possible parentals. These so-called parentals are reference sequences from the original subtype classification.

Genomic regions involved in recombination may support distinct topologies, and phylogenetic analyses should incorporate this heterogeneity. If we have such a scenario of sporadic recombination,then phylogenetic methods to detect recombination can be employed. Recombination can therefore be detected by comparing inconsistency in topologies between adjacent segments, taking account of uncertainty in the phylogenetic inference. On the other hand, when recombination is more common than substitutions, this phylogenetic signal may be completely lost - thus every site would follow a distinct phylogenetic tree. In this cases we should give up the topological description and focus on populational parameters (like the recombination rate, population expansion, or divergence times).

So far inference of recombination under the phylogenetic approach has been restricted to the presence or absence of recombination break-points between sites, and detection of recombination hot-spots relied on unusual clustering patterns of these break-points along the genome. Many techniques of recombination detection are based on sliding window procedures that compare the topology of one egment against neighbouring segments or the whole alignment.These methods are sensitive to ancestral recombination events and moderate contribution of recombination. Variation in the selective ressure should be considered when estimating recombination events, since it may also lead to conflicting spatial phylogenetic signal. Bayesian change point models identify recombination breakpoints and differentiated substitution rates as change points of topologies and evolutionary rate parameters.Short segments may not have enough phylogenetic signal to discriminate between competing topologies,and large egments may miss the recombination breakpoints.

We developed a distance measure between unrooted topologies that closely resembles the number of recombinations. Despite the relation between a distance metric between topologies (called the Subtree Prune-and-Regraft distance, or SPR distance) and the amount of recombination is well known,there is still no definitive way of calculating it. To achieve that we needed to devise an approximation to this distance, which is a conservative estimate of the number of recombinations between two segments based on the distance between their inferred topologies. By introducing a prior distribution on these recombination distances, a Bayesian hierarchical model was devised to detect phylogenetic inconsistencies occurring due to recombinations. Our procedure assumes that recombination is moderate,and we focus on detectable changes in the phylogeny. An attractive argument in favor of Bayesian procedures is that instead of having a single point estimate of the parameter of interest, we have its distribution, posterior to observing the data. Other advantages include the possibility of exploiting arbitrarily complex models and choosing the prior distributions to achieve a manageable level of abstraction.The disadvantage is the complexity of implementing the algortithm to draw samples from this posterior distribution. Since these samples should not be correlated, our algorithm creates the posterior samples by running heated chains serially and in parallel.

In our model the topological distance between segments (where one segment may one or a few sites) is modelled according to a modified Poisson distribution. By modelling the recombination distance between segments we penalize recombination scenarios where neighboring regions can only be explained by an excessive number of recombinations. This model relaxes the assumption of known parental sequences, still common in HIV analysis, allowing the entire dataset to be analyzed at once.We furthermore remove one possible source of noise from the phylogenetic inference which are the individual branch lengths (amount of evolution along the tree). This removal is achieved by averaging the topology over all possible branch lengths assuming they are independent realizations of an exponential distribution. This marginalization over individual branches and the assumption of indepence among segments should accomodate for rate heterogeneity among lineages and sites.

On simulated datasets with up to 16 taxa, our method correctly detected recombination breakpoints and the number of recombination events for each breakpoint. With this correlation between sites even a single break-point has information about the minimum number of recombinations between the segments it comprises. This not only has a biological support but also makes the topology sampling problem computationally tractable, since sampling from the topological space is not trivial for more than a few taxa.

Our Bayesian hierarchical procedure not only detects the recombination breakpoints but also quantifies the disagreement between the trees. It therefore provides information regarding regions where recombinations occur frequently. We also compared the results of our procedure with other Bayesian ethods, providing them with the real recombination breakpoints. The chance of correctly inferring the true tree is also higher than using other Bayesian procedures that neglect the similarity between trees on neighboring regions. Our simulated datasets contained variability of substitution rates along the trees for each site and across sites, and assuming a model of independent rates for each site and averaging over individual branch lengths proved to be useful in distinguishing recombination from non-random rate heterogeneity.

Distinguishing one ancestral recombination (shared among many sequences) from a recombination hotspot (many recombinations rising independently) can be difficult. The robustness of our procedure comes from the fact that a breakpoint cannot be pinpointed with arbitrary precision, and the prior on the SPR distance accommodates this compromise. The amount of recombination over a region can,therefore, be quantified regardless of the number of breakpoints just by looking at the sum of over this region. In the Bayesian framework, once we obtain the posterior distribution of the variables of interest it is straightforward to have point estimates ("best" configuration), credibility intervals ("best" ensemble of configurations) and to test hypothesis (likeliness of a given configuration).

Applying our method to the HIV-1 dataset we detected a higher number of recombination breakpoints than that detected when parental sequences are assumed.This dataset was constructed by a systematic analysis of near full genome sequences from putative recombinant sequences from Brasil,Argentina and other South American countries. All of them were pre-analysed by bootscanning and determined to be variants of the subtype CRF12 BF. The procedure for choosing the recombinant sequences to be included in our analysis was thus by selecting sequences with the same recombination mosaic pattern, since in this case we can directly infer the monophyly of the recombinant sequences.We compared each putative recombinant sequence independently against reference subtypes F, B and C using the software DualBrothers. We utilized one reference parental sequence from each subtype to increase the detection power,avoiding contradicting signals. The sequences with the most similar mosaic structures as inferred by a hierarchical cluster analysis were then analyzed by our software,and the results are shown in Figure 1. In such a scenario we could confirm that all recombinations represented by the mosaic were reconstructed by our procedure, and the differences between the procedures reflected de novo recombination, that did not involve the reference parental subtypes.

The average of two recombinations per breakpoint, detected by noticing that the number of SPR moves was twice the number of breakpoints, is indeed an indication of existence of hot-spots. A scenario of one ancestral recombination giving rise to the diversity of a new recombinant subtype assumes that irrespective of intra-subtype recombination these recombinants should share a most recent common ancestor along all non-recombinant regions. Our results do not support a common ancestral origin for these recombinant sequences, at least for the chosen reference parental sequences, since the putative recombinants do not form a monophyletic group among segments.

We conclude that even for datasets displaying an identical recombination mosaic pattern, it is imperative to check for phylogenetic incongruences within the dataset. We must not rely on the breakpoints only as defined by the mosaic, since they are based on an arbitrary definition of sequences free from recombination.

Figure 1: Posterior distribution of SPR distances among HIV-1 sequences. Below we have the genomic mosaic structure of each putative recombinant, where red means clustering with B subtype and blue indicates F subtype ancestrality.

審査要旨 要旨を表示する

ウイルスゲノムは頻繁に組換えを経験する。このため、ゲノム配列の多型性は単一の系統樹では表現できず、領域により異なる系統関係(トポロジー)を持つ。その結果、ゲノムの組換えを無視して解析を行うと、ウイルスゲノムの分子進化について、誤った推論をしてしまう危険性がある。ウイルスゲノムの組換えを推定する方法として、大きく分けて二通りのアプローチがある。ひとつは集団遺伝学的なアプローチである。これはゲノム上の部位の間の連鎖不平衡の情報を下に、組換えの履歴を祖先組換えグラフ(Ancestral Recombination Graph)で表現し、組換え率を推定する。ただし、連鎖不平衡の強さは集団の履歴やゲノムにかかる淘汰圧に影響される。しかし、これらのモデル表現は複雑であることから、現在のところ、組換えの推定プログラムは中立進化と平衡集団を仮定している。

もう一つは分子系統学的なアプローチである。この方法は、ゲノムの領域の間でトポロジーが食い違うことを利用して組換えを検出するもので、淘汰圧や集団の履歴に関する仮定を置く必要がない。比較的ゲノム組換えの頻度が小さく、組換え位置に挟まれたゲノム断片が分子系統に係るシグナルを保持しているときに有効である。組換え親が予め知られているときには、彼らの配列を参照配列として、対象とする配列を領域に分割し、参照配列に関連付ける。検出力を高めるために、トポロジーを状態とする隠れマルコフモデル、組換え位置に関するベイズ型変化点モデルが開発されてきている。組換え親が未知の場合は、すべてのトポロジーの可能性を調査する必要がある。その自由度の高さ(配列数が10本ですでに可能なトポロジーの数は200万を超える)から、扱える配列の数は数本と限定されるのが弱点である。本研究では、組換え距離に関する事前分布を導入することにより、この弱点を克服した。

1.SPR距離の近似アルゴリズム

本論文で提案する方法は、隣接領域間におけるトポロジーの食い違いのうち、組換えによる成分にペナルティを課す。すなわち、トポロジーの推定に伴う誤差は不規則であるのに対し、組換えによるトポロジーの食い違いは規則性を持っていることに注意する。トポロジー空間における実質的な自由度を低く抑えることにより、計算負荷の重圧から解放されるとともに、組換えの検出力が格段に向上することが期待される。トポロジーの間の距離としては、Robinson-Foulds距離と補最大一致部分樹(complementary maximum agreement subtree: cMAST)距離が知られている。前者は対応関係のない枝の数で定義され、後者は極大共通部分樹の葉の補集合で定義される。残念ながら、これらの距離はいずれも、組換えの回数との関連が薄い。組換えによるトポロジーの食い違いは、部分樹刈り込み・すげ替え(subtree prune-and-regraft: SPR)距離と関係している。本論文では、トポロジー対の縮約表現と枝の抜き取りの最節約的更新アルゴリズムを開発し、トポロジー間の食い違いを説明する必要組換え回数を近似計算することが可能となった。

2.組換えのベイズ推定法

分子進化をマルコフ過程でモデリングすることにより、各サイトの尤度が記述できる。この尤度は、トポロジーと隣接ノード間の推移確率で表現される。分子進化速度のサイト間の不均質性、サイト間で加速・減速する枝が異なるheterotachyに対して頑健な推定をするために、枝長にランダム性を導入し、確率変数とした。隣接したサイトのトポロジー不一致に対するSPR距離にポアソン分布に従う事前分布を導入することにより、ベイズの枠組みで組換え頻度に対するペナルティを実現した。各サイトが個別に、異なるトポロジーを持つことを許す。組換えに対するペナルティの強さ、およびサイト間の不均質性は、予め固定することなく、事前分布を規定する超パラメータにランダム性を導入し、階層ベイズモデルを構築した。マルコフ連鎖モンテカルロ法(MCMC)により、超パラメータとサイトごとのトポロジー、隣接サイト間のSPR距離の事後分布を求める。これにより、組換え位置と組換えのパターンを同時推定することが可能となる。

3.シミュレーションによる有効性評価と南アメリカHIV-1集団の解析

シミュレーションを通じて、開発ソフトウエアの有効性の検討を行うと共に、南アメリカHIV-1集団の解析を行った。シミュレーションでは、HIV-1ゲノムの解析を想定し、10サイトを1つの単位として解析することを念頭に置いて塩基置換速度を設定した。配列数8, 12, 16の解析を行ったところ、いずれにおいても、配列内に変異性の高い部位が存在しても偽陽性を拾うことなく、組換え位置と組換えパターンを偏りなく推定することが示された。さらに、分子系統樹のベイズ推定を行うMrBayesとの対比を行った。このプログラムは組換え位置の推定は行わないため、MrBayesでは組換え位置を既知として、それらに挟まれた領域のトポロジーを推定した。驚くべきことに、本研究の手法は組換え位置は未知としているにも拘らず、推定精度は高まった。これは、隣接領域間の食い違いに対する事前分布の有効性を示している。南アメリカHIV-1集団からサンプリングされたBF組換え型16本のゲノムの解析からは、37±7の組換え位置、65±10回の組換えが検出された。それらの多くは、サブタイプ内の組換えであるため、組換え親を所与にした解析では検出されないことが確認された。

ゲノムの組換は、適応進化において重要な役割を担うと考えられながらも、これまで推定が困難であった。本論文では、隣接領域間のSPR距離に事前分布を導入した階層ベイズモデルを開発し、その有効性を確認した。また南アメリカHIV-1集団の解析からはゲノム全体にわたり数多くの組換えが検出され、これまで蓄積されたHIV-1集団に関する解析に再検討を促すこととなった。本論文で提案したゲノム組換えの頑健推定法は、数多くの進化研究者のニーズに応えるものと期待される。現在のところ、実用的な時間内で解析可能な配列の数は数十本と限られるが、今後こうした計算上の制約からも解放されるであろう。したがって、ここで得られた成果は、学問的にも応用的にも貢献するところが大きい。よって審査委員会委員一同は本論文が博士(農学)の学位を受けるに十分な価値があると認めた。

UTokyo Repositoryリンク