学位論文要旨詳細

学位論文要旨


No		127565
著者（漢字）		ロツキカミルマレック
著者（英字）
著者（カナ）		ロツキカミルマレック
標題（和）		GPUによる大規模木探索
標題（洋）		Large Scale Tree Search on GPU
報告番号		127565
報告番号		甲27565
学位授与日		2011.09.27
学位種別		課程博士
学位種類		博士(情報理工学)
学位記番号		博情第350号
研究科		情報理工学系研究科
専攻		コンピュータ科学専攻
論文審査委員		主査：　東京大学　教授　平木,敬　東京大学　教授　今井,浩　東京大学　教授　石川,裕　東京大学　准教授　渋谷,哲郎　東京工業大学　特任准教授　遠藤,敏夫
内容要旨		要旨を表示する Monte Carlo Tree Search (MCTS) is a method for making optimal decisions in artificial intelligence (AI) problems, typically move planning in combinatorial games. It combines the generality of random simulation with the precision of tree search. Research interest in MCTS has risen sharply due to its spectacular success with computer Go and potential application to a number of other difficult problems. Its application extends beyond games, and MCTS can theoretically be applied to any domain that can be described in terms of state, action pairs and simulation used to forecast outcomes such as decision support, control, delayed reward problems or complex optimization. The main advantages of the MCTS algorithm are that it does not require any strategic or tactical knowledge about the given domain to make reasonable decisions and algorithm can be halted at any time to return the current best estimate. So far, current research has shown that the algorithm can be parallelized on multiple CPUs. The motivation behind this work was caused by the emerging GPU- based systems and their high computational potential combined with relatively low power usage compared to CPUs. As a problem to be solved I chose developing an AI GPU(Graphics Processing Unit)-based agent in the game of Reversi (Othello) and SameGame puzzle which provide sufficiently complex problems for tree searching with non-uniform structure. The importance of this research is that if the MCTS algorithm can be efficiently parallelized on GPU(s) it can also be applied to other similar problems on modern multi-CPU/GPU systems such as the TSUBAME 2.0 supercomputer. Tree searching algorithms are hard to parallelize, especially when GPU is considered. Finding an algorithm which is suitable for GPUs is crucial if tree search has to be performed on recent supercomputers. Conventional ones do not provide good performance, because of the limitations of the GPUs' architecture and the programming scheme, threads' communication boundaries. One of the problems is the SIMD execution scheme within GPU for a group of threads. It means that a standard CPU parallel implementation such as root-parallelism fail. The other problem is a difficulty in generating pseudo-random numbers on GPU which is important in Monte Carlo methods. Available methods are usually very time consuming. Third of all, no current research work discusses scalability of the algorithm for millions of threads (when multiple GPUs are considered), so it is important to estimate to what extent the parallelism can be increased.In this thesis I present an efficient parallel GPU MCTS implementa- tion based on the introduced 'block-parallelism' scheme which combines GPU SIMD thread groups and performs independent searches without any need of intra-GPU or inter-GPU communication. I compare it with a simple leaf parallel scheme which implies certain performance limitations. The obtained results show that using my GPU MCTS implementation on the TSUBAME 2.0 system one GPU's performance can be compared to 50-100 CPU threads depending on factors such as the search time and other MCTS parameters. The block-parallel algorithm provides better results than the naive leaf-parallel scheme which fail to scale well beyond 1000 threads on a single GPU. The block-parallel algorithm is approximately 4 times more efficient in terms of the number of CPU threads' results comparable with the GPU implementation. In order not to generate random numbers on GPU I introduce an algorithm, where the numbers are transferred from the CPU for each GPU block accessible as a look-up table. This approach makes the time needed for random-sequence generation insignificantly small. In this thesis for the first time I discuss scalability of the algorithm for millions of threads. The program is designed in the way that it can be run on many nodes using Message Passing Interface (MPI) standard. As a method of evaluating my results I compared the results of multiple CPU cores and GPUs playing against the standard sequential CPU implementation. Therefore the algorithm's scalability is analyzed for multiple CPUs and GPUs. My results show that this algorithm implies almost no inter-node communication overhead and it scales linearly in terms of the number of simulation performed in a given time period. However, beyond a certain number of running threads, a lack of performance improvement was observed. I concluded that this limit is affected by the algorithm's implementation and it can be improved to some extent by tuning the parameters or adjusting the algorithm itself. The improvements I propose and analyze are variance-based error estimation and simultaneous CPU/GPU execution. Using these two methods modifying the MCTS algorithm the overall effectiveness can be increased by 10-15% further, compared to the basis block-parallel implementation. Also, another factor considered is the criteria of estimating the performance is the overall score of the game (win percentage or the score). Not all the parameters in the MCTS algorithm are analyzed thoroughly in regarding the GPU's implementation and their importance considering scalability. This is caused by the certain limitations of the proposed evaluation method. As it is based on the average score, multiple games have to be played to get accurate results and time needed to aquire them is relatively long.
審査要旨		要旨を表示する GPU (Graphic Processing Unit) は多数の SIMD (Single Instruction Multiple Data) 型の並列性を持つプロセッサコアを搭載し,またストリーム処理に適した専用の高バンド幅メモリを連結することにより,CPU よりも非常に高い演算スループットを実現している.GPU はもともと画面描画用の専用プロセッサであるが,近年はアーキテクチャおよびプログラミング言語が拡張され,汎用的な計算が可能になっている.GPU を用いた汎用計算は GPGPU (General Purpose GPU) 計算と呼ばれ,GPU の高い処理性能を引き出すソフトウェア技術として研究開発が精力的に進められている.GPU は SIMD 型の並列アーキテクチャのため,行列計算などのデータ並列処理との親和性が高く,これらの問題では高い性能を相対的に容易に達成できる.これに対し,探索問題などの離散アルゴリズムは SIMD 型の並列処理との親和性が低く,GPU を用いて高速化することは容易ではない.このため,探索問題の求解アルゴリズムの GPU 上での高性能な実装技術の開発は,GPU の高い処理性能をより多様な計算に活かすための突破口を開く重要な課題である.さらに,GPU を多数搭載した計算機クラスタは GPU クラスタと呼ばれ,極めて高い性能を実現する.実際,トップエンドのスーパーコンピュータの多くが GPU クラスタアーキテクチャを採用している.本研究は,高並列な探索アルゴリズムであるモンテカルロ木探索において,GPU クラスタの高い性能を最大限に引き出す手法を開発した研究である. 本論文は以下に概略を示す5つの章から構成されている. 1章では,まず木探索の一般的フレームワークが論じられた後,本論文で実際の例として取り上げる木探索問題であるゲーム木探索について論じられている.取り上げられるゲームは,二人ゲームとして reversi とも呼ばれるオセロ,および一人ゲーム(パズル)としてサメガメ samegame である.ゲーム木探索問題は,これらのゲームの与えられた盤面に対して,勝率あるいは得点を最大化する次の一手を与える問題である.これらは完全情報ゲームであり,盤面のすべての展開を完全に把握することにより,理論的には最適解を見出すことができる.しかし可能な盤面の数は手数に対して指数関数的であり,スーパーコンピュータを用いてもゲームの終盤以外は最適解を求めることは現実的に不可能である.このため実際には近似アルゴリズムが必要である.本章では,さらに GPU のアーキテクチャが詳細に論じられ,また MPI による GPU クラスタ並列処理が論じられる.最後に,古典的な木探索アルゴリズムでは GPU クラスタアーキテクチャの非常に高い並列性を活かすことが困難であることが示されている. 2章では関連研究および本研究の貢献が述べられている.まず,古典的な木探索アルゴリズムの並列化の既存の研究とその限界が示されている.次に,本研究で採用する木探索アルゴリズムである,モンテカルロ木探索とその並列化の既存研究が論じられている.そして,モンテカルロ木探索を GPU クラスタに実装する際に生じる問題点と,本研究が提案する解決が提示されている. 3章ではモンテカルロ木探索の GPU クラスタへの実装手法が論じられている.モンテカルロ木探索の既存の並列化手法としては,根並列化,葉並列化,大域的排他制御による木並列化,局所的排他制御による木並列化に基づく4つの手法が知られている.このうち排他制御は GPU では効率的に実行できない.葉並列化は GPU での実装が容易であるが,並列化に伴って並列化効率(同じ演算量でどれだけゲームが強くなるか)が顕著に低下する.根並列化は並列化効率の低下は少ないが,メモリ消費量の多さおよび SIMD 並列性との親和性の低さのために GPU の性能を十分に引き出すことが困難である.そこで本研究では根並列化と葉並列化を組み合わせたブロック並列化を提案する.ブロック並列化は,GPU の SIMD コアに葉並列性を,GPU コア間に根並列性を割り当てることにより,高い実装効率と高い並列化効率を両立する.ブロック並列化は GPU 上で効率的に実装できるのに加え,MPI を用いた GPU クラスタも有効に活用できるという利点がある.しかし単にブロック並列化を適用しただけではいくつかの弱点が生じる.弱点の一つ目は,GPU コアの演算遅延の遅さのために,探索する木の深さが浅くなってしまうことである.これを克服するため,本研究ではハイブリッド CPU-GPU アプローチを提案している.本アプローチでは低遅延の CPU コアによる深い木探索と,高演算帯域の GPU コアによる詳細な木探索の両方の情報を結合させることにより,両者の長所をともに引き出すことに成功している.これは単に CPU と GPU で負荷分散を行うものではなく,CPU と GPU の提供する異質な情報を結合することにより両者の弱点を克服する特徴を持つ.弱点の二つ目は,ブロック並列化が内包する葉並列性のために並列化効率が低下することである.従来の葉並列化では並列計算した評価値の平均値のみを用いていたが,本研究では分散も参照する手法を提案している.本手法により,葉並列化がもたらす詳細な情報をより深く活用することができる.また,オンチップメモリが CPU に比べて小さい GPU において問題になる疑似乱数について,本研究で用いる手法を示している. 4章では,提案手法の実装において得られた結果およびその分析が論じられている.本研究では東工大の TSUBAME 2.0 スーパーコンピュータを用いた大規模並列実装が実現されている.また,根並列性,葉並列性,およびブロック並列性の大規模並列処理におけるスケーラビリティーを示し,その違いの原因について論じている. 最後に5章では,研究成果を総括し,今後の研究の展望を述べて,論文を締めくくっている. 以上のように,本論文はモンテカルロ木探索の GPU クラスタへの高性能な実装方式を示し,スーパーコンピュータによる大規模並列探索を実現することを通じて,高性能並列計算の分野において顕著な貢献をしたものと言える.また,本研究は GPU クラスタを用いた離散アルゴリズムの高性能実装技術の開拓に向けて先端的な貢献をしたものであることから,審査委員会は,本研究は博士号に十分値するものであると判断した. よって本論文は博士(情報理工学)の学位請求論文として合格と認められる。
UTokyo Repositoryリンク