学位論文要旨詳細

学位論文要旨


No		126806
著者（漢字）		ラトナ,クリシュナムルティ
著者（英字）		Ratna,Krishnamoorthy
著者（カナ）		ラトナ,クリシュナムルティ
標題（和）		粗粒度再構成可能アーキテクチャのためのコンパイラ最適化
標題（洋）		Compiler Optimizations for Coarse Grained Reconfigurable Architectures (CGRAs)
報告番号		126806
報告番号		甲26806
学位授与日		2011.03.24
学位種別		課程博士
学位種類		博士(工学)
学位記番号		博工第7447号
研究科		工学系研究科
専攻		電子工学専攻
論文審査委員		主査：　東京大学　教授　藤田,昌宏　東京大学　教授　若原,恭　東京大学　教授　相田,仁　東京大学　教授　坂井,修一　東京大学　准教授　中山,雅哉　東京大学　准教授　五島,正裕
内容要旨		要旨を表示する Reconfigurable Architectures are those which exploit a higher degree of spatial computation. Typical examples of these include Field Programmable Gate Arrays (FPGA). However, to run an application on an FPGA the number of resources on the FPGA must be sufficient to spatially accommodate them at the same time. Thus emerged a new architecture called Coarse Grained Reconfigurable Architecture (CGRA). CGRA was designed from scratch to have low reconfiguration overhead. This was achieved partially through the use of Coarse Grained building blocks instead of LUTs. Also CGRAs supported loading multiple contexts simultaneously to reduce the reconfiguration overhead. This meant that the fabric of the CGRA performed spatial execution of instructions within one configuration and was temporally multiplexed between several configurations. Further, unlike FPGAs, CGRA supports programming in high level languages viz. C. The spatio-temporal nature of execution in CGRAs makes the task of application compilation quite challenging, since they need to address both the spatial and temporal aspects of execution. In this thesis, we develop a few compiler techniques which can be applied in the context of CGRAs, with an aim to reduce the total execution time. The total execution time on a CGRA includes the reconfiguration time i.e. the time spent in sending the configuration data before the execution starts and the instruction execution time i.e. the actual time it takes to execute a task on the fabric. We propose three different techniques to reduce the total execution time, which includes the reconfiguration time and the instruction execution time. Our work primarily focuses on those aspects of compilation which are unique to the spatio-temporal execution paradigm. We present the results for all these compiler optimizations in the context of CGRA called REDEFINE, which we use as our testbed. REDEFINE uses static dataflow on the execution fabric and dynamic dataflow to orchestrate various application sub-structures on to the execution fabric. The use of dynamic dataflow is different from previous CGRAs and makes orchestration logic simpler in hardware. REDEFINE also supports the use of custom function units to accelerate kernels. This is supported programmatically through the use of extern function calls. The performance of this CGRA is good and performs 10-15x times better than a General Purpose Processor. Compilation on CGRAs involves two different aspects. One to determine the application sub-structures which are to be temporally multiplexed on to the fabric. The application substructures then need to be executed on the fabric through appropriate spatial placement, so as to reduce communication overhead. In REDEFINE, the application sub-structure too is spatio-temporally executed on the fabric. This is done since over-allocating resources, more than the available instruction level parallelism is wasteful. So application substructures are partitioned, such that each partition containing several instructions are mapped to the same compute element. The compute element executes these instructions sequentially based on the availability of data. The task of partitioning is governed by the opposing forces of being able to expose as much parallelism as possible and reducing communication time. We extend Edge-Betweenness Centrality scheme, originally used for detecting community structures in social and biological networks, for partitioning instructions of a dataflow graph. We also implement several other partitioning algorithms from literature and compare the execution time obtained by each of these partitioning algorithms in the context of REDEFINE. Centrality based partitioning scheme outperforms several other schemes with 6-20% execution time speedup for various Cryptographic kernels. Centrality performs well on account of the appropriate choice of edges and use of a larger number of partitions. Further, we observe that these partitions tend to be unbalanced. Unbalanced partitions help reduce the perceived reconfiguration time by overlapping it with instruction execution. However, the use of unbalanced partitions is not beneficial in all cases and this observation was made in the context of two applications. We extend centrality based partitioning scheme to produce balanced partitions. We observe that the new scheme has the ability to improve performance up to 15% (over and above the Centrality scheme). REDEFINE using centrality based partitioning performs 9x times better than a General Purpose Processor, as opposed to 7.76x times better without using centrality based partitioning. Similarly, centrality improves the execution time comparison of AES-128 Decryption from 11x times to 13.2x times. After the application sub-structures are partitioned, these partitions need to be appropriately placed on the fabric to minimize communication latency. This task is carried out by the mapping algorithm. We propose to demonstrate an interconnection independent mapping algorithm based on greedy and local search based heuristics. Unlike previous work, we experiment with several objective functions and evaluate the best possible mapping algorithm in the context of balanced centrality and centrality based partitioning algorithms. It is observed that the orchestration of application substructures can take as much time as the execution of the instructions. These application substructures are sequenced based on data and control dependences. Substructure prefetching is used in CGRAs to hide the reconfiguration time while another substructure executes. In REDEFINE, these application substructures are referred to as HyperOps. Determining the successor for a HyperOp requires merging information from the control flow graph and the HyperOp dataflow graph. A new data structure called Control-Data Interaction Graph has been developed. The CDIG captures control and data interactions across HyperOps. Succession in many cases is data dependent. Since hardware branch predictors cannot be applied due to the non-binary branches, we employ a speculative prefetch unit together with a profile based prediction scheme. The results of profiling are annotated on the CDIG. Based on the results the successor HyperOp is selected. The prefetch unit was implemented in the simulator as a look up table. Every time a HyperOp is scheduled, a look up is performed to determine the next most likely successor. The most likely successor is prefetched, but instruction execution is delayed until the point the decision is made. If the prediction is correct then the data is transferred. In case of a misprediction the HyperOp's allocation on the fabric is revoked. The misprediction penalty is typically four clock cycles. Simulation results show around 7-33% reduction in overall execution time, when compared to the overall execution time without prefetching. In conclusion, we have proposed three different techniques to improve the performance of CGRAs. All of these techniques have been shown to give good results in the context of REDEFINE. However, these techniques are generic and can be applied to any CGRA.
審査要旨		要旨を表示する本論文は、Compiler Optimizations for Coarse Grained Reconfigurable Architectures (CGRAs)(粗粒度再構成可能アーキテクチャのためのコンパイラ最適化)と題し、多数の計算ユニットを有する動的再構成可能アーキテクチャに基づくプロセッサ向けコンパイラ最適化技術が示されており、英文で7章から構成されている。第1章は、Introduction(序章)であり、研究の背景と目的を述べている。動的再構成可能アーキテクチャ研究開発の現状と汎用プロセッサとの比較、特にそれらに対するコンパイラ技術の差異を整理し、動的再構成可能アーキテクチャ向けコンパイラ最適化技術を新たに研究開発することにより、実行性能を大幅に改善できる可能性のあることが示されている。第2章は、Background and Related Work(研究の背景と関連研究)であり、動的再構成可能アーキテクチャのためのコンパイラ技術の現状を整理している。また、動的再構成可能アーキテクチャの具体的実現例として、REDEFINEプロセッサアーキテクチャの概要が説明されている。REDEFINEは本論文の提案手法を評価する際に利用されている。第3章は、Control-Data Interaction Graph(制御・データ関連グラフ)であり、動的再構成可能アーキテクチャでは、多数の計算ユニットがあり、かつそれらの構成を動的に変更できるため、制御のみ、あるいはデータのみから最適なスケジューリングや計算ユニットの割当てを行うことができない。そこで両者を統合し、両者の関連を明示的に示すグラフを新たに提案するとともに、それを利用することで、効率的にスケジューリングや計算ユニットの割当てが可能となることが示されている。また、C言語などのプログラムからの生成手法についても述べている。第4章は、Speculative Prefetch Mechanism for Temporal Partitions(時間軸上の計算の分割のための投機的プリフェッチ機構)であり、動的再構成可能アーキテクチャで問題となる再構成時のオーバヘッドを削減するためのプリフェッチ機構が提案・評価されている。計算を単純に実行すると、再構成した後実行することになり、再構成時間がオーバヘッドとなる。しかし、プリフェッチ機構を導入すれば、これらをオーバラップすることができ、再構成のための処理時間を見かけ上隠すことができる。多数の計算ユニット持つ動的再構成可能アーキテクチャの場合、条件文の選択肢が多数ある場合があり、従来のプリフェッチ機構を拡張する必要がある。ここでは、その拡張法を示すとともに、それを実際のアーキテクチャであるREDEFINEアーキテクチャに適用し、ハードウェア設計も含めて評価を行い、その有用性を示している。第5章は、Spatial Partitioning(空間的分割)であり、REDEFINEアーキテクチャを例として、多数の計算ユニットで並列に実行するためのプログラム分割手法について述べている。従来から提案されている各種アルゴリズムをREDEFINE向けに修正するとともに、Edge Betweenness Centralityと呼ばれる評価関数に基づく新規分割手法を提案している。ベンチマークプログラムによりこれらの評価を行い、Edge Betweenness Centralityに基づく分割手法を適用することで、最も実行速度が向上できることが示されている。第6章は、Mapping and Consolidated Results(マッピング手法と総合的結果)であり、分割されたプログラムを具体的に計算ユニットにマッピングする手法を示すとともに、今までに提案した手法全体としての評価を行っている。総合的な性能評価として、具体的にAES暗号を実装したプログラムについて評価を行っている。提案手法に基づいてコンパイルしREDEFINEアーキテクチャに基づくプロセッサで実行すると、先端汎用プロセッサ上の実行と比較し、10倍から13倍程度高速に実行できることが示されている。第7章は、Conclusion and Future Work(結論と今後の課題)と題し、本論文の研究成果をまとめるとともに、今後の発展方向について議論している。以上、多数の計算ユニットを有する動的再構成可能アーキテクチャに基づくプロセッサ向けコンパイラ最適化技術に関し、多数のプロセッサを空間的・時間的に有効利用するための手法として、その解析のための制御・データ関連グラフの提案、多数の計算ユニットを有効利用できるプリフェチ機構、動的再構成可能アーキテクチャ向けの新規分割・マッピング手法が提案・評価され、また具体的応用で汎用プロセッサでの実行と比較し10倍以上の高速化が達成されており、電子工学の発展に寄与する点は少なくない。よって本論文は博士(工学)の学位請求論文として合格したものと認められる。
UTokyo Repositoryリンク