学位論文要旨



No 128854
著者(漢字) グロート,スヴェン
著者(英字) Groot,Sven
著者(カナ) グロート,スヴェン
標題(和) データインテンシブ分散システムにおける高効率計算機資源利用に関する研究
標題(洋) Research on efficient resource utilization in data intensive distributed systems
報告番号 128854
報告番号 甲28854
学位授与日 2013.03.11
学位種別 課程博士
学位種類 博士(情報理工学)
学位記番号 博情第407号
研究科 情報理工学系研究科
専攻 電子情報学専攻
論文審査委員 主査: 東京大学 教授 安達,淳
 東京大学 教授 喜連川,優
 東京大学 教授 坂井,修一
 東京大学 准教授 五島,正裕
 東京大学 准教授 田浦,健次朗
 東京大学 准教授 豊田,正史
内容要旨 要旨を表示する

In recent years there has been an unprecedented growth in the amount of data being gathered worldwide. Data may include business data, sensor data, web data, log data, and social data. Data volumes of terabytes are common, petabytes are not unusual anymore, and it is almost certainly not long before exabyte scale data becomes the norm.

Big data presents an opportunity for in-depth analysis. Trends and patterns can be derived from the data, enabling a wide variety of business and research opportunities. As a result, the ability to process large amounts of data-often too large to be stored and processed by traditional relational database systems-has become very important. Processing big data requires very large scale resources such as large clusters of commodity machines that allow the data to be processed in a distributed fashion. Google's MapReduce framework and its open source implementation Hadoop have emerged as the de-facto standard data processing platform for such environments.

At the same time cloud computing has emerged, enabling users to dynamically provision computing resources and pay only for usage on a pay-as-you-go basis. Cloud computing enables anyone to gain access to large scale computing resources without the need to create their own infrastructure, drastically driving down the cost and making it possible for even individuals or small organizations to afford the resources necessary for processing big data. Cloud computing adoption has been driven by advances in virtualization and the wide-spread availability of high-bandwidth network connectivity.

When dealing with resources at this level of scale it becomes increasingly difficult to efficiently utilize all resources. With the billing model used by most cloud providers, provisioning a single node for ten hours costs the same as provisioning ten nodes for one hour, so it would be highly desirable if distributed applications could achieve a linear speed-up when more nodes or nodes with higher hardware specifications are provisioned. However, a number of factors limit the scalability of these applications so they cannot fully use the provisioned resources. In the cloud, inefficient resource utilization directly leads to higher costs for no additional benefit.

In this thesis, I investigate two major causes that contribute to the inability of data intensive distribute applications to efficiently utilize resources in cloud environments or other large scale distributed systems, and propose methods to mitigate their effects.

The first major cause of inefficient resource utilization is workload imbalance. This occurs when certain workers have a longer execution time than others, leading to stragglers: a handful of tasks that hold up an entire application. When a workload has stragglers, some computing resources provisioned for the workload may be idle, or partially idle, while waiting for the stragglers to finish. Because these resources are not being utilized they are unable to assist in speeding up the processing of the workload.

Workload imbalance and stragglers can have several causes such as data skew, processing skew and performance heterogeneity. Data skew occurs when the data to be processed is not divided evenly among the workers. Processing skew occurs when certain records in the data---even if they are not bigger than others---take more processing time. Performance heterogeneity can be caused by differences in the hardware between nodes, but also by environmental factors such as background processes active on certain machines, or interference between virtual machines when they are sharing the same physical host.

It is possible to mitigate data skew-and to a lesser extent, processing skew-by sampling the data beforehand, but these approaches cannot deal with performance heterogeneity. Performance heterogeneity is more complicated to account for beforehand, because precise knowledge about the hardware configuration is usually not available in the cloud and even identical instances can have very different performance. Although measurements can be used to establish the performance of each node, this can still not account for environmental factors that may change during the execution of the workload.

I propose a method called Dynamic Partition Assignment, which is able to dynamically adjust data distribution among workers whenever imbalance is detected. The workload is divided into many more partitions which are dynamically reassigned to workers that have already finished. Dynamic Partition Assignment avoids the overhead of doing many small transfers by assigning partitions in groups that can be transferred in one operation whenever possible. Because imbalance is lazily detected by monitoring the completion times of workers Dynamic Partition Assignment is able to handle data skew, processing skew and performance heterogeneity without any prior knowledge about the data or hardware environment.

Dynamic Partition Assignment was implemented in Jumbo, a MapReduce-style experimental data intensive distributed processing platform created for the purpose of experimenting with workload imbalance, and evaluated both for data skew and by running workloads on a heterogeneous environment. Dynamic Partition Assignment was able to successfully reduce the effects of stragglers, in some cases improving processing times by 30 to 50%, and bringing the processing time to within 10% of the optimally balanced processing time.

The second major cause of inefficient resource utilization is I/O interference. When multiple applications are accessing the same resource simultaneously, they can interfere with each other causing performance degradation. This is particularly a problem for I/O resources such as disk storage, which are often shared between processes and can have a dramatic reduction in performance when under contention. For example, when running parallel processes to exploit the many CPU cores of modern server nodes these processes contend for the same, more limited, I/O resources. The interference caused by this leads to sub-optimal utilization of both the node's CPU and disk resources.

The nature of I/O interference means that it is very difficult to mitigate after it is detected to occur. Reassigning work often requires additional I/O to move data to new locations, exacerbating the problem. Therefore, it is desirable to be able to predict the effects of I/O interference before it actually occurs so that scheduling and resource provisioning can be adjusted accordingly.

I propose a cost model that is able to predict the effects of I/O interference when multiple MapReduce tasks running on the same node contend for that node's I/O resources. The model uses several workload parameters measured directly from running a subset of the workload, and a number of hardware parameters derived from micro-benchmarks, including hardware specific interference functions that describe how the storage devices behave under contention. These parameters and functions are used by an analytical model of MapReduce, which uses knowledge of MapReduce's processing flow and I/O patterns to predict the performance of the workload when using specified numbers of parallel processes.

The I/O interference model was evaluated against several representative MapReduce workloads and was able to predict their performance to within 5 to 10% even for highly I/O intensive workloads or workloads that use a combination of I/O and CPU intensive processing. The information provided by this model can be utilized, for example, to determine how many nodes to provision with how many CPUs, and how many tasks to run simultaneously on any given node. Additionally, this model can be used by a scheduler to decide how to place tasks in the cluster based on their expected effect on I/O performance.

Improving resource utilization in the cloud helps to reduce application execution time, and reduces costs for both cloud providers and users. In this thesis, I address two aspects of this problem: I propose Dynamic Partition Assignment to mitigate the effect of stragglers and improve workload balancing, and I propose a cost model that addresses the problem of I/O interference. While both of these are targeted at MapReduce, the methods used in this thesis are not specific to MapReduce and can be applied to other data intensive applications in the cloud.

審査要旨 要旨を表示する

本論文は「Research on efficient resource utilization in data intensive distributed systems (データインテンシブ分散システムにおける高効率計算機資源利用に関する研究)」と題し、英文7章から構成されている。大規模分散処理環境におけるデータインテンシブアプリケーションの高効率計算機資源割当手法の確立を目的とし、代表的な大規模分散処理環境MapReduceを用い、アプリケーション実行時負荷分散方式としてデータをより詳細に分割し、部分データのタスク割当を動的に行うDynamic Partition Assignment手法を提案すると共に、同時発行される入出力が互いに干渉すると並列処理効果が低減することから、アプリケーション実行前に並列効果を得られる適切な計算機資源割当を実現すべく、入出力の干渉を考慮した処理コスト予測モデルを提案し、その結果を代表的なデータインテンシブアプリケーションを用い実測結果と比較することで、提案手法が入出力の干渉を正確に反映していることを明らかにし、さらに処理コストモデルに基づく計算機資源割当フレームワークを構築し、与えられたアプリケーションとシステム構成に対し効率のよい計算機資源割当結果を導出できることを示している。

第1章は、「Introduction (序章)」であり、本論文の背景および目的について概観し、本論文の構成を述べている。

第2章は、「Cloud computing (クラウドコンピューティング)」と題し、クラウドコンピューティングの概要、利用形態、特性と、計算機資源の利用および関連研究についてまとめている。

第3章は、「Data intensive processing on the cloud (クラウドにおけるデータインテンシブ処理)」と題し、クラウドで用いられる分散データストレージとして、Google分散ファイルシステムおよび関連研究をまとめ、分散データ処理フレームワークとして、MapReduceおよび関連研究をまとめている。

第4章は、「Runtime workload balancing for data intensive applications (データインテンシブアプリケーションにおける実行時負荷均衡)」と題し、データの偏りや不均一なシステム構成により実行時の負荷分散が不均衡になることを実機にてFP-Growthを実行することで具体的に示し、MapReduceのタスク分散処理の限界を明らかにした。負荷分散の均衡化を図るために、MapReduceでは予め決められているタスクの割当を実行時に動的に行うDynamic Partition Assignment手法(DPA)を提案し、MapReduceのスケジューリングを変更できるよう擬似MapReduceプラットフォーム(Jumbo)を実装し、その上でDPAを実現、データの偏りがある場合に提案手法が有効であることを確認した。

第5章は、「I/O interference in data intensive distributed processing(データインテンシブ分散処理における入出力の干渉)」と題し、複数ノードによる並列処理効果を狙った場合でも、同時に発行される入出力が互いに干渉し、並列処理効果が得られないことを示した。実機上にて、マイクロベンチマークを用いて基本入出力コストを計測すると共にマルチスレッド環境で同時アクセスによる入出力コストの低減を調べ、実測に基づいた入出力コスト予測モデルを提案した。さらに、代表的なデータインテンシブアプリケーションであるWordCount, TeraSort、FP Growth、PageRankを実機上で実行し、タスク単位の詳細な処理コスト解析を行い、提案した入出力コスト予測モデルが精度高く処理負荷コストを算出できることを示した。

第6章は、「Mariom: MapReduce I/O interference model framework (Mariom : MapReduceの入出力干渉モデルのフレームワーク)」と題し、MapReduce環境においてアプリケーションの負荷に見合った適切な計算機資源量を得るために、第5章で得られた実行時の干渉も含めた入出力コスト予測モデルを基に、対象となるアプリケーションの負荷測定解析、性能予測システム、ハードウェアの性能測定用のマイクロベンチマークからなるツール、予測システムを構築し、ユーザが適切なシステム構成を得られるよう提案していた。Mariomの結果は実機にて代表的なアプリケーションを実行した結果と比較し、マルチスレッド数が増えた場合の入出力の干渉による並列処理効果の低減など、処理コストの増減の傾向を的確に出力可能であることを示した。

第7章「Conclusion(結論)」では、本論文の成果と今後の課題について総括している。

以上これを要するに、本論文は、大規模分散処理環境MapReduceにおいて実行時処理負荷分散として分割されたデータ片を動的に割当てる手法を提案すると共に、同時発行された入出力が互いに干渉して並列処理効果を低減する可能性を予測する入出力コストモデルを構築し、提案モデルが与えられたアプリケーションとシステム構成に対し適切な計算機資源を獲得できることを実機による実行結果を示すことで、提案モデルの有効性を明らかにしており、電子情報学上貢献するところが少なくない。

よって本論文は博士(情報理工学)の学位請求論文として合格と認められる。

UTokyo Repositoryリンク