学位論文要旨



No 122732
著者(漢字) デシルヴァ ガムヘワゲ チャミンダ
著者(英字) De Silva Gamhewage Chaminda
著者(カナ) デシルヴァ ガムヘワゲ チャミンダ
標題(和) ユビキタスホームにおける体験情報処理と検索
標題(洋) Multimedia Experience Retrieval in a Ubiquitous Home
報告番号 122732
報告番号 甲22732
学位授与日 2007.03.22
学位種別 課程博士
学位種類 博士(科学)
学位記番号 博創域第269号
研究科 新領域創成科学研究科
専攻 基盤情報学専攻
論文審査委員 主査: 東京大学 教授 相澤,清晴
 東京大学 教授 原島,博
 東京大学 教授 柴田,直
 東京大学 教授 近山,隆
 東京大学 教授 相田,仁
 東京大学 助教授 佐藤,洋一
内容要旨 要旨を表示する

Automated capture and retrieval of multimedia experiences at home is interesting due to a number of reasons. A system with such capability can help the residents by capturing experiences that the residents do not want to be away from, for the sake of shooting photos or video. It can "entertain" the residents by allowing them to recall the happy moments and experiences, and discover things that were unknown to them. It can act as a "memory-aid" or a "healthcare assistant," thereby making life more comfortable for the elderly. If used over a long period of time, it can also help the residents to identify their behavioral patterns and take corrective action if necessary.

 However, this is a difficult task with several challenges in different aspects. The number of sensors required for complete capture of experiences taking place in a home-like environment is quite large. Continuous capture is necessary to prevent missing experiences that residents are not prepared for, resulting in a large amount of multimedia content that is much less structured compared to those from any other environment. Recognition of actions, events and experiences using such data is extremely difficult. Queries for retrieval will be at high semantic level, and at different levels of granularity; a resident might just want to find out the number of visitors to the house on a certain day, or want to see the video of what he was doing during the afternoon of a selected day. Different places of the home have different levels of privacy, restricting the ability to capture certain types of data in some locations.

 In this research, we focus on capturing and retrieval of personal experiences in a ubiquitous environment that simulates a house, with the objective of creating an electronic chronicle that enables the residents to retrieve the captured video using simple and interactive queries. A large number of cameras and microphones are used to continuously record video and audio at desired areas of the house. Pressure based sensors, mounted on the house floor, record context data corresponding to the footsteps of residents. A given region of the house may contain none, some or all of these types of sensors, depending on the level of privacy in that region. One day of continuous capture in this house results in 408 hours of video and 600 hours of audio data, which amounts to about 500 GB of disk space, suggesting that manual retrieval is impossible.

 Our approach in this work is to select sources that convey the most amount of information based on context data. Only the selected sources are queried to retrieve data, and these data are analyzed further for retrieval thereby minimizing the computational effort on content analysis. However, at the same time, the redundancy caused by the presence of a large number of sensors is utilized to improve the accuracy of retrieval.

 Data from floor sensors are clustered using a hierarchical approach to segment footstep sequences of different persons. Algorithms for automatic video and audio handover are used for the creation of video clips using these sequences, while automatically changing cameras and microphones to keep the person in view and hear the sounds in his/her surroundings. Key frames are extracted from these videos the create summaries, allowing the users to get a quick preview of their content. An adaptive algorithm based on the time and location, and the rate of activity of the person is used to create complete and compact summaries.

 Audio data from each microphone are segmented at two levels for retrieving audio events. First, data corresponding to silence and small noises are removed. Thereafter, sounds heard from regions other than where the microphone is located are removed using a sound source localization algorithm. The resulting audio segments are classified into different categories of sounds, to retrieve the sounds and video showing the locations where the sounds are heard.

 Basic analysis of image data is used for the detection of selected events that take place inside the house. Floor sensor data are analyzed in combination with other sensory modalities, for recognition of some common actions inside the house. The results are written to a central a relational database, where they can be fused for accurate detection of activities.

 The users, who are also residents, retrieve their experiences from the database through a graphical user interface by submitting interactive queries. This interface is designed based on the concepts of hierarchical media segmentation and Interactive retrieval, to facilitate effective retrieval with a minimal amount of manual data input using only a pointing device. Visualizations of different types of data at various levels of detail were included to help the user to retrieve required media and understand the results.

 We evaluated the system using a two-pronged approach. Each functional component was evaluated individually, to ensure that it provides accurate results to the user and the other components using the results. We used standard accuracy measures and experiments where available, while designing experiments and defining new accuracy measures where necessary. We also conducted a user study for the purposes of gathering system requirements and evaluating the overall system. A set of "real-life experiments," in each of which a family actually lived in the house for a period of 7-14 days, were conducted for data collection. One of these families took part in a user study, where suggested system requirements, used the system for retrieving their experiences, and provided feedback.

 Segmentation of floor sensor data followed by video handover enabled the creation of personalized video clips using a large number of cameras. It was possible to dub this video with reasonably good quality, using audio handover. Adaptive key frame extraction enabled retrieval of more than 80% of the key frames required for a complete summary of the video. Silence elimination and false positive removal from audio data produced results with a high accuracy of 98%. The scaled template matching algorithm we proposed is able to achieve localize sound sources with an accuracy of about 90%, despite the absence of microphone arrays or a beam-forming setup. The accuracy of audio classification using only time domain features is above 83%. Basic image analysis facilitated detection of events that are useful in understanding the activities that take place inside the house. Action detection using multiple sensory modalities yielded an average accuracy of approximately 78%.

 The residents who evaluated the system found it useful, and enjoyed using it. They discovered events that they were not aware of before using the system. The residents wanted to keep some of the video they were able to retrieve, demonstrating the system's applicability. They found the system easy to learn and usable. The requirements they identified and the feedback they provided were valuable in improving the system.

審査要旨 要旨を表示する

 本論文は,「Multimedia Experience Retrieval in a Ubiquitous Home (ユビキタスホームにおける体験情報処理と検索)」と題し,10章からなり,英文で書かれている.情報環境の進展により,人の体験や日常を克明に記録し,利活用する道が開けつつある.ユビキタスホームという"家"では,多数のカメラ,マイクロホン,床センサ等により,家の中の様子を常時連続的に記録することができる.家では様々な出来事が起こり,映像記録を見直すことで,新たな発見も多い.但し,注目すべき対象の映るカメラは限られているにもかかわらず,そのデータは極めて大きく,人手での検索は容易ではない.如何に効率よく,記録された出来事を探すことができるかがとても重要な課題になる.本論文では,マルチモーダルなセンサデータの処理を進め,自動的に出来事を切り出し,連続した映像として提示する手法を考案し,そのためのシステムを実現している.

 第1章は,「Introduction(序論)」と題し,本論文の目的と背景について論じている.

 第2章は,「State of the Art(技術動向)」と題し,関連する技術であるユビキタス情報環境,マルチメディア検索,ユビキタス環境での取得データの検索,個人体験の取得と検索といった項目をあげ,その現状について述べている.

 第3章は,「Ubiquitous Home(ユビキタスホーム)」と題し,本研究のプラットホームとして用いたユビキタスホームの構成と取得するセンサデータについて述べている.ユビキタスホームは,2LDKの家であり17台のカメラ,25台のマイクロホン,床圧力センサなどが組み込まれており,そのデータは常時記録されている.

 第4章は,「System Overview(システム概要)」と題し,センサデータの処理,検索についての概要を論じている. 床センサの処理,音響データの処理,画像データの処理,行動分類,検索処理よりなるシステム概要を示している.

 第5章は,「Personalized Video Retrieval Using Floor Sensor Data(床センサデータを用いた人物映像の検索)」と題し,床センサデータを用いた人物映像の追跡表示について論じている.床センサデータに階層的クラスタリングを施し,足跡の検出,軌跡のセグメントの検出,軌跡の構築を行った.これにより人物の追跡処理を行い,その位置情報をもとに最適なカメラ映像を選択し,人物の移動に併せて,自動的にカメラを切り替えるハンドオーバーを実現している.さらに,その追跡映像からのキーフレームの抽出についても論じ,時間,位置,歩数(活動度)の3者を組み合わせる適応的なサンプリングによるキーフレーム要約が最も効果的であることを実験を通して示している.

 第6章は,「Audio Analysis for Multimedia Retrieval(マルチメディア検索のための音響信号処理)」と題し,多数のマイクロホンからの音響信号の処理について論じている.音源の存在する領域の同定に対して,エネルギー分布テンプレート法と称する新しい手法を考案し,精度の良いことを評価している.

 第7章は,「Event and Action Detection Using Multiple Modalities(複数のモダリティを用いたイベントと動作の検出)」と題し,他のデータを用いたイベントの検出について論じている.画像の輝度変化を詳細に解析することで,環境光の変化を検出し,自動露光カメラの映像から照明のオンオフのイベントの検出が行えることを示した.また,床センサデータから行動の自動検出について論じ,6つの行動について,その検出精度を評価している.

 第8章は,「User Interaction(ユーザインタラクション)」と題し,ユビキタスホームのセンサデータからの様々な検出結果を利用した対話的な検索インタフェースについて論じている.家全体のサマリの提示,部屋ごとのサマリ,そして,人の移動,音源,光変化などのイベントの選択により詳細な映像の提示にいたるインタフェースについて示している.

 第9章は,「User Study(ユーザスタディ)」と題し,ユビキタスホームで12日間暮らした家族のデータを利用した検索処理のユーザスタディを行っている.具体的には,実生活実験の約6ヵ月後に当該家族に検索システムを利用してもらい,アンケート形式での調査を行った.システムの使い易さなどに関して高い評価が得られた.さらに,実験時のデータをナビゲーションすることで,当時気付かなかった新しい発見をすることができたことも高く評価していた.

 第10章は,「Conclusion and Future Work(結論と今後の課題)」と題し,本論文の成果と今後の課題をまとめている.

 以上これを要するに,本論文では,ユビキタスホームという新しい情報環境において取得することのできる膨大な体験情報に対して,そのマルチモーダルなセンサ情報の処理に基づく検索の方策を提示し,インタラクティブな検索システムを構築し,実生活実験を通して評価したものであり,メディア技術の新しい領域を切り開くものとして期待でき,情報学の基盤に貢献するところが少なくない.

 従って、博士(科学)の学位を授与できると認める。

UTokyo Repositoryリンク http://hdl.handle.net/2261/9285