学位論文要旨



No 123067
著者(漢字) 森,純一郎
著者(英字)
著者(カナ) モリ,ジュンイチロウ
標題(和) 検索エンジンを利用したウェブからのエンティティ情報抽出手法と応用に関する研究
標題(洋) Entity Information Extraction from the Web Using Search Engine:Methodology and Application
報告番号 123067
報告番号 甲23067
学位授与日 2007.09.28
学位種別 課程博士
学位種類 博士(情報理工学)
学位記番号 博情第156号
研究科 情報理工学系研究科
専攻 電子情報学専攻
論文審査委員 主査: 東京大学 教授 安達,淳
 東京大学 教授 石塚,満
 東京大学 教授 浅見,徹
 東京大学 教授 近山,隆
 東京大学 教授 喜連川,優
 東京大学 教授 坂井,修一
内容要旨 要旨を表示する

The current development of Internet infrastructure such as broadband and wireless network enables users to easily access the Web. Nearly 87 million people in Japan are currently using Internet. Moreover, the current development of Web applications enables users to easily create and disseminate their contents in the Web. For example, using Blogs which are diary-like sites including multimedia contents such as photos and videos, users can easily publish their information. Nearly 8.68 million people in Japan are currently using Blog services.

With the rapidly growing contents on the Web, the recent Web has witnessed the transition from quality to quantity of information. A few years ago, when people tried to find information in the Web, they relied on the several ''authority'' sites that aggregate and disseminate valuable information. The algorithms for ranking the Web sites such as HITS and Pagerank have been developed and applied to such sites. However, the recent information explosion and distribution where users can easily publish their information on the Web has made it difficult to find valuable information only by using such hub and authority-based algorithms. As the contents on the Web are rapidly increasing, the quantity of information is recently becoming more important in the Web.

The importance of quantity of information has been explained with recent ''collective intelligence'' in the Web. Collective intelligence is the capacity of communities to co-operate intellectually in creation, innovation and invention. For example, Wikipedia, an online encyclopedia is based on the notion that every user can add an entry, is a successful site using the idea of the collective intelligence. Folksonomy, a style of collaborative categorization of Web sites using freely chosen keywords (or tags), is another example of the collective intelligence. As seen in Wikipedia, every single user contributes to creating large quantity of information and then as seen in Folksonomy, the information are organized and guided by user communities.

The collective intelligence is also emerging in huge language resources of the Web documents that contain hundreds of billions of words of text. Therein, search engine plays an important role to access the resources. The simple way to access the language resources in the Web is to leverage hit counts of search engine as word frequencies. For example, when checking the spell, speculater or speculator, Google gives 4,700 for the former and 1,210,000 for the latter. As seen in this example, the collective intelligence of majority decision in the Web can be easily obtained simply by exploiting Google hit counts. With the large quantity of information, the Web has turned to the huge corpus that can be easily accessible source of language material using search engines, which in turn opens new possibility to handle the vast relevant information and mine important structures and knowledge.

In addition to the trend of "Web as corpus", another important aspect of the current Web is that our daily life is reflected in the Web. For example, social networking services (SNSs) have recently received considerable attention on the Web. SNSs enable users to maintain an online network of friends or associates for social or business purposes. Therein, the users can create their contents such as profiles and Blogs and communicate with their friends. Information about tens of millions of people and their relationships are published in several SNSs. For example, more than 10 million users are using mixi, the largest SNS in Japan.

As users publish their daily activities and social relationships in Blogs and SNSs, the Web is currently reflecting the information in the real world and the information is constantly updated through the contents that the users create online. Communication and information sharing in the real world are also reflected in the Web. Using several communication tools such as Email, Instant Messenger, and SNSs, users can communicate each other and share information online as they do in the real world. As information and communication in the real world have been reflected in the Web. The Web is becoming another form of our society.

With the current trend of "Web as Corpus" and "Web as Society", the large amounts of information that are originated from our daily activities in the real world are available on the Web. In line with these trends in the Web, there is a new tendency of information retrieval that users try to find the "entity-based" information rather than documents. Here, entity is defined as the object in the real world such as person, location, and organization. In addition to single entity information, as we can see the recent trend of social networks which are basically representing the structure of relations among entities, relation information among entities from the Web (e.g. relation between two persons or relation between a person and an organization) are also becoming important information to be retrieved by users.

For example, when a user wants to know "Prof. Mitsuru Ishizuka", he might put the query "Mitsuru Ishizuka" into a search engine and try to find the information about Prof. Ishizuka from the search results. Therein, the final goal of the user is not to find the documents that include descriptions about Prof. Ishizuka but to find the related information of Prof. Ishizuka such as his students, research fields, affiliations, and projects. In other words, what the user wants to know is the information or attributes about Prof. Ishizuka as a person (or more precisely researcher) entity. In order to know about him further, the user might try to find the relation between Prof. Ishizuka and his student, co-author, or colleague. The user might be also interested in the relation between Prof. Ishizuka and his affiliation. As seen in this example, users are currently searching for entity-based information and entity relations on top of existing document-based Web information.

The Semantic Web is one approach to realize the entity-based information retrieval. In the Semantic Web, every resource is annotated with metadata using ontology. For example, "Prof. Ishizuka" is explicitly represented as an instance of Person class and related information about him such as affiliations and research fields are described with metadata. Users can easily search for and find the information about Prof. Ishizuka using the annotated metadata. However, because data should be annotated with metadata in advance to fully use the Semantic Web technologies, the annotation of metadata is a major problem to realize the Semantic Web. Therefore, there is still a huge gap between the current Web where most data are unstructured and the Semantic Web.

Aiming at realizing information services based on entity-based information and entity relations toward a next stage of current information retrieval, in this thesis we propose the methods for extracting entity information and entity relations from the Web. The key features of our approach are to leverage existent search engine and obtain several Web-scale static such as hit counts and snippets in order to assess entity-related information. Applying several text processing technologies such as named entity recognition and clustering to the information obtained from search engine, our methods extract entity information, entity relations and social networks. The extracted information can be applied to several applications that are based on the entity information. We first develop the researcher search system that the information about researchers and relationships are automatically extracted from the Web. We also develop the information sharing system and the expert finding system using the extracted social networks.

Overall, in this thesis we address two major research questions for extracting entity information from the Web: (1) how the search engine can be used to access the Web corpus and extract entity information from the Web and (2) how the extracted entity information can be used to support users in entity-based information services.

For first question, we propose the basic method to use search engine in order to obtain the information about Web-scale static such as hit counts, co-occurrence, and snippets. Using the basic method, we develop the algorithms for extracting entity information, entity relations, and social networks from the Web.

For extracting the entity information, we propose a method of keyword extraction. The proposed method is based on the statistical features of word co-occurrence that are obtained from search engine. The basic idea is a following: if a word co-occurs with an entity in many Web pages, the word might be a relevant keyword about the entity. Importantly, our method extracts relevant keywords depending on the context of the entity. Our evaluation shows better performance to existing keyword extraction.

For extracting the entity relations, we propose a method that automatically extracts descriptive labels of relations among entities automatically such as affiliations, roles, locations, part-whole, social relationships. Fundamentally, the method clusters similar entity pairs according to their collective contexts in Web documents. The descriptive labels for relations are obtained from results of clustering. The proposed method is entirely unsupervised and is easily incorporated with existing social network extraction methods. Our experiments conducted on entities in researcher social networks and political social networks achieved clustering with high precision and recall. The results showed that our method is able to extract appropriate relation labels to represent relations among entities in the social networks.

For extracting the social networks, we propose a method that leverages a search engine to build a social network by merging the information distributed on the Web. We describe some basic algorithms that extract social networks based on co-occurrence information as well as advanced algorithms that distinguish classes of relations based on a supervised learning. We also address new aspects of social networks: same-name problem, scalability, and keyword extraction.

For second question, we develop three systems that leverage the extracted entity information and social networks: researcher search system, information sharing system, and expert finding system. The systems are aiming at supporting users by using the extracted entity information.

We develop a researcher search system. The system is a Web-based system for an academic community to facilitate communication and mutual understanding based on a social network extracted from the Web. The system provides various types of retrieval on the social network: users can search for researchers by name, affiliation, keyword, and research field; related researchers to a retrieved researcher are listed; and the shortest path between two researchers can be retrieved.

We also develop a real-world-oriented information sharing system that uses social networks. The system automatically obtains users' social relationships by mining various sources in the Web. It also enables users to analyze their social networks to provide awareness of the information dissemination process. Users can determine who has access to particular information based on the social relationships and network analysis.

We finally propose a method that leverages the entity information and social networks of Web communities in order to find experts who have appropriate expertise and are likely to be able to reply to an information request. We develop the system using several data from the actual social network service and provide the service for locating relevant and socially close experts for information seekers.

審査要旨 要旨を表示する

本論文は「Entity Information Extraction from the Web Using Search Engine: Methodology and Application(検索エンジンを利用したWebからのエンティティ情報抽出手法と応用に関する研究)」と題し,英文で記されており,9章から成る.

第1章「Introduction(序論)」では,WWW(Web)が社会の重要な情報インフラになってきており,膨大なWeb情報を集合知によるコーパスと見なせ,また最近のBlogやSNS(social networking service)に見られるように,Webが実社会の状況を反映するメディアになってきているという背景を述べている.そのWebからの情報抽出,特にエンティティ(具体的には人物や組織)に関連する情報抽出が,Web情報の新しい活用法に向けて価値あるものであることを述べている.

第2章「Background and Related Work(背景と関連研究)」では,Webからの情報抽出とWebマイニングの関連研究,コンピュータがWebコンテンツの意味を把握できる次世代Webに向けてのSemantic Webに関する関連研究,Web関係の社会ネットワークにおける関連研究について纏めている.そして,既存研究に対する本研究の特徴を述べている.

第3章「Modeling Entity Information From Web(Webからのエンティティのモデル化)」では,Web上のエンティティを表現するための基本となるモデルと,検索エンジンを利用してそのモデルを構築するための方法を述べている.

第4章「Entity Information Extraction from Web(Webからのエンティティ情報抽出)」では,人物に関するキーワードをWebから抽出する手法を示している.この機能はSemantic Webのメタデータ作成の観点からも重要となる.この手法の核となっているのは,人物名と語の共起の統計情報を用いることであり,人物としては主に研究者を対象にした場合について具体的に提示している.例えば,検索エンジンのAND検索で"Alfred Kobsa AND User Modeling"のヒット数が3100で,"Alfred Kobsa AND Software Engineering"のヒット数が450であれば,この研究者は"User Interface"により関係があると判断できる.複合語を含むキーワード候補の切り出しにはターム抽出ツールを用いる.人物名との共起に基づくスコアリングの尺度としては,共起の割合を表すJaccard係数を用いている.ある人物が複数のコンテクストでのWebに出現する場合,例えば研究者でありかつ芸術家としても活動している場合の問題への対処法も提示している.キーワード抽出法として広く用いられるTFIDF(term frequency * inverse document frequency)による方法と比較し,本手法によれば優れた結果が得られることを実験による数値で示している.

第5章「Entities Relation Extraction from Web(Webからのエンティティ間関係の抽出)」では,Web上のエンティティの内で特に人物を中心とする事柄について,関係の種別を抽出する手法を示している.この手法では基本的に各エンティティ対が現れるコンテクストを単語ベクトル(bag-of-words)で表し,このコンテクスト集合をボトムアップにクラスタリングし,得られた各クラスタについてエンティティ間関係の種別を表す共通的な語彙を記述ラベルとして抽出する.例えば,"小泉純一郎,日本"や"森喜朗,日本"などを含むクラスタから,"首相"が関係を表す記述ラベルとして抽出されることになる.これは教師なし学習法によるものであることから,事例データを必要としない利点を有する.政治家と地理的エンティティ間の関係,人工知能分野の研究者間の関係(会議論文やジャーナル論文の共著者,本の共著者,本の共編者,同一研究プロジェクトの研究者,同一所属機関の研究者など)について実験を行い,コンテクストの範囲は30用語程が適当であることを見い出し,本提案手法により良好な結果が得られることを示している.

第6章は,「Social Network Extraction from Web(Webからの社会ネットワーク抽出)」で,具体例として人工知能学会大会の論文著者,参加者を対象とする人間関係ネットワーク生成について記している.人物をノード,関係をアークとしてネットワークを構成するのだが,関係の有無はWebでの両人物名の共起の割合を用い,Webに現れる回数が少ない場合は検索エンジンで両人物名が共起する上位10ページを精査して関係の有無を判定する.関係の種別の付与は第5章の手法により,また人物には第4章の手法によるキーワードを付与してその属性が判るようにする.このネットワークにWebのPageRankアルゴリズムのように隣接ノードに権威者度の伝播を繰り返すアルゴリズムを適用し,権威者度の高い人物を見い出せることを示している.このような人間関係ネットワークシステムPolyphonetを共同研究者と共に構築し,実際に人工知能学会大会(2003, 2004, 2005, 2006, 2007)と国際会議Ubicomp2005で,参加者に関連情報を伝える支援システムとして運用している.特にその研究者検索機能では,ある研究者が行っている研究トピックを検索できたり,ある研究者から他の研究者への知人関係経路を提示できたりする.

第7章「Information Sharing using Social Networks(社会ネットワークを用いる情報共有)」では,社会ネットワークを用いることによる情報公開・共有の範囲を適切に制御する方法を提案している.この方法では,ユーザは自身に関連する社会ネットワークをWeb,電子メール等から抽出し,それを編集することによって自身の情報種別毎に公開の範囲を指定する.また,ネットワーク分析の手法により各人物の中心性等の指標を提示し,情報公開の範囲に入る中心性が高い人物などを分けるようにする.具体例として,研究者に関するネットワークについてこの機能を実現している.

第8章「Expert Finding using Social Networks(社会ネットワークを用いるエキスパート発見)」では,料理法についてのオンラインコミュニティを具体例対象にして,各人物のプロファイルや料理法情報と共に社会ネットワークを形成し,特定の条件下で適切な料理法を持ち,かつ関係の近い人物を見い出す方法を提案している.単にその料理法に詳しいと言う観点だけでなく,社会ネットワークにより関係の近い人物を探すことにより,必要に応じて尋ねることも可能になる.この方法の実験システムを作成し,限定的ではあるものの,問題点や評価を示している.

第9章は「Conclusion(結論)」であり,本論文の成果を纏めている.

以上を要するに,本論文はWeb情報活用の新側面を拓くことに向けて,検索エンジン機能を利用して,Web上のエンティティ情報,具体的には人物に関するキーワードを抽出する手法,Web上のエンティティ(具体的に扱っているのは人物)間の関係を種別も含めて抽出する手法を考案し,人間関係ネットワークを構成する方法を提示している.これらの手法を含む具体的システムとして人工知能学会大会の論文著者,参加者を対象とする人間関係ネットワークを構築し,実際に運用することによる実証的研究により,その実現性と効用を示している.また,このような人間関係ネットワーク(社会ネットワーク)を用いることにより,あるコミュニティで権威度が高い人物を見い出す機能,情報公開の適切な範囲を指定できる機能,ある事柄について詳しくかつ関係の近い人物を見い出す機能を実現できることを示し,これらの機能も具体的に実現し評価している.これらの研究成果により,本論文は電子情報学上貢献するところが少なくない.

よって本論文は博士(情報理工学)の学位論文として合格と認められる.

UTokyo Repositoryリンク http://hdl.handle.net/2261/8143