学位論文要旨



No 122800
著者(漢字) 岡崎,直観
著者(英字)
著者(カナ) オカザキ,ナオアキ
標題(和) 複数文書自動要約における情報の集約と統合に関する研究
標題(洋) RESEARCH ON INFORMATION AGGREGATION AND INTEGRATION FOR MULTI-DOCUMENT SUMMARIZATION
報告番号 122800
報告番号 甲22800
学位授与日 2007.03.22
学位種別 課程博士
学位種類 博士(情報理工学)
学位記番号 博情第130号
研究科 情報理工学系研究科
専攻 電子情報学専攻
論文審査委員 主査: 東京大学 教授 喜連川,優
 東京大学 教授 石塚,満
 東京大学 教授 広瀬,啓吉
 東京大学 教授 浅見,徹
 東京大学 教授 近山,隆
 東京大学 教授 坂井,修一
内容要旨 要旨を表示する

 Numerous computerized documents are accessible on-line. In November 2006, Netcraft Web Server Survey reported that more than 100 million web sites with domain names were found on the Internet. The Internet has doubled in size since May 2004, when the survey hit 50 million. The grand total number of pages on the Web is estimated to be much larger than this figure. These facts suggest information explosion, tremendous increase of the amount of published documents. Meanwhile, search engines on the Web achieve a moderate success in keeping up with the growth of computerized documents. Major search engines (e.g. Google and Yahoo!) claim to have indexed more than several billion pages on the Web. Using a search engine on the Web becomes a pervading idea to obtain information at a reasonable cost and time.

 Notwithstanding, we are often disappointed with the quantity of retrieved documents despite having narrowed the range of documents of interest. For instance, Google retrieves as many as 10,300 documents with a query "hijacking all nippon airways 61" (as of November 2006); and 144 documents are found, with the same query, in the Mainichi newspaper articles published in 1999. The situation having too much information to utilize is called information overload, and has been regarded a major problem

 The information explosion also results in a big shift of structuring information. The more information retrieval systems play an important role in information society, the more often we deal with documents gathered dynamically without inter-document structures, e.g., search results. Unlike human-made document collections such as books and journals, we must sort out the structure of retrieved documents, i.e., what is common in the document set, what is the different part of a document from others, what is the optimal order to read through the documents, etc. In addition, more and more information is being published in unstructured styles. We cannot expect a coherent story in a series of blog entries as a blogger might ramble about various subjects such as products, news events, or local events. Thus, a mechanism to aggregate information published in different documents is key to the solution of the information overload problem.

 Automatic text summarization is a challenge to the information overload problem, allowing users to control the amount of text to be read. The goal of automatic summarization is, given an information source, to present the most important content to the user in a condensed form and in a manner sensitive to the user's need. Multi-Document Summarization (MDS), which is a summarization task specialized in dealing with related documents (e.g. a collection of news stories on the same topic), has attracted much attention in recent years. In addition to the conventional process of information retrieval, a user forwards retrieved documents to the MDS system and obtain a summary. If the summary could satisfy user's information demand directly, the user would save time and effort to read the retrieved documents. The summary could also help the user determine whether if they need an intensive reading for some of the documents.

 Each component in an MDS system also presents research challenges in text mining. This thesis addresses methodologies for aggregating information and knowledge across documents, focusing on three research topics essential to an MDS system: sentence extraction, sentence ordering, and acronym recognition. This thesis consists of seven chapters. The first chapter addresses the background, motivation, and goal of this study. The subsequent chapter (Chapter 2) provides a review for automatic text summarization. Chapter 3 presents the task definition and evaluation methodology of the 3rd Text Summarization Challenge (TSC), as this study makes use of its evaluation corpus. Chapter 4 describes a method for sentence extraction in an MDS system, which is formalized as a combinational optimization problem that determines a set of sentences containing as much important information as possible. Chapter 5 addresses two approaches to text structuring for extracts from multi-documents: a novel method to refine the conventional method for arranging sentences; and a machine learning approach to aggregate the multiple criteria for further improvement. Chapter 6 presents a methodology for building an abbreviation dictionary from a large corpus. Chapter 7 remarks future directions of this work and concludes the thesis.

 Chapter 4 of this thesis presents a methodology for sentence extraction. Passage extraction is the most basic technology for building an automatic summarization system. Excluding a few exceptions, most summarization systems employ some kind of extractive techniques. Passage extraction finds important passages in source documents to produce a summary. In general, the extraction problem is formalized as the computer-friendly task of assigning relevance scores to textual passages in source documents. Therefore, passage extraction is obviously the low-cost but main solution to the summarization research.

 The current NLP technology cannot deal with the meaning of a text as flexibly and efficiently as humans do. This study assumes that: a human reader breaks a sentence into a set of information fragments to which the sentence is referring; an information fragment is independent from each other; and an information fragment has its importance score. Thus, a sentence is approximated by a set of information fragments each of which conveys atomic information in a sentence. Among various sentence representations such as bag-of-words, bi-gram, tri-gram, n-gram, FrameNet, this study proposes the use of the dependency relations of terms in a sentence. The advantage of this representation is that dependency relations refer to what the original sentence is saying (with a certain degree of human interpretation), and therefore are useful to keep track of information conveyed by the sentences.

 Based on the sentence representation, the problem of sentence extraction is formalized as a combinational optimization problem that determines a set of sentences containing as much important information fragments as possible under the constraint of the summarization ratio. The presented algorithm chooses sentences in source documents one by one. Because source documents often contain redundant information, the algorithm reduces the importance of information fragments that has been included in the sentences chosen previously. The implemented system employs a beam search to find a quasi-optimal solution in a reduced search cost as the extraction problem belongs to the NP-hard class.

 The presented system achieved a good result on TSC-3 evaluation corpus. For both high and low summarization ratios, the system took the 3rd place among the systems participated in TSC-3. The result was far better than that of a baseline system, which features the lead extraction strategy. The comparison among sentence representation demonstrated that the proposed representation using pair-wise dependency relations performed better than bag-of-words and co-occurrence representations.

 Chapter 5 examines a method to arrange sentences that are extracted by important sentence extraction. A summary with improperly ordered sentences confuses a reader and degrades the quality/reliability of the summary itself. However, ordering a set of sentences for a coherent text is a non-trivial task. For example, identifying rhetorical relations in an ordered text has been a difficult task for computers, whereas the ordering task is more complicated, i.e., reconstructing such relations from unordered set of sentences. Source documents for MDS may have been written by different authors, by different writing styles, on different dates, and based on different background knowledge. We cannot expect that a set of extracted sentences from such diverse documents is coherent on their own.

 When asked to arrange sentences, a human may perform this task without difficulty just as we write out thoughts in a text. However, we must consider what accomplishes this task because computers are, by their nature, unaware of ordering. The most common strategy for sentence ordering is chronological ordering, which arranges sentences in the order of their publication dates. However, this study addresses the problem of chronological ordering: some sentences may lose the presupposition assumed in their original documents. Thus, chronological ordering sometimes obscures what a sentence is intended to convey.

 In order to deal with the problem case with chronological ordering, this study proposes the use of precedence relations for coherent arrangement. The proposed method improves chronological ordering by resolving precedent information of arranging sentences. This study also proposes an evaluation metric that measures sentence continuity and an amendment-based evaluation task. The proposed method achieved good results in a rating task, raising poor chronological orderings to an acceptable level by 20%. Amendment-based evaluation outperformed an evaluation that compares an ordering with an answer made by a human. The sentence continuity metric, when applied to the amendment-based task, showed good agreement with the rating result.

 Although several strategies to decide a sentence ordering have been proposed in the previous work, the appropriate way to combine these strategies to achieve more coherent summaries remains unsolved. This chapter also formalizes four criteria to capture the association of sentences. These criteria are integrated into a criterion by a supervised learning approach. The chapter also proposes a bottom-up approach to arrange sentences, which repeatedly concatenate textual segments until we obtain the overall segment with all sentences arranged. Our experimental results showed a significant improvement over existing sentence ordering strategies.

 Chapter 6 addresses abbreviation recognition for MDS. Abbreviations result from a highly productive type of term variation which substitutes fully expanded terms (e.g., European Union) with shortened term-forms (e.g., EU). Abbreviations hinder automatic text summarization as an acronym could be expressed in different forms across documents, e.g., European Union (EU); European Union; or EU. An MDS system should be aware of associations between shortened term-forms and fully expanded terms so that a summary chooses a consistent term form in describing the same concept/entity.

 In practice, no generic rules or exact patterns have been established for dealing with abbreviation creation. Thus, abbreviation recognition aims to extract pairs of short forms (acronyms or abbreviations) and long forms (their expanded forms or definitions) occurring in text. Except for a few studies, most studies focus on parenthetical expressions and locate a textual fragment with an abbreviation definition by using a letter-matching algorithm. However, the letter matching approach cannot deal with a long form whose short form is arranged in a different word order, e.g., water activity (AW). In addition, the letter matching approach is not applicable to Japanese acronyms, which does not necessarily share the same letters between a short/long-form pair because of its foreign origin, e.g., Yakan Ri-chakuriku Kunren (NLP).

 This study assumes a word sequence is a possible long-form if the word sequence co-occurs frequently with a specific abbreviation and not with other surrounding words. Satisfying a validation rule for being a long form, the word sequence is stored in the abbreviation dictionary. This approach detects the starting point of the long form without using letter matching. In order to validate a short/long-form pair in English, this study uses a refined letter-matching algorithm that can recognize shuffled abbreviations. This study also examines the number of paraphrase instances in the source documents for validating short/long-form pairs in Japanese. The proposed method outperformed the base-line systems, achieving 99% precision and 82-95% recall on MEDLINE evaluation corpus.

 In conclusion, this thesis reports novel methods for sentence extraction, sentence ordering, and acronym recognition. The effectiveness of these methods is shown with some experimental evidences. Even though this thesis targets at multi-document summarization as a problem exemplar, the outcomes of this study will contribute to various NLP applications.

審査要旨 要旨を表示する

本論文は「RESEARCH ON INFORMATION AGGREGATION AND INTEGRATION FOR MULTI-DOCUMENT SUMMARIZATION(複数文書自動要約における情報の集約と統合に関する研究)」と題し,英文で記されており,7章から成る.

 第1章「Introduction(序論)」では,WWW(Web)等の発展により大量の電子化情報の流通と共有が進んできていることにより,その効率的な利用を可能にするための複数文書要約技術が重要になるという本研究の背景を述べている.

 第2章「Automatic Text Summarization(文書自動要約)」では,関係する文書自動要約技術について纏めている.最初に基本となる考え方と留意点を述べ,指示的(indicative)要約と報知的(informative)要約の考え方があることを示している.複数文書要約に関しては,文書間の関係タイプの考慮や情報集約に際して冗長性の排除などの配慮が必要になることを述べ,幾つかの具体的システムを紹介している.

 第3章「Text Summarization Challenge(TSC)(文書自動要約のワークショップ)」では,まず文書要約技術性能を客観的に比較し,その発展を促進するために世界で催されている評価型ワークショップについて記している.本研究の複数文書要約システムも,日本の国立情報学研究所(NII)が主催する評価型ワークショップであるNTCIR(NII-NACSIS Test Collection for IR Systems)の複数文書要約タスクに参加して評価を受けており,開発して参加したシステムの全体構成を示している.要約の元となるテキスト文書も,この評価型コーパスで採用された1998年と1999年の毎日新聞及び読売新聞の記事データを用いており,1話題についての記事数は5〜19(平均として11.7)である.この評価コンテストで要約性能を評価するのに用いる主な項目は,システムが出力した文の正解率(precision),システムが要約として含むべき文をどのくらいカバーしたか(coverage),要約の質に関するアンケート調査(quality question)であることを記している.

 第4章「Sentence Extraction(文抽出)」では,要約の主要コンポーネントである重要文抽出について,関連手法について記した後,本研究で考案しシステムに用いた手法を記している.本研究では,重要文抽出問題を複数文書に含まれる重要な情報断片を,できるだけ多く要約文に含める文の組み合わせを見出す最適化問題と捉え,効率的な手法を開発している.すなわち,係り受け関係にある自立語単語ペアを基礎情報要素とし,その重要度を計算し,その結果に基づき各文の重要度計算を行っている.そして重要度の高い文の順に抽出を行い,抽出された文に含まれる上記の基礎情報要素の重要度を0にして各文の重要度を再計算し,重要度の高い文の抽出を要約文として許容される長さまで反復する.この係り受け関係にある自立語単語ペアを基礎とする重要文抽出法は,よく用いられる自立語の単語(bag-of-words法と呼ばれる)を基礎とする方法や,文中に共起する自立語単語ペアを基礎とする方法に比べて,良好な抽出性能が達成できることを示している.本手法によるシステムは,前述のNTCIR評価型ワークショップにおいて多くの労力を投入している企業からのシステムも含む参加10システムの中で,第3位の好成績を挙げたという結果を報告している.本システムは要約に含まれる冗長情報が少ない点でも優れていることを示している.

 第5章「Structuring Extracts(要約文の並び替え)」では,複数文書から抽出した重要文の順序付けに関して新手法を提示している.従来研究では,抽出した文が記された日付の古い順に並べるというchronological ordering(時間順序)がよく用いられていた.本研究はまず,それぞれの文が元の記事の中でどのような情報を前提として書かれていたのかを解析し,時間順序を改善するアルゴリズムを提案している.さらに,時間順序,前提情報,後置情報,トピックの類似性の4要素を教師あり機械学習アプローチで統合し,2つの文の順序関係の向きと強さに基づいて,階層化クラスタリングと同様の方法で全体の文の並び順を構築する手法を提案している.また,文の並び順の評価方法として,スピアマンの順位相関係数,ケンドールの順位相関係数に加え,連続性の指標を提案し,連続性の指標が文の並び順の評価を行ううえで,優れた指標であることを示している.これらの評価指標を用いて提案手法を評価し,時間順の並びよりも提案手法が優れていることを定量的に示している.

 第6章「Abbreviation(略語)」では,略語認識の新たなアプローチを提案している.略語が要約システムにもたらす問題点として,略語が新聞記事でよく用いられる表現であること,短縮形と完全形の関係を認識したうえで,表現の統一を図る必要があることを挙げている.そして,まず英語の略語の短縮形と完全形のペアを獲得する手法として,従来研究である文字列マッチングに依存する手法に対し,括弧表現の外側と内側の表現の共起強度に基づき,略語の短縮形と完全形を抽出する新手法を提案している.医学系文献データベースであるMEDLINEを評価コーパスとして,精度-再現率による評価を行い,先行研究よりも提案手法の方がはるかに良い性能(99%の精度,85-95%の再現率)を示すことを示している.

 提案手法を日本語の新聞記事に適用する場合の問題点として,括弧表現の内側と外側の表現が強く共起しても,それらが略語の短縮形と完全形の関係を持たないことが多いことを述べた上で,括弧表現の内側と外側の表現に言い換えの関係が存在するかどうかを調べる手法を新たに提案し,F尺度で60%程度の性能を達成したことを報告している.

 第7章「Conclusion(結論)」では,本論文の成果を纏めている.

 以上を要するに,本論文は複数文書自動要約に関して,係り受け関係にある自立語単語ペアを基礎情報要素として重要文を抽出する技術,抽出した重要文の要約文中での順序付けを,各文が元の記事中でどのような情報を前提として書かれたか等を考慮することにより決定する新手法を考案し,これらに基づくシステムを作成し,その性能を評価型コンテストなどを通じて客観的に評価し,優位性があることを実証している.また,略語認識に関して,括弧表現の外側と内側の表現の共起度強度に基づき,略語の短縮形と完全形を抽出する新手法を考案し,その性能を実験的に実証している.これらの研究成果により本論文は電子情報学上貢献するところが少なくない.

 よって本論文は博士(情報理工学)の学位論文として合格と認められる.

UTokyo Repositoryリンク http://hdl.handle.net/2261/25843