Retrieval Evaluation Modern Information Retrieval, Chapter 3 Ricardo Baeza-Yates, Berthier Ribeiro-Neto 圖書與資訊學刊第 29 期 (1999 年 5 月 ), 台大圖資所碩士論文,

Retrieval Evaluation

Modern Information Retrieval, Chapter 3 Ricardo Baeza-Yates, Berthier Ribeiro-Neto 圖書與資訊學刊第 29 期 (1999 年 5 月 ), 台大圖資所碩士論文 , 江玉婷，陳光華

Outline

Introduction Retrieval Performance Evaluation

Recall and precision Alternative measures

Reference Collections TREC Collection CACM&ISI Collection CF Collection

Trends and Research Issues

Introduction

Type of evaluation Functional analysis phase, and Error analysis phase Performance evaluation

Performance evaluation Response time/space required

Retrieval performance evaluation The evaluation of how precise is the answer set

Retrieval Performance Evaluation

評估以 batch query 為主的 IR 系統

collection

Relevant DocsIn Answer Set

|Ra|

Relevant Docs|R|

Answer Set|A|

Recall=|Ra|/|R|

Precision=|Ra|/|A|

Sorted by relevance

Precision versus recall curve

Rq={d3,d5,d9,d25,d39,d44,d56, d71,d89,d123}

• P=100% at R=10%• P= 66% at R=20%• P= 50% at R=30%

Ranking for query q:

1.d123*2.d84

3.d56*4.d6

5.d8

6.d9*7.d511

8.d129

9.d187

10.d25*

11.d38

12.d48

13.d250

14.d11

15.d3*

Usually based on 11 standard recall levels: 0%, 10%, ..., 100%

Precision versus recall curve

For a single query

Fig3.2

Average Over Multiple Queries

P(r)=average precision at the recall level r Nq= Number of queries used Pi(r)=The precision at recall level r for the i-th query

qN

ii

q

rPN

rP1

)(1

)(

Interpolated precision

Rq={d3,d56,d129}

• P=33% at R=33%• P= 25% at R=66%• P= 20% at R=100%

P(rj)=max ri≦ r≦ rj+1P(r)

1.d123

2.d84

3.d56*4.d6

5.d8

6.d9

7.d511

8.d129*

9.d187

10.d25

11.d38

12.d48

13.d250

14.d11

15.d3*

Interpolated precision Let rj, j{0, 1, 2, …, 10}, be a reference to the j-th standard recall

level P(rj)=max ri≦ r≦ rj+1P(r) R=30%, P3(r)~P4(r)=33%

R=40%, P4(r)~P5(r)R=50%, P5(r)~P6(r)R=60%, P6(r)~P7(r)=25%

Average recall vs. precision figure

Single Value Summaries

Average precision versus recall: Compare retrieval algorithms over a set of example queries

Sometimes we need to compare individual query’s performance Average precision 可能會隱藏演算法中不正常的部分可能需要知道 , 兩個演算法中，對某特定 query 的 performance

為何

Need a single value summary The single value should be interpreted as a summary of the c

orresponding precision versus recall curve

Single Value Summaries

Average Precision at Seen Relevant Documents Averaging the precision figures obtained after each new rele

vant document is observed. Example: Figure 3.2, (1+0.66+0.5+0.4+0.3)/5=0.57 此方法對於很快找到相關文件的系統是相當有利的 ( 相關文件被

排在越前面 , precision 值越高 ) R-Precision

The precision at the R-th position in the ranking R: the total number of relevant documents of the current query (tot

al number in Rq) Fig3.2:R=10, value=0.4 Fig3.3,R=3, value=0.33

Precision Histograms Use R-precision measures to compare the retrieval history

of two algorithms through visual inspection

RPA/B(i)=RPA(i)-RPB(i)

-1.5

-1

-0.5

0

0.5

1

1.5

1 2 3 4 5 6 7 8 9 10

Query Numbaer

R-P

recision

A/B

Summary Table Statistics

將所有 query 相關的 single value summary 放在 table中 the number of queries , total number of documents retrieved by all queries, total number of relevant documents were effectively retriev

ed when all queries are considered total number of relevant documents retrieved by all queries

…

Precision and Recall 的適用性 Maximum recall 值的產生，需要知道所有文件相關的背

景知識 Recall and precision 是相對的測量方式，兩者要合併使

用比較適合。 Measures which quantify the informativeness of the r

etrieval process might now be more appropriate Recall and precision are easy to define when a linear

ordering of the retrieved documents is enforced

Alternative Measures

The Harmonic Mean , 介於 0,1

The E Measure- 加入喜好比重

b=1, E(j)=F(j) b>1, more interested in precision b<1, more interested in recall

)(1

)(1

2)(

jPjr

jF

)(1

)(

2

2

11)(

jPjrb

bjE

User-Oriented Measure

假設： Query 與使用者有相關 , 不同使用者有不同的 relevant docs Coverage=|Rk|/|U| Novelty=|Ru|/(|Ru|+|Rk|)

Coverage 越高 ,系統找到使用者期望的文件越多

Noverlty 越高 ,系統找到許多使用者之前不知道相關的文件越多

Reference Collection

用來作為評估 IR 系統 reference test collections TIPSTER/TREC: 量大，實驗用 CACM, ISI: 歷史意義 Cystic Fibrosis: small collections, relevant documents 由專家

研討後產生

IR system 遇到的批評 Lacks a solid formal framework as a basic foundation

無解 ! 一個文件是否與查詢相關，是相當主觀的 ! Lacks robust and consistent testbeds and benchmark

s 較早，發展實驗性質的小規模測試資料 1990 後， TREC 成立，蒐集上萬文件，提供給研究團體作 IR 系

統評量之用

TREC (Text REtrieval Conference) Initiated under the National Institute of

Standards and Technology(NIST) Goals:

Providing a large test collection Uniform scoring procedures Forum

7th TREC conference in 1998: Document collection: test collections, example

information requests (topics), relevant docs The benchmarks tasks

The Documents Collection

由 SGML 編輯<doc>

<docno>WSJ880406-0090</docno>

<hl>AT&T Unveils Services to Upgrade Phone Networks Under Global Plan</hl>

<author>Janet GuyonWSJ Staff)</author>

<dateline>New York</dateline>

<text>

American Telephone & Telegrapj Co. introduced the first of a new generation of phone service with broad…

</text>

</doc>

<doc>

<docno>WSJ880406-0090</docno>

<hl>AT&T Unveils Services to Upgrade Phone Networks Under Global Plan</hl>

<author>Janet GuyonWSJ Staff)</author>

<dateline>New York</dateline>

<text>

American Telephone & Telegrapj Co. introduced the first of a new generation of phone service with broad…

</text>

</doc>

TREC1-6 DocumentsDisk Contents Size Number Words/Doc. Words/Doc.

Mb Docs (Median) (Mean)1 WSJ, 1987-1989 267 98,732 245 434

AP, 1989 254 84,678 446 473.9ZIFF 242 75,180 200 473FR, 1989 260 25,960 391 1315.9DOE 184 226,087 111 120.4

2 WSJ, 1990-1992 242 74,520 301 508.4AP, 1988 237 79,919 438 468.7ZIFF 175 56,920 182 451.9FR, 1988 209 19,860 396 1378.1

3 SJMN, 1991 287 90,257 379 453AP, 1990 237 78,321 451 478.4ZIFF 345 161,021 122 295.4PAT, 1993 243 6,711 4,445 5391

4 FT, 1991-1994 564 210,158 316 412.7FR, 1994 395 55,630 588 644.7CR, 1993 235 27,922 288 1373.5

5 FBIS 470 130,471 322 543.6LAT 475 131,896 351 526.5

6 FBIS 490 120,653 348 581.3

The Example Information Requests (Topics)

用自然語言將資訊需求描述出來 Topic number: 給不同類型的 topics

<top>

<num> Number:168

<title>Topic:Financing AMTRAK

<desc>Description:

…..

<nar>Narrative:A …..

</top>

TREC～ Topics

字數 (包含停字)

欄位最小字數

最大字數

平均字數

Total 44 250 107.4

Title 1 11 3.8

Description 5 41 17.9

Narrative 23 209 64.5

TREC-1

(51-100)

Concepts 4 111 21.2

Total 54 231 130.8

Title 2 9 4.9



TREC-2

(101-150)

Concepts 3 88 28.5

Total 49 180 103.4

Title 2 20 6.5


TREC-3

(151-200)


Total 8 33 16.3TREC-4

(201-250) Description 8 33 16.3

Total 29 213 82.7

Title 2 10 3.8


TREC-5

(251-300)


Total 47 156 88.4

Title 1 5 2.7


TREC-6

(301-350)


主題結構與長度主題建構主題篩選

pre-search 判斷相關文件的數量

前 25篇文章中有多少篇是相關的?

0 1-5 6-20 ≧ 20

不採

納此

主題

繼續閱讀檢索出的

第 26-100篇文件，

判斷其相關性

根據相關回饋等方

式，輸入更多的查詢

問句，再次執行檢

索，並判斷前 100篇

文件的相關性

記錄相關文件的數量

不採

納此

主題

在 PRISE系統中輸入關鍵字執行檢索

TREC-6 之主題篩選程序

TREC ～相關判斷判斷方法

Pooling Method 人工判斷

判斷基準 : 二元式 , 相關與不相關相關判斷品質

完整性一致性

Pooling 法針對每個查詢主題，從參與評比的各系統所送回之測試結果中抽取出前 n(=100)篇文件，合併形成一個 Pool

視為該查詢主題可能的相關文件候選集合，將集合中重覆的文件去除後，再送回給該查詢主題的原始建構者進行相關判斷。

利用此法的精神是希望能透過多個不同的系統與不同的檢索技術，盡量網羅可能的相關文件，藉此減少人工判斷的負荷。

Adhoc Routing

各系統送至Pool 內之文件總數

Pool中實際之文件數

(去除重覆)

實際相關文件數

各系統送至Pool 內之文件總數

Pool中實際之文件數

(去除重覆)

實際相關文件數

TREC-1 8800 1279(39%) 277(22%) TREC-1 2200 1067(49%) 371(35%)

TREC-2 4000 1106(28%) 210(19%) TREC-2 4000 1466(37%) 210(14%)

TREC-3 2700 1005(37%) 146(15%) TREC-3 2300 703(31%) 146(21%)

TREC-4 7300 1711(24%) 130(08%) TREC-4 3800 957(25%) 132(14%)

TREC-5 10100 2671(27%) 110(04%) TREC-5 3100 955(31%) 113(12%)

TREC-6 8480 1445(42%) 92(6.4%) TREC-6 4400 1306(30%) 140(11%)

TREC 候選集合與實際相關文件之對照表

The (Benchmark) Tasks at the TREC Conferences Ad hoc task:

Receive new requests and execute them on a pre-specified document collection

Routing task Receive test info. Requests, two document collections first doc:training and tuning retrieval algorithm Second doc:testing the tuned retrieval algorithm

Other tasks: *Chinese Filtering Interactive *NLP(natural language procedure) Cross languages High precision Spoken document retrieval Query Task(TREC-7)

TREC ～評比Tasks/Tracks TREC1 TREC2 TREC3 TREC4 TREC5 TREC6 TREC7

Routing Main Tasks

Adhoc

Confusion Confusion Spoken Document

Retrieval

Database Merging

Filtering

High Precision

Interactive

Cross Language

Spanish Multilingual

Chinese

Natural Language Processing

Query

Very Large Corpus

TREC ～質疑與負面評價測試集方面

查詢主題• 並非真實的使用者需求 , 過於人工化• 缺乏需求情境的描述

相關判斷• 二元式的相關判斷不實際• pooling method 會遺失相關文件 , 導致回收率不準確• 品質與一致性

效益測量方面只關注量化測量回收率的問題適合作系統間的比較 , 但不適合作評估

TREC ～質疑與負面評價 ( 續 )

評比程序方面互動式檢索

• 缺乏使用者介入• 靜態的資訊需求不切實際

Evaluation Measures at the TREC Conferences

Summary table statistics Recall-precision Document level averages* Average precision histogram

The CACM Collection Small collections about computer science literature Text of doc Structured subfields

word stems from the title and abstract sections Categories direct references between articles:a list of pairs of documents[da,db] Bibliographic coupling connections:a list of triples[d1,d2,ncited] Number of co-citations for each pair of articles[d1,d2,nciting]

A unique environment for testing retrieval algorithms which are based on information derived from cross-citing patterns

The ISI Collection

ISI 的 test collection 是由之前在 ISI(Institute of Scientific Information) 的 Small組合而成

這些文件大部分是由當初 Small計畫中有關 cross-citation study 中挑選出來

支持有關於 terms和 cross-citation patterns 的相似性研究

The Cystic Fibrosis Collection

有關於“囊胞性纖維症”的文件 Topics和相關文件由具有此方面在臨床或研究的專家所

產生 Relevance scores

0:non-relevance 1:marginal relevance 2:high relevance

Characteristics of CF collection

Relevance score均由專家給定 Good number of information requests(relative to the

collection size) The respective query vectors present overlap among themse

lves 利用之前的 query增加檢索效率

Trends and Research Issues

Interactive user interface 一般認為 feedback 的檢索可以改善效率如何決定此情境下的評估方式 (Evaluation measures)?

其它有別於 precise, recall 的評估方式研究

Retrieval Evaluation Modern Information Retrieval, Chapter 3 Ricardo Baeza-Yates, Berthier Ribeiro-Neto 圖書與資訊學刊第 29 期 (1999 年 5 月 ), 台 大圖資所碩士論文,

Documents

Retrieval Evaluation Modern Information Retrieval, Chapter 3 Ricardo Baeza-Yates, Berthier Ribeiro-Neto 圖書與資訊學刊第 29 期 (1999 年 5 月 ), 台大圖資所碩士論文,