Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Post on 08-Jul-2015

528 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Reliable evaluation of Information Retrieval systems requires large amounts of relevance judgments. Making these annotations is quite complex and tedious for many Music Information Retrieval tasks, so performing such evaluations requires too much effort. A low-cost alternative is the application of Minimal Test Collection algorithms, which offer quite reliable results while significantly reducing the annotation effort. The idea is to incrementally select what documents to judge so that we can compute estimates of the effectiveness differences between systems with a certain degree of confidence. In this paper we show a first approach towards its application to the evaluation of the Audio Music Similarity and Retrieval task, run by the annual MIREX evaluation campaign. An analysis with the MIREX 2011 data shows that the judging effort can be reduced to about 35% to obtain results with 95% confidence.

Transcript

AdMIRe 2012 Lyon, France · April 17th Picture by ERdi43 (Wikipedia)

Towards Minimal Test Collections for Evaluation of

Audio Music Similarity and Retrieval

@julian_urbano University Carlos III of Madrid

@m_schedl Johannes Kepler University

Problem

evaluation of IR systems is costly

Annotations

time consuming expensive

boring

(Bad) Consequence

small and biased test collections unlikely to change from year to year

Solution

apply low-cost evaluation methodologies

2011 1960

ISMIR (2000-today)

MIREX (2005-today)

TREC (1992-today)

CLEF (2000-today)

NTCIR (1999-today)

Cranfield 2 (1962-1966)

MEDLARS (1966-1967)

SMART (1961-1995)

nearly 2 decades of Meta-Evaluation in Text IR

a lot of things have happened here!

some good practices inherited from here

Minimal Test Collections (MTC) [Carterette at al.]

estimate the ranking of systems with very few judgments (high incompleteness)

Application in Audio Music Similarity (AMS)

dozens of volunteers required by MIREX every year to make thousands of judgments

Year Teams Systems Queries Results Judgments Overlap 2006 5 6 60 1,800 1,629 10% 2007 8 12 100 6,000 4,832 19% 2009 9 15 100 7,500 6,732 10% 2010 5 8 100 4,000 2,737 32% 2011 10 18 100 9,000 6,322 30%

evaluation with

incomplete judgments

Basic Idea

treat similarity scores as random variables can be estimated with uncertainty

gain of an arbitrary document: Gi ⤳ multinomial

𝐸 𝐺𝑖 = 𝑃 𝐺𝑖 = 𝑙 · 𝑙

𝑙∈ℒ

ℒ𝐵𝑅𝑂𝐴𝐷 = 0, 1, 2 ℒ𝐹𝐼𝑁𝐸 = {0, 1, … , 100}

whenever document i is judged:

𝐸 𝐺𝑖 = 𝑙 𝑉𝑎𝑟 𝐺𝑖 = 0

*all variance formulas in the paper

AG@k is also treated as a random variable

𝐸 𝐴𝐺@𝑘 =1

𝑘 𝐸 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘

𝑖∈𝒟

iterate all documents (in practice, only

the top k retrieved)

ranking at which it was retrieved

Ultimate Goal

compute a good estimate with the least effort

Comparing Two Systems

𝐸 𝛥𝐴𝐺@𝑘 =1

𝑘 𝐸 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘

𝑖∈𝒟

what really matters is the sign of the difference

Evaluating Several Queries

𝐸 𝛥𝐴𝐺@𝑘 =1

𝒬 𝐸 𝛥𝐴𝐺@𝑘𝑞

𝑞∈𝒬

iterate all queries

The Rationale

if then judge another document else stop judging

𝛼 < 𝑃 Δ𝐴𝐺@𝑘 ≤ 0 < 1 − 𝛼

Distribution of AG@k

𝑃 𝐴𝐺@𝑘 = 𝓏 ≔ 𝑃 𝐴𝐺@𝑘 = 𝓏 𝛾𝑘 · 𝑃 𝛾𝑘

𝛾𝑘∈𝛤𝑘

what are the possible assignments of similarity?

iterate all possible permutations of k

similarity assignments

ultimately depends on the distribution of Gi

Plain English

the ratio of similarity assignments s.t. AG@k=z

For Complex Measures or Large Similarity Scales

run Monte Carlo simulation

Actually, AG@k is a Special Case

let G be the similarity of the top k for all queries

1. take a sample of k documents. Mean = X1

2. take a sample of k documents. Mean = X2

...

Q. take a sample of k documents. Mean = XQ

Mean of sample means = X

Central Limit Theorem

regardless of the distribution of G

query AG@k for a single query

mean AG@k over all queries

as Q→∞, X approximates a normal distribution

AG@k is Normally Distributed

use the normal cumulative density function Φ

𝑃 ∆𝐴𝐺@𝑘 ≤ 0 = Φ−𝐸 ∆𝐴𝐺@𝑘

𝑉𝑎𝑟 ∆𝐴𝐺@𝑘

BROAD scale

AG@5

De

nsity

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

FINE scale

AG@5

De

nsity

0 20 40 60 80 100

0.0

00

0.0

10

0.0

20

0.0

30

Confidence as a Function of # Judgments

Percent of judgments

Co

nfid

en

ce

in

ra

nkin

g o

f syste

ms

0 10 20 30 40 50 60 70 80 90 100

75

80

85

90

95

100

50

55

60

65

70

what documents should we judge? those that maximize the confidence

or keep judging to be really confident we can

stop judging

or waste our time

The Trick

documents retrieved by both systems are useless there is no need to judge them

whatever Gi is, it is added and then subtracted

Comparing Several Systems

compute a weight wi for each query-document judge the document with largest effect

wi in the Original MTC

wi = largest weight across system pairs reduces to # of system pairs affected by query-doc i

wi Dependent on Confidence

if we are highly confident about a pair of systems we do not need to judge another of their documents

𝑤𝑖 = 1− 𝐶𝐴,𝐵 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘2

𝐴,𝐵 ∈𝒮−ℛ

better results than traditional weights

iterate system pairs with low confidence

weight inversely proportional to confidence

even if it has the largest weight

MTC for AMS

with AG@k

MTC for ΔAG@k

while 1

𝒮 𝐶𝐴,𝐵𝐴,𝐵 ∈𝒮

≤ 1 − 𝛼 do

𝑖∗ ← 𝑎𝑟𝑔𝑚𝑎𝑥𝑖 𝑤𝑖

from all unjudged query-documents judge query-document 𝑖∗ (obtain true 𝑔𝑎𝑖𝑛𝑖∗) 𝐸 𝐺𝑖∗ ← 𝑔𝑎𝑖𝑛𝑖∗ 𝑉𝑎𝑟 𝐺𝑖∗ ← 0

end while

average confidence on the ranking

select the best document

update (increase confidence)

MTC in MIREX AMS 2011

Why MIREX 2011

largest edition so far 18 systems (153 pairwise comparisons)

100 queries and 6,322 judgments

Distribution of Gi

let us work with a uniform distribution for now

Confidence as Judgments are Made

correct bins: estimated sign is correct or not significant anyway

Confidence as Judgments are Made

correct bins: estimated sign is correct or not significant anyway

Confidence as Judgments are Made

correct bins: estimated sign is correct or not significant anyway

high confidence with considerably

less effort

Accuracy as Judgments are Made estimated bins always better than expected

Accuracy as Judgments are Made

estimated signs highly correlated with confidence

Accuracy as Judgments are Made

rankings with tau = 0.9 traditionally considered equivalent (same as 95% accuracy)

high confidence and

high accuracy with considerably

less effort

Statistical Significance

MTC allows us to accurately estimate the ranking but for the current set of queries

can we generalize to a general set of queries?

Not Trivial

we have the variance of the estimates but not the sample variance

Work with Upper and Lower Bounds of ΔAG@k

Upper bound: best case for A Lower bound: best case for B

∆𝐴𝐺@𝑘 =1

𝑘 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘

𝑖∈𝜋

+

+1

𝑘 𝑙+ · 𝐼 𝐴𝑖 ≤ 𝑘

𝑖∈𝜋

−1

𝑘 𝑙− · 𝐼 𝐵𝑖 ≤ 𝑘 ∧ 𝐴𝑖 > 𝑘

𝑖∈𝜋

known judgments

*same for the lower bound

Work with Upper and Lower Bounds of ΔAG@k

Upper bound: best case for A Lower bound: best case for B

∆𝐴𝐺@𝑘 =1

𝑘 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘

𝑖∈𝜋

+

+1

𝑘 𝑙+ · 𝐼 𝐴𝑖 ≤ 𝑘

𝑖∈𝜋

−1

𝑘 𝑙− · 𝐼 𝐵𝑖 ≤ 𝑘 ∧ 𝐴𝑖 > 𝑘

𝑖∈𝜋

retrieved by A

*same for the lower bound

unknown judgments best

similarity score

Work with Upper and Lower Bounds of ΔAG@k

Upper bound: best case for A Lower bound: best case for B

∆𝐴𝐺@𝑘 =1

𝑘 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘

𝑖∈𝜋

+

+1

𝑘 𝑙+ · 𝐼 𝐴𝑖 ≤ 𝑘

𝑖∈𝜋

−1

𝑘 𝑙− · 𝐼 𝐵𝑖 ≤ 𝑘 ∧ 𝐴𝑖 > 𝑘

𝑖∈𝜋

*same for the lower bound

unknown judgments

retrieved by B but not by A

worst similarity

score

3 Rules

1. Assume best case for A (upper bound) if A <<< B then conclude A <<< B

2. Assume best case for B (lower bound) if B <<< A then conclude B <<< A

3. If in the best case for A we do not have A >>> B and in the best case for B we do not have B >>> A then conclude they are not significantly different

Problem upper and lower bounds are very unrealistic

Incorporate a Heuristic

4. If the estimated difference is larger than t naively conclude significance

Choose t Based on Power Analysis

t = effect-size detectable by a t-test with • sample variance σ2=0.0615 • sample size n=100 • Type I Error rate α=0.05 • Type II Error rate β=0.15

t ≈ 0.067

from previous MIREX editions

typical values

Accuracy of the Significance Estimates

rule 4 (heuristic) ends up overestimating significance

pretty good around 95% confidence

Accuracy of the Significance Estimates

rule 4 (heuristic) ends up overestimating significance

rules 1 to 3 begin to apply and correct overestimations

Accuracy of the Significance Estimates

closer to expected

never under 90%

significance can be estimated

fairly well too

what we did

Introduce MTC to the MIR folks

Work out the Math for MTC with AG@k

See How Well it would have Done in AMS 2011 quite well actually!

what now

Learn the true Distribution of Similarity Judgments

Significance Testing with Incomplete Judgments

Study Low-Cost Methodologies for other MIR Tasks

it‘s clearly not uniform would give more accurate estimates with less effort

use previous AMS data or fit a model as we judge

best-case scenarios are very unrealistic

what for

MTC Greatly Reduces the Effort for AMS (and SMS)

have MIREX volunteers incrementally create brand new test collections for other tasks

Better Yet

study low-cost methodologies for the other tasks

Not Only for MIREX

private collections for in-house evaluations no possibility of gathering large pools of annotators

lost-cost becomes paramount

the MIR community needs a paradigm shift

from a priori to a posteriori evaluation methods

to reduce cost and gain reliability

top related