Design and implementation_of_relevance_assessm

1

Design and Implementation ofRelevance Assessments Using

Crowdsourcing

Omar Alonso1 and Ricardo Baeza-Yates2 1Microsoft Corp., Mountain View, California, US

2Yahoo! Research, Barcelona, Spain

European Conference on Information Retrieval, 2011

2

Outline of presentation

• What is Crowdsourcing?– Amazon Mechanical Turk (AMT)

• Overview of the paper.• Details of AMT experimental design• Results• Recommendations

3

Crowdsourcing

The term "crowdsourcing" is a portmanteau of "crowd" and "outsourcing," first coined by Jeff Howe in a June 2006 Wired

magazine article "The Rise of Crowdsourcing".

"Crowdsourcing is the act of taking a job traditionally

performed by a designated agent (usually an employee) and

outsourcing it to an undefined, generally large group of people

in the form of an open call." - Jeff Howe

4

5

Mechanical Turk

Was a fake chess playing machine constructed in 1770

6

7

Designand build

HITs

Put HITs on Amazon Mechanical

Turk

Collect evaluation

results from MTurk

Workers

Approve evaluation results

according to designed approval

rules

Pay Workerswhose inputs

have beenapproved

Collected binary relevance

Assessmentsof TREC 8

Amazon Mechanical Turk Workflow

8

9

Objective of the paper

To introduce a methodology for Crowdsourcing binary relevance assessments using Amazon Mechanical

Turk

http://www.google.com/imgres?imgurl=http://onlinemoneybuzz.com/wp-content/uploads/2011/04/Amazon_mechanical_turk.gif&imgrefurl=http://onlinemoneybuzz.com/2011/04/making-money-with-amazon%25E2%2580%2599s-mechanical-turk/&usg=__xZgxkk6M8ymKIl6D_OhInSDsgdY=&h=87&w=184&sz=5&hl=en&start=4&sig2=YEfkFTsifn3NmVFd2_ZSBw&zoom=1&tbnid=MZjp4hIKqGhjGM:&tbnh=48&tbnw=102&ei=foWkTsvQB6yHsALO1PihBQ&prev=/search%3Fq%3DAmazon%2BMechanical%2BTurk%26um%3D1%26hl%3Den%26sa%3DN%26tbm%3Disch&um=1&itbs=1

10

Methodology

• Data preparation• Document collection, topics (queries), documents per topic, number

of people that will evaluate one HIT

• Interface design• Most important part of AMT experiment design• Keep HITs simple• Instructions should be clear, concise, specific, free from jargon, and

easy to read• Include examples in HITs• Use UI elements to specify formatting• Don’t ask for all or every• Explain what will not be accepted to avoid conflicts later on

11

Methodology

• Filtering the workerso Approval rate: provided by AMT o Qualification test : better quality filter but involves more

development cycleso Honey pots: Interleave assignments - checks spamming

• Scheduling the tasks• Split tasks into small chunks: helps avoid Worker fatigue• Submit shorter tasks first• Incorporate any implicit or explicit feedback into the

experimental design

12

Experimental Setup

• TREC 8 - LA Times and FBIS sub-collections• 50 topics (queries)• 10 documents per query• 5 Workers per HIT• Budget = $100

• $0.02 for binary assessment + $0.02 for comment/feedback• Agreement between raters is measured using Cohen’s kappa(k)

13

14

15

16

17

18

19

Effect of Highlighting

• Two UIs– One with query terms highlighted– Other with no highlighting of query terms

• With a couples of exceptions highlighting contributed to higher relevance compared to plain UI.

20

21

Experiment with comments

• E1-E3 had optional comments• Experiment E4 onwards comments were

made mandatory• Re-launched E5 to see the effect of bonus

on the length and quality of comments

22

23

Recommendations

• Iterative approach in designing UI: ability to incorporate feedback

• Split tasks into small chunks and submit smaller tasks first

• Provide detailed feedback for rejected HITs to build Worker trust by word of mouth effect

• Look out for very fast work as it might be that of a Robot

• Bonus payments can help generate better comments

24

Questions?

Thank you!

Design and implementation_of_relevance_assessm

Technology

amt experiment design

tasks split tasks

term crowdsourcing

rise of crowdsourcing

better comments

quality of comments

feedback split tasks

smaller tasks