Top Banner
1 Design and Implementation of Relevance Assessments Using Crowdsourcing Omar Alonso 1 and Ricardo Baeza- Yates 2 1 Microsoft Corp., Mountain View, California, US 2 Yahoo! Research, Barcelona, Spain European Conference on Information Retrieval, 2011
24

Design and implementation_of_relevance_assessm

Jan 20, 2015

Download

Technology

shilpashukla01

Design and Implementation of relevance assessments using Crowdsourcing by Omar Alonso and Ricardo Baeza Yates
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Design and implementation_of_relevance_assessm

1

Design and Implementation ofRelevance Assessments Using

Crowdsourcing

Omar Alonso1 and Ricardo Baeza-Yates2 1Microsoft Corp., Mountain View, California, US

 2Yahoo! Research, Barcelona, Spain

European Conference on Information Retrieval, 2011

Page 2: Design and implementation_of_relevance_assessm

2

Outline of presentation

• What is Crowdsourcing?– Amazon Mechanical Turk (AMT)

• Overview of the paper.• Details of AMT experimental design• Results• Recommendations

Page 3: Design and implementation_of_relevance_assessm

3

Crowdsourcing

The term "crowdsourcing" is a portmanteau of "crowd" and "outsourcing," first coined by Jeff Howe in a June 2006 Wired

magazine article "The Rise of Crowdsourcing".

"Crowdsourcing is the act of taking a job traditionally

performed by a designated agent (usually an employee) and

outsourcing it to an undefined, generally large group of people

in the form of an open call." - Jeff Howe

Page 4: Design and implementation_of_relevance_assessm

4

Page 5: Design and implementation_of_relevance_assessm

5

Mechanical Turk

Was a fake chess playing machine constructed in 1770

Page 6: Design and implementation_of_relevance_assessm

6

Page 7: Design and implementation_of_relevance_assessm

7

Designand build

HITs

Put HITs on Amazon Mechanical

Turk

Collect evaluation

results from MTurk

Workers

Approve evaluation results

according to designed approval

rules

Pay Workerswhose inputs

have beenapproved

Collected binary relevance

Assessmentsof TREC 8

Amazon Mechanical Turk Workflow

Page 8: Design and implementation_of_relevance_assessm

8

Page 10: Design and implementation_of_relevance_assessm

10

Methodology

• Data preparation• Document collection, topics (queries), documents per topic, number

of people that will evaluate one HIT

• Interface design• Most important part of AMT experiment design• Keep HITs simple• Instructions should be clear, concise, specific, free from jargon, and

easy to read• Include examples in HITs• Use UI elements to specify formatting• Don’t ask for all or every• Explain what will not be accepted to avoid conflicts later on

Page 11: Design and implementation_of_relevance_assessm

11

Methodology

• Filtering the workerso Approval rate: provided by AMT o Qualification test : better quality filter but involves more

development cycleso Honey pots: Interleave assignments - checks spamming

• Scheduling the tasks• Split tasks into small chunks: helps avoid Worker fatigue• Submit shorter tasks first• Incorporate any implicit or explicit feedback into the

experimental design

Page 12: Design and implementation_of_relevance_assessm

12

Experimental Setup

• TREC 8 - LA Times and FBIS sub-collections• 50 topics (queries)• 10 documents per query• 5 Workers per HIT• Budget = $100

• $0.02 for binary assessment + $0.02 for comment/feedback• Agreement between raters is measured using Cohen’s kappa(k)

Page 13: Design and implementation_of_relevance_assessm

13

Page 14: Design and implementation_of_relevance_assessm

14

Page 15: Design and implementation_of_relevance_assessm

15

Page 16: Design and implementation_of_relevance_assessm

16

Page 17: Design and implementation_of_relevance_assessm

17

Page 18: Design and implementation_of_relevance_assessm

18

Page 19: Design and implementation_of_relevance_assessm

19

Effect of Highlighting

• Two UIs– One with query terms highlighted– Other with no highlighting of query terms

• With a couples of exceptions highlighting contributed to higher relevance compared to plain UI.

Page 20: Design and implementation_of_relevance_assessm

20

Page 21: Design and implementation_of_relevance_assessm

21

Experiment with comments

• E1-E3 had optional comments• Experiment E4 onwards comments were

made mandatory• Re-launched E5 to see the effect of bonus

on the length and quality of comments

Page 22: Design and implementation_of_relevance_assessm

22

Page 23: Design and implementation_of_relevance_assessm

23

Recommendations

• Iterative approach in designing UI: ability to incorporate feedback

• Split tasks into small chunks and submit smaller tasks first

• Provide detailed feedback for rejected HITs to build Worker trust by word of mouth effect

• Look out for very fast work as it might be that of a Robot

• Bonus payments can help generate better comments

Page 24: Design and implementation_of_relevance_assessm

24

Questions?

Thank you!