1 Design and Implementation of Relevance Assessments Using Crowdsourcing Omar Alonso 1 and Ricardo Baeza- Yates 2 1 Microsoft Corp., Mountain View, California, US 2 Yahoo! Research, Barcelona, Spain European Conference on Information Retrieval, 2011
Jan 20, 2015
1
Design and Implementation ofRelevance Assessments Using
Crowdsourcing
Omar Alonso1 and Ricardo Baeza-Yates2 1Microsoft Corp., Mountain View, California, US
2Yahoo! Research, Barcelona, Spain
European Conference on Information Retrieval, 2011
2
Outline of presentation
• What is Crowdsourcing?– Amazon Mechanical Turk (AMT)
• Overview of the paper.• Details of AMT experimental design• Results• Recommendations
3
Crowdsourcing
The term "crowdsourcing" is a portmanteau of "crowd" and "outsourcing," first coined by Jeff Howe in a June 2006 Wired
magazine article "The Rise of Crowdsourcing".
"Crowdsourcing is the act of taking a job traditionally
performed by a designated agent (usually an employee) and
outsourcing it to an undefined, generally large group of people
in the form of an open call." - Jeff Howe
4
5
Mechanical Turk
Was a fake chess playing machine constructed in 1770
6
7
Designand build
HITs
Put HITs on Amazon Mechanical
Turk
Collect evaluation
results from MTurk
Workers
Approve evaluation results
according to designed approval
rules
Pay Workerswhose inputs
have beenapproved
Collected binary relevance
Assessmentsof TREC 8
Amazon Mechanical Turk Workflow
8
9
Objective of the paper
To introduce a methodology for Crowdsourcing binary relevance assessments using Amazon Mechanical
Turk
10
Methodology
• Data preparation• Document collection, topics (queries), documents per topic, number
of people that will evaluate one HIT
• Interface design• Most important part of AMT experiment design• Keep HITs simple• Instructions should be clear, concise, specific, free from jargon, and
easy to read• Include examples in HITs• Use UI elements to specify formatting• Don’t ask for all or every• Explain what will not be accepted to avoid conflicts later on
11
Methodology
• Filtering the workerso Approval rate: provided by AMT o Qualification test : better quality filter but involves more
development cycleso Honey pots: Interleave assignments - checks spamming
• Scheduling the tasks• Split tasks into small chunks: helps avoid Worker fatigue• Submit shorter tasks first• Incorporate any implicit or explicit feedback into the
experimental design
12
Experimental Setup
• TREC 8 - LA Times and FBIS sub-collections• 50 topics (queries)• 10 documents per query• 5 Workers per HIT• Budget = $100
• $0.02 for binary assessment + $0.02 for comment/feedback• Agreement between raters is measured using Cohen’s kappa(k)
13
14
15
16
17
18
19
Effect of Highlighting
• Two UIs– One with query terms highlighted– Other with no highlighting of query terms
• With a couples of exceptions highlighting contributed to higher relevance compared to plain UI.
20
21
Experiment with comments
• E1-E3 had optional comments• Experiment E4 onwards comments were
made mandatory• Re-launched E5 to see the effect of bonus
on the length and quality of comments
22
23
Recommendations
• Iterative approach in designing UI: ability to incorporate feedback
• Split tasks into small chunks and submit smaller tasks first
• Provide detailed feedback for rejected HITs to build Worker trust by word of mouth effect
• Look out for very fast work as it might be that of a Robot
• Bonus payments can help generate better comments
24
Questions?
Thank you!