Task and Workflow Design II KSE 652 Social Computing System Design and Analysis Uichin Lee
Task and Workflow Design II
KSE 652 Social Computing System Design and Analysis
Uichin Lee
Contents
• Turkomatic: divide and conquer strategy for performing more “challenging tasks” in M-Turk
• TurKontrol: decision-theoretic approach for work-flow control (e.g., how many improve/vote tasks?)
• Turkalytics: monitoring workers’ behavior remotely
Turkomatic: Automatic Recursive Task and Workflow Design for Mechanical Turk
CHI'11 WIP
Turkomatic• Turkomatic interface accepts task requests
written in natural language• Subdivide phase:
– For each request, it posts a HIT to M-Turk, asking workers to break the task down into a set of logical subtasks
– Each subtask is then automatically reposted to M-Turk; subtask can be further broken down
• Merge phase: – Once all subtasks are completed, HITs are
posted asking workers to combine subtask solutions into a coherent whole
• The end result will then be delivered to the requester
Subdivide Phase
• Decomposition of tasks, and the creation of solution elements
Divide and Merge
Divide and Merge
Evaluation• Tasks:
– Producing a written essay in response to a prompt: “please write a five-paragraph essay on the topic of your choice”
– Solving an example SAT test “Please solve the 16-question SAT located at http://bit.ly/SATexam”
– Payment: $0.10 to $0.40 per HIT• Each “subdivide” or “merge” HIT received answers within 4
hours; solutions to the initial task were completed within 72 hours
• Essay: the final essay (about “university legacy admissions”) displayed a reasonably good understanding of a topic; yet the writing quality is often mixed
• SAT: the task was divided into 12 subtasks (containing 1-3 questions); the score was 12/17
Decision-Theoretic Control of Crowd-Sourced Workflows
Peng Dai, Mausam, Daniel S. WeldAAAI 2010
Motivation
• Iterative workflow (i.e., improve and vote) used in TurKit has the following problems: – What is the optimal number of iterations?– How many ballots (votes) should we use?– How do answers change if the workers are more/less
skilled?
Iterative workflow
TurKontrol: Computation Model
• Text α is improved to text α’ (after improve task)• Given a pair (α, α’), a series of votes can be received
(bk ) to judge which one is better
TurKontrol: Computation Model• Text α: quality density function: fQ(q) – prior• A worker x takes an improvement job and
submits α‘• Text α‘ done by worker x:
quality density function: fQ’|q,x(q’) – posterior • Quality density function of text α‘
TurKontrol: Computation Model• Voting:
– A series of n votes: b = b1, b2, …, bn where bi {1, 0}∈– Posterior probability after n votes: fQ|b (q) and fQ’|b (q’)
• Difficulty: – Closer the two results the more difficult to judge– d(q, q’) = 1 - |q-q’|M where M is constant; and d [0, 1], ∈
• Accuracy (of a worker x) – ax(d) = ½ [1+(1-d)r] where r is a knob for controlling accuracy dist
If the i-th worker xi has accuracy axi (d),
TurKontrol: Computation Model• For a given pair (α, α’), its posterior probabilities
(Q, Q’) are fQ|b(q) and fQ’|b(q)
where
α
Given that we don’t know the worker, an average worker is used
TurKontrol: Computation Model
Improveα α‘
Cost: c_imp
Voteα
Cost: c_b
α'
fQ(q) fQ’(q’)
fQ|b (q)fQ’|b (q’)
fQ|b+1 (q)fQ’|b+1 (q’)
Utility function:
utilit
y
quality
TurKontrol: Computation Model• Utility estimation of a pair (α, α’), for (1) improve and
(2) voting task– (2) utility of a vote task
– (1) utility of an improve task
• Decision making: – Three options: (a) vote, (b) improve, or(c) accept– k-step lookahead: evaluate all sequences of k decisions,
and find the sub-sequence with the highest utility
U: utility functioncb: vote cost
cimp: improve cost
Numerical Results• Convex utility function with max 1000• Fixed cost (improve, vote) = (30, 10) • Net utility: utility of submitted artifact –payment to workers• TurKit: performs as many iterations as possible (max allowance 400)• TurKontrol (2): lookahead of 2
cf: accuracy of workers ax(d) = ½[1+(1-d)r]
Turkalytics: Real-time Analytics for Human Computation
Paul Heymann and Hector Garcia-MolinaWWW'11
Basic Buyer human programming• A human program generates forms; advertised through a marketplace. • Workers look at posts, and then complete the forms for compensation.
Game Maker human programming• The programmer writes a human program and a game. • The game implements features to make it fun and difficult to cheat. • The human program loads and dumps data from the game.
Human Processing programming
Human Processing programming• Task description:
– Input, output, web forms, human driver, other information– Human task instance
• Human drivers: interact with workers– Functions: initialization (forms, games), retrieving results – “Human Program” accesses workers via “human drivers”
• Recruiters: post task instances into the marketplaces, (by working with marketplace drivers)– Marketplace driver provides an interface to marketplaces
(description) (instance)
Turkalytics
• Challenge: collecting reliable data about the workers and the tasks they perform
• Why?– If a task is not being completed, is it because no workers
are seeing it? Is it because the task is currently being offered at too low a price?
– How does the task completion time break down? – Do workers spend more time previewing tasks or doing
them? – Do they take long breaks? – Which are the more “reliable” workers?
Interaction Model
• Search-Preview-Accept (SPA) model
Interaction Model• Search-Continue-RapidAccept-Accept-Preview (SCRAP)
Continue completing a task that was accepted but not submitted
Accept the next task in a HITGroup w/o previewing it
Turkalytics Data Models
Turkalytics ArchitectureClient-side javascript: ta.js Log Server
Client-side javascript: ta.js
ta.js
ta.js
Ajax: POST
Log messages (JSON )
Analysis Server
Log messages (JSON )
Implementation: client-side Javascript
• Requester embeds a Turkalytics script (ta.js) into a HIT (when designing a HIT)– Monitoring: Detect relevant worker data and actions.– Sending: Log events by making image requests to the
log server (ajax: POST)
Implementation: ta.js -- client-side JavaScript
• ta.js’s monitoring activities:– Client Information: Worker’s screen resolution? What
plugins are supported? Can ta.js set cookies?– DOM Events: Over the course of a page view, the
browser emits various events (e.g., load, submit, before unload, and unload events)
– Activity: listens on a second-by-second basis for the mousemove, scroll and keydown events to determine if the worker is active or inactive.
– Form Contents: examines forms on the page and their contents; logs initial form contents, incremental updates, and final state.
Implementation: log/analysis
• Log Server:– Simple web app built on Google’s App Engine. – Receives logging events from clients running ta.js and saves them
to a data store. • IP address, user agent, and referer, etc
• Analysis Server: – Periodically polls the log server to download any new events that
have been received – Event inserted into DB, considering the following:
• Time constraints: data availability to analysis server• Dependencies: if events are dependent on one another• Incomplete input: if all events are not received yet..• Unknown input: what if unexpected input is received?
Implementation: analysis
// what type of data (event) is sent // actual data for a given type
Detailed info about task
// session ID
Experiments• Tasks:
– Named Entity Recognition (NER): This task, posted in groups of 200 by a researcher in Natural Language Processing, asks workers to label words in a Wikipedia article if they correspond to people, organizations, locations, or demonyms. (2, 000 HITs, 1 HIT Type, more than 500 workers.)
– Turker Count (TC): This task, posted once a week by a professor of business at U.C. Berkeley, asks workers to push a button, and is designed just to gauge how many workers are present in the marketplace. (2 HITs, 1 HIT Type, more than 1, 000 workers each.)
– Create Diagram (CD): This task, posted by the authors, asked workers to draw diagrams for this paper based on hand drawn sketches
Experiments: origin of workers
• GeoLite City DB from MaxMind to geolocate all remote users by IP address
Experiments: worker characteristics
Experiments: states/actions
• RapidAccept is quite popular (Continue is rare)
Experiments: # previews• Artificial recency for NER/CD (keep making them near the top in the list):
NER and CD exhibit less severe drop as opposed to TC
ArtificialRecency
Experiments: activity vs. delay
• Average active and total seconds for each worker who completed the NER task (correlation 0.88)
Discussion
• Multi-tasking users? Activity vs. working time• Privacy??– We can collect as much as we can..– How about Google Analytics? Any web pages that we visit
can collect such information…
• False data injection?• How can we better utilize the dataset?– Re-designing existing tasks, pricing, etc. (or mining user
behavior?)
Summary
• Turkomatic: divide and conquer strategy for performing more “challenging tasks” in M-Turk
• TurKontrol: decision-theoretic approach for work-flow control (e.g., how many improve/vote tasks?)
• Turkalytics: monitoring workers’ behavior remotely