Juho Kim Phu Nguyen
Sarah Weir Philip J. Guo
Robert C. Miller Krzysztof Z. Gajos
Crowdsourcing Step-by-Step
Information Extraction to
Enhance Existing How-to
Videos
how-to videos
online
learning from how-to videos:
limited by video player interfaces
Watching Example
Problem in Watching
It’s difficult to navigate to
specific parts you’re interested in.
Problem in Watching
It’s difficult to navigate to
specific parts you’re interested in.
find
repeat
skip
How-to Video: Step-by-Step
Nature
Apply
gradient map
Completeness & detail of step-by-step instructions are
integral to task performance.Eiriksdottir and Catrambone, 2011
Proactive & random access, semantic indices in
instructional videos: better task performance and learner
satisfactionZhang et al., 2006
Interactivity can help overcome the difficulties of
perception and comprehension. Stopping, starting and
replaying an animation can allow reinspection.Tversky et al., 2002
Design InsightEnable step-by-step navigation with high interactivity
ToolScape: Step-aware video player
work in progress
images
parts with no
visual progress
step labels & links
enhance existing how-to videos with
step-level interactivity & annotation
Research Questions
Does step-by-step navigation help learners?
Preliminary user study
How can we annotate an existing how-to
video with step-by-step information?
Crowdsourcing annotation workflow
Research Questions
Does step-by-step navigation help learners?
Preliminary user study
How can we annotate an existing how-to
video with step-by-step information?
Crowdsourcing annotation workflow
Study: Photoshop Design Tasks
12 novice Photoshop users
manually annotated videos
Baseline ToolScape
With ToolScape, learners will…
H1. feel more confident about their design skills.
- self-efficacy gain
H2. believe they produced better designs.
- self-rating on designs produced
H3. actually produce better designs.
- external rating on designs produced
H1. Higher self-efficacy gain with ToolScape– Four 7-Likert scale questions
– Mann-Whitney’s U test (Z=2.06, p<0.05), error bar: standard error
1.4
0 1 2 3 4 5 6 7
ToolScape
Baseline 0.13.8
3.8
H2. Higher self-rating with ToolScape– One 7-Likert scale question
– Mann-Whitney’s U test (Z=2.70, p<0.01), error bar: standard error
5.3
3.5
0 1 2 3 4 5 6 7
ToolScape
Baseline
H3. External raters rank ToolScape designs higher.– (Ranking: Lower is better)
– Wilcoxon Signed-rank test (W=317, Z=-2.79, p<0.01, r=0.29) , error bar: standard error
– Krippendorff’s alpha = 0.753
5.7
7.3
0 2 4 6 8 10 12
ToolScape
Baseline
Non-sequentially navigating
videoStep-level navigation: clicked 8.9 times per task
“It is great for skipping straight to relevant
portions of the tutorial.”
“It was also easier to go back to parts I missed.”
Research Questions
Does step-by-step navigation help learners?
Preliminary user study
How can we annotate an existing how-to
video with step-by-step information?
Crowdsourcing annotation workflow
Annotations for Step-Aware Video
Player
• step time
• step label
• before/after results
Design Goals for Annotation
Method• domain-independent
• existing videos
• untrained annotators
Crowdsourcing
Multi-stage crowdsourcing
workflow
When & What are
the steps?
Vote & Improve
Before/After the steps?
FIND VERIFY EXPAND
When & What are
the steps?
Vote & Improve
Before/After the steps?
FIND VERIFY EXPAND
Input video
When & What are
the steps?
Vote & Improve
Before/After the steps?
FIND VERIFY EXPAND
Input video
When & What are
the steps?
Vote & Improve
Before/After the steps?
FIND VERIFY EXPAND
Input video
When & What are
the steps?
Vote & Improve
Before/After the steps?
FIND VERIFY EXPAND
Input video
When & What are
the steps?
Vote & Improve
Before/After the steps?
FIND VERIFY EXPAND
Input video
Output timeline
Stage 1. FIND candidate steps
Labeling a step
Time-based Clustering
Stage 2. VERIFY steps by
voting/improving
Quality control for Stage 2
• Majority voting
• Breaking ties
– String matching to combine
“similar enough” labels
– Longer string
“grate three cups of cheese” > “grate cheese”
Stage 3.
EXPAND with
before/after
images
Quality control for Stage 3
• Majority voting
• Breaking ties:
– Pixel diff to combine
“similar enough” frames
– Choose what’s closer to the step
Evaluation
• Generalizable?
75 Photoshop / Cooking / Makeup videos
• Accurate?
precision and recall
against trained annotators’ labels
Across all domains,
~80% precision and recall
Domain Precision Recall
Cooking 0.77 0.84
Makeup 0.74 0.77
Photoshop 0.79 0.79
All 0.77 0.81
Conceptual Level Differences
• “Now apply the bronzer to your face
evenly”
• “Apply the bronzer to the forehead”
• “Apply the bronzer to the cheekbones”
• “Apply the bronzer to the jawline”
Timing is 2.7 seconds off on
average
Ground truth: one step every 17.3 seconds
2.7 seconds
Cost: $1.07 per minute of video
• 111 HITs / video (3 workers / task)
• $2.50 / video (Find + Verify)
• $4.85 / video (Find + Verify + Expand)
• $0.32 / step (time + label + before/after)
Contributions
• Study: increased interactivity improved task performance & self-efficacy
• Crowd video annotation method & Find-Verify-Expand design pattern
• Evaluation: fully extracted 75 existing videos across 3 domains, 80% accuracy
hierarchical solution structure extraction
Catrambone, R. The subgoal learning model: Creating better examples so that
students can solve novel problems. Journal of Experimental Psychology: General, 127, (1998).
Ongoing Work: Beyond low-level
steps
hierarchical solution structure extraction
Ongoing Work: Beyond low-level
steps
Learnersourcing: learners as a crowd
• Motivated, qualified
• Feedback loop between learners & system
Future of How-to Video
Learning
What if we had 1000s of
fully annotated videos?
• Flexible learning paths with multiple videos
• Step-level search, recommendation
• Patterns from multiple solutions
Crowdsourcing Step-by-Step Information Extraction
to
Enhance Existing How-to Videos
Juho Kim
MIT CSAIL
juhokim.com
Acknowledgement: This work was supported in part by
Quanta Computer & the Samsung Fellowship.