An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004
Jan 04, 2016
An Investigation of Evaluation Metrics for Analytic Question Answering
ARDA Metrics Challenge: PI Meeting
Emile MorseOctober 7, 2004
Outline
Motivation & Goals Hypothesis-driven development of metrics Design – collection, subjects, scenarios Data Collection & Results Summary and issues
Motivation Much progress in IR has been attributed to community
evaluations using metrics of precision and recall with common tasks and common data.
There is no corresponding set of evaluation criteria for the interaction of users with the systems.
While performance is crucial, the utility of the system to the user is critical.
The lack of evaluation criteria prevents the comparison of systems based on utility. Acquisition of new systems is therefore based on performance of the system alone – and frequently does NOT reflect how systems will work in the user’s actual process.
Goals of Workshop To develop metrics for process and products that will
reflect the interaction of users and information systems. To develop the metrics based on:
Cognitive task analyses of intelligence analysts Previous experience in AQUAINT and NIMD evaluations Expert consultation
To deliver an evaluation package consisting of: Process and product metrics An evaluation methodology A data set to use in the evaluation
Hypothesis-driven development of metrics Hypotheses – QA systems should …
Candidate metrics – What could we measure that would provide evidence to support or refute this hypothesis?
Measures – implementation of metric; depends on specific collection method
Collection methods:
• Questionnaires
• Mood meter
• System logs
• Report evaluation method
• System surveillance tool
• Cognitive workload instrument
• …
HypothesesQuestion answering systems should …
H1 Support information gathering with lower cognitive workload
H2 Assist in exploring more paths/hypotheses
H3 Enable production of higher quality reports
H4 Provide useful suggestions to the analyst
H5 Provide more good surprises than bad
H6 Enable more focus on Analysis than data collection
H7 Enable analysts to collect more data in less time
H8 Reduce the time spent reading
H9 Identify gaps in the knowledge base
H10 Help the analyst recognize gaps in their thinking
H11 Provide context for information
H12 Provide context, continuity and coherence of dialogue
H13 Let analysts relocate previously seen materials
H14 Be easy to use
H15 Increase an analyst’s confidence in exploration and report
Examples of candidate metrics H1: Support gathering the same type of information with a lower
cognitive workload # queries/questions % interactions where analyst takes initiative Number of non-content interactions with system (clarifications) Cognitive workload measurement
H7: Enable analysts to collect more data in less time Growth of shoebox over time Subjective assessment
H12: QA systems should provide context and continuity for the user – coherence of dialogue! Similarity between queries – calculate shifts in dialog trails Redundancy of documents – count how often a snippet is found
more than one time Subjective assessment
Top-level Design of the Workshop
Systems Domain Scenarios Collection Subjects On-site team Block design and on-site plan
Design – Systems
HITIQA
Ferret
GINKO
GNIST
Tomek Strzalkowski
Sanda HarabagiuStefano Bertolo
Design – Scenarios
ID Topic
A Indian Chemical Weapons Production and Delivery Systems
B Libyan Chemical Weapons Program
C Iranian Chemical Weapons Development and Impact
D North Korean Chemical and Biological Weapons Research
E Pakistani Chemical Agent Production
F Current Status of Russia’s Chemical Weapons Program
G South African Chemical Agents Program Status
H Assessment of Egypt’s Biological Weapons
Design – Document Collection
Design – Subjects
8 reservists (7 Navy; 1 Army) Age: 30-54 yrs (M=40.8) Educational background:
1 PhD; 4 Masters; 2 Bachelors; 1 HS Military service: 2.5-31 yrs (M=18.3) Analysis Experience: 0-23 yrs (M=10.8)
Design – On-Site Team Affiliation Trainers Observers Focus
GroupsTechnical
Emile Morse NIST X X X X
Paul Kantor Rutgers X
Diane Kelly UNC X X X
Robert Rittman Rutgers X
Aleksandra Sarcevic Rutgers X
Ying Sun Rutgers X
Joe Konczal NIST X
Stefano Bertolo Cycorp X X
Andy Hickl LCC X
Sharon Small/Hilda Hardy Albany X
Sean Ryan Albany X
8 Analysts 7 Navy/1 Army
Batelle Support Other Visitors
Antonio Sanfilippo Lynn Franklin
Ben Barnett Peter LaMonica
Laura Curtis
Troy Juntenen
Trina Pitcher
2-day Block Schedule
Day 1
Start Time
Duration Activity
9:30 1:00 System Training
10:30 0:30 Check Ride
11:00 0:30 Exploration
11:30 1:15LUNCH
12:45 0:15Scenario Discussion
1:00 2:30 Scenario Task
3:30 1:00 Questionnaires & Debriefing
Day 2
Start Time
Duration Activity
8:15 0:15 Scenario Discussion
8:30 2:30 Scenario Task
11:00 1:10 Questionnaires & Debriefing
12:10 1:00 LUNCH
1:10 0:50 Questionnaires & Debriefing (cont'd)
2:00 2:25 Cross-Evaluation
Data Collection Instruments Questionnaires
Post-scenario (SCE) Post-session (SES) Post-system (SYS)
Cross-evaluation of reports Cognitive workload Glass Box System Logs
Mood indicator Status reports Debriefing Observer notes Scenario difficulty assessment
Question answering systems shouldQuestion-
nairesCross-
evaluationNASA TLX
Glass Box
System Logs
Support information gathering with lower cognitive workload X X X
Assist in exploring more paths/hypotheses X
Enable production of higher quality reports X X
Provide useful suggestions to the analyst X X X
Provide more good surprises than bad X
Enable more focus on analysis than data collection X
Enable analysts to collect more data in less time X X
Reduce the time spent reading X X
Identify gaps in the knowledge base X X X
Help the analyst recognize gaps in their thinking X
Provide context for information X X
Provide context, continuity and coherence of dialogue X X X
Let analysts relocate previously seen materials X
Be easy to use X X
Increase an analyst’s confidence in exploration and report X
Questionnaires
Type # Questions# Hypotheses
covered
Post-Scenario (SCE) 6 --
Post-Session (SES) 15 8
Post-System (SYS) 27 12
•Coverage for 14/15 hypotheses
•Other question types:•All SCE questions relate to scenario content•3 SYS questions on Readiness•3 SYS questions on Training
Questions for Hypothesis 7
Enable analysts to collect more data in less time
SES Q2: In comparison to other systems that you normally use for work tasks, how would you assess the length of time that it took to perform this task using the [X] system? [less … same … more]
SES Q13: If you had to perform a task like the one described in the scenario at work, do you think that having access to the [X] system would help increase the speed with which you find information? [not at all … a lot]
SYS Q23: Having the system at work would help me find information faster than I can currently find it.
SYS Q6: The system slows down my process of finding information.
Question answering systems shouldQuestion-
nairesCross-
evaluationNASA TLX
Glass Box
System Logs
SES SYS
Support information gathering with lower cognitive workload X X X
Assist in exploring more paths/hypotheses 1/1 3/5
Enable production of higher quality reports 2/2 3/3 X
Provide useful suggestions to the analyst 1/1 X X
Provide more good surprises than bad 1/2
Enable more focus on analysis than data collection 0/1
Enable analysts to collect more data in less time 2/2 1/2 X
Reduce the time spent reading 0/1 X
Identify gaps in the knowledge base 0/1 X X
Help the analyst recognize gaps in their thinking 4/4
Provide context for information 1/1 X
Provide context, continuity and coherence of dialogue 1/1 2/3 X X
Be easy to use 2/2 2/6 X
Increase an analyst’s confidence in exploration and report 2/2 0/1
Additional Analysis for Questionnaire Data – Factor Analysis Four factors emerged
Factor 1: most questions Factor 2: time, navigation, training Factor 3: novel information, new way of searching Factor 4: Skill in using the system improved
These factors distinguished between the four systems with each system being most distinguished from the others (positively or negatively) on one factor
GNIST was related to Factor 2. Positive for navigation and training; negative for time.
Cross Evaluation CriteriaSubjects rated the reports (including their own)
on seven characteristics Covers the important ground
Avoids the irrelevant materials
Avoids redundant information
Includes selective information
Is well organized
Reads clearly and easily
Overall rating
Cross-evaluation ResultsEffects F Sig.
Judge 15.32 0.00 **
Author 12.26 0.00 **
System 6.33 0.00 **
Scenario 2.94 0.01 **
Observer 1.76 0.15
Self*judge 5.59 0.00 **α β γ
β 0.06
γ 0.37* 0.30
GNIST -0.07 -0.14 -0.44**
Question answering systems shouldQuestion-
nairesCross-
evaluationNASA TLX
Glass Box
System Logs
SES SYS
Support information gathering with lower cognitive workload X X X
Assist in exploring more paths/hypotheses 1/1 3/5
Enable production of higher quality reports 2/2 3/3 X
Provide useful suggestions to the analyst 1/1 X X
Provide more good surprises than bad 1/2
Enable more focus on analysis than data collection 0/1
Enable analysts to collect more data in less time 2/2 1/2 X
Reduce the time spent reading 0/1 X
Identify gaps in the knowledge base 0/1 X X
Help the analyst recognize gaps in their thinking 4/4
Provide context for information 1/1 X
Provide context, continuity and coherence of dialogue 1/1 2/3 X X
Be easy to use 2/2 2/6 X
Increase an analyst’s confidence in exploration and report 2/2 0/1
NASA TLX -- Cognitive Load
GNIST Sig.
Mental 9.93 14.94 19.13 7.80 >>{, GNIST}
Physical 6.50 2.50 2.13 2.20 >{, , GNIST}
Temporal 13.36 17.50 9.31 25.50 GNIST>>>
Performance 10.79 7.06 12.38 14.80 {GNIST,}>>
Frustration 16.57 13.88 8.81 12.30 {, }>GNIST>
Effort 16.50 15.31 11.19 21.60 GNIST>{, }>
TLX Score 4.91 4.75 4.20 5.61 {GNIST,,}>
NASA TLX -- Cognitive Load
GNIST Sig.
Mental 9.93 14.94 19.13 7.80 >>{, GNIST}
Physical 6.50 2.50 2.13 2.20 >{, , GNIST}
Temporal 13.36 17.50 9.31 25.50 GNIST>>>
Performance 10.79 7.06 12.38 14.80 {GNIST,}>>
Frustration 16.57 13.88 8.81 12.30 {, }>GNIST>
Effort 16.50 15.31 11.19 21.60 GNIST>{, }>
TLX Score 4.91 4.75 4.20 5.61 {GNIST,,}>
Question answering systems shouldQuestion-
nairesCross-
evaluationNASA TLX
Glass Box
System Logs
SES SYS
Support information gathering with lower cognitive workload X X X
Assist in exploring more paths/hypotheses 1/1 3/5
Enable production of higher quality reports 2/2 3/3 X
Provide useful suggestions to the analyst 1/1 X X
Provide more good surprises than bad 1/2
Enable more focus on analysis than data collection 0/1
Enable analysts to collect more data in less time 2/2 1/2 X
Reduce the time spent reading 0/1 X
Identify gaps in the knowledge base 0/1 X X
Help the analyst recognize gaps in their thinking 4/4
Provide context for information 1/1 X
Provide context, continuity and coherence of dialogue 1/1 2/3 X X
Be easy to use 2/2 2/6 X
Increase an analyst’s confidence in exploration and report 2/2 0/1
Glass Box Data
Types of data captured:
• Keystrokes
• Mouse moves
• Session start/stop times
• Task times
• Application focus time
• Copy/paste events
• Screen capture & audio track
Glass Box Data Allocation of session time
0
10
20
30
40
50
60
70
80
90
100
Tim
e (m
inu
tes)
App
Word
GNIST
Question answering systems shouldQuestion-
nairesCross-
evaluationNASA TLX
Glass Box
System Logs
SES SYS
Support information gathering with lower cognitive workload X X X
Assist in exploring more paths/hypotheses 1/1 3/5
Enable production of higher quality reports 2/2 3/3 X
Provide useful suggestions to the analyst 1/1 X X
Provide more good surprises than bad 1/2
Enable more focus on analysis than data collection 0/1
Enable analysts to collect more data in less time 2/2 1/2 X
Reduce the time spent reading 0/1 X
Identify gaps in the knowledge base 0/1 X X
Help the analyst recognize gaps in their thinking 4/4
Provide context for information 1/1 X
Provide context, continuity and coherence of dialogue 1/1 2/3 X X
Be easy to use 2/2 2/6 X
Increase an analyst’s confidence in exploration and report 2/2 0/1
System log data # queries/questions
‘Good’ queries/questions
Total documents delivered
# unique documents delivered
% unique documents delivered
# documents copied from
# copies
# Questions vs # ‘Good’ Questions
0
5
10
15
20
question
good
GNIST
66% 38% 60% 35%
Question answering systems shouldQuestion-
nairesCross-
evaluationNASA TLX
Glass Box
System Logs
SES SYS
Support information gathering with lower cognitive workload X X X
Assist in exploring more paths/hypotheses 1/1 3/5
Enable production of higher quality reports 2/2 3/3 X
Provide useful suggestions to the analyst 1/1 X X
Provide more good surprises than bad 1/2
Enable more focus on analysis than data collection 0/1
Enable analysts to collect more data in less time 2/2 1/2 X
Reduce the time spent reading 0/1 X
Identify gaps in the knowledge base 0/1 X X
Help the analyst recognize gaps in their thinking 4/4
Provide context for information 1/1 X
Provide context, continuity and coherence of dialogue 1/1 2/3 X X
Be easy to use 2/2 2/6 X
Increase an analyst’s confidence in exploration and report 2/2 0/1
$ $$ $ $$$ $$$
What Next? Query trails are being worked on by LCC, Rutgers and others;
available as part of deliverable.
Scenario difficulty has become an independent effort with potential impact on both NIMD and AQUAINT.
Thinking about alternative implementation of mood indicator.
Each project team employs metrics
and methodology
AQUAINT sponsors large-scale group evals using metrics and methodology
Or something in between
Issues to be Addressed What constitutes a replication of the method? the whole
thing? a few hypotheses with all data methods? all hypotheses with a few data methods?
Costs associated with data collection methods Is a comparison needed?
Baseline – if so, is Google the right one? Maybe the ‘best so far’ to keep the bar high.
Past results – can measure progress over time, but requires iterative application
‘Currency’ of data and scenarios Analysts are sensitive to staleness What is the effect of updating on repeatability?
Backups
Report Cross-Evaluation Results
Effects F Sig.
Judge 7* analysts (as judges): 1-5, 7,8
15.32 0.00 **
Author 7* analysts (as authors): 1-5, 7,8
12.26 0.00 **
System α, β, γ, GNIST 6.33 0.00 **
Scenario 8 scenarios: A – H 2.94 0.01 **
Observer 4 observers: A, D, E, R 1.76 0.15
Self*judge Self judgment 5.59 0.00 **
Summary of Findings
Question answering systems shouldQuestion-
nairesCross-
evaluationNASA TLX
Glass Box
System Logs
H1
Support information gathering with lower cognitive workload X X X
H2 Assist in exploring more paths/hypotheses X
H3 Enable production of higher quality reports X X
H4 Provide useful suggestions to the analyst X X X
H5 Provide more good surprises than bad X
H6
Enable more focus on analysis than data collection X
H7
Enable analysts to collect more data in less time X X
H8 Reduce the time spent reading X X
H9 Identify gaps in the knowledge base X X X
H10
Help the analyst recognize gaps in their thinking X
H11 Provide context for information X X
H12
Provide context, continuity and coherence of dialogue X X X
H14 Be easy to use X X
H15
Increase an analyst’s confidence in exploration and report X
$ $$ $ $$$ $$$