Top Banner
An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004
37

An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Jan 04, 2016

Download

Documents

Kerry Hopkins
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

An Investigation of Evaluation Metrics for Analytic Question Answering

ARDA Metrics Challenge: PI Meeting

Emile MorseOctober 7, 2004

Page 2: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Outline

Motivation & Goals Hypothesis-driven development of metrics Design – collection, subjects, scenarios Data Collection & Results Summary and issues

Page 3: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Motivation Much progress in IR has been attributed to community

evaluations using metrics of precision and recall with common tasks and common data.

There is no corresponding set of evaluation criteria for the interaction of users with the systems.

While performance is crucial, the utility of the system to the user is critical.

The lack of evaluation criteria prevents the comparison of systems based on utility. Acquisition of new systems is therefore based on performance of the system alone – and frequently does NOT reflect how systems will work in the user’s actual process.

Page 4: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Goals of Workshop To develop metrics for process and products that will

reflect the interaction of users and information systems. To develop the metrics based on:

Cognitive task analyses of intelligence analysts Previous experience in AQUAINT and NIMD evaluations Expert consultation

To deliver an evaluation package consisting of: Process and product metrics An evaluation methodology A data set to use in the evaluation

Page 5: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Hypothesis-driven development of metrics Hypotheses – QA systems should …

Candidate metrics – What could we measure that would provide evidence to support or refute this hypothesis?

Measures – implementation of metric; depends on specific collection method

Collection methods:

• Questionnaires

• Mood meter

• System logs

• Report evaluation method

• System surveillance tool

• Cognitive workload instrument

• …

Page 6: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

HypothesesQuestion answering systems should …

H1 Support information gathering with lower cognitive workload

H2 Assist in exploring more paths/hypotheses

H3 Enable production of higher quality reports

H4 Provide useful suggestions to the analyst

H5 Provide more good surprises than bad

H6 Enable more focus on Analysis than data collection

H7 Enable analysts to collect more data in less time

H8 Reduce the time spent reading

H9 Identify gaps in the knowledge base

H10 Help the analyst recognize gaps in their thinking

H11 Provide context for information

H12 Provide context, continuity and coherence of dialogue

H13 Let analysts relocate previously seen materials

H14 Be easy to use

H15 Increase an analyst’s confidence in exploration and report

Page 7: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Examples of candidate metrics H1: Support gathering the same type of information with a lower

cognitive workload # queries/questions % interactions where analyst takes initiative Number of non-content interactions with system (clarifications) Cognitive workload measurement

H7: Enable analysts to collect more data in less time Growth of shoebox over time Subjective assessment

H12: QA systems should provide context and continuity for the user – coherence of dialogue! Similarity between queries – calculate shifts in dialog trails Redundancy of documents – count how often a snippet is found

more than one time Subjective assessment

Page 8: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Top-level Design of the Workshop

Systems Domain Scenarios Collection Subjects On-site team Block design and on-site plan

Page 9: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Design – Systems

HITIQA

Ferret

GINKO

GNIST

Tomek Strzalkowski

Sanda HarabagiuStefano Bertolo

Page 10: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Design – Scenarios

ID Topic

A Indian Chemical Weapons Production and Delivery Systems

B Libyan Chemical Weapons Program

C Iranian Chemical Weapons Development and Impact

D North Korean Chemical and Biological Weapons Research

E Pakistani Chemical Agent Production

F Current Status of Russia’s Chemical Weapons Program

G South African Chemical Agents Program Status

H Assessment of Egypt’s Biological Weapons

Page 11: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Design – Document Collection

Page 12: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Design – Subjects

8 reservists (7 Navy; 1 Army) Age: 30-54 yrs (M=40.8) Educational background:

1 PhD; 4 Masters; 2 Bachelors; 1 HS Military service: 2.5-31 yrs (M=18.3) Analysis Experience: 0-23 yrs (M=10.8)

Page 13: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Design – On-Site Team  Affiliation Trainers Observers Focus

GroupsTechnical

Emile Morse NIST X X X X

Paul Kantor Rutgers     X  

Diane Kelly UNC   X X X

Robert Rittman Rutgers   X    

Aleksandra Sarcevic Rutgers   X    

Ying Sun Rutgers       X

Joe Konczal NIST       X

Stefano Bertolo Cycorp X     X

Andy Hickl LCC X      

Sharon Small/Hilda Hardy Albany X      

Sean Ryan Albany       X

8 Analysts 7 Navy/1 Army        

Batelle Support Other Visitors

Antonio Sanfilippo Lynn Franklin

Ben Barnett Peter LaMonica

Laura Curtis  

Troy Juntenen  

Trina Pitcher  

Page 14: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

2-day Block Schedule

  Day 1

Start Time

Duration Activity

9:30 1:00 System Training

10:30 0:30 Check Ride

11:00 0:30 Exploration

11:30 1:15LUNCH

12:45 0:15Scenario Discussion

1:00 2:30 Scenario Task

3:30 1:00 Questionnaires & Debriefing

Day 2

Start Time

Duration Activity

8:15 0:15 Scenario Discussion

8:30 2:30 Scenario Task

11:00 1:10 Questionnaires & Debriefing

12:10 1:00 LUNCH

1:10 0:50 Questionnaires & Debriefing (cont'd)

2:00 2:25 Cross-Evaluation

Page 15: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Data Collection Instruments Questionnaires

Post-scenario (SCE) Post-session (SES) Post-system (SYS)

Cross-evaluation of reports Cognitive workload Glass Box System Logs

Mood indicator Status reports Debriefing Observer notes Scenario difficulty assessment

Page 16: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Question answering systems shouldQuestion-

nairesCross-

evaluationNASA TLX

Glass Box

System Logs

Support information gathering with lower cognitive workload     X X   X

Assist in exploring more paths/hypotheses X        

Enable production of higher quality reports X X      

Provide useful suggestions to the analyst X     X X

Provide more good surprises than bad X        

Enable more focus on analysis than data collection X        

Enable analysts to collect more data in less time X     X  

Reduce the time spent reading X     X  

Identify gaps in the knowledge base X     X X

Help the analyst recognize gaps in their thinking X

Provide context for information X       X

Provide context, continuity and coherence of dialogue X     X X

Let analysts relocate previously seen materials X

Be easy to use X   X    

Increase an analyst’s confidence in exploration and report X        

Page 17: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Questionnaires

Type # Questions# Hypotheses

covered

Post-Scenario (SCE) 6 --

Post-Session (SES) 15 8

Post-System (SYS) 27 12

•Coverage for 14/15 hypotheses

•Other question types:•All SCE questions relate to scenario content•3 SYS questions on Readiness•3 SYS questions on Training

Page 18: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Questions for Hypothesis 7

Enable analysts to collect more data in less time

SES Q2: In comparison to other systems that you normally use for work tasks, how would you assess the length of time that it took to perform this task using the [X] system? [less … same … more]

SES Q13: If you had to perform a task like the one described in the scenario at work, do you think that having access to the [X] system would help increase the speed with which you find information? [not at all … a lot]

SYS Q23: Having the system at work would help me find information faster than I can currently find it.

SYS Q6: The system slows down my process of finding information.

Page 19: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Question answering systems shouldQuestion-

nairesCross-

evaluationNASA TLX

Glass Box

System Logs

SES SYS

Support information gathering with lower cognitive workload     X X X

Assist in exploring more paths/hypotheses 1/1 3/5        

Enable production of higher quality reports 2/2 3/3 X      

Provide useful suggestions to the analyst 1/1     X X

Provide more good surprises than bad 1/2        

Enable more focus on analysis than data collection   0/1        

Enable analysts to collect more data in less time 2/2 1/2     X  

Reduce the time spent reading   0/1     X  

Identify gaps in the knowledge base 0/1       X X

Help the analyst recognize gaps in their thinking 4/4    

Provide context for information   1/1       X

Provide context, continuity and coherence of dialogue 1/1 2/3     X X

Be easy to use 2/2 2/6   X    

Increase an analyst’s confidence in exploration and report 2/2 0/1        

Page 20: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Additional Analysis for Questionnaire Data – Factor Analysis Four factors emerged

Factor 1: most questions Factor 2: time, navigation, training Factor 3: novel information, new way of searching Factor 4: Skill in using the system improved

These factors distinguished between the four systems with each system being most distinguished from the others (positively or negatively) on one factor

GNIST was related to Factor 2. Positive for navigation and training; negative for time.

Page 21: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Cross Evaluation CriteriaSubjects rated the reports (including their own)

on seven characteristics Covers the important ground

Avoids the irrelevant materials

Avoids redundant information

Includes selective information

Is well organized

Reads clearly and easily

Overall rating

Page 22: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Cross-evaluation ResultsEffects F Sig.

Judge 15.32 0.00 **

Author 12.26 0.00 **

System 6.33 0.00 **

Scenario 2.94 0.01 **

Observer 1.76 0.15

Self*judge 5.59 0.00 **α β γ

β 0.06

γ 0.37* 0.30

GNIST -0.07 -0.14 -0.44**

Page 23: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Question answering systems shouldQuestion-

nairesCross-

evaluationNASA TLX

Glass Box

System Logs

SES SYS

Support information gathering with lower cognitive workload     X X X

Assist in exploring more paths/hypotheses 1/1 3/5        

Enable production of higher quality reports 2/2 3/3 X      

Provide useful suggestions to the analyst 1/1     X X

Provide more good surprises than bad 1/2        

Enable more focus on analysis than data collection   0/1        

Enable analysts to collect more data in less time 2/2 1/2     X  

Reduce the time spent reading   0/1     X  

Identify gaps in the knowledge base 0/1       X X

Help the analyst recognize gaps in their thinking 4/4    

Provide context for information   1/1       X

Provide context, continuity and coherence of dialogue 1/1 2/3     X X

Be easy to use 2/2 2/6   X    

Increase an analyst’s confidence in exploration and report 2/2 0/1        

Page 24: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

NASA TLX -- Cognitive Load

GNIST Sig.

Mental 9.93 14.94 19.13 7.80 >>{, GNIST}

Physical 6.50 2.50 2.13 2.20 >{, , GNIST}

Temporal 13.36 17.50 9.31 25.50 GNIST>>>

Performance 10.79 7.06 12.38 14.80 {GNIST,}>>

Frustration 16.57 13.88 8.81 12.30 {, }>GNIST>

Effort 16.50 15.31 11.19 21.60 GNIST>{, }>

TLX Score 4.91 4.75 4.20 5.61 {GNIST,,}>

Page 25: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

NASA TLX -- Cognitive Load

GNIST Sig.

Mental 9.93 14.94 19.13 7.80 >>{, GNIST}

Physical 6.50 2.50 2.13 2.20 >{, , GNIST}

Temporal 13.36 17.50 9.31 25.50 GNIST>>>

Performance 10.79 7.06 12.38 14.80 {GNIST,}>>

Frustration 16.57 13.88 8.81 12.30 {, }>GNIST>

Effort 16.50 15.31 11.19 21.60 GNIST>{, }>

TLX Score 4.91 4.75 4.20 5.61 {GNIST,,}>

Page 26: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Question answering systems shouldQuestion-

nairesCross-

evaluationNASA TLX

Glass Box

System Logs

SES SYS

Support information gathering with lower cognitive workload     X X X

Assist in exploring more paths/hypotheses 1/1 3/5        

Enable production of higher quality reports 2/2 3/3 X      

Provide useful suggestions to the analyst 1/1     X X

Provide more good surprises than bad 1/2        

Enable more focus on analysis than data collection   0/1        

Enable analysts to collect more data in less time 2/2 1/2     X  

Reduce the time spent reading   0/1     X  

Identify gaps in the knowledge base 0/1       X X

Help the analyst recognize gaps in their thinking 4/4    

Provide context for information   1/1       X

Provide context, continuity and coherence of dialogue 1/1 2/3     X X

Be easy to use 2/2 2/6   X    

Increase an analyst’s confidence in exploration and report 2/2 0/1        

Page 27: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Glass Box Data

Types of data captured:

• Keystrokes

• Mouse moves

• Session start/stop times

• Task times

• Application focus time

• Copy/paste events

• Screen capture & audio track

Page 28: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Glass Box Data Allocation of session time

0

10

20

30

40

50

60

70

80

90

100

Tim

e (m

inu

tes)

App

Word

GNIST

Page 29: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Question answering systems shouldQuestion-

nairesCross-

evaluationNASA TLX

Glass Box

System Logs

SES SYS

Support information gathering with lower cognitive workload     X X X

Assist in exploring more paths/hypotheses 1/1 3/5        

Enable production of higher quality reports 2/2 3/3 X      

Provide useful suggestions to the analyst 1/1     X X

Provide more good surprises than bad 1/2        

Enable more focus on analysis than data collection   0/1        

Enable analysts to collect more data in less time 2/2 1/2     X  

Reduce the time spent reading   0/1     X  

Identify gaps in the knowledge base 0/1       X X

Help the analyst recognize gaps in their thinking 4/4    

Provide context for information   1/1       X

Provide context, continuity and coherence of dialogue 1/1 2/3     X X

Be easy to use 2/2 2/6   X    

Increase an analyst’s confidence in exploration and report 2/2 0/1        

Page 30: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

System log data # queries/questions

‘Good’ queries/questions

Total documents delivered

# unique documents delivered

% unique documents delivered

# documents copied from

# copies

Page 31: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

# Questions vs # ‘Good’ Questions

0

5

10

15

20

question

good

GNIST

66% 38% 60% 35%

Page 32: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Question answering systems shouldQuestion-

nairesCross-

evaluationNASA TLX

Glass Box

System Logs

SES SYS

Support information gathering with lower cognitive workload     X X X

Assist in exploring more paths/hypotheses 1/1 3/5        

Enable production of higher quality reports 2/2 3/3 X      

Provide useful suggestions to the analyst 1/1     X X

Provide more good surprises than bad 1/2        

Enable more focus on analysis than data collection   0/1        

Enable analysts to collect more data in less time 2/2 1/2     X  

Reduce the time spent reading   0/1     X  

Identify gaps in the knowledge base 0/1       X X

Help the analyst recognize gaps in their thinking 4/4    

Provide context for information   1/1       X

Provide context, continuity and coherence of dialogue 1/1 2/3     X X

Be easy to use 2/2 2/6   X    

Increase an analyst’s confidence in exploration and report 2/2 0/1        

$ $$ $ $$$ $$$

Page 33: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

What Next? Query trails are being worked on by LCC, Rutgers and others;

available as part of deliverable.

Scenario difficulty has become an independent effort with potential impact on both NIMD and AQUAINT.

Thinking about alternative implementation of mood indicator.

Each project team employs metrics

and methodology

AQUAINT sponsors large-scale group evals using metrics and methodology

Or something in between

Page 34: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Issues to be Addressed What constitutes a replication of the method? the whole

thing? a few hypotheses with all data methods? all hypotheses with a few data methods?

Costs associated with data collection methods Is a comparison needed?

Baseline – if so, is Google the right one? Maybe the ‘best so far’ to keep the bar high.

Past results – can measure progress over time, but requires iterative application

‘Currency’ of data and scenarios Analysts are sensitive to staleness What is the effect of updating on repeatability?

Page 35: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Backups

Page 36: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Report Cross-Evaluation Results

Effects F Sig.

Judge 7* analysts (as judges): 1-5, 7,8

15.32 0.00 **

Author 7* analysts (as authors): 1-5, 7,8

12.26 0.00 **

System α, β, γ, GNIST 6.33 0.00 **

Scenario 8 scenarios: A – H 2.94 0.01 **

Observer 4 observers: A, D, E, R 1.76 0.15

Self*judge Self judgment 5.59 0.00 **

Page 37: An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Summary of Findings

  Question answering systems shouldQuestion-

nairesCross-

evaluationNASA TLX

Glass Box

System Logs

H1

Support information gathering with lower cognitive workload     X X X

H2 Assist in exploring more paths/hypotheses X        

H3 Enable production of higher quality reports X X      

H4 Provide useful suggestions to the analyst X     X X

H5 Provide more good surprises than bad X        

H6

Enable more focus on analysis than data collection X        

H7

Enable analysts to collect more data in less time X     X  

H8 Reduce the time spent reading X     X  

H9 Identify gaps in the knowledge base X     X X

H10

Help the analyst recognize gaps in their thinking X  

H11 Provide context for information X       X

H12

Provide context, continuity and coherence of dialogue X     X X

H14 Be easy to use X   X    

H15

Increase an analyst’s confidence in exploration and report X        

$ $$ $ $$$ $$$