An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

An Investigation of Evaluation Metrics for Analytic Question Answering

ARDA Metrics Challenge: PI Meeting

Emile MorseOctober 7, 2004

Outline

Motivation & Goals Hypothesis-driven development of metrics Design – collection, subjects, scenarios Data Collection & Results Summary and issues

Motivation Much progress in IR has been attributed to community

evaluations using metrics of precision and recall with common tasks and common data.

There is no corresponding set of evaluation criteria for the interaction of users with the systems.

While performance is crucial, the utility of the system to the user is critical.

The lack of evaluation criteria prevents the comparison of systems based on utility. Acquisition of new systems is therefore based on performance of the system alone – and frequently does NOT reflect how systems will work in the user’s actual process.

Goals of Workshop To develop metrics for process and products that will

reflect the interaction of users and information systems. To develop the metrics based on:

Cognitive task analyses of intelligence analysts Previous experience in AQUAINT and NIMD evaluations Expert consultation

To deliver an evaluation package consisting of: Process and product metrics An evaluation methodology A data set to use in the evaluation

Hypothesis-driven development of metrics Hypotheses – QA systems should …

Candidate metrics – What could we measure that would provide evidence to support or refute this hypothesis?

Measures – implementation of metric; depends on specific collection method

Collection methods:

• Questionnaires

• Mood meter

• System logs

• Report evaluation method

• System surveillance tool

• Cognitive workload instrument

• …

HypothesesQuestion answering systems should …

H1 Support information gathering with lower cognitive workload

H2 Assist in exploring more paths/hypotheses

H3 Enable production of higher quality reports

H4 Provide useful suggestions to the analyst

H5 Provide more good surprises than bad

H6 Enable more focus on Analysis than data collection

H7 Enable analysts to collect more data in less time

H8 Reduce the time spent reading

H9 Identify gaps in the knowledge base

H10 Help the analyst recognize gaps in their thinking

H11 Provide context for information

H12 Provide context, continuity and coherence of dialogue

H13 Let analysts relocate previously seen materials

H14 Be easy to use

H15 Increase an analyst’s confidence in exploration and report

Examples of candidate metrics H1: Support gathering the same type of information with a lower

cognitive workload # queries/questions % interactions where analyst takes initiative Number of non-content interactions with system (clarifications) Cognitive workload measurement

H7: Enable analysts to collect more data in less time Growth of shoebox over time Subjective assessment

H12: QA systems should provide context and continuity for the user – coherence of dialogue! Similarity between queries – calculate shifts in dialog trails Redundancy of documents – count how often a snippet is found

more than one time Subjective assessment

Top-level Design of the Workshop

Systems Domain Scenarios Collection Subjects On-site team Block design and on-site plan

Design – Systems

HITIQA

Ferret

GINKO

GNIST

Tomek Strzalkowski

Sanda HarabagiuStefano Bertolo

Design – Scenarios

ID Topic

A Indian Chemical Weapons Production and Delivery Systems

B Libyan Chemical Weapons Program

C Iranian Chemical Weapons Development and Impact

D North Korean Chemical and Biological Weapons Research

E Pakistani Chemical Agent Production

F Current Status of Russia’s Chemical Weapons Program

G South African Chemical Agents Program Status

H Assessment of Egypt’s Biological Weapons

Design – Document Collection

Design – Subjects

8 reservists (7 Navy; 1 Army) Age: 30-54 yrs (M=40.8) Educational background:

1 PhD; 4 Masters; 2 Bachelors; 1 HS Military service: 2.5-31 yrs (M=18.3) Analysis Experience: 0-23 yrs (M=10.8)

Design – On-Site Team Affiliation Trainers Observers Focus

GroupsTechnical

Emile Morse NIST X X X X

Paul Kantor Rutgers X

Diane Kelly UNC X X X

Robert Rittman Rutgers X

Aleksandra Sarcevic Rutgers X

Ying Sun Rutgers X

Joe Konczal NIST X

Stefano Bertolo Cycorp X X

Andy Hickl LCC X

Sharon Small/Hilda Hardy Albany X

Sean Ryan Albany X

8 Analysts 7 Navy/1 Army

Batelle Support Other Visitors

Antonio Sanfilippo Lynn Franklin

Ben Barnett Peter LaMonica

Laura Curtis

Troy Juntenen

Trina Pitcher

2-day Block Schedule

Day 1

Start Time

Duration Activity

9:30 1:00 System Training

10:30 0:30 Check Ride

11:00 0:30 Exploration

11:30 1:15LUNCH

12:45 0:15Scenario Discussion

1:00 2:30 Scenario Task

3:30 1:00 Questionnaires & Debriefing

Day 2

Start Time

Duration Activity

8:15 0:15 Scenario Discussion

8:30 2:30 Scenario Task

11:00 1:10 Questionnaires & Debriefing

12:10 1:00 LUNCH

1:10 0:50 Questionnaires & Debriefing (cont'd)

2:00 2:25 Cross-Evaluation

Data Collection Instruments Questionnaires

Post-scenario (SCE) Post-session (SES) Post-system (SYS)

Cross-evaluation of reports Cognitive workload Glass Box System Logs

Mood indicator Status reports Debriefing Observer notes Scenario difficulty assessment

Question answering systems shouldQuestion-

nairesCross-

evaluationNASA TLX

Glass Box

System Logs

Support information gathering with lower cognitive workload X X X

Assist in exploring more paths/hypotheses X

Enable production of higher quality reports X X

Provide useful suggestions to the analyst X X X

Provide more good surprises than bad X

Enable more focus on analysis than data collection X

Enable analysts to collect more data in less time X X

Reduce the time spent reading X X

Identify gaps in the knowledge base X X X

Help the analyst recognize gaps in their thinking X

Provide context for information X X

Provide context, continuity and coherence of dialogue X X X

Let analysts relocate previously seen materials X

Be easy to use X X

Increase an analyst’s confidence in exploration and report X

Questionnaires

Type # Questions# Hypotheses

covered

Post-Scenario (SCE) 6 --

Post-Session (SES) 15 8

Post-System (SYS) 27 12

•Coverage for 14/15 hypotheses

•Other question types:•All SCE questions relate to scenario content•3 SYS questions on Readiness•3 SYS questions on Training

Questions for Hypothesis 7

Enable analysts to collect more data in less time

SES Q2: In comparison to other systems that you normally use for work tasks, how would you assess the length of time that it took to perform this task using the [X] system? [less … same … more]

SES Q13: If you had to perform a task like the one described in the scenario at work, do you think that having access to the [X] system would help increase the speed with which you find information? [not at all … a lot]

SYS Q23: Having the system at work would help me find information faster than I can currently find it.

SYS Q6: The system slows down my process of finding information.


nairesCross-

evaluationNASA TLX

Glass Box

System Logs

SES SYS


Assist in exploring more paths/hypotheses 1/1 3/5

Enable production of higher quality reports 2/2 3/3 X

Provide useful suggestions to the analyst 1/1 X X

Provide more good surprises than bad 1/2

Enable more focus on analysis than data collection 0/1

Enable analysts to collect more data in less time 2/2 1/2 X

Reduce the time spent reading 0/1 X

Identify gaps in the knowledge base 0/1 X X

Help the analyst recognize gaps in their thinking 4/4

Provide context for information 1/1 X

Provide context, continuity and coherence of dialogue 1/1 2/3 X X

Be easy to use 2/2 2/6 X

Increase an analyst’s confidence in exploration and report 2/2 0/1

Additional Analysis for Questionnaire Data – Factor Analysis Four factors emerged

Factor 1: most questions Factor 2: time, navigation, training Factor 3: novel information, new way of searching Factor 4: Skill in using the system improved

These factors distinguished between the four systems with each system being most distinguished from the others (positively or negatively) on one factor

GNIST was related to Factor 2. Positive for navigation and training; negative for time.

Cross Evaluation CriteriaSubjects rated the reports (including their own)

on seven characteristics Covers the important ground

Avoids the irrelevant materials

Avoids redundant information

Includes selective information

Is well organized

Reads clearly and easily

Overall rating

Cross-evaluation ResultsEffects F Sig.

Judge 15.32 0.00 **

Author 12.26 0.00 **

System 6.33 0.00 **

Scenario 2.94 0.01 **

Observer 1.76 0.15

Self*judge 5.59 0.00 **α β γ

β 0.06

γ 0.37* 0.30

GNIST -0.07 -0.14 -0.44**


nairesCross-

evaluationNASA TLX

Glass Box

System Logs

SES SYS















NASA TLX -- Cognitive Load

GNIST Sig.

Mental 9.93 14.94 19.13 7.80 >>{, GNIST}

Physical 6.50 2.50 2.13 2.20 >{, , GNIST}

Temporal 13.36 17.50 9.31 25.50 GNIST>>>

Performance 10.79 7.06 12.38 14.80 {GNIST,}>>

Frustration 16.57 13.88 8.81 12.30 {, }>GNIST>

Effort 16.50 15.31 11.19 21.60 GNIST>{, }>

TLX Score 4.91 4.75 4.20 5.61 {GNIST,,}>

NASA TLX -- Cognitive Load

GNIST Sig.

Mental 9.93 14.94 19.13 7.80 >>{, GNIST}

Physical 6.50 2.50 2.13 2.20 >{, , GNIST}

Temporal 13.36 17.50 9.31 25.50 GNIST>>>

Performance 10.79 7.06 12.38 14.80 {GNIST,}>>

Frustration 16.57 13.88 8.81 12.30 {, }>GNIST>

Effort 16.50 15.31 11.19 21.60 GNIST>{, }>

TLX Score 4.91 4.75 4.20 5.61 {GNIST,,}>


nairesCross-

evaluationNASA TLX

Glass Box

System Logs

SES SYS















Glass Box Data

Types of data captured:

• Keystrokes

• Mouse moves

• Session start/stop times

• Task times

• Application focus time

• Copy/paste events

• Screen capture & audio track

Glass Box Data Allocation of session time

0

10

20

30

40

50

60

70

80

90

100

Tim

e (m

inu

tes)

App

Word

GNIST


nairesCross-

evaluationNASA TLX

Glass Box

System Logs

SES SYS















System log data # queries/questions

‘Good’ queries/questions

Total documents delivered

# unique documents delivered

% unique documents delivered

# documents copied from

# copies

# Questions vs # ‘Good’ Questions

0

5

10

15

20

question

good

GNIST

66% 38% 60% 35%


nairesCross-

evaluationNASA TLX

Glass Box

System Logs

SES SYS















$ $$ $ $$$ $$$

What Next? Query trails are being worked on by LCC, Rutgers and others;

available as part of deliverable.

Scenario difficulty has become an independent effort with potential impact on both NIMD and AQUAINT.

Thinking about alternative implementation of mood indicator.

Each project team employs metrics

and methodology

AQUAINT sponsors large-scale group evals using metrics and methodology

Or something in between

Issues to be Addressed What constitutes a replication of the method? the whole

thing? a few hypotheses with all data methods? all hypotheses with a few data methods?

Costs associated with data collection methods Is a comparison needed?

Baseline – if so, is Google the right one? Maybe the ‘best so far’ to keep the bar high.

Past results – can measure progress over time, but requires iterative application

‘Currency’ of data and scenarios Analysts are sensitive to staleness What is the effect of updating on repeatability?

Backups

Report Cross-Evaluation Results

Effects F Sig.

Judge 7* analysts (as judges): 1-5, 7,8

15.32 0.00 **

Author 7* analysts (as authors): 1-5, 7,8

12.26 0.00 **

System α, β, γ, GNIST 6.33 0.00 **

Scenario 8 scenarios: A – H 2.94 0.01 **

Observer 4 observers: A, D, E, R 1.76 0.15

Self*judge Self judgment 5.59 0.00 **

Summary of Findings


nairesCross-

evaluationNASA TLX

Glass Box

System Logs

H1


H2 Assist in exploring more paths/hypotheses X

H3 Enable production of higher quality reports X X

H4 Provide useful suggestions to the analyst X X X

H5 Provide more good surprises than bad X

H6

Enable more focus on analysis than data collection X

H7

Enable analysts to collect more data in less time X X

H8 Reduce the time spent reading X X

H9 Identify gaps in the knowledge base X X X

H10

Help the analyst recognize gaps in their thinking X

H11 Provide context for information X X

H12

Provide context, continuity and coherence of dialogue X X X

H14 Be easy to use X X

H15

Increase an analyst’s confidence in exploration and report X

$ $$ $ $$$ $$$

An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004.

Documents

information systems

comparison of systems

metrics of precision

evaluation package

common data

acquisition of new systems

lack of evaluation criteria

analysts confidence