Is Question Answering an Acquired Skill?

Is Question Answering

an Acquired Skill?Soumen Chakrabarti

IIT BombayWith

Ganesh Ramakrishnan

Deepa Paranjpe Pushpak

Bhattacharyya

QA Chakrabarti

The query-response gap Language models for Web corpus and Web

queries radically different (Church, 2003—4)

Not surprising, because• Users are conditioned to drop verbs,

prepositions and articles (anything interesting)• Queries inherently seek to express a “missing

piece”, documents don’t IR vs. DB

• DB queries clearly indicate what’s given and what’s missing in a query

• IR systems do not (yet)

QA Chakrabarti

Web search and QA Information need – words relating

“things” + “thing” aliases = telegraphic Web queries• Cheapest laptop with wireless

best price laptop 802.11• Why is the sky blue? sky blue because• When was the Space Needle built?

“Space Needle” history People used to ask telegraphic queries

• Fix keywords you are sure of• Guess document features that will answer

the missing piece in your query

QA Chakrabarti

Factoid QA Specialize given domain to a token related

to ground constants in the query• What animal is Winnie the Pooh?

• hyponym(“animal”) NEAR “Winnie the Pooh”• When was television invented?

• instance-of(“time”) NEAR “television” NEAR synonym(“invented”)

FIND x “NEAR” GroundConstants(question) WHERE x IS-A Atype(question)• Ground constants: Winnie the Pooh, television• Atypes: animal, time

QA Chakrabarti

A relational view of QA

Entity class or atype may be expressed by• A finite IS-A hierarchy (e.g. WordNet, TAP)• A surface pattern matching infinitely many strings

(e.g. “digit+”, “Xx+”, “preceded by a preposition”) Match selectors, specialize atype to answer tokens

Question Atypeclues Selectors

Answerpassage

Questionwords

“Answer zone”

DirectsyntacticmatchEntity class

IS-ALimit searchto certain rows

Locate whichcolumn to read

“Answer zone”

Attributeor column

name

QA Chakrabarti

But who provides is-a info? Compiled knowledge bases (WordNet,

CYC) Automatic “soft” compilations

• Google sets• KnowItAll• BioText

Basic tricks• Do jordan and

basketball cooccurmore often thanyou’d expect?

• Small phrase probes like “actor Willis”

QA Chakrabarti

Benefits of the relational view “Scaling up by dumbing down”

• Next stop after vector-space• Far short of real knowledge representation

and inference• Barely getting practical at (near) Web scale

Can set up as a learning problem: train with questions (query logs) and answers in context

Transparent, self-tuning, easy to deploy• Feature extractors used in entity taggers• Relational/graphical learning on features

QA Chakrabarti

Broad strategy Learn soft patterns of correlation

between question features and answer context

Use models to index corpus with atype annotations

Given query, assign a soft reward to all atype patterns

Search efficiently for passages containing promising tokens

Score passages and report best token sequences

QA Chakrabarti

What TREC QA feels like How to assemble chunker, parser, POS and NE

tagger, WordNet, WSD, … into a QA system? Experts get much insight from old QA pairs

• Matching an upper-cased term adds a 60% bonus … for multi-words terms and 30% for single words

• Matching a WordNet synonym … discounts by 10% (lower case) and 50% (upper case)

• Lower-case term matches after Porter stemming are discounted 30%; upper-case matches 70%

QA Chakrabarti

Talk outline Relational interpretation of QA Motivation for a “clean-room” IE+ML

system Learning to map between questions and

answers using is-a hierarchies and IE-style surface patterns• Can handle prominent finite set of atypes:

person, place, time, measurements,… Extending to arbitrary atype specializations

• Required for what… and which… questions Ongoing work and concluding remarks

QA Chakrabarti

Feature + Soft match FIND x “NEAR” GroundConstants(question)

WHERE x IS-A Atype(question) No fixed question or answer type system Convert “x IS-A Atype(question)” to a soft

match “DoesAtypeMatch(x, question)Question Answer tokens

Passage

IE-style surfacefeature extractors

WordNet hypernymfeature extractors

IE-style surfacefeature extractors

Question feature vector

Snippet feature vectorLearn joint distrib.

QA Chakrabarti

Feature extraction: Intuitionhow who

fast manyfar rich wrote first

How fast can a cheetah run?

A cheetah can chase its preyat up to 90 km/h

How fast does light travel?

Nothing moves faster than186,000 miles per hour, thespeed of light

rate#n#2

abstraction#n#6NNS

rate

#n#2

mag

nitu

de_r

elat

ion#

n#1

mile

#n#3

linea

r_un

it#n#

1

mea

sure

#n#3

defin

ite_q

uant

ity#n

#1

pape

r_m

oney

#n#1

curre

ncy#

n#1

writer, composer,artist, musician

NNP, person

explorer

QA Chakrabarti

Feature extractors Question features: 1, 2, 3-token

sequences starting with standard wh-words

Passage surface features: hasCap, hasXx, isAbbrev, hasDigit, isAllDigit, lpos, rpos,…

Passage WordNet features: all noun hypernym ancestors of all senses of token

Get top 300 passages from IR engine For each token invoke feature extractors Label = 1 if token is in answer span, 0 o/w Question vector xq, passage vector xp

QA Chakrabarti

Preliminary likelihood ratio tests

Surface patterns WordNet hypernyms

QA Chakrabarti

Joint feature-vector design Obvious “linear” juxtaposition x=(xp,xq)

• Does not expose pairwise dependencies “Quadratic” form x = xq xp

• All pairwise product of elements Model has param for every pair

Can discount for redundancy in pair info If xq (xp) is fixed, what xp (xq) will yield

the largest Pr(Y=1|x)? (linear iceberg query)

xwx

exp11)|1Pr(Y

how_farwhen

what_city

region#n#3entity#n#1

QA Chakrabarti

Classification accuracy

0

0.2

0.4

0.6

0.8

1

0 0.05 0.1 0.15 0.2False positiveTr

ue p

ositi

ve LinearQuadratic

0

0.2

0.4

0.6

0.8

0 0.2 0.4 0.6 0.8 1RecallP

reci

sion

Linear

Quadratic

Pairing more accurate than linear model Steep learning curve; linear never “gets it” beyond

“prior” atypes like proper nouns (common in TREC) Are the estimated w parameters meaningful?

QA Chakrabarti

Parameter anecdotes Surface and

WordNet features complement each other

General concepts get negative params: use in predictive annotation

Learning is symmetric (QA)

QA Chakrabarti

Query-driven information extraction

“Basis” of atypes A, a A could be a synset, a surface pattern, feature of a parse tree

Question q “projected” to vector (wa: a A) in atype space via learning conditional model

E.g. if q is “when…” or “how long…” whasDigit and wtime_period#n#1 are large, wregion#n#1 is small

Each corpus token t has associated indicator features a(t ) for every a

E.g. hasDigit(3,000) = is-a(region#n#1)(Japan) = 1 Can also learn [0,1] value of is-a proximity

QA Chakrabarti

Single token scoring A token t is a candidate answer if

Hq(t ): Reward tokens appearing “near” selectors matched from question• 0/1: appears within fixed window with selector/s• Activation in linear token sequence model• Proximity in chunk sequences, parse trees,…

Order tokens by decreasing

0)()( Aa

aa qwtAtype indicator features of the token

Projection of questionto “atype space”

…the armadillo, found in Texas, is covered with strong horny plates

Aa

aaq qwttH )()()(

QA Chakrabarti

Mean reciprocal rank (MRR) nq = smallest rank among answer

passages MRR = (1/|Q|) qQ(1/nq)

• Dropping passage from #1 to #2 as bad as dropping it from #2 to

TREC requires MRR5: round up nq>5 to • Improving rank from 20 to 6 as useless as

improving it from 20 to 15 Aggregate score influenced by many

complex subsystems• Complete description rarely available

QA Chakrabarti

Effect of eliminating non-answers

300 top IR score hits If Pr(Y=1|token) <

threshold reject token All tokens rejected then

reject passage Present survivors in IR

order

0

100

200

300

0 100 200 300IR rank

Ran

k af

ter f

ilter

ing

TREC 2000TREC 2002

TREC 20000.491

0.336

0.3

0.4

0.5

0 0.5Acceptance threshold

MR

R

MRRMRR5Baseline

TREC 20020.334

0.224

0.2

0.25

0.3

0.35

0 0.5Acceptance threshold

MR

R

MRRMRR5Baseline

QA Chakrabarti

Drill-down and ablation studies Scale average MRR

improvement to 1• What, Which <

average• Who average

Atype of what… and which… not captured well by 3-grams starting at wh-words

Atype ranges over essentially infiniteset with relativelylittle training data

TREC 2002

0.8

0.9

1

1.1

1.2

wha

t

whi

ch

nam

e

whe

re

how

whe

n

whoQuestion

type-->

Rel

ativ

e M

RR

ga

in

QA Chakrabarti

Talk outline Relational interpretation of QA Motivation for a “clean-room” IE+ML

system Learning to map between questions and

answers using is-a hierarchies and IE-style surface patterns• Can handle prominent finite set of atypes:

person, place, time, measurements,… Extending to arbitrary atype specializations

• Required for what… and which… questions Ongoing work and concluding remarks

QA Chakrabarti

What…, which…, name… atype clues

Assumption: Question sentence has a wh-word and a main/auxiliary verb

Observation: Atype clues are embedded in a noun phrase (NP) adjoining the main or auxiliary verb

Heuristic: Atype clue = head of this NP• Use a shallow parser and apply rule

Head can have attributes• Which (American (general)) is buried in

Salzburg?• Name (Saturn’s (largest (moon)))

QA Chakrabarti

Atype clue extraction statsQuestion

type #Questions #Extracted correctly

what 630 612which 29 28name 23 20

Simple heuristic quite effective If successful, extracted atype is mapped to

WordNet synset (mooncelestial body etc.) If no atype of this form available, try the “self-

evident” atypes (who, when, where, how_X etc.) New boolean feature for candidate token: is

token hyponym of atype synset?

QA Chakrabarti

The last piece: Learning selectors

Which question words are likely to appear (almost) unchanged in an answer passage?• Constants in select-clauses of SQL queries• Guides backoff policy for keyword query

Arises in Web search sessions too• Opera login fails• Opera problem with login• Opera login accept password• Opera account authentication• …

QA Chakrabarti

Features for identifying selectors

Local and global features• POS of word, POS of adjacent words, case

info, proximity to wh-word• Suppose word is associated with synset set S

• NumSense: size of S (how polysemous is the word?)

• NumLemma: average #lemmas describing s S

Model as a sequential learning problem• Each token has local context and global

features

POS@0 POS@1POS@-1

QA Chakrabarti

Selector results Global features (IDF, NumSense, NumLemma)

essential for accuracy• Best F1 accuracy with local features alone: 71—73%• With local and global features: 81%

Decision trees better than logistic regression• F1=81% as against LR F1=75%• Intuitive decision branches• But logistic regression gives scores for query backoff

N um Lem ma@ 0<=2.5 N um Lem m a@ 0>2.5

N um S ense@ 0<=9 N um Sense@ 0>9

P O S@ -1=N oun ...

P O S@ 0=Adj

PO S @ -1=N oun

N um Lem m a@ 0<=1.82 N um Lem m a@ 0>1.82

PO S @ 0=V erb

QA Chakrabarti

Putting together a QA system

QASystem

Wordnet

POSTagger

TrainingCorpus

Shallow parse

rLearning tools

N-E Tagger

QA Chakrabarti

Question

PassageIndex

Corpus

Sentence splitterPassage indexer

Candidatepassage

Keyword query

Keyword querygenerator

ShallowParser

Noun andverb markers

AtypeExtractor

Atype clues

Learning to rerank passagesSample features:•Do selectors match? How many?•Is some non-selector passage token a specialization of the question’s atype clue?•Min, avg, linear token distance between candidate token and matched selectors

LogisticRegression

Rerankedpassages

Putting together a QA systemTokenizer

POS TaggerTaggedquestion

TokenizerPOS Tagger

Entity Extractor

Taggedpassage

SelectorLearner

Is QA pair?

QA Chakrabarti

Learning to re-rank passages Remove passage tokens matching

selectors• User already knows these are in passage

Find passage token/s specializing atype

For each candidate token collect• Atype of question, original rank of passage• Min, avg linear distances to matched

selectors• POS and entity tag of token if available

Ushuaia, a port of about 30,000 dwellers set between the Beagle Channel and …

How many inhabitants live in the town of Ushuaia

selector matchSurface pattern hasDigitsWordNet match

5 tokens apart 1

QA Chakrabarti

Re-ranking results Categorical and

numeric attributes Logistic regression Good precision,

poor recall Use logit score to

re-rank passages Rank of first correct

passage shifts substantially

194479

1

10

100

1000

1 2 3 4 5 6 7 8 9 10Answer at rankFr

eque

ncy

BaselineRerank

QA Chakrabarti

MRR gains from what, which, name

Substantial gain in MRR

What/which now show above-average MRR gains

TREC 2000 top MRRs:0.76 0.71 0.46 0.46 0.31

Ranking strategy TREC 2000 TREC 2002IR score (Lucene) 0.377 0.249Conditional model 0.491 0.334Atype for what/which/name 0.71 0.565

00.10.20.30.40.50.60.70.8

whe

n

wha

t

whe

re

how

whi

ch

how

man

y

how

muc

h

Question type

MR

R

Pre-reranking

Post-reranking

QA Chakrabarti

Generalization across corpora

Across-year numbers close to train/test split on a single year

Features and model seem to capture corpus-independent linguistic Q+A artifacts

QA Chakrabarti

Conclusion Clean-room QA= feature extraction+learning

• Recover structure info from question• Learn correlations between question structure

and passage features Competitive accuracy with negligible domain

expertise or manual intervention Ongoing work

• Use model coefficients for predictive annotation• Combine token scores to better passage scores• Treat all question types uniformly• Use redundancy available from the Web

Is Question Answering an Acquired Skill?

Documents

near web scalecan

web corpus

atype annotationsgiven

atype patternssearch

missing piece

soft reward

questions query logs

wordnet synonym discounts