Top Banner
1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado
40

1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

1

Meaning from Text:Teaching Computers to Read

Steven Bethard

University of Colorado

Page 2: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

2

Query: “Who is opposing the railroad through Georgia?”1 en.wikipedia.org/wiki/Sherman's_March_to_the_Sea

…they destroyed the railroads and the manufacturing and agricultural infrastructure of the state…Henry Clay Work wrote the song Marching Through Georgia…

…3 www.ischool.berkeley.edu/~mkduggan/politics.html

While the piano piece "Marching Through Georgia" has no words...Party of California (1882) has several verses opposing the "railroad robbers"...

…71 www.azconsulatela.org/brazaosce.htm

Azerbaijan, Georgia and Turkey plan to start construction of Kars-Akhalkalaki-Tbilisi-Baku railroad in May, 2007…However, we’ve witnessed a very strong opposition to this project both in Congress and White House. President George Bush signed a bill prohibiting financing of this railroad…

Page 3: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

3

What went wrong? Didn’t find some similar word forms (Morphology)

Finds opposing but not opposition Finds railroad but not railway

Didn’t know how words should be related (Syntax) Looking for: opposing railroad Finds: opposing the “railroad robbers”

Didn’t know that “is opposing” means current (Semantics/Tense) Looking for: recent documents Finds: Civil War documents

Didn’t know that “who” means a person (Semantics/Entities) Looking for: <person> opposing Finds: several verses opposing

Page 4: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

4

Teaching Linguistics to Computers Natural Language Processing (NLP)

Symbolic approaches Statistical approaches

Machine learning overview Statistical NLP

Example: Identifying people and places Example: Constructing timelines

Page 5: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

5

Early Natural Language Processing

Symbolic approaches Small domains

Example: SHRDLU block world Vocabulary of ~50 words Simple word combinations Hand-written rules to

understand sentences

Person: WHAT DOES THE BOX CONTAIN?

Comp: THE BLUE PYRAMID.Person: WHAT IS THE PYRAMID

SUPPORTED BY?Comp: THE BOX.Person: HOW MANY BLOCKS ARE

NOT IN THE BOX?Comp: SEVEN OF THEM.

Page 6: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

6

Recent Natural Language Processing Large scale linguistic corpora

e.g. Penn TreeBank million words of syntax:

Statistical machine learning e.g. Charniak parser

Trained on the TreeBank Builds new trees with 90% accuracy

sentence

noun-phrase verb-phrase

proper-noun proper-noun signed noun-phrase

determiner nounGeorge Bush

the bill

Page 7: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

7

Machine Learning General approach

Analyze data Extract preferences Classify new examples using learned preferences

Supervised machine learning Data have human-annotated labels

e.g. each sentence in the TreeBank has a syntactic tree

Learns human preferences

Page 8: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

8

?

?

Supervised Machine Learning Models Given:

An N dimensional feature space

Points in that space A human-annotated label for

each point Goal:

Learn a function to assign labels to points

Methods: K-nearest-neighbors,

support vector machines, etc.

A Two-Dimensional Space

Page 9: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

9

Machine Learning Examples Character Recognition

Feature space: 256 pixels (0 = black, 1 = white) Labels: A, B, C, …

Cardiac Arrhythmia Feature space: age, sex, heart rate, … Labels: has arrythmia, doesn’t have arrythmia

Mushrooms Feature space: cap shape, gill color, stalk surface, … Labels: poisonous, edible

… and many more: http://www.ics.uci.edu/~mlearn/MLRepository.html

Page 10: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

10

Machine Learning and Language Example:

Identifying people, places, organizations (named entities) However, we’ve witnessed a very strong opposition to

this project both in [ORG Congress] and [ORG White House]. President [PER George Bush] signed a bill prohibiting financing of this railroad.

This doesn’t look like that lines and dots example! What’s the classification problem? What’s the feature space?

Page 11: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

11

Named Entities: Classification

Word-by-word classification

Is the word beginning, inside or outside of a named entity?

Word Label

in Outside

Congress Begin-ORG

and Outside

White Begin-ORG

House Inside-ORG

. Outside

President Outside

George Begin-PER

Page 12: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

12

Named Entities: Clues The word itself

U.S. is always a Location (though Turkey is not)

Part of speech The Locations Turkey and Georgia are nouns (though the White of White House is not)

Is the first letter of the word capitalized? Bush and Congress are capitalized (though the von of von Neumann is not)

Is the word at the start of the sentence? In the middle of a sentence, Will is likely a Persion (but at the start it could be an auxiliary verb)

Page 13: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

13

Named Entities: Clues as Features Each clue defines part of the feature space

Word Part of Speech

Starts Sent

Initial Caps

Label

in preposition False False Outside

Congress

noun False True Begin-ORG

and conjunction False False Outside

White adjective False True Begin-ORG

House noun False True Inside-ORG

. punctuation False False Outside

President

noun True True Outside

George noun False True Begin-PER

Page 14: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

14

Named Entities: String Features But machine learning

models need numeric features! True 1 False 0 Congress ? adjective ?

Solution: Binary feature for each

word

StringFeature

NumericFeatures

destroyed 1 0 0 0 0

the 0 1 0 0 0

railroads 0 0 1 0 0

and 0 0 0 1 0

the 0 1 0 0 0

manufacturing

0 0 0 0 1

and 0 0 0 1 0

Page 15: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

15

Named Entities: Review…[ORG Congress] and [ORG White House]…

Congress

noun False

True Begin-ORG

and conjunction

False

False

Outside

White adjective False

True Begin-ORG

House noun False

True Inside-ORG

1 0 0 0 1 0 0 0 1 Begin-ORG

0 1 0 0 0 1 0 0 0 Outside

0 0 1 0 0 0 1 0 1 Begin-ORG

0 0 0 1 1 0 0 0 1 Inside-ORG

Page 16: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

16

Named Entities: Features and Models String features

word itself part of speech starts sentence has initial capitalization

How many numeric features? N = Nwords + Nparts-of-speech + 1 + 1 Nwords ≈ 10,000 Nparts-of-speech ≈ 50

Need efficient implementations, e.g. TinySVM

Page 17: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

17

Named Entities in Use

We know how to: View named entity recognition as classification Convert clues to an N-dimensional feature space Train a machine learning model

How can we use the model?

Page 18: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

18

Named Entities in Search Engines

Page 19: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

19

Named Entities in Research TREC-QA

Factoid question answering Various research systems compete All use named entity matching

State of the art performance: ~90% That’s 10% wrong! But good enough for real use Named entities are a “solved” problem

So what’s next?

Page 20: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

20

Learning Timelines

The top commander of a Cambodian resistance force said Thursday he has sent a team to recover the remains of a British mine removal expert kidnapped and presumed killed by Khmer Rouge guerrillas almost two years ago.

Page 21: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

21

Learning Timelines

The top commander of a Cambodian resistance force said Thursday he has sent a team to recover the remains of a British mine removal expert kidnapped and presumed killed by Khmer Rouge guerrillas almost two years ago.

Page 22: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

22

Learning Timelines

said Thursday

sent

recover

kidnapped

killed

almost twoyears ago

before

before

before

includes

includes

includes

before

The top commander of a Cambodian resistance force said Thursday he has sent a team to recover the remains of a British mine removal expert kidnapped and presumed killed by Khmer Rouge guerrillas almost two years ago.

Page 23: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

23

Why Learn Timelines? Timelines are summarization

1996 Khmer Rouge kidnapped and killed British mine removal expert

1998 Cambodian commander sent recovery team

Timelines allow reasoning Q: When was the expert kidnapped?

A: Almost two years ago. Q: Was the team sent before the expert was killed?

A: No, afterwards.

Page 24: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

24

Learning Timelines: Classification Standard questions:

What’s the classification problem? What’s the feature space?

Three different problems Identify times Identify events Identify links (temporal relations)

Page 25: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

25

Times and Events: Classification Word-by-word

classification

Time features: word itself has digits …

Event features: word itself suffixes (e.g. -ize, -tion) root (e.g. evasionevade) …

Word Part of Speech

Label

The determiner Outside

company noun Outside

’s possessive Outside

sales noun Outside

force noun Outside

applauded

verb Begin-Event

the determiner Outside

shake noun Begin-Event

up particle Inside-Event

. punctuation Outside

Page 26: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

26

Times and Events: State of the Art Performance:

Times: ~90% Events: ~80%

Mr Bryza, it's been [Event reported] that Azerbaijan, Georgia and Turkey [Event plan] to [Event start] [Event construction] of Kars Akhalkalaki Tbilisi Baku railroad in [Time May], [Time 2007].

Why are events harder? No orthographic cues (capitalization, digits, etc.) More parts of speech (nouns, verbs and adjectives)

Page 27: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

27

Temporal Links Everything so far looked like:

Aaaa [X bb] ccccc [Y dd eeeee] fff [Z gggg]

But now we want this:

Word-by-word classification won’t work!

Aaaa bb ccccc dd eeeee fff ggg

X Y

Page 28: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

28

Temporal Links: Classification Pairwise classification

Each event with each time

Saddam Hussein [Time today] [Event sought] [Event peace] on another front by [Event promising] to [Event withdraw] from Iranian territory and [Event release] soldiers [Event captured] during the Iran-Iraq [Event war].

Event Time Label

sought today

During

peace today

After

promising

today

During

withdraw today

After

release today

After

captured today

Before

war today

Before

Page 29: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

29

Temporal Links: Clues Tense of the event

said (past tense) is probably Before today says (present tense) is probably During today

Nearby temporal expression In “said today”, said is During today In “captured in 1989”, captured is During 1989

Negativity In “People believe this”, believe is During today In “People don’t believe this any more”, believe is Before today

Page 30: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

30

Temporal Links: FeaturesSaddam Hussein [Time today] [Event sought] [Event peace] on another front by [Event promising] to [Event withdraw] from Iranian territory…

sought today

past today positive

During

peace today

none none positive

After

promising

today

present none positive

During

withdraw today

infinitive none positive

After1 0 0 0 0 1 0 0 0 1 0 0 During

0 1 0 0 0 0 1 0 0 0 1 0 After

0 0 1 0 0 0 0 1 0 0 1 0 During

0 0 0 1 0 0 0 0 1 0 1 0 After

Page 31: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

31

Temporal Links: State of the Art Corpora with temporal links:

PropBank: verbs and subjects/objects TimeBank: certain pairs of events

(e.g. reporting event and event reported)

TempEval A: events and times in the same sentence TempEval B: events in a document and document time

Performance on TempEval data: Same-sentence links (A): ~60% Document time links (B): ~80%

Page 32: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

32

What will make timelines better? Larger corpora

TempEval is only ~50 documents Treebank is ~2400

More types of links Event-time pairs for all events

TempEval only considers high-frequency events

Event-event pairs in the same sentence

Page 33: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

33

Summary Statistical NLP asks:

What’s the classification problem? Word-by-word? Pairwise?

What’s the feature space? What are the linguistic clues? What does the N-dimensional space look like?

Statistical NLP needs: Learning algorithms efficient when N is very large Large-scale corpora with linguistic labels

Page 34: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

34

Future Work: Automate this!

Page 35: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

35

References Symbolic NLP

Terry Winograd. 1972. Understanding Natural Language. Academic Press.

Statistical NLP Daniel M. Bikel, Richard Schwartz, Ralph M. Weischedel. 1999. “An

Algorithm that Learns What's in a Name.” Machine Learning. Kadri Hacioglu,Ying Chen and Benjamin Douglas. 2005. “Automatic

Time Expression Labeling for English and Chinese Text.” In Proceedings of CICLing-2005.

Ellen M. Voorhees and Hoa Trang Dang. 2005. “Overview of the TREC 2005 Question Answering Track.” In proceedings of The Fourteenth Text REtrieval Conference.

Page 36: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

36

References Corpora

Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. “Building a large annotated corpus of english: The penn treebank.” Computational Linguistics, 19:313-330.

Martha Palmer, Dan Gildea, Paul Kingsbury. 2005. The Proposition Bank: A Corpus Annotated with Semantic Roles, Computational Linguistics Journal, 31:1.

James Pustejovsky, Patrick Hanks, Roser Saurí, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro and Marcia Lazo. 2003. The TIMEBANK Corpus. Proceedings of Corpus Linguistics 2003: 647-656.

Page 37: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

37

Page 38: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

38

Feature Windowing (1)

Problem: Word-by-word gives

no context

Solution: Include surrounding

features

Word Part of Speech

Label

The determiner Outside

company noun Outside

’s possessive Outside

sales noun Outside

force noun Outside

applauded

verb Begin-Event

the determiner Outside

shake noun Begin-Event

up particle Inside-Event

. punctuation Outside

Page 39: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

39

Feature Windowing (2) From previous word: features, label From current word: features From following word: features

Need special values like !START! and !END!

Word-1 Word0 Word+1 POS-1 POS0 POS+1 Label-1 Label

the shake up DT NN PRT Outside

Begin

shake up . NN PRT O Begin Inside

up . !END! PRT O !END!

Inside Outside

Page 40: 1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado.

40

Evaluation: Precision, Recall, F

predictedentities

correctlypredictedentitiesprecision

#

#

presentactuallyentities

correctlypredictedentitiesrecall

#

#

)(

2

recallprecision

recallprecisionF