1 Meaning from Text: Teaching Computers to Read Steven Bethard University of Colorado
Dec 19, 2015
2
Query: “Who is opposing the railroad through Georgia?”1 en.wikipedia.org/wiki/Sherman's_March_to_the_Sea
…they destroyed the railroads and the manufacturing and agricultural infrastructure of the state…Henry Clay Work wrote the song Marching Through Georgia…
…3 www.ischool.berkeley.edu/~mkduggan/politics.html
While the piano piece "Marching Through Georgia" has no words...Party of California (1882) has several verses opposing the "railroad robbers"...
…71 www.azconsulatela.org/brazaosce.htm
Azerbaijan, Georgia and Turkey plan to start construction of Kars-Akhalkalaki-Tbilisi-Baku railroad in May, 2007…However, we’ve witnessed a very strong opposition to this project both in Congress and White House. President George Bush signed a bill prohibiting financing of this railroad…
3
What went wrong? Didn’t find some similar word forms (Morphology)
Finds opposing but not opposition Finds railroad but not railway
Didn’t know how words should be related (Syntax) Looking for: opposing railroad Finds: opposing the “railroad robbers”
Didn’t know that “is opposing” means current (Semantics/Tense) Looking for: recent documents Finds: Civil War documents
Didn’t know that “who” means a person (Semantics/Entities) Looking for: <person> opposing Finds: several verses opposing
4
Teaching Linguistics to Computers Natural Language Processing (NLP)
Symbolic approaches Statistical approaches
Machine learning overview Statistical NLP
Example: Identifying people and places Example: Constructing timelines
5
Early Natural Language Processing
Symbolic approaches Small domains
Example: SHRDLU block world Vocabulary of ~50 words Simple word combinations Hand-written rules to
understand sentences
Person: WHAT DOES THE BOX CONTAIN?
Comp: THE BLUE PYRAMID.Person: WHAT IS THE PYRAMID
SUPPORTED BY?Comp: THE BOX.Person: HOW MANY BLOCKS ARE
NOT IN THE BOX?Comp: SEVEN OF THEM.
6
Recent Natural Language Processing Large scale linguistic corpora
e.g. Penn TreeBank million words of syntax:
Statistical machine learning e.g. Charniak parser
Trained on the TreeBank Builds new trees with 90% accuracy
sentence
noun-phrase verb-phrase
proper-noun proper-noun signed noun-phrase
determiner nounGeorge Bush
the bill
7
Machine Learning General approach
Analyze data Extract preferences Classify new examples using learned preferences
Supervised machine learning Data have human-annotated labels
e.g. each sentence in the TreeBank has a syntactic tree
Learns human preferences
8
?
?
Supervised Machine Learning Models Given:
An N dimensional feature space
Points in that space A human-annotated label for
each point Goal:
Learn a function to assign labels to points
Methods: K-nearest-neighbors,
support vector machines, etc.
A Two-Dimensional Space
9
Machine Learning Examples Character Recognition
Feature space: 256 pixels (0 = black, 1 = white) Labels: A, B, C, …
Cardiac Arrhythmia Feature space: age, sex, heart rate, … Labels: has arrythmia, doesn’t have arrythmia
Mushrooms Feature space: cap shape, gill color, stalk surface, … Labels: poisonous, edible
… and many more: http://www.ics.uci.edu/~mlearn/MLRepository.html
10
Machine Learning and Language Example:
Identifying people, places, organizations (named entities) However, we’ve witnessed a very strong opposition to
this project both in [ORG Congress] and [ORG White House]. President [PER George Bush] signed a bill prohibiting financing of this railroad.
This doesn’t look like that lines and dots example! What’s the classification problem? What’s the feature space?
11
Named Entities: Classification
Word-by-word classification
Is the word beginning, inside or outside of a named entity?
Word Label
in Outside
Congress Begin-ORG
and Outside
White Begin-ORG
House Inside-ORG
. Outside
President Outside
George Begin-PER
12
Named Entities: Clues The word itself
U.S. is always a Location (though Turkey is not)
Part of speech The Locations Turkey and Georgia are nouns (though the White of White House is not)
Is the first letter of the word capitalized? Bush and Congress are capitalized (though the von of von Neumann is not)
Is the word at the start of the sentence? In the middle of a sentence, Will is likely a Persion (but at the start it could be an auxiliary verb)
13
Named Entities: Clues as Features Each clue defines part of the feature space
Word Part of Speech
Starts Sent
Initial Caps
Label
in preposition False False Outside
Congress
noun False True Begin-ORG
and conjunction False False Outside
White adjective False True Begin-ORG
House noun False True Inside-ORG
. punctuation False False Outside
President
noun True True Outside
George noun False True Begin-PER
14
Named Entities: String Features But machine learning
models need numeric features! True 1 False 0 Congress ? adjective ?
Solution: Binary feature for each
word
StringFeature
NumericFeatures
destroyed 1 0 0 0 0
the 0 1 0 0 0
railroads 0 0 1 0 0
and 0 0 0 1 0
the 0 1 0 0 0
manufacturing
0 0 0 0 1
and 0 0 0 1 0
15
Named Entities: Review…[ORG Congress] and [ORG White House]…
Congress
noun False
True Begin-ORG
and conjunction
False
False
Outside
White adjective False
True Begin-ORG
House noun False
True Inside-ORG
1 0 0 0 1 0 0 0 1 Begin-ORG
0 1 0 0 0 1 0 0 0 Outside
0 0 1 0 0 0 1 0 1 Begin-ORG
0 0 0 1 1 0 0 0 1 Inside-ORG
16
Named Entities: Features and Models String features
word itself part of speech starts sentence has initial capitalization
How many numeric features? N = Nwords + Nparts-of-speech + 1 + 1 Nwords ≈ 10,000 Nparts-of-speech ≈ 50
Need efficient implementations, e.g. TinySVM
17
Named Entities in Use
We know how to: View named entity recognition as classification Convert clues to an N-dimensional feature space Train a machine learning model
How can we use the model?
19
Named Entities in Research TREC-QA
Factoid question answering Various research systems compete All use named entity matching
State of the art performance: ~90% That’s 10% wrong! But good enough for real use Named entities are a “solved” problem
So what’s next?
20
Learning Timelines
The top commander of a Cambodian resistance force said Thursday he has sent a team to recover the remains of a British mine removal expert kidnapped and presumed killed by Khmer Rouge guerrillas almost two years ago.
21
Learning Timelines
The top commander of a Cambodian resistance force said Thursday he has sent a team to recover the remains of a British mine removal expert kidnapped and presumed killed by Khmer Rouge guerrillas almost two years ago.
22
Learning Timelines
said Thursday
sent
recover
kidnapped
killed
almost twoyears ago
before
before
before
includes
includes
includes
before
The top commander of a Cambodian resistance force said Thursday he has sent a team to recover the remains of a British mine removal expert kidnapped and presumed killed by Khmer Rouge guerrillas almost two years ago.
23
Why Learn Timelines? Timelines are summarization
1996 Khmer Rouge kidnapped and killed British mine removal expert
1998 Cambodian commander sent recovery team
…
Timelines allow reasoning Q: When was the expert kidnapped?
A: Almost two years ago. Q: Was the team sent before the expert was killed?
A: No, afterwards.
24
Learning Timelines: Classification Standard questions:
What’s the classification problem? What’s the feature space?
Three different problems Identify times Identify events Identify links (temporal relations)
25
Times and Events: Classification Word-by-word
classification
Time features: word itself has digits …
Event features: word itself suffixes (e.g. -ize, -tion) root (e.g. evasionevade) …
Word Part of Speech
Label
The determiner Outside
company noun Outside
’s possessive Outside
sales noun Outside
force noun Outside
applauded
verb Begin-Event
the determiner Outside
shake noun Begin-Event
up particle Inside-Event
. punctuation Outside
26
Times and Events: State of the Art Performance:
Times: ~90% Events: ~80%
Mr Bryza, it's been [Event reported] that Azerbaijan, Georgia and Turkey [Event plan] to [Event start] [Event construction] of Kars Akhalkalaki Tbilisi Baku railroad in [Time May], [Time 2007].
Why are events harder? No orthographic cues (capitalization, digits, etc.) More parts of speech (nouns, verbs and adjectives)
27
Temporal Links Everything so far looked like:
Aaaa [X bb] ccccc [Y dd eeeee] fff [Z gggg]
But now we want this:
Word-by-word classification won’t work!
Aaaa bb ccccc dd eeeee fff ggg
X Y
28
Temporal Links: Classification Pairwise classification
Each event with each time
Saddam Hussein [Time today] [Event sought] [Event peace] on another front by [Event promising] to [Event withdraw] from Iranian territory and [Event release] soldiers [Event captured] during the Iran-Iraq [Event war].
Event Time Label
sought today
During
peace today
After
promising
today
During
withdraw today
After
release today
After
captured today
Before
war today
Before
29
Temporal Links: Clues Tense of the event
said (past tense) is probably Before today says (present tense) is probably During today
Nearby temporal expression In “said today”, said is During today In “captured in 1989”, captured is During 1989
Negativity In “People believe this”, believe is During today In “People don’t believe this any more”, believe is Before today
30
Temporal Links: FeaturesSaddam Hussein [Time today] [Event sought] [Event peace] on another front by [Event promising] to [Event withdraw] from Iranian territory…
sought today
past today positive
During
peace today
none none positive
After
promising
today
present none positive
During
withdraw today
infinitive none positive
After1 0 0 0 0 1 0 0 0 1 0 0 During
0 1 0 0 0 0 1 0 0 0 1 0 After
0 0 1 0 0 0 0 1 0 0 1 0 During
0 0 0 1 0 0 0 0 1 0 1 0 After
31
Temporal Links: State of the Art Corpora with temporal links:
PropBank: verbs and subjects/objects TimeBank: certain pairs of events
(e.g. reporting event and event reported)
TempEval A: events and times in the same sentence TempEval B: events in a document and document time
Performance on TempEval data: Same-sentence links (A): ~60% Document time links (B): ~80%
32
What will make timelines better? Larger corpora
TempEval is only ~50 documents Treebank is ~2400
More types of links Event-time pairs for all events
TempEval only considers high-frequency events
Event-event pairs in the same sentence
33
Summary Statistical NLP asks:
What’s the classification problem? Word-by-word? Pairwise?
What’s the feature space? What are the linguistic clues? What does the N-dimensional space look like?
Statistical NLP needs: Learning algorithms efficient when N is very large Large-scale corpora with linguistic labels
35
References Symbolic NLP
Terry Winograd. 1972. Understanding Natural Language. Academic Press.
Statistical NLP Daniel M. Bikel, Richard Schwartz, Ralph M. Weischedel. 1999. “An
Algorithm that Learns What's in a Name.” Machine Learning. Kadri Hacioglu,Ying Chen and Benjamin Douglas. 2005. “Automatic
Time Expression Labeling for English and Chinese Text.” In Proceedings of CICLing-2005.
Ellen M. Voorhees and Hoa Trang Dang. 2005. “Overview of the TREC 2005 Question Answering Track.” In proceedings of The Fourteenth Text REtrieval Conference.
36
References Corpora
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. “Building a large annotated corpus of english: The penn treebank.” Computational Linguistics, 19:313-330.
Martha Palmer, Dan Gildea, Paul Kingsbury. 2005. The Proposition Bank: A Corpus Annotated with Semantic Roles, Computational Linguistics Journal, 31:1.
James Pustejovsky, Patrick Hanks, Roser Saurí, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro and Marcia Lazo. 2003. The TIMEBANK Corpus. Proceedings of Corpus Linguistics 2003: 647-656.
38
Feature Windowing (1)
Problem: Word-by-word gives
no context
Solution: Include surrounding
features
Word Part of Speech
Label
The determiner Outside
company noun Outside
’s possessive Outside
sales noun Outside
force noun Outside
applauded
verb Begin-Event
the determiner Outside
shake noun Begin-Event
up particle Inside-Event
. punctuation Outside
39
Feature Windowing (2) From previous word: features, label From current word: features From following word: features
Need special values like !START! and !END!
Word-1 Word0 Word+1 POS-1 POS0 POS+1 Label-1 Label
the shake up DT NN PRT Outside
Begin
shake up . NN PRT O Begin Inside
up . !END! PRT O !END!
Inside Outside