Relation Relation Extraction Extraction Slides from Dan Jurafsky, Rion Slides from Dan Jurafsky, Rion Snow, Jim Martin, Chris Manning Snow, Jim Martin, Chris Manning and William Cohen and William Cohen
Dec 28, 2015
Relation ExtractionRelation Extraction
Slides from Dan Jurafsky, Rion Snow, Slides from Dan Jurafsky, Rion Snow, Jim Martin, Chris Manning and William Jim Martin, Chris Manning and William
CohenCohen
Background: Information Background: Information ExtractionExtraction
Extract information from textExtract information from text Sometimes called text analytics
commercially Extract entities (the people,
organizations, locations, times, dates, genes, diseases, medicines, etc. in a text)
Extract the relations between entities Figure out the larger events that are
taking place
Information ExtractionInformation Extraction
Creating knowledge bases and ontologiesCreating knowledge bases and ontologies Implications for cognitive modelingImplications for cognitive modeling Digital LibariesDigital Libaries
Google scholar, Citeseer need to extract the title, author and references
BioinformaticsBioinformatics Patent analysisPatent analysis Specific market segments for stock analysisSpecific market segments for stock analysis SEC filingsSEC filings Intelligence analysisIntelligence analysis
Paradigms: Data Mining vs Text Paradigms: Data Mining vs Text MiningMining
Data Mining Data Mining is a set of techniques that aims to is a set of techniques that aims to extract knowledge by big data analysis, e.g. from extract knowledge by big data analysis, e.g. from databases, using simple patterns and statistical databases, using simple patterns and statistical methodsmethods
In contrast, the In contrast, the Text MiningText Mining 's goal (or 's goal (or Machine Machine ReadingReading),),
is to extract relationship between entities that just is to extract relationship between entities that just existing in examinated existing in examinated text, text, semantically deeper semantically deeper than data mining doesthan data mining does
The difference is in the source information:The difference is in the source information: relationship in text just exist in our source, we don't need to try to guess it but only discover it!
OutlineOutline
Reminder: Named Entity TaggingReminder: Named Entity Tagging Relation ExtractionRelation Extraction
Hand-built patterns Seed (bootstrap) methods Supervised classification Distant supervision
What is “Information What is “Information Extraction”Extraction”
Slide from William Cohen
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATION
What is “Information What is “Information Extraction”Extraction”
Slide from William Cohen
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..
IE
What is “Information What is “Information Extraction”Extraction”
Slide from William Cohen
Information Extraction = segmentation + classification + clustering + association
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
“named entity extraction”
What is “Information What is “Information Extraction”Extraction”
Slide from William Cohen
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
What is “Information What is “Information Extraction”Extraction”
Slide from William Cohen
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
What is “Information What is “Information Extraction”Extraction”
Slide from William Cohen
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N
AME
TITLE ORGANIZATION
Bill Gates
CEO
Microsoft
Bill Veghte
VP
Microsoft
Richard Stallman
founder
Free Soft..
*
*
*
*
Extracting Structured Extracting Structured KnowledgeKnowledge
LLNL EQ Lawrence Livermore National Laboratory LLNL LOC-IN CaliforniaLivermore LOC-IN CaliforniaLLNL IS-A scientific research laboratoryLLNL FOUNDED-BY University of CaliforniaLLNL FOUNDED-IN 1952
Each article can contain hundreds or thousands of items of knowledge...
“The Lawrence Livermore National Laboratory (LLNL) in Livermore, California is a scientific research
laboratory founded by the University of California in 1952.”
Goal: Machine-readable summariesGoal: Machine-readable summaries
SubjectSubject RelationRelation ObjectObject
p53 is_a protein
Bax is_a protein
p53has_functio
napoptosis
Baxhas_functio
ninduction
apoptosis involved_in cell_death
Bax is_in mitochondrialouter membrane
Bax is_in cytoplasm
apoptosis related_to caspase activation
... ... ...
Textual abstract: Summary for human
Structured knowledge extraction: Summary for
machine
From Unstructured Text to Structured Knowledge
Unstructured Text
News articles...slide from Rion Snow
From Unstructured Text to Structured Knowledge
Unstructured Text
Blog posts....slide from Rion Snow
From Unstructured Text to Structured Knowledge
Unstructured Text
Scientific journal articles...slide from Rion Snow
From Unstructured Text to Structured Knowledge
Unstructured Text
Tweets, instant messages, chat logs...slide from Rion Snow
From Unstructured Text to Structured Knowledge
Unstructured Text
slide from Rion Snow
From Unstructured Text to Structured Knowledge
Structured KnowledgeUnstructured Text
slide from Rion Snow
From Unstructured Text to Structured Knowledge
Structured KnowledgeUnstructured Text
slide from Rion Snow
From Unstructured Text to Structured Knowledge
Structured KnowledgeUnstructured Text
slide from Rion Snow
From Unstructured Text to Structured Knowledge
Structured KnowledgeUnstructured Text
slide from Rion Snow
From Unstructured Text to Structured Knowledge
Structured KnowledgeUnstructured Text
slide from Rion Snow
From Unstructured Text to Structured Knowledge
Structured KnowledgeUnstructured Text
slide from Rion Snow
Relation ExtractionRelation Extraction
What is relation extraction?What is relation extraction? Founded in 1801 as South Carolina College, Founded in 1801 as South Carolina College,
USC is the USC is the flagship institution of the institution of the University of South Carolina System and offers and offers more than 350 programs of study leading to more than 350 programs of study leading to bachelor's, , master's, and , and doctoral degrees from degrees from fourteen degree-granting colleges and schools to fourteen degree-granting colleges and schools to an enrollment of approximately 45,251 students, an enrollment of approximately 45,251 students, 30,967 on the main Columbia campus. … [wiki]30,967 on the main Columbia campus. … [wiki]
complex relation = summarizationcomplex relation = summarization focus on binary relation predicate(subject, object) focus on binary relation predicate(subject, object)
or triples <subj predicate obj>or triples <subj predicate obj>
Wiki Info Box – Wiki Info Box – structured datastructured data
template template • standard things about standard things about
UniversitiesUniversities• EstablishedEstablished• typetype• facultyfaculty• studentsstudents• locationlocation• mascotmascot
Focus on extracting binary Focus on extracting binary relationsrelations
• predicate(subject, object) from predicate logicpredicate(subject, object) from predicate logic
• triples <subj relation object>triples <subj relation object>
• Directed graphsDirected graphs
Why relation extraction?Why relation extraction?
create new structured KBcreate new structured KB Augmenting existing: words -> wordnet, facts -> Augmenting existing: words -> wordnet, facts ->
FreeBase or DBPediaFreeBase or DBPedia Support question answering: Jeopardy Support question answering: Jeopardy Which relationsWhich relations
Automated Content Extraction (ACE) Automated Content Extraction (ACE) http://www.itl.nist.gov/iad/mig//tests/ace/http://www.itl.nist.gov/iad/mig//tests/ace/
17 relations17 relations
ACE examplesACE examples
Unified Medical Language Unified Medical Language System (UMLS)System (UMLS)
UMLS: Unified Medical 134 entities, 54 UMLS: Unified Medical 134 entities, 54 relationsrelations
http://www.nlm.nih.gov/research/umls/
UMLS semantic networkUMLS semantic network
Current Relations in the Current Relations in the UMLS Semantic NetworkUMLS Semantic Network isa isa associated_with associated_with physically_related_to physically_related_to part_of part_of consists_of consists_of contains contains connected_to connected_to interconnects interconnects branch_of branch_of tributary_of tributary_of ingredient_of ingredient_of spatially_related_to spatially_related_to location_of location_of adjacent_to adjacent_to surrounds surrounds traverses traverses functionally_related_to functionally_related_to affects affects … …
……temporally_related_to temporally_related_to co-occurs_with co-occurs_with precedes precedesconceptually_related_toconceptually_related_toevaluation_of evaluation_of degree_of degree_of analyzes analyzes assesses_effect_of assesses_effect_of measurement_of measurement_of measures measures diagnoses diagnoses property_of property_of derivative_of derivative_of developmental_form_of developmental_form_of method_of method_of … …
Databases of Wikipedia Databases of Wikipedia RelationsRelations
• DBpedia is a crowd-sourced community effortDBpedia is a crowd-sourced community effort• to extract structured information from to extract structured information from
WikipediaWikipedia• and to make this information readily availableand to make this information readily available• DBpedia allows you to make sophisticated DBpedia allows you to make sophisticated
queriesqueries
http://dbpedia.org/About
English version of the English version of the DBpedia knowledge baseDBpedia knowledge base• 3.77 million things3.77 million things• 2.35 million are classified in an ontology2.35 million are classified in an ontology• including:including:
• including 764,000 persons, • 573,000 places (including 387,000 populated
places), • 333,000 creative works (including 112,000 music
albums, 72,000 films and 18,000 video games), • 192,000 organizations (including 45,000
companies and 42,000 educational institutions), • 202,000 species and • 5,500 diseases.
freebasefreebase
google (freebase wiki) google (freebase wiki) http://wiki.freebase.com/wiki/Main_Pagehttp://wiki.freebase.com/wiki/Main_Page
Ontological relationsOntological relations
Ontological relationsOntological relations• IS-A hypernymIS-A hypernym• Instance-ofInstance-of• has-Parthas-Part• hyponym (opposite of hypernym)hyponym (opposite of hypernym)
Reminder: Task 1: Named Reminder: Task 1: Named Entity TaggingEntity Tagging
Slide from Chris Manning
General NER or Biomedical NER
<PER> John Hennessy</PER> is a professor at <ORG> Stanford University </ORG>, in <LOC> Palo Alto </LOC>.
<RNA> TAR </RNA> independent transactivation by <PROTEIN> Tat </PROTEIN> in cells derived from the <CELL> CNS </CELL> - a novel mechanism of <DNA> HIV-1 gene </DNA> regulation.
Reminder: Maximum Reminder: Maximum Entropy Markov ModelEntropy Markov Model
K
kjk
m
jj
j
m
jj
thf
thf
htP
1 1
1
)),(exp(
)),(exp(
)|(
Slide from Chris Manning
DNAO DNA
of HIV−1 gene regulation
O
Task II: Relation ExtractionTask II: Relation Extraction
Relations between wordsRelations between words
Language Understanding Applications needs word Language Understanding Applications needs word meaning!meaning! Question answering Conversational agents Summarization
One key meaning component: word relationsOne key meaning component: word relations Hierarchical (ontological) relations
• “San Francisco” ISA “city” Other relations between words
• “alternator” is a part of a “car”
Relation PredictionRelation Prediction
How can we capture the variability of expression of a relation in natural text from a large, unannotated corpus?
“...works by such authors as Herrick, Goldsmith, and Shakespeare.”
“If you consider authors like Shakespeare...”
“Shakespeare, author of The Tempest...”
“Some authors (including Shakespeare)...”
“Shakespeare was the author of several...”
Shakespeare IS-A author (0.87)
HyponymyHyponymy
One sense is a One sense is a hyponymhyponym of another if the first of another if the first sense is more specific, denoting a subclass of the sense is more specific, denoting a subclass of the otherother car is a hyponym of vehicle dog is a hyponym of animal mango is a hyponym of fruit
ConverselyConversely vehicle is a hypernym/superordinate of car animal is a hypernym of dog fruit is a hypernym of mango
superordinate
vehicle fruit furniture mammal
hyponym car mango chair dog
42
X is-a-part-of Y(meronym / holonym)
X is-a-kind-of Y(hyponym / hypernym)
WordNet relations
WordNet is incompleteWordNet is incomplete
In WordNet 2.1 Not in WordNet“insulin”“progesterone”
“leptin”“pregnenolone”
“combustibility”“navigability
“affordability”“reusability”
“HTML” “XML”“Google”, “Yahoo” “Microsoft”, “IBM”
Ontological relations are missing for many words
Especially true for specific domains (restaurants, auto parts, finance)
Other kinds of Relations: Disease Other kinds of Relations: Disease OutbreaksOutbreaks
Extract structured information from textExtract structured information from text
Slide from Eugene Agichtein
May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…
Date Disease Name Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia U.S.
May 1995 Ebola Zaire
Information Extraction System (e.g., NYU’s Proteus)
Disease Outbreaks in The New York Times
More relations: Protein More relations: Protein InteractionsInteractions
„We show that CBF-A and CBF-C interact with each other to form a CBF-A-CBF-C complexand that CBF-B does not interact with CBF-A or CBF-C individually but that it associates with the CBF-A-CBF-C complex.“
CBF-A CBF-C
CBF-B CBF-A-CBF-C complex
interactcomplex
associates
Slide from Rosario and Hearst
Yet More RelationsYet More Relations
CHICAGO (AP) — Citing high fuel prices, United Airlines said CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. some cities also served by lower-cost carriers. American American Airlines, a unit AMRAirlines, a unit AMR, immediately matched the move, , immediately matched the move, spokesman Tim Wagnerspokesman Tim Wagner said. said. United, a unit of UALUnited, a unit of UAL, said the , said the increase took effect Thursday night and applies to most routes increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as where it competes against discount carriers, such as Chicago to Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles Dallas and Atlanta and Denver to San Francisco, Los Angeles and New Yorkand New York
Slide from Jim Martin
Relation TypesRelation Types
For generic news texts...For generic news texts...
Slide from Jim Martin
More relations: UMLSMore relations: UMLS
Unified Medical Language SystemUnified Medical Language System integrates linguistic, terminological and semantic information Semantic Network consists of 134 semantic types and 54
relations between types
Pharmacologic Substance affects Pathologic FunctionPharmacologic Substance causes Pathologic FunctionPharmacologic Substance complicates Pathologic FunctionPharmacologic Substance diagnoses Pathologic FunctionPharmacologic Substance prevents Pathologic FunctionPharmacologic Substance treats Pathologic Function
Slide from Paul Buitelaar
Relations in Ontologies: GO (Gene Relations in Ontologies: GO (Gene Ontology)Ontology) GO (Gene Ontology)GO (Gene Ontology)
Aligns descriptions of gene products in different databases, including plant, animal and microbial genomes
Organizing principles are molecular function, biological process and cellular component
Accession: GO:0009292Ontology: biological processSynonyms: broad: genetic exchangeDefinition: In the absence of a sexual life cycle, the processes
involved in the introduction of genetic information to create a genetically different individual.
Term Lineage all : all (164142)GO:0008150 : biological process (115947)
GO:0007275 : development (11892)GO:0009292 : genetic transfer (69)
Slide from Paul Buitelaar
Relations in Ontologies: Relations in Ontologies: geographicalgeographical
OntologyF-Logic
similar
city
NeckarZugspitze
Geographical Entity (GE)
Natural GE Inhabited GE
countryrivermountain
instance_of
Germany
BerlinStuttgart
is-a
flow_through
located_in
capital_of
flow_through
flow_through
located_in
capital_of
367
length (km)
2962
height (m)
Design: Philipp CimianoSlide from Paul Buitelaar
MeSH (Medical Subject MeSH (Medical Subject Headings) ThesaurusHeadings) Thesaurus
51
MeSH Descriptor Definition
Synonym set
Slide from Illhoi Yoo, Xiaohua (Tony) Hu,and Il-Yeol Song
MeSH TreeMeSH TreeMeSH OntologyMeSH Ontology
Hierarchically arranged from most general to most specific.
Actually a graph rather than a tree
• normally appear in more than one place in the tree
MeSH Tree
Slide from Yoo, Hu, Song
Types of ACE Relations, Types of ACE Relations, 20032003
ROLE ROLE - relates a person to an organization or a - relates a person to an organization or a geopolitical entitygeopolitical entity Subtypes: member, owner, affiliate, client, citizen
PART PART - generalized containment- generalized containment Subtypes: subsidiary, physical part-of, set membership
AT AT - permanent and transient locations- permanent and transient locations Subtypes: located, based-in, residence
SOCIALSOCIAL- social relations among persons- social relations among persons Subtypes: parent, sibling, spouse, grandparent,
associate
Slide from Doug Appelt
Frequent Freebase Frequent Freebase RelationsRelations
aa
Predicting the “is-a” Predicting the “is-a” relationrelation
How can we capture the variability of expression of a relation in natural text from a large, unannotated corpus?
“...works by such authors as Herrick, Goldsmith, and Shakespeare.”
“If you consider authors like Shakespeare...”
“Shakespeare, author of The Tempest...”
“Some authors (including Shakespeare)...”
“Shakespeare was the author of several...”
Shakespeare IS-A author (0.87)
Why this is hard: Why this is hard: Ambiguity!Ambiguity!
Treatment Disease
Cure?
Prevent?
Side Effect?
Which relations hold between 2 Which relations hold between 2 entities?entities?
Different relations between Disease Different relations between Disease (Hepatitis) and Treatment(Hepatitis) and Treatment
CureCure These results suggest that con A-induced
hepatitis was ameliorated by pretreatment with TJ-135.
PreventPrevent A two-dose combined hepatitis A and B vaccine
would facilitate immunization programsVagueVague
Effect of interferon on hepatitis B
Slide from B. Rosario and M. Hearst
5 easy methods for relation 5 easy methods for relation extractionextraction
1.1. Hand-built patternsHand-built patterns2.2. Bootstrapping (seed) methodsBootstrapping (seed) methods
3.3. Supervised methodsSupervised methods
4.4. Unsupervised methodsUnsupervised methods5.5. Distant supervisionDistant supervision
5 easy methods for relation 5 easy methods for relation extractionextraction
1.1. Hand-built patternsHand-built patterns2.2. Supervised methodsSupervised methods3.3. Bootstrapping (seed) methodsBootstrapping (seed) methods
4.4. Unsupervised methodsUnsupervised methods5.5. Distant supervisionDistant supervision
A complex hand-built extraction A complex hand-built extraction rule [NYU Proteus]rule [NYU Proteus]
Goal: Add hyponyms to Goal: Add hyponyms to WordNet directly from text.WordNet directly from text.
Intuition from Intuition from Hearst (1992) Hearst (1992) “Agar is a substance prepared from a mixture of
red algae, such as Gelidium, for laboratory or industrial use”
What does What does GelidiumGelidium mean? mean? How do you know?How do you know?
Goal: Add hyponyms to WordNet Goal: Add hyponyms to WordNet directly from text.directly from text.
Intuition from Intuition from Hearst (1992) Hearst (1992) “Agar is a substance prepared from a mixture of
red algae, such as Gelidium, for laboratory or industrial use”
What does What does GelidiumGelidium mean? mean? How do you know?How do you know?
Hearst’s Hand-Designed Lexico-Hearst’s Hand-Designed Lexico-Syntactic PatternsSyntactic Patterns
(Hearst, 1992): Automatic Acquisition of Hyponyms
“Y such as X ((, X)* (, and/or) X)”“such Y as X…”“X… or other Y”“X… and other Y”“Y including X…”“Y, especially X…”
Hearst’s hand-built patterns for Hearst’s hand-built patterns for Relation ExtractionRelation Extraction
Hearst pattern Example occurrencesX and other Y ...temples, treasuries, and other important civic
buildings.X or other Y Bruises, wounds, broken bones or other
injuries...Y such as X The bow lute, such as the Bambara ndang...
Such Y as X ...such authors as Herrick, Goldsmith, and Shakespeare.
Y including X ...common-law countries, including Canada and England...
Y , especially X European countries, especially France, England, and Spain...
Problem with hand-built Problem with hand-built patternspatterns
Requires that we hand-build patterns for Requires that we hand-build patterns for each relation!each relation! don’t want to have to do this for all
possible relations! we’d like better accuracy
5 easy methods for relation 5 easy methods for relation extractionextraction
1.1. Hand-built patternsHand-built patterns2.2. Supervised methodsSupervised methods3.3. Bootstrapping (seed) methodsBootstrapping (seed) methods
4.4. Unsupervised methodsUnsupervised methods
2. Supervised Relation 2. Supervised Relation ExtractionExtraction
Sometimes done in 3 stepsSometimes done in 3 steps1. Find all pairs of named entities2. Decide if 2 entities are related3. If yes, classifying the relation
Why the extra step?Why the extra step? Cuts down on training time for classification
by eliminating most pairs Producing separate feature-sets that are
appropriate for each task.
Relation AnalysisRelation Analysis
Usually just run on named entities within the Usually just run on named entities within the same sentencesame sentence
Slide from Jim Martin
Relation ExtractionRelation Extraction
Task definition: to label the semantic relation Task definition: to label the semantic relation between a pair of entities in a sentence between a pair of entities in a sentence (fragment)(fragment)
Slide from Jing Jiang
…[leader arg-1] of a minority [government arg-2]…
PHYS PER-SOC EMP-ORG NIL
PHYS: PhysicalPER-SOC: Personal / SocialEMP-ORG: Employment / Membership / Subsidiary
Supervised LearningSupervised Learning
Supervised machine learning Supervised machine learning ((e.g. [e.g. [Zhou et al. 2005], Zhou et al. 2005], [Bunescu & Mooney 2005], [Zhang et al. 2006[Bunescu & Mooney 2005], [Zhang et al. 2006], [Surdeanu & Ciaramita ], [Surdeanu & Ciaramita 2007]2007]))
Training data is needed for each relation typeTraining data is needed for each relation type
…[leader arg-1] of a minority [government arg-2]…
arg-1 word: leader arg-2 type: ORG
dependency:arg-1 of arg-2
EMP-ORGPHYS PER-SOC NIL
Slide from Jing Jiang
ACE 2008 Six RelationsACE 2008 Six Relations
Features: WordsFeatures: Words
Headwords of M1 and M2, and combinationHeadwords of M1 and M2, and combination• George Washington Bridge
Bag of words and bigrams in M1 and M2Bag of words and bigrams in M1 and M2 Words or bigrams in particular positions to the Words or bigrams in particular positions to the
left and right of the M1 and M2left and right of the M1 and M2• +/- 1, 2, 3
Bag of words or bigrams between the two entitiesBag of words or bigrams between the two entities
Features: Named Entity Type Features: Named Entity Type and Mention Leveland Mention Level Named-entity types (ORG, LOC, etc)Named-entity types (ORG, LOC, etc) Concatenation of the typesConcatenation of the types Entity Level of M1 and M2 Entity Level of M1 and M2
(NAME, NOMINAL, PRONOUN)
Features: Parse Tree and Features: Parse Tree and Base PhrasesBase Phrases
Syntactic environmentSyntactic environment Constituent path through the tree from
one to the other Base syntactic chunk sequence from one
to the other Dependency path
Slide from Jim Martin
Features: Gazeteers and Features: Gazeteers and trigger wordstrigger words
Personal relative trigger listPersonal relative trigger list from wordnet: parent, wife, husabnd, grandparent,
etc
Country name listCountry name list
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.
Classifiers for supervised Classifiers for supervised methodsmethods
Now you can use any classifier you likeNow you can use any classifier you like SVM Logistic regression Naïve Bayes etc
SummarySummary
Can get high accuracies with enough hand-Can get high accuracies with enough hand-labeled training data labeled training data
If test data looks exactly like the training dataIf test data looks exactly like the training data ButBut
labeling 5000 relations (and named entities) is expensive
the approach doesn’t generalize to different genres
5 easy methods for relation 5 easy methods for relation extractionextraction
1.1. Hand-built patternsHand-built patterns2.2. Supervised methodsSupervised methods3.3. Bootstrapping (seed) Bootstrapping (seed)
methodsmethods
4.4. Unsupervised methodsUnsupervised methods5.5. Distant supervisionDistant supervision
Bootstrapping ApproachesBootstrapping Approaches
What if you don’t have enough annotated text to train on.What if you don’t have enough annotated text to train on. But you might have some seed tuples Or you might have some patterns that work pretty well
Can you use those seeds to do something useful?Can you use those seeds to do something useful? Co-training and active learning use the seeds to train
classifiers to tag more data to train better classifiers... Bootstrapping tries to learn directly (populate a relation)
through direct use of the seeds
Slide from Jim Martin
Bootstrapping Example: Bootstrapping Example: Seed TupleSeed Tuple
<Mark Twain, Elmira> <Mark Twain, Elmira> Seed tupleSeed tuple Grep (google) “Mark Twain is buried in Elmira, NY.”
• X is buried in Y
“The grave of Mark Twain is in Elmira”• The grave of X is in Y
“Elmira is Mark Twain’s final resting place”• Y is X’s final resting place.
Use those patterns to grep for new tuples that you Use those patterns to grep for new tuples that you don’t already knowdon’t already know
Slide from Jim Martin
Hearst (1992) proposal for Hearst (1992) proposal for bootstrappingbootstrapping
Choose lexical relation RChoose lexical relation R Gather a set of pairs that have this relationGather a set of pairs that have this relation Find places in the corpus where these Find places in the corpus where these
expressions occur near each other and expressions occur near each other and record the environmentrecord the environment
Find the commonalities among these Find the commonalities among these environments and hypothesize that common environments and hypothesize that common ones yield patterns that indicate the relation ones yield patterns that indicate the relation of interestof interest
Bootstrapping RelationsBootstrapping Relations
Slide from Jim Martin
Dipre (Brin 1998)Dipre (Brin 1998)
Extract <author, book> pairsExtract <author, book> pairs Start with these 5 seedsStart with these 5 seeds
Learn these patterns:Learn these patterns:
Now iterate, using these patterns to get more Now iterate, using these patterns to get more instances and patterns…instances and patterns…
Snowball Snowball [Agichtein & [Agichtein & Gravano 2000]Gravano 2000] Exploit duality between patterns and Exploit duality between patterns and
tuplestuples
- find tuples that match a set of patterns- find patterns that match a lot of tuples
bootstrapping approach
Initial Seed Tuples Occurrences of Seed Tuples
Tag Entities
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
SnowballSnowball
ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA
initial seed tuples
Computer servers at Microsoft’s headquarters in Redmond…
In mid-afternoon trading, share ofRedmond-based Microsoft fell…
The Armonk-based IBM introduceda new line…
The combined company will operate
from Boeing’s headquarters in Seattle.
Intel, Santa Clara, cut prices of itsPentium processor.
occurrences of seed tuples
Snowball Snowball
Require that X and Y be named entities of particular typesRequire that X and Y be named entities of particular types
{<’s 0.7> <headquarters 0.7> <in 0.7> }ORGANIZATION LOCATIO
N
{<- 0.75> <based 0.75>}
ORGANIZATIONLOCATION
PatternsPatterns
(extraction)(extraction) patternpattern has format <left, tag1, has format <left, tag1, middle, tag2, right>, middle, tag2, right>,
where tag1, tag2 are named-entity tags and left, middle, and right are vectors of weighted terms
•patterns derived directly from occurrences patterns derived directly from occurrences are too specificare too specific
< left , tag1 , middle , tag2 , right >
ORGANIZATION 's central headquarters in LOCATION is home to...
LOCATIONORGANIZATION{<'s 0.5>, <central 0.5> <headquarters 0.5>, < in 0.5>}
{<is 0.75>, <home 0.75> }
Pattern ClustersPattern Clusters
cluster patterns, cluster centroids define patternscluster patterns, cluster centroids define patterns
ORGANIZATION
{<servers 0.75><at 0.75>}
{<’s 0.5> <central 0.5> <headquarters 0.5> <in 0.5>}
LOCATION
ORGANIZATION
{<operate 0.75><from 0.75>}
{<’s 0.7> <headquarters 0.7> <in 0.7>}
LOCATION
ORGANIZATION
Cluster 1
{<shares 0.75><of 0.75>}
{<- 0.75> <based 0.75> }
{<fell 1>}
{<the 1>}
{<- 0.75> <based 0.75> }
{<introduced 0.75> <a 0.75>}LOCATION
ORGANIZATION
ORGANIZATION
Cluster 2
LOCATION
5 easy methods for 5 easy methods for relation extractionrelation extraction
1.1. Hand-built patternsHand-built patterns2.2. Supervised methodsSupervised methods3.3. Bootstrapping (seed) methodsBootstrapping (seed) methods
4.4. Unsupervised methodsUnsupervised methods5.5. Distant supervisionDistant supervision
Distant supervision Distant supervision paradigmparadigm
Instead of hand-creating 5 seed examplesInstead of hand-creating 5 seed examples Use a large database to get our seed Use a large database to get our seed
examplesexamples lots of examples supervision from a database, not a corpus!
• Not genre-dependent!
Create lots and lots of noisy features from all these examples
Combine in a classifier
Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17
Mintz, Bills, Snow, Jurafsky (2009) Distant supervision for relation extraction without labeled data. ACL-2009.
Distant supervision Distant supervision paradigmparadigm
Has advantages of supervised classification:Has advantages of supervised classification: use of rich of hand-created knowledge
Has advantages of unsupervised Has advantages of unsupervised classification:classification: infinite amounts of data allows for very large number of weak
features not sensitive to training corpus
Relation Classification with Relation Classification with “Distant Supervision”“Distant Supervision”
We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet.
Slide from Rion Snow
Suppose scientists could erase memories with a single substance in the brain.
Relation Classification with Relation Classification with “Distant Supervision”“Distant Supervision”
This leads to high-signal examples like:
“...consider authors like Shakespeare...”
“Shakespeare, author of The Tempest...”
“Some authors (including Shakespeare)...”
“Shakespeare was the author of several...”
Construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet.
Slide from Rion Snow
Relation Classification with Relation Classification with “Distant Supervision”“Distant Supervision”
This leads to high-signal examples like:
But noisy examples like:
“...consider authors like Shakespeare...”
“Shakespeare, author of The Tempest...”
“Some authors (including Shakespeare)...”
“Shakespeare was the author of several...”
“The author of Shakespeare in Love...”
“...authors at the Shakespeare Festival...”
Training set (TREC and Wikipedia):14,000 hypernym pairs, ~600,000 total pairs
Slide from Rion Snow
How to learn patternsHow to learn patterns
Take corpus sentencesTake corpus sentences
Collect noun pairsCollect noun pairs752,311 pairs from 6M words of 752,311 pairs from 6M words of
newswirenewswire
Is pair an IS-A in WordNet? Is pair an IS-A in WordNet? 14,387 yes, 737,924 no
Parse the sentencesParse the sentences
Extract patternsExtract patterns69,592 dependency paths >5 69,592 dependency paths >5
pairs)pairs)
Train classifier on these Train classifier on these patternspatterns
Logistic regressionLogistic regression with 70K with 70K featuresfeatures(actually converted to (actually converted to 974,288 bucketed binary 974,288 bucketed binary features)features)
… doubly heavy hydrogen atom called deuterium…
44
11
22
33
55
66
(Atom, deuterium)
YES
Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17
One of 70,000 patternsOne of 70,000 patterns
“<superordinate> ‘called’ <subordinate>”
Learned from cases such as:
“sarcoma / cancer”: …an uncommon bone cancer called osteogenic sarcoma and to…“deuterium / atom” ….heavy water rich in the doubly heavy hydrogen atom called deuterium.
New pairs discovered:
“efflorescence / condition”: …and a condition called efflorescence are other reasons for… “’neal_inc / company” …The company, now called O'Neal Inc., was sole distributor of E-Ferol…“hat_creek_outfit / ranch” …run a small ranch called the Hat Creek Outfit.“hiv-1 / aids_virus” …infected by the AIDS virus, called HIV-1.“bateau_mouche / attraction” …local sightseeing attraction called the Bateau Mouche...“kibbutz_malkiyya / collective_farm” …an Israeli collective farm called Kibbutz Malkiyya…
Hypernym Precision / Recall for all Hypernym Precision / Recall for all FeaturesFeatures
Slide from Rion Snow
logistic regression
Idea: use each pattern as a Idea: use each pattern as a feature!!!!feature!!!!Precision/Recall for Hypernym Precision/Recall for Hypernym Classification:Classification:
10-fold Cross Validation on 14,000 WordNet-Labeled Pairs
Slide from Rion Snow
Extracting more relations with Extracting more relations with distant supervisiondistant supervision
2700 relations > 10 instances
5.2 million instances3.7 million entities
Training SetCorpus
2.3 million articles31.5 million sentences
Mintz, Bills, Snow, Jurafsky (2009) Distant supervision for relationextraction without labeled data. ACL-2009.
Frequent Freebase Frequent Freebase RelationsRelations
aa
Algorithm: Distant Algorithm: Distant SupervisionSupervision
A kind of A kind of weakly supervised learningweakly supervised learning Use a large database to get seed
examples Create lots and lots of noisy pattern
features from all these examples Combine in a classifier
Extract parse and other Extract parse and other featuresfeatures
“Astronomer Edward Hubble was born in Marshfield, Missouri”
“Named entities”
Edwin Hubble is a PERSON
Marshfield is a LOCATION Lexical items nearby…
New relations learnedNew relations learned
Montmartre Montmartre IS-IN IS-IN ParisParis Fort Erie Fort Erie IS-IN IS-IN OntarioOntario Fyoder Kamesnky Fyoder Kamesnky DIED-IN DIED-IN ClearwaterClearwater Utpon Sinclair Utpon Sinclair WROTE WROTE Lanny BuddLanny Budd Vince McMahon Vince McMahon FOUNDED FOUNDED WWEWWE Thomas Mellon Thomas Mellon HAS-PROFESSION HAS-PROFESSION JudgeJudge
Human evaluation: precision Human evaluation: precision using Mechanical Turk using Mechanical Turk labelerslabelersFeatureFeature PrecisionPrecision
SyntacticSyntactic .67.67
Lexical Lexical .66.66
BothBoth .69.69
Where syntactic Where syntactic knowledge helpsknowledge helps
Back Street Back Street is a 1932 film made by Universal is a 1932 film made by Universal Pictures, Pictures, directed directed by by John M. StahlJohn M. Stahl, and , and produced by Carl Laemmle Jr.produced by Carl Laemmle Jr.
Back Street Back Street and and John M. Stahl John M. Stahl are very far are very far apart in surface stringapart in surface string
But are close together in dependency parseBut are close together in dependency parse
Unsupervised relation Unsupervised relation extractionextraction Banko et al 2007 “Open information extraction Banko et al 2007 “Open information extraction
from the Web”from the Web” Extracting relations from the web withExtracting relations from the web with
no training data no predetermined list of relations
The Open ApproachThe Open Approach1.Use parse data to train a “trust-worthy” classifier
2.Extract trustworthy relations among NPs
3.Rank relations based on text redundancy