1 Extracting Relations from Extracting Relations from Large Text Collections Large Text Collections Eugene Agichtein, Eugene Agichtein, Eleazar Eskin and Luis Eleazar Eskin and Luis Gravano Gravano Department of Computer Science Department of Computer Science Columbia University Columbia University Combining Strategies for
46
Embed
1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Eugene Agichtein, Eugene Agichtein, Eleazar Eskin and Luis Eleazar Eskin and Luis GravanoGravano
Department of Computer ScienceDepartment of Computer ScienceColumbia UniversityColumbia University
Combining Strategies for
Eugene Agichtein Columbia University•2
Extracting Relations From Extracting Relations From DocumentsDocuments
•Extract all tuples that appear in the document collection•Require minimal training•Resolve conflicting information•Exploit redundancy of information in documents
Text Documents hide valuable structured information
Eugene Agichtein Columbia University•3
Example Task: Example Task: Organization/LocationOrganization/Location
Apple's programmers "think different" on a "campus" in
Cupertino, Cal. Nike employees "just do it" at what the company refers to as its "World Campus," near Portland, Ore.
Microsoft's central headquarters in Redmond is home to almost every product group and division. Organization Location
Microsoft
Apple Computer
Nike
Redmond
Cupertino
Portland
Brent Barlow, 27, a software analyst and beta-tester at Apple Computer headquarters in Cupertino, was fired Monday for "thinking a little too different."
Eugene Agichtein Columbia University•4
Related WorkRelated Work
• Traditional Information ExtractionTraditional Information Extraction– MUCMUC
• BootstrappingBootstrapping– Riloff et al. (‘99), Collins & Singer (‘99)Riloff et al. (‘99), Collins & Singer (‘99)
Extracting Relations from TextExtracting Relations from Text
ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKINTEL SANTA CLARABOEING SEATTLE
Seed tuples:
Eugene Agichtein Columbia University•6
Extracting Relations from TextExtracting Relations from Text
Computer servers at Microsoft'sheadquarters in Redmond...Exxon, Irving, said it will boost itsstake in the...In midafternoon trading, shares ofIrving-based Exxon fell…The Armonk-based IBM has introduced anew line ...The combined company will operate fromBoeing's headquarters in Seattle.Intel, Santa Clara, cut prices of itsPentium...
Occurrences of seed tuples in text:
Eugene Agichtein Columbia University•7
Extracting Relations from TextExtracting Relations from Text
Computer servers at Microsoft'sheadquarters in Redmond...Exxon, Irving, said it will boost itsstake in the...In midafternoon trading, shares ofIrving-based Exxon fell…The Armonk-based IBM has introduced anew line ...The combined company will operate fromBoeing's headquarters in Seattle.Intel, Santa Clara, cut prices of itsPentium...
Tag Entities:
Eugene Agichtein Columbia University•8
Extracting Relations from TextExtracting Relations from Text
Computer servers at Microsoft'sheadquarters in Redmond...Exxon, Irving, said it will boost itsstake in the...In midafternoon trading, shares ofIrving-based Exxon fell…The Armonk-based IBM has introduced anew line ...The combined company will operate fromBoeing's headquarters in Seattle.Intel, Santa Clara, cut prices of itsPentium...
GeneratePatterns:
• <ORGANIZATION>’s headquarters in <LOCATION>
•<LOCATION> -based <ORGANIZATION>
•<ORGANIZATION> , <LOCATION>
Eugene Agichtein Columbia University•9
• <ORGANIZATION>’s headquarters in <LOCATION>
•<LOCATION> -based <ORGANIZATION>
•<ORGANIZATION> , <LOCATION>
Extracting Relations from TextExtracting Relations from Text
Generate new seed tuples:
Eugene Agichtein Columbia University•10
Extracting Relations from TextExtracting Relations from Text
A tuple T will have high confidence if it is extracted A tuple T will have high confidence if it is extracted by multiple high-quality patterns, scaled by theby multiple high-quality patterns, scaled by thesimilarity of actual text segments to patterns. similarity of actual text segments to patterns.
• Estimated precision using samplingEstimated precision using sampling(see paper)(see paper)
Eugene Agichtein Columbia University•32
TheThe Ideal Ideal metric metric
All tuples mentioned in the collection
Hoover’s directory(13K+ organizations)
Ideal
• Precision:Precision:Fraction of Fraction of IdealIdeal tuples that the system tuples that the system extracted that have correct values.extracted that have correct values.
• Recall:Recall:Fraction of Fraction of IdealIdeal tuples extracted from tuples extracted from
Precision and Recall (b) of the Intersection, Union and Mixture strategies compared to Snowball-VS
(a) (b)
Eugene Agichtein Columbia University•35
ConclusionsConclusions
• System works after minimal training System works after minimal training (handful of examples)(handful of examples)
• Able to extract 80% of test tuplesAble to extract 80% of test tuples
• Combining Models results in better Combining Models results in better tabletable
Eugene Agichtein Columbia University•36
Future WorkFuture Work
• EfficiencyEfficiency
• Evaluate on other extraction tasksEvaluate on other extraction tasks– non-binary relationsnon-binary relations
– relations with no keyrelations with no key
HTML documentsHTML documents
– link structurelink structure
– document structuredocument structure
Eugene Agichtein Columbia University•37
ReferencesReferences
• Blum & Mitchell. Combining labeled and unlabeled Blum & Mitchell. Combining labeled and unlabeled data with co-training. Proceedings of 1998 Conference data with co-training. Proceedings of 1998 Conference on Computational Learning Theory.on Computational Learning Theory.
• Brin. Extracting patterns and relations from the World-Brin. Extracting patterns and relations from the World-Wide Web. Proceedings on the 1998 International Wide Web. Proceedings on the 1998 International Workshop on Web abd Databases (WebDB’98)Workshop on Web abd Databases (WebDB’98)
• Collins & Singer. Unsupervised models for named Collins & Singer. Unsupervised models for named entity classification. EMNLP 1999entity classification. EMNLP 1999
• Riloff & Jones. Learning dictionaries for information Riloff & Jones. Learning dictionaries for information extraction by multi-level bootstrapping. AAAI’99extraction by multi-level bootstrapping. AAAI’99
• Yarowsky. Unsupervised word sense disambiguation Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. ACL’95rivaling supervised methods. ACL’95
38
Snowball: Snowball: Extracting Relations from Extracting Relations from Large Plain-Text CollectionsLarge Plain-Text Collections
Eugene AgichteinEugene Agichtein and Luis Gravano and Luis GravanoDepartment of Computer ScienceDepartment of Computer ScienceColumbia UniversityColumbia University
• Computes conditional probability: P(Positive|WComputes conditional probability: P(Positive|W11,W,W22,W,W33,W,W44) ) where W where Wii is the i-th word in the context between entities. is the i-th word in the context between entities.
• SMT’s compute mixture of sparse prediction trees.SMT’s compute mixture of sparse prediction trees.