1 Snowball Snowball : : Extracting Relations from Extracting Relations from Large Plain-Text Large Plain-Text Collections Collections Eugene Agichtein Eugene Agichtein Luis Gravano Luis Gravano Department of Computer Science Department of Computer Science Columbia University Columbia University
48
Embed
1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Snowball Snowball : : Extracting Relations from Extracting Relations from Large Plain-Text CollectionsLarge Plain-Text Collections
Department of Computer ScienceDepartment of Computer ScienceColumbia UniversityColumbia University
Eugene Agichtein Columbia University•2
Extracting Relations from Extracting Relations from DocumentsDocuments
Text documents hide valuable structured information.
If we manage to extract this information: • We can answer user queries more accurately• We can run data mining tasks (e.g., finding trends)
Eugene Agichtein Columbia University•3
GOAL: Extract All Tuples “Hidden” in the Document Collection
System must:• Require minimal training for each new
task• Recover from noise• Exploit redundancy of information in
documents
Eugene Agichtein Columbia University•4
Example Task: Example Task: Organization/LocationOrganization/Location
Apple's programmers "think different" on a "campus" in
Cupertino, Cal. Nike employees "just do it" at what the company refers to as its "World Campus," near Portland, Ore.
Microsoft's central headquarters in Redmond is home to almost every product group and division.
Organization Location
Microsoft
Apple Computer
Nike
Redmond
Cupertino
Portland
Brent Barlow, 27, a software analyst and beta-tester at Apple Computer headquarters in Cupertino, was fired Monday for "thinking a little too different."
Redundancy
Eugene Agichtein Columbia University•5
Extracting Relations from Text Extracting Relations from Text CollectionsCollections
• Related WorkRelated Work
• The The SnowballSnowball System System
• Evaluation MetricsEvaluation Metrics
• Experimental ResultsExperimental Results
Eugene Agichtein Columbia University•6
Related WorkRelated Work
• Traditional Information ExtractionTraditional Information Extraction– MUCs (MUCs (MMessage essage UUnderstanding nderstanding CConferences)onferences)
• Significant (manual) training for each new task Significant (manual) training for each new task
• BootstrappingBootstrapping– Riloff et al. (‘99), Collins & Singer (‘99)Riloff et al. (‘99), Collins & Singer (‘99)
Extracting Relations from Text: Extracting Relations from Text: DIPREDIPRE
Initial Seed Tuples:
Initial Seed Tuples Occurrences of Seed Tuples
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA
Eugene Agichtein Columbia University•8
Extracting Relations from Text: Extracting Relations from Text: DIPREDIPRE
Occurrences of seed tuples:
Computer servers at Microsoft’s headquarters in Redmond…
In mid-afternoon trading, share ofRedmond-based Microsoft fell…
The Armonk-based IBM introduceda new line…
The combined company will operate
from Boeing’s headquarters in Seattle.
Intel, Santa Clara, cut prices of itsPentium processor.
ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA
Initial Seed Tuples Occurrences of Seed Tuples
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
Eugene Agichtein Columbia University•9
• <STRING1>’s headquarters in <STRING2>
•<STRING2> -based <STRING1>
•<STRING1> , <STRING2>
Extracting Relations from Text: Extracting Relations from Text: DIPREDIPRE
Initial Seed Tuples Occurrences of Seed Tuples
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
DIPREPatterns:
Eugene Agichtein Columbia University•10
Extracting Relations from Text: Extracting Relations from Text: DIPREDIPRE
Initial Seed Tuples Occurrences of Seed Tuples
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
Generatenew seedtuples; start newiteration
ORGANIZATION LOCATIONAG EDWARDS ST LUIS157TH STREET MANHATTAN7TH LEVEL RICHARDSON3COM CORP SANTA CLARA3DO REDWOOD CITYJELLIES APPLEMACWEEK SAN FRANCISCO
Eugene Agichtein Columbia University•11
Extracting Relations from Text: Extracting Relations from Text: Potential PitfallsPotential Pitfalls
• Invalid tuples generatedInvalid tuples generated– Degrade quality of tuples on Degrade quality of tuples on
subsequent iterationssubsequent iterations
– Must have automatic way to selectMust have automatic way to selecthigh quality tuples to use as new seedhigh quality tuples to use as new seed
• Pattern representationPattern representation– Patterns must generalizePatterns must generalize
Eugene Agichtein Columbia University•12
Extracting Relations from Text Extracting Relations from Text CollectionsCollections• Related WorkRelated Work
– DIPREDIPRE
• The The SnowballSnowball System: System: – Pattern representation and generationPattern representation and generation
– Tuple generationTuple generation
– Automatic pattern and tuple evaluationAutomatic pattern and tuple evaluation
• Evaluation MetricsEvaluation Metrics
• Experimental ResultsExperimental Results
Eugene Agichtein Columbia University•13
Extracting Relations from Text: Extracting Relations from Text: SnowballSnowball
Initial Seed Tuples:
Initial Seed Tuples Occurrences of Seed Tuples
Tag Entities
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA
Eugene Agichtein Columbia University•14
Extracting Relations from Text: Extracting Relations from Text: SnowballSnowball
Occurrences of seed tuples:
ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA
Initial Seed Tuples Occurrences of Seed Tuples
Tag Entities
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
Computer servers at Microsoft’s headquarters in Redmond…
In mid-afternoon trading, share ofRedmond-based Microsoft fell…
The Armonk-based IBM introduceda new line…
The combined company will operate
from Boeing’s headquarters in Seattle.
Intel, Santa Clara, cut prices of itsPentium processor.
Eugene Agichtein Columbia University•15
Today's merger with McDonnell Douglas
positions Seattle -based Boeing to make major money in space.
Extracting Relations from Text: Extracting Relations from Text: SnowballSnowball
Computer servers at Microsoft’s headquarters in Redmond…
In mid-afternoon trading, share ofRedmond-based Microsoft fell…
The Armonk-based IBM introduceda new line…
The combined company will operate
from Boeing’s headquarters in Seattle.
Intel, Santa Clara, cut prices of itsPentium processor.
Tag Entities
Use MITRE’s Alembic Named Entity tagger
Initial Seed Tuples Occurrences of Seed Tuples
Tag Entities
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
Eugene Agichtein Columbia University•17
Extracting Relations from TextExtracting Relations from Text
Computer servers at Microsoft'sheadquarters in Redmond...Exxon, Irving, said it will boost itsstake in the...In midafternoon trading, shares ofIrving-based Exxon fell…The Armonk-based IBM has introduced anew line ...The combined company will operate fromBoeing's headquarters in Seattle.Intel, Santa Clara, cut prices of itsPentium...
• <ORGANIZATION>’s headquarters in <LOCATION>
•<LOCATION> -based <ORGANIZATION>
•<ORGANIZATION> , <LOCATION>
Initial Seed Tuples Occurrences of Seed Tuples
Tag Entities
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
PROBLEM: Patterns too specific: have to match text exactly.
– Estimation of Probability (Correct (Tuple) )Estimation of Probability (Correct (Tuple) )
– A tuple will have high confidence ifA tuple will have high confidence ifgenerated by multiple high-confidencegenerated by multiple high-confidencepatterns (Ppatterns (Pii).).
Apple's programmers "think different" on
a "campus" in Cupertino, Cal.
Brent Barlow, 27, a software analyst and beta-tester at Apple Computer headquarters in Cupertino, was fired Monday for "thinking a little too different."
ORGANIZATION LOCATION CONFAG EDWARDS ST LUIS 0.93AIR CANADA MONTREAL 0.897TH LEVEL RICHARDSON 0.883COM CORP SANTA CLARA 0.83DO REDWOOD CITY 0.83M MINNEAPOLIS 0.8MACWORLD SAN FRANCISCO 0.7
157TH STREET MANHATTAN 0.5215TH CENTURY EUROPE NAPOLEON 0.315TH PARTY CONGRESS CHINA 0.3MAD SMITH 0.3
Eugene Agichtein Columbia University•29
Extracting Relations from Text Extracting Relations from Text CollectionsCollections
• Related WorkRelated Work
• The The SnowballSnowball System: System: – Pattern representation and generationPattern representation and generation
– Tuple generationTuple generation
– Automatic pattern and tuple evaluationAutomatic pattern and tuple evaluation
FROM 0.01 <S 0.52><' 0.52><IN 0.24><HEADQUARTERS 0.22>
<AND 0.01> 0.69
Eugene Agichtein Columbia University•46
Convergence of Convergence of SnowballSnowball and DIPRE and DIPRE
Precision (a) and Recall (b) of the DIPRE and Snowballwith increased iterations
(a) (b)
Eugene Agichtein Columbia University•47
Organization Location
Microsoft Corp Wash.Microsoft Corporation RedmondMicrosoft Corp. WA
Apple Computer Calif.Apple Corp CupertinoApple Computer Corp. US
Approximate Matching of Approximate Matching of OrganizationsOrganizations
• Use Use WhirlWhirl (W. Cohen @ AT&T) to match similar organization names (W. Cohen @ AT&T) to match similar organization names
• Self-join the Extracted table on the Self-join the Extracted table on the OrganizationOrganization attribute attribute
• Join resulting table with the Join resulting table with the Test Test table, and compare values oftable, and compare values ofLocationLocation attributes attributes
Location OrganizationRedmond MicrosoftArmonk IBMSanta Clara IntelMountain View NetscapeCupertino Apple
Extracted Extracted ‘
Eugene Agichtein Columbia University•48
ReferencesReferences
• Blum & Mitchell. Combining labeled and unlabeled Blum & Mitchell. Combining labeled and unlabeled data with co-training. Proceedings of 1998 Conference data with co-training. Proceedings of 1998 Conference on Computational Learning Theory.on Computational Learning Theory.
• Brin. Extracting patterns and relations from the World-Brin. Extracting patterns and relations from the World-Wide Web. Proceedings on the 1998 International Wide Web. Proceedings on the 1998 International Workshop on Web and Databases (WebDB’98).Workshop on Web and Databases (WebDB’98).
• Collins & Singer. Unsupervised models for named Collins & Singer. Unsupervised models for named entity classification. EMNLP 1999.entity classification. EMNLP 1999.
• Riloff & Jones. Learning dictionaries for information Riloff & Jones. Learning dictionaries for information extraction by multi-level bootstrapping. AAAI’99.extraction by multi-level bootstrapping. AAAI’99.
• Yarowsky. Unsupervised word sense disambiguation Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. ACL’95.rivaling supervised methods. ACL’95.