Top Banner
1 Snowball Snowball : : Extracting Relations from Extracting Relations from Large Plain-Text Large Plain-Text Collections Collections Eugene Agichtein Eugene Agichtein Luis Gravano Luis Gravano Department of Computer Science Department of Computer Science Columbia University Columbia University
48

1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Dec 18, 2015

Download

Documents

Lester Byrd
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

1

Snowball Snowball : : Extracting Relations from Extracting Relations from Large Plain-Text CollectionsLarge Plain-Text Collections

Eugene AgichteinEugene AgichteinLuis GravanoLuis Gravano

Department of Computer ScienceDepartment of Computer ScienceColumbia UniversityColumbia University

Page 2: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•2

Extracting Relations from Extracting Relations from DocumentsDocuments

Text documents hide valuable structured information.

If we manage to extract this information: • We can answer user queries more accurately• We can run data mining tasks (e.g., finding trends)

Page 3: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•3

GOAL: Extract All Tuples “Hidden” in the Document Collection

System must:• Require minimal training for each new

task• Recover from noise• Exploit redundancy of information in

documents

Page 4: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•4

Example Task: Example Task: Organization/LocationOrganization/Location

Apple's programmers "think different" on a "campus" in

Cupertino, Cal. Nike employees "just do it" at what the company refers to as its "World Campus," near Portland, Ore.

Microsoft's central headquarters in Redmond is home to almost every product group and division.

Organization Location

Microsoft

Apple Computer

Nike

Redmond

Cupertino

Portland

Brent Barlow, 27, a software analyst and beta-tester at Apple Computer headquarters in Cupertino, was fired Monday for "thinking a little too different."

Redundancy

Page 5: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•5

Extracting Relations from Text Extracting Relations from Text CollectionsCollections

• Related WorkRelated Work

• The The SnowballSnowball System System

• Evaluation MetricsEvaluation Metrics

• Experimental ResultsExperimental Results

Page 6: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•6

Related WorkRelated Work

• Traditional Information ExtractionTraditional Information Extraction– MUCs (MUCs (MMessage essage UUnderstanding nderstanding CConferences)onferences)

• Significant (manual) training for each new task Significant (manual) training for each new task

• BootstrappingBootstrapping– Riloff et al. (‘99), Collins & Singer (‘99)Riloff et al. (‘99), Collins & Singer (‘99)

• (Named-entity recognition)(Named-entity recognition)

– Brin (DIPRE) (‘98)Brin (DIPRE) (‘98)

Page 7: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•7

Extracting Relations from Text: Extracting Relations from Text: DIPREDIPRE

Initial Seed Tuples:

Initial Seed Tuples Occurrences of Seed Tuples

Generate Extraction Patterns

Generate New Seed Tuples

Augment Table

ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA

Page 8: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•8

Extracting Relations from Text: Extracting Relations from Text: DIPREDIPRE

Occurrences of seed tuples:

Computer servers at Microsoft’s headquarters in Redmond…

In mid-afternoon trading, share ofRedmond-based Microsoft fell…

The Armonk-based IBM introduceda new line…

The combined company will operate

from Boeing’s headquarters in Seattle.

Intel, Santa Clara, cut prices of itsPentium processor.

ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA

Initial Seed Tuples Occurrences of Seed Tuples

Generate Extraction Patterns

Generate New Seed Tuples

Augment Table

Page 9: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•9

• <STRING1>’s headquarters in <STRING2>

•<STRING2> -based <STRING1>

•<STRING1> , <STRING2>

Extracting Relations from Text: Extracting Relations from Text: DIPREDIPRE

Initial Seed Tuples Occurrences of Seed Tuples

Generate Extraction Patterns

Generate New Seed Tuples

Augment Table

DIPREPatterns:

Page 10: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•10

Extracting Relations from Text: Extracting Relations from Text: DIPREDIPRE

Initial Seed Tuples Occurrences of Seed Tuples

Generate Extraction Patterns

Generate New Seed Tuples

Augment Table

Generatenew seedtuples; start newiteration

ORGANIZATION LOCATIONAG EDWARDS ST LUIS157TH STREET MANHATTAN7TH LEVEL RICHARDSON3COM CORP SANTA CLARA3DO REDWOOD CITYJELLIES APPLEMACWEEK SAN FRANCISCO

Page 11: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•11

Extracting Relations from Text: Extracting Relations from Text: Potential PitfallsPotential Pitfalls

• Invalid tuples generatedInvalid tuples generated– Degrade quality of tuples on Degrade quality of tuples on

subsequent iterationssubsequent iterations

– Must have automatic way to selectMust have automatic way to selecthigh quality tuples to use as new seedhigh quality tuples to use as new seed

• Pattern representationPattern representation– Patterns must generalizePatterns must generalize

Page 12: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•12

Extracting Relations from Text Extracting Relations from Text CollectionsCollections• Related WorkRelated Work

– DIPREDIPRE

• The The SnowballSnowball System: System: – Pattern representation and generationPattern representation and generation

– Tuple generationTuple generation

– Automatic pattern and tuple evaluationAutomatic pattern and tuple evaluation

• Evaluation MetricsEvaluation Metrics

• Experimental ResultsExperimental Results

Page 13: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•13

Extracting Relations from Text: Extracting Relations from Text: SnowballSnowball

Initial Seed Tuples:

Initial Seed Tuples Occurrences of Seed Tuples

Tag Entities

Generate Extraction Patterns

Generate New Seed Tuples

Augment Table

ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA

Page 14: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•14

Extracting Relations from Text: Extracting Relations from Text: SnowballSnowball

Occurrences of seed tuples:

ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA

Initial Seed Tuples Occurrences of Seed Tuples

Tag Entities

Generate Extraction Patterns

Generate New Seed Tuples

Augment Table

Computer servers at Microsoft’s headquarters in Redmond…

In mid-afternoon trading, share ofRedmond-based Microsoft fell…

The Armonk-based IBM introduceda new line…

The combined company will operate

from Boeing’s headquarters in Seattle.

Intel, Santa Clara, cut prices of itsPentium processor.

Page 15: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•15

Today's merger with McDonnell Douglas

positions Seattle -based Boeing to make major money in space.

…, a producer of apple-based jelly, ...

Pattern: <STRING2>-based <STRING1>

Problem: Problem: Patterns Excessively Patterns Excessively GeneralGeneral

<jelly, apple>

Incorrect!

Page 16: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•16

Extracting Relations from Text: Extracting Relations from Text: SnowballSnowball

Computer servers at Microsoft’s headquarters in Redmond…

In mid-afternoon trading, share ofRedmond-based Microsoft fell…

The Armonk-based IBM introduceda new line…

The combined company will operate

from Boeing’s headquarters in Seattle.

Intel, Santa Clara, cut prices of itsPentium processor.

Tag Entities

Use MITRE’s Alembic Named Entity tagger

Initial Seed Tuples Occurrences of Seed Tuples

Tag Entities

Generate Extraction Patterns

Generate New Seed Tuples

Augment Table

Page 17: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•17

Extracting Relations from TextExtracting Relations from Text

Computer servers at Microsoft'sheadquarters in Redmond...Exxon, Irving, said it will boost itsstake in the...In midafternoon trading, shares ofIrving-based Exxon fell…The Armonk-based IBM has introduced anew line ...The combined company will operate fromBoeing's headquarters in Seattle.Intel, Santa Clara, cut prices of itsPentium...

• <ORGANIZATION>’s headquarters in <LOCATION>

•<LOCATION> -based <ORGANIZATION>

•<ORGANIZATION> , <LOCATION>

Initial Seed Tuples Occurrences of Seed Tuples

Tag Entities

Generate Extraction Patterns

Generate New Seed Tuples

Augment Table

PROBLEM: Patterns too specific: have to match text exactly.

Page 18: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•18

Snowball:Snowball: Pattern Representation Pattern Representation

A Snowball pattern vector is a 5-tuple <left, tag1, middle, tag2, right>,

– tag1, tag2 are named-entity tags

– left, middle, and right are vectors of weighed terms.

< left , tag1 , middle , tag2 , right >

ORGANIZATION 's central headquarters in LOCATION is home to...

LOCATIONORGANIZATION{<'s 0.5>, <central 0.5> <headquarters 0.5>, < in 0.5>}

{<is 0.75>, <home 0.75> }

Page 19: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•19

The combined company will operate from Boeing’s headquarters in Seattle.

The Armonk -based IBM introduced a new line…

In mid-afternoon trading, share of Redmond-based Microsoft fell…

Computer servers at Microsoft’s central headquarters in Redmond…

Snowball:Snowball: Pattern Generation Pattern Generation

Tagged Occurrences of seed tuples:

Page 20: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•20

{<servers 0.75><at 0.75>}

SnowballSnowball Pattern Generation: Pattern Generation: Cluster Similar OccurrencesCluster Similar Occurrences

{<’s 0.5> <central 0.5> <headquarters 0.5> <in 0.5>}

ORGANIZATION LOCATION

{<shares 0.75><of 0.75>}

{<- 0.75> <based 0.75> } {<fell 1>}

{<the 1>} {<- 0.75> <based 0.75> }

ORGANIZATION

LOCATION

{<introduced 0.75> <a 0.75>}

LOCATION

ORGANIZATION

{<operate 0.75><from 0.75>}

{<’s 0.7> <headquarters 0.7> <in 0.7>}

ORGANIZATION LOCATION

Occurrences of seed tuples converted to Snowball representation:

Page 21: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•21

Similarity MetricSimilarity Metric

{ Lp . Ls + Mp . Ms + Rp . Rs if the tags match

0 otherwise

Match(P, S) =

P =

S =

< Lp , tag1 , Mp , tag2 , Rp >

< Ls , tag1 , Ms , tag2 , Rs >

Page 22: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•22

{<servers 0.75><at 0.75>}

SnowballSnowball Pattern Generation: Pattern Generation: ClusteringClustering

{<’s 0.5> <central 0.5> <headquarters 0.5> <in 0.5>}

ORGANIZATION LOCATION

{<shares 0.75><of 0.75>}

{<- 0.75> <based 0.75> } {<fell 1>}

{<the 1>} {<- 0.75> <based 0.75> }

ORGANIZATION

LOCATION

{<introduced 0.75> <a 0.75>}

LOCATION

ORGANIZATION

{<operate 0.75><from 0.75>}

{<’s 0.7> <headquarters 0.7> <in 0.7>}

ORGANIZATION LOCATION

Cluster 1

Cluster 2

Page 23: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•23

Snowball:Snowball: Pattern Generation Pattern Generation

{<’s 0.7> <in 0.7> <headquarters 0.7>}ORGANIZATION LOCATIO

N

{<- 0.75> <based 0.75>}

ORGANIZATIONLOCATION

Pattern2

Patterns are formed as centroids of the clusters. Filtered by minimum number of supporting tuples.

Pattern1

Page 24: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•24

Snowball: Snowball: Tuple ExtractionTuple Extraction

Initial Seed Tuples Occurrences of Seed Tuples

Tag Entities

Generate Extraction Patterns

Generate New Seed Tuples

Augment Table

Using the patterns, scan the collection to generate new seed tuples:

Page 25: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•25

Snowball: Snowball: Tuple ExtractionTuple Extraction

Represent each new text segment in the Represent each new text segment in the collection as the context 5-tuple:collection as the context 5-tuple:

Find most similar pattern (if any)Find most similar pattern (if any)

LOCATIONORGANIZATION{<'s 0.5>, <flashy 0.5>, <headquarters 0.5>, < in 0.5>}

{<is 0.75>, <near 0.75> }

Netscape 's flashy headquarters in Mountain View is near

LOCATIONORGANIZATION{<'s 0.7>, <headquarters 0.7>, < in 0.7>}

Page 26: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•26

SnowballSnowball: Automatic Pattern : Automatic Pattern EvaluationEvaluation

Automatically estimate probability ofAutomatically estimate probability ofa pattern generating valid tuples:a pattern generating valid tuples:

Conf(Pattern) = _____Conf(Pattern) = _____Positive____Positive____ Positive + Negative Positive + Negative

e.g., Conf(Pattern) = 2/3 = 66%e.g., Conf(Pattern) = 2/3 = 66%

Pattern “ORGANIZATION, LOCATION” in action:

PatternConfidence:

Boeing, Seattle, said… PositiveIntel, Santa Clara, cut prices… Positiveinvest in Microsoft, New York-based Negativeanalyst Jane Smith said

ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA

Seed tuples

Page 27: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•27

Snowball:Snowball: Automatic Tuple Automatic Tuple EvaluationEvaluation

Conf(Tuple) = 1 - Conf(Tuple) = 1 - (1 -Conf(P(1 -Conf(Pii))))

– Estimation of Probability (Correct (Tuple) )Estimation of Probability (Correct (Tuple) )

– A tuple will have high confidence ifA tuple will have high confidence ifgenerated by multiple high-confidencegenerated by multiple high-confidencepatterns (Ppatterns (Pii).).

Apple's programmers "think different" on

a "campus" in Cupertino, Cal.

Brent Barlow, 27, a software analyst and beta-tester at Apple Computer headquarters in Cupertino, was fired Monday for "thinking a little too different."

<Apple Computer, Cupertino>

Page 28: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•28

Snowball: Snowball: Filtering Seed TuplesFiltering Seed Tuples

Initial Seed Tuples Occurrences of Seed Tuples

Tag Entities

Generate Extraction Patterns

Generate New Seed Tuples

Augment Table

Generatenew seedtuples:

ORGANIZATION LOCATION CONFAG EDWARDS ST LUIS 0.93AIR CANADA MONTREAL 0.897TH LEVEL RICHARDSON 0.883COM CORP SANTA CLARA 0.83DO REDWOOD CITY 0.83M MINNEAPOLIS 0.8MACWORLD SAN FRANCISCO 0.7

157TH STREET MANHATTAN 0.5215TH CENTURY EUROPE NAPOLEON 0.315TH PARTY CONGRESS CHINA 0.3MAD SMITH 0.3

Page 29: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•29

Extracting Relations from Text Extracting Relations from Text CollectionsCollections

• Related WorkRelated Work

• The The SnowballSnowball System: System: – Pattern representation and generationPattern representation and generation

– Tuple generationTuple generation

– Automatic pattern and tuple evaluationAutomatic pattern and tuple evaluation

• Evaluation MetricsEvaluation Metrics

• Experimental ResultsExperimental Results

Page 30: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•30

Task Evaluation MethodologyTask Evaluation Methodology

• Data: Large collection, extracted tablesData: Large collection, extracted tablescontain many tuples (> 80,000)contain many tuples (> 80,000)

• Need scalable methodology:Need scalable methodology:– IdealIdeal set of tuples set of tuples

– Automatic recall/precision estimationAutomatic recall/precision estimation

• Estimated precision using samplingEstimated precision using sampling

Page 31: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•31

Collections used in Experiments Collections used in Experiments

More than 300,000 real newspaper articles

Collection Source YearThe New York Times 1996

Training The Wall Street Journal 1996The Los Angeles Times 1996The New York Times 1995

Test The Wall Street Journal 1995The Los Angeles Times 1995,’97

Page 32: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•32

TheThe Ideal Ideal Metric (1) Metric (1)

Creating theCreating the Ideal Ideal set of tuplesset of tuples

All tuples mentioned in the collection

Hoover’s directory(13K+ organizations)*Ideal

* A perfect, (ideal) system would be able to extract all these tuples

Page 33: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•33

TheThe Ideal Ideal Metric (2) Metric (2)

• Precision:Precision: | Correct ( | Correct (ExtractedExtracted IdealIdeal) |) | | | ExtractedExtracted IdealIdeal | |

• Recall:Recall: | Correct ( | Correct (ExtractedExtracted IdealIdeal) |) |

| | Ideal Ideal ||

Extracted IdealCorrectlocationfound

Page 34: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•34

Estimate Precision by SamplingEstimate Precision by Sampling

• Sample extracted table Sample extracted table – Random samples, each 100 tuplesRandom samples, each 100 tuples

• Manually check validity of tuples in Manually check validity of tuples in eacheachsamplesample

Page 35: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•35

Extracting Relations from Text Extracting Relations from Text CollectionsCollections

• Related WorkRelated Work

• The The SnowballSnowball System: System: – Pattern representation and generationPattern representation and generation

– Tuple generationTuple generation

– Automatic pattern and tuple validationAutomatic pattern and tuple validation

• Evaluation MetricsEvaluation Metrics

• Experimental ResultsExperimental Results

Page 36: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•36

Experimental results: Test CollectionExperimental results: Test Collection

(a) (b)

Recall (a) and precision (a) using the Ideal metric, plotted against the minimal number of occurrences of test tuples in the collection

Page 37: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•37

Experimental results: Sample and Experimental results: Sample and CheckCheck

(a) (b)

Recall (a) and precision (b) for varying minimum confidence threshold Tt.

NOTE: Recall is estimated using the Ideal metric, precision is estimated by manually checking random samples of result table.

Page 38: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•38

ConclusionsConclusions

We presentedWe presented

• Our Our Snowball Snowball system:system:– Requires minimal training Requires minimal training

(handful of seed tuples) (handful of seed tuples)

– Uses a flexible pattern representationUses a flexible pattern representation

– Achieves high recall/precisionAchieves high recall/precision > 80% of test tuples extracted> 80% of test tuples extracted

• Scalable evaluation methodologyScalable evaluation methodology

Page 39: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•39

Recent and Future WorkRecent and Future Work

• Recent (presented in DMKD’00 workshop)Recent (presented in DMKD’00 workshop)– Alternative pattern representationAlternative pattern representation

– Combining representationsCombining representations

• Future WorkFuture Work– Evaluation on other extraction tasksEvaluation on other extraction tasks

– Extensions:Extensions:• Non-binary relationsNon-binary relations

• Relations with no keyRelations with no key

HTML documentsHTML documents

Page 40: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

40

Snowball: Snowball: Extracting Relations from Large Extracting Relations from Large Plain-Text CollectionsPlain-Text Collections

Eugene Agichtein Eugene Agichtein ([email protected])([email protected])Luis GravanoLuis Gravano

Department of Computer ScienceDepartment of Computer ScienceColumbia UniversityColumbia University

Page 41: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

41

Backup SlidesBackup Slides

Page 42: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•42

Snowball Snowball SolutionsSolutions

• Flexible pattern representationFlexible pattern representation

• Pattern generationPattern generation

• Automatic pattern and tuple Automatic pattern and tuple evaluationevaluation– Able to recover from noiseAble to recover from noise

– Keeps only high quality tuples as new Keeps only high quality tuples as new seedseed

Page 43: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•43

Experimental Results: TrainingExperimental Results: Training

(a) (b)

Recall (a) and precision (b) using the Ideal metric (training collection)

Page 44: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•44

Sampling Results: Error AnalysisSampling Results: Error Analysis

Type of ErrorCorrect Incorrect Location Organization Relationship

DIPRE 74 26 3 28 5Snowball (all tuples) 52 48 6 41 1Snowball (t = 0.8) 93 7 3 4 0Baseline 25 75 8 62 5

The tuples in the random samples were checked by hand to pinpoint the “culprits” responsible for incorrect tuples.Sample size is 100.

Page 45: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•45

Sample Discovered PatternsSample Discovered Patterns

Left Middle Right Conf<NEAR 0.01> <IN 0.79>

<HEADQUARTERS 0.03><, 0.20> 0.36

< OF 0.61> <, 0.61> <, 0.15) 0.37

< - 0.53>< BASED 0.53>< , 0.25 >

<SAID 0.1> 1

<WHILE 0.01> <BASED 0.52><IN 0.52>< , 0.43>

<, 0.28> 0.96

< - 0.70><, 0.08>

0.63

FROM 0.01 <S 0.52><' 0.52><IN 0.24><HEADQUARTERS 0.22>

<AND 0.01> 0.69

Page 46: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•46

Convergence of Convergence of SnowballSnowball and DIPRE and DIPRE

Precision (a) and Recall (b) of the DIPRE and Snowballwith increased iterations

(a) (b)

Page 47: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•47

Organization Location

Microsoft Corp Wash.Microsoft Corporation RedmondMicrosoft Corp. WA

Apple Computer Calif.Apple Corp CupertinoApple Computer Corp. US

Approximate Matching of Approximate Matching of OrganizationsOrganizations

• Use Use WhirlWhirl (W. Cohen @ AT&T) to match similar organization names (W. Cohen @ AT&T) to match similar organization names

• Self-join the Extracted table on the Self-join the Extracted table on the OrganizationOrganization attribute attribute

• Join resulting table with the Join resulting table with the Test Test table, and compare values oftable, and compare values ofLocationLocation attributes attributes

Location OrganizationRedmond MicrosoftArmonk IBMSanta Clara IntelMountain View NetscapeCupertino Apple

Extracted Extracted ‘

Page 48: 1 Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University.

Eugene Agichtein Columbia University•48

ReferencesReferences

• Blum & Mitchell. Combining labeled and unlabeled Blum & Mitchell. Combining labeled and unlabeled data with co-training. Proceedings of 1998 Conference data with co-training. Proceedings of 1998 Conference on Computational Learning Theory.on Computational Learning Theory.

• Brin. Extracting patterns and relations from the World-Brin. Extracting patterns and relations from the World-Wide Web. Proceedings on the 1998 International Wide Web. Proceedings on the 1998 International Workshop on Web and Databases (WebDB’98).Workshop on Web and Databases (WebDB’98).

• Collins & Singer. Unsupervised models for named Collins & Singer. Unsupervised models for named entity classification. EMNLP 1999.entity classification. EMNLP 1999.

• Riloff & Jones. Learning dictionaries for information Riloff & Jones. Learning dictionaries for information extraction by multi-level bootstrapping. AAAI’99.extraction by multi-level bootstrapping. AAAI’99.

• Yarowsky. Unsupervised word sense disambiguation Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. ACL’95.rivaling supervised methods. ACL’95.