Unsupervised Named-Entity Extraction from the Web: An Experimental Study

Unsupervised Named-Entity Extractionfrom the Web: An Experimental Study

By:Ms. Shaima SalamMr. Amrut Budihal

KnowItAll• The KnowItAll system aims to automate the tedious

process of extracting large collections of facts from the Web in an unsupervised, domain-independent, and scalable manner.

• The distinctive feature of KnowItAll is that it does not require any hand-labelled training examples.

• KnowItAll introduces a novel, generate-and-test architecture that extracts information in two stages.

• What is Recall in Information Extraction?• What is Precision in Information Extraction?• What is Extraction Rate?

Problem Definition. To improve KnowItAll’s recall and extraction rate without sacrificing precision.

Motivation.• Improvement to this challenge.

How to achieve it?• Pattern Learning (PL)

• Subclass Extraction (SE)

• List Extraction (LE)

Agenda.

• Overview of KnowItAll system.

• Challenges faced by this System.

• Distinct techniques to address this Challenge.

• Evaluation of these techniques.

Architecture.

Bootstrapping•It takes as input a domain-dependent predicate. For example, “City”.

•It then generates an extraction rule and discriminators for the predicate.

•Each extraction rule consists of a predicate, an extraction pattern, constraints, bindings and keywords.

Bootstrapping.

Extraction Rules.• The extraction pattern is applied to a sentence and has

a sequence of alternating context strings and slots, where each slot represents a string from the sentence.

• Rules may set constraints on a slot, and may bind it to one of the predicate arguments as a phrase to be extracted.

• The constraints of a rule can specify the entire phrase that matches the slot, the head of the phrase, or the head of each simple NP in an NPList slot.

• The rule bindings specify which slots or slot heads are extracted for each argument of the predicate.

• Keywords are created from the context strings and any slots that have an exact word constraint.

Discriminators.• The Assessor module uses discriminators

that apply to search engine indices.

• A discriminator consists of an extraction pattern with alternating context strings and slots. There are no explicit or implicit constraints on the slots, and the pattern matches Web pages where the context strings and slots are immediately adjacent, ignoring punctuation, whitespace, or HTML tags.

Extractor.• The Extractor creates a query from keywords in each

rule, sends the query to a Web search engine, and applies the rule (shallow syntactic analysis) to extract information from the resulting Web pages.

• Syntactic Analysis involves checking for capitalization and proper nouns.

Example: “The tour includes major cities such as New York, central Los Angeles, and Dallas”

Here the head of “central Los Angeles” is “Los Angeles”

Assessor• The Assessor computes a probability that each

extraction is correct before adding the extraction to KnowItAll’s knowledge base.

• Specifically, the Assessor uses a form of pointwise mutual information (PMI) between words and phrases that is estimated from Web search engine hit counts.

• The Assessor computes the PMI between each extracted instance and multiple, automatically generated discriminator phrases associated with the class.

Assessor• For example, in order to estimate the likelihood that

“Paris” is the name of a city, the Assessor might check to see if there is a high PMI between “Paris” and phrases such as “Paris is a city”.

•Problems with using PMI.

• Sparse Data problem: If an instance is found on only a few thousand Web pages, the expected number of hits for a positive instance will be less than 1 for such a discriminator.

• Homonyms — words that have the same spelling, but different meanings.

For eg: Georgia refers to both a state and country.

Challenges.

Challenges.

• What is the effect on performance of the Assessor if the automatically selected training seeds include errors?

• Results: There is some degradation of performance from 10% noise, and a sharp drop in performance from 30% noise. The Assessor can tolerate 10% noise in bootstrapped training seeds up to recall 0.75, but performance degrades sharply after that.

• Bootstrap training with robustness and noise tolerance.

•Resource Allocation• KnowItAll needs a policy that dictates when to stop

looking for more instances of a predicate because it reduces efficiency and precision.

• Two metrics are used to determine the utility of searching for further instances of a predicate: Signal to Noise ratio (STN) and Query Yield Ratio (QYR).

• KnowItAll computes the STN ratio by dividing the number of high probability new extractions by the number of low probability ones over the most recent n Web pages examined for that predicate.

Challenges.

Resource Allocation.

• QYR is defined as the ratio of query yield over the most recent n Web pages examined, divided by the initial query yield over the first n Web pages, where query yield is the number of new extractions divided by the number of Web pages examined.

• If this ratio falls below a cutoff point, the Extractor has reached a point of diminishing returns where it is hardly finding any new extractions and halts the search for that predicate.

Challenges.

Results with Baseline KnowItAll.

Conclusions:•The Assessor is able to do a good job of assigning high probabilities to correct instances with only a few false positives.

•The Assessor has more trouble with false negatives than with false positives.

Extending KnowItAll with Pattern Learning.Many of the best extraction rules do not match the general patterns.For eg: “ the film <film> staring”

“ headquartered in <city>”

Arming KnowItAll with pattern learning can significantly improve the coverage and accuracy.

Working:1)Start with instances generated by domain independent extractors.2)For every instance queried , record “n” words before and after the class instance.3)Choose the best patterns.Why only best patterns?

Learned Patterns as Extractors.• Selection criteria for best patterns is done on following

heuristics:

1)Select those patterns which occur in multiple seed instances. (Results show this eliminates 96% of the unnecessary patterns!)

2)The remaining patterns are sorted by their estimated precision.

Learned Patterns as Discriminators.• How does Pattern Learning help here?

1)Consider a discriminator: “cities such as <city>” which appears only in few times in the web.

2) Executing same set of discriminators on every extractions is not good.

Related Work:1)Bootstrap learning.2)Using Web Search engines.3)Wrapper Induction Algorithms.4)Rule Learning Schemes.

Extending KnowItAll with Subclass Extraction.

•For example, not all scientists are found in sentences that identify them as “scientist”.

•Extracting Candidate subclasses:How to determine whether a information being extracted contains an instance or sub class?Instances are determined by proper nouns and sub classes are identified using common nouns.

•Assessing Candidate subclasses:- Checks the morphology of the candidate term. Eg: “microbiologist” is a subclass of “biologist”.- Checks whether a subclass is a hyponym of the class in WordNet and if so, it assigns it a very high probability.

Improving Subclass Extraction.

Results: SE is successful on general classes such as “Scientist” and least successful in specific classes such as “City”.

Need for improvement: That is this technique didn’t produce every expected subclass.For eg: “ Biologists, physicists and chemists have convened at this inter-disciplinary conference.”, such rules identify “chemists” as a possible sibling of “biologists” and “physicists” for this pattern C1 {“,”} C2 {“,” } “and” CN.

Improving Subclass Extraction.Two methods:1)SEiter : An extraction matching large number of enumeration rules is a good candidate for subclass.2)SEself : Adjusts probabilities to the extractors based on confidence scores assigned to the enumeration rules in a recursive fashion.

Results:•Improvement in Recall at the cost of loss in precision.•SEself more robust to noise than SEiter.•Overall system improved by factor of 5.

Extending KnowItAll with List Extractor.List Extractor (LE) searches a lists of items on webpage, learns wrapper on the fly for each list and extracts items from these lists.

Input : Name of the class and a set of positive seeds.Output: List of candidate tokens for that class.

Working:1)Obtain a set of documents from the input.2)Iterate through each document, with each document being partitioned based on html tags.3)Apply learning to induce a wrapper.4)Once a wrapper is learnt, add it to wrapper tree.5)Once all the wrappers are added, select the best ones to get extractions.

Eg: List Extractor.Keywords: Italy, Japan, Spain, Brazil

1 <html>2 <body>3 My favorite countries:4 <table>5 <tr><td><a>Italy</a></td><td><a>Japan</a></td><td><a>France</a></td></tr>6 <tr><td><a>Israel</a></td><td><a>Spain</a></td><td><a>Brazil</a></td></tr>7 </table>8 My favorite pets:9 <table>10 <tr><td><a>Dog</a></td><td><a>Cat</a></td><td><a>Alligator</a></td></tr>11 </table>12 </body>13 </html>

Wrappers (at least 2 keywords match):w1 (1 - 13): <td><a>TOKEN</a></td>w2 (2 - 12): <td><a>TOKEN</a></td>w3 (4 - 7): <td><a>TOKEN</a></td>w4 (5 - 5): <td><a>TOKEN</a></td><td><a>W5 (6 - 6): </a></td><td><a>TOKEN</a></td>

Advantages of List Extractor.

•Helps Assessor by providing it with a subset of candidate tokens.

•Detect rare cities present in the html select tags.

Related Work.

Several similar projects exists:

•Google’s Froogle.

•Whizbang’s Flipdog.

•Elion.

Future Work.

•Testing for n-ary predicates.

•Generalize KNOWITALL’s bootstrapping and assessment modules as well as its recall-enhancing methods to handle n-ary predicates.

•Tricky extraction problems such as word sense disambiguation, the extraction of temporally changing facts.

•Investigate EM and related co-training techniques.

•Improving KNOWITALL’s scalability.

•Creating a multi-lingual version of KNOWITALL.

Results and Conclusions.•The three techniques greatly improved the extraction rate of base version KnowItAll.

•List Extraction technique made highest contribution to the system’s improvement. Experimental results show that LE’s extraction rate was over forty times greater than that of the other methods.

•SE extracted the most new Scientists.

•Futuristic possibilities such as massive Web based information extraction and automatic accumulation of large collections of fact to support knowledge based AI systems.

Questions?

Thank You.

Unsupervised Named-Entity Extraction from the Web: An Experimental Study

Documents

extraction pattern

information extraction

extraction rate

extraction rules

web search engine

rule bindings

resulting web pages

nplist slot