ENTITY EXTRACTION: RULE-BASED METHODS “I’m booked on the train leaving from Paris at 6 hours 31” Rule: Location (Token string = “from”) ({DictionaryLookup=locati on}):loc -> Location=:loc
Feb 24, 2016
ENTITY EXTRACTION: RULE-BASED METHODS
“I’m booked on the train leaving from Paris at 6 hours 31”
Rule: Location(Token string = “from”)({DictionaryLookup=location}):loc -> Location=:loc
Extraction through Rules Rules are useful for:
Creating wrappersExtracting information from controlled and well-behaved
data
“My email address is [email protected]”
“For information you should call 801-111-2222”
Email address
Phone number
Rule-Based System: . Collection of
RulesPolicies to control the firing of multiple rules
Rule Representation Diverse rule-based systems employ
distinct formats for rule representation CSPL: Common Pattern Specification
Language Rapier: Robust Automated Production of
I.E. Rules WHISK: supervised algorithm to learn
regular expressions Avatar: SQL expressions
Form of a Basic Rule Contextual Pattern -> Action
1 or more labeled patterns capturing properties of 1 or more entities and the context in which they appear in the text Examples:
• Assigning entity label • Inserting a start/end of
an entity tag at a position
• Assigning multiple entity tags
Regular expressions defined over features of tokens in the text and an optional label
Tagging actions
Any property of the token or the context or the documents in which the token appears
Features of Tokens A token is associated with a bag of
features obtained through 1 or more criteria:
String representing the token
Orthography type of the token
Part of Speech of the token
List of dictionaries in which token s appear
Annotations attached by previous processing steps
<location> ….. </location>Heat water in a large
vessel
V N P DET ADJ N
Kitchen Jones & Jones
Locations:
RomeParis
Greece
Rules to Identify a Single Entity Patterns followed by entity-recognizing rules
An optional pattern to capture the context before the start of an entity
A pattern matching the tokens in the entity An optional pattern to capture the context after
the end of the entity
Example: Identify person names of the form “Dr. Yair Weiss”
{Orthography type = capitalized word} {2} )
({DictionaryLookup = Titles} {String = “.”} -> Person names
Rules to Identify a Single Entity Examples:
Rules for identifying company names in GATE, a popular entity recognition system
Rules to Mark Entity Boundaries Entity boundaries are useful to mark long
entities Separate rules are defined to mark the
start/end of entity boundary Each rule leads to the insertion of a SGML tag in
the text Example:
Insert <journal> tag to mark the start of a journal name in a citation record({String=“to”} {String=“appear”} {String=“in”}):jstart
({Orthography type = Capitalized word} {2-5}) -> insert <journal> after jstart
Rules for Multiple Entities Rules for multiple-entity recognition
Regular expression with multiple slots, each representing a different entity, to simultaneously identify more than one entity
Useful to extract information from structured records Medical records, classified ads, etc.
Example: Extract the number of bedrooms and rent from an
apt. rental ad({Orthography type = Digit}): Bedroom ({String=“BR”}) ({})* {String=“$”}) ({Orthography type = Number} ):Price -> Number of Bedrooms =: Bedroom, Rent =: Price
Organizing Collections of Rules Rule-based systems consist of very large
collection of rules Problem
Solution
How
Spans demarcated by different rules overlap, leading to conflicting actions
Component to organize rules and control de order in which they are applied to eliminate/resolve conflict
Use of heuristics and special –case handling, since rule-managing is a nonstandardized and custom-tuned part of rule-based system
Resolving Rule Conflicts Use of special/custom policies
Sample policies Prefer rules that mark larger spans of text as an
entity type Merge spans of text that overlap, only if the
action portion of the two applied rules is the same
Popular strategy since it allows flexibility in defining rules
Resolving Rule Conflicts Arrange rules as ordered set
Prioritize the order on all the rules and favor the one with higher priority Priority of a rule is fixed by some function of
the precision and coverage of the rule of the training data
“It is an open question whether a good rule-based theory should consist of rules that cover many
examples at the expense of a certain number of misclassifications or whether one should prefer rules that cover only few examples, but appear to be more
precise” (Fürnkranz, 2003)
Resolving Rule Conflicts Based on complete order
A later rule can be defined on actions of earlier rules
Example: Insert an end tag on the results of an earlier
rule used for inserting a start tag Since R1 has precedence, R2
can assume that <journal> can be used as part of the rule
Resolving Rule Conflicts Finite State Machines
A full automata is explicitly defined to control the exact sequence in which rules are applied Nodes (entities) are connected via directed edges
Each edge is associated with a rule on the input tokens that must be satisfied for the edge to be taken
Each rule correspond to a path in the FST There is no ambiguity about the order in which rules
are applied, as long as there is a unique path from the start to sink state for each sequence of tokens
How Rules are Created A typical entity extraction system
depends on a finely tuned set of rules
Rules manually coded by domain experts
Rules automatically learnt from labeled examples of entities in unstructured text
Rule Learning Algorithm Create a set of rules R1, R2, … Rk such
that the action of each rule either Identifies a single entity Marks entity boundaries Identifies multiple entities
From a training set consisting of Unstructured set of documents where all
the occurrences of entities are marked correctly
Rule Learning Algorithm Goal
Cover all the segments that contain an annotation by 1 or more rules
Ensure that the precision of each rule is high
Coverage of a rule R, i.e. S(R), is the fraction of data segments matched by R in the training documents
Precision of R is the ratio between S’(R), the subset of segments covered by R for which the action specified by R is correct, and S(R)
The overall set of rules must have good recall and precision on new documents
Rule Learning Algorithm Generalizability of learnt rules is required
Define the smallest set of rules that cover the maximum number of training cases with high precision Finding the optimal size for a rule set is
intractable Rule-learning algorithms follow a greedy hill
climbing strategy for learning one rule at the time
Rule Learning Algorithm The main challenge is to create a new
rule that achieves high overall coverage and has high precision These can be achieved using heuristics or
existing strategies, classified as
Bottom-up approach
Top-down approach
Specific Rule
Generalized Rule
General Rule
Specialized Rule
Bottom-Up Bottom-up rule formation
Start with a rule with minimal coverage but 100% precision
Gradually make rule more general to increase coverage, even if some precision is lost
Example Rule learning using (LP)2 for each tag type
Bottom-Up Creation of a seed rule
Example Seed rule to insert tag T=<PER> before a
position pstart
({String =“According”} {String =“to”}):pstart { String=“Robert”} {String =“Callahan”} -> insert <PER> at :pstart
Bottom-Up Generalizing seed rules
Example Replace or drop a token by a more general
feature token({String =“According”} {String =“to”}):pstart { String=“Robert”} {String =“Callahan”} -> insert <PER> at :pstart
({String =“According”} {String =“to”}):pstart {Orthography type =“Capitalized word”} {Orthography type =“Capitalized word”} -> insert <PER> after :pstart({DictioinaryLookup = Person}):pstart ({DictionaryLookup = Person}) -> insert <PER> before :pstart
Bottom-Up Generalizations retained starting from a
single seed rule Top-K rules are selected sequentially in
decreasing order of precision over uncovered instances (LP)2 also considers a number of measure of
quality rules, such as Precision Overall coverage Coverage of instances not covered by other rules
Top-Down Top-down rule formation
Start with a rule that covers all possible instances, i.e., 100% coverage and poor precision
Specialize rule to increase precision Select the set k of the most precise rules
Example Rule learning using (LP)2 for each tag typeUser-provided
threshold for the coverage of each rule
Rule Learning Algorithm Problem
Due to the limited availability of labeled data, purely automated data-driven method for rule induction are not adequate
Labeled Data
Rules
Solution Hybrid of automated and manual methods
to improve rule-based systems
Summary
Rule-based methods for entity extraction
Conflicts need to be resolved
“I’m booked on the train leaving from Paris at 6 hours 31”
Rule: Location(Token string = “from”)({DictionaryLookup=location}):loc -> Location=:loc
How sets of rules are created