ENTITY EXTRACTION: RULE-BASED METHODS “I’m booked on the train leaving from Paris at 6 hours 31” Rule: Location (Token string = “from”) ({DictionaryLookup=location}):loc.

ENTITY EXTRACTION: RULE-BASED METHODS

“I’m booked on the train leaving from Paris at 6 hours 31”

Rule: Location(Token string = “from”)({DictionaryLookup=location}):loc -> Location=:loc

Extraction through Rules

Rules are useful for:

Creating wrappersExtracting information from controlled and well-behaved

data

“My email address is [email protected]”

“For information you should call 801-111-2222”

Email address

Phone number

Rule-Based System: . Collection of

RulesPolicies to control the firing of multiple rules

mailto:[email protected]

Rule Representation

Diverse rule-based systems employ distinct formats for rule representation CSPL: Common Pattern Specification

Language Rapier: Robust Automated Production of

I.E. Rules WHISK: supervised algorithm to learn

regular expressions Avatar: SQL expressions

Form of a Basic Rule

Contextual Pattern -> Action

1 or more labeled patterns capturing properties of 1 or more entities and the context in which they appear in the text Examples:

• Assigning entity label • Inserting a start/end of

an entity tag at a position

• Assigning multiple entity tags

Regular expressions defined over features of tokens in the text and an optional label

Tagging actions

Any property of the token or the context or the documents in which the token appears

Features of Tokens

A token is associated with a bag of features obtained through 1 or more criteria:

String representing the token

Orthography type of the token

Part of Speech of the token

List of dictionaries in which token s appear

Annotations attached by previous processing steps

<location> ….. </location>Heat water in a large

vessel

V N P DET ADJ N

Kitchen Jones & Jones

Locations:

RomeParis

Greece

Rules to Identify a Single Entity Patterns followed by entity-recognizing rules

An optional pattern to capture the context before the start of an entity

A pattern matching the tokens in the entity An optional pattern to capture the context after

the end of the entity

Example: Identify person names of the form “Dr. Yair Weiss”

{Orthography type = capitalized word} {2} )

({DictionaryLookup = Titles} {String = “.”}

-> Person names

Rules to Identify a Single Entity Examples:

Rules for identifying company names in GATE, a popular entity recognition system

Rules to Mark Entity Boundaries Entity boundaries are useful to mark long

entities Separate rules are defined to mark the

start/end of entity boundary Each rule leads to the insertion of a SGML tag in

the text Example:

Insert <journal> tag to mark the start of a journal name in a citation record

({String=“to”} {String=“appear”} {String=“in”}):jstart({Orthography type = Capitalized word} {2-5}) -> insert <journal> after jstart

Rules for Multiple Entities

Rules for multiple-entity recognition Regular expression with multiple slots, each

representing a different entity, to simultaneously identify more than one entity

Useful to extract information from structured records Medical records, classified ads, etc.

Example: Extract the number of bedrooms and rent from an

apt. rental ad({Orthography type = Digit}): Bedroom ({String=“BR”}) ({})* {String=“$”}) ({Orthography type = Number} ):Price -> Number of Bedrooms =: Bedroom, Rent =: Price

Organizing Collections of Rules Rule-based systems consist of very large

collection of rules Problem

Solution

How

Spans demarcated by different rules overlap, leading to conflicting actions

Component to organize rules and control de order in which they are applied to eliminate/resolve conflict

Use of heuristics and special –case handling, since rule-managing is a nonstandardized and custom-tuned part of rule-based system

Resolving Rule Conflicts

Use of special/custom policies Sample policies

Prefer rules that mark larger spans of text as an entity type

Merge spans of text that overlap, only if the action portion of the two applied rules is the same

Popular strategy since it allows flexibility in defining rules


Arrange rules as ordered set Prioritize the order on all the rules and

favor the one with higher priority Priority of a rule is fixed by some function of

the precision and coverage of the rule of the training data

“It is an open question whether a good rule-based theory should consist of rules that cover many

examples at the expense of a certain number of misclassifications or whether one should prefer rules that cover only few examples, but appear to be more

precise” (Fürnkranz, 2003)


Based on complete order A later rule can be defined on actions of

earlier rules Example:

Insert an end tag on the results of an earlier rule used for inserting a start tag Since R1 has

precedence, R2 can assume

that <journal> can be used as part of the rule


Finite State Machines A full automata is explicitly defined to control

the exact sequence in which rules are applied Nodes (entities) are connected via directed edges

Each edge is associated with a rule on the input tokens that must be satisfied for the edge to be taken

Each rule correspond to a path in the FST There is no ambiguity about the order in which rules

are applied, as long as there is a unique path from the start to sink state for each sequence of tokens

How Rules are Created

A typical entity extraction system depends on a finely tuned set of rules

Rules manually coded by domain experts

Rules automatically learnt from labeled examples of entities in unstructured text

Rule Learning Algorithm

Create a set of rules R1, R2, … Rk such that the action of each rule either Identifies a single entity Marks entity boundaries Identifies multiple entities

From a training set consisting of Unstructured set of documents where all

the occurrences of entities are marked correctly


Goal

Cover all the segments that contain an annotation by 1 or more rules

Ensure that the precision of each rule is high

Coverage of a rule R, i.e. S(R), is the fraction of data segments matched by R in the training documents

Precision of R is the ratio between S’(R), the subset of segments covered by R for which the action specified by R is correct, and S(R)

The overall set of rules must have good recall and precision on new documents


Generalizability of learnt rules is required Define the smallest set of rules that cover

the maximum number of training cases with high precision Finding the optimal size for a rule set is

intractable Rule-learning algorithms follow a greedy hill

climbing strategy for learning one rule at the time


The main challenge is to create a new rule that achieves high overall coverage and has high precision These can be achieved using heuristics or

existing strategies, classified as

Bottom-up approach

Top-down approach

Specific Rule

Generalized Rule

General Rule

Specialized Rule

Bottom-Up

Bottom-up rule formation Start with a rule with minimal coverage but

100% precision Gradually make rule more general to

increase coverage, even if some precision is lost

Example Rule learning using (LP)2 for each tag type

Bottom-Up

Creation of a seed rule Example

Seed rule to insert tag T=<PER> before a position pstart

({String =“According”} {String =“to”}):pstart { String=“Robert”} {String =“Callahan”} -> insert <PER> at :pstart

Bottom-Up

Generalizing seed rules Example

Replace or drop a token by a more general feature token({String =“According”} {String =“to”}):pstart { String=“Robert”} {String =“Callahan”} -> insert <PER> at :pstart

({String =“According”} {String =“to”}):pstart {Orthography type =“Capitalized word”} {Orthography type =“Capitalized word”} -> insert <PER> after :pstart({DictioinaryLookup = Person}):pstart ({DictionaryLookup = Person}) -> insert <PER> before :pstart

Bottom-Up

Generalizations retained starting from a single seed rule Top-K rules are selected sequentially in

decreasing order of precision over uncovered instances (LP)2 also considers a number of measure of

quality rules, such as Precision Overall coverage Coverage of instances not covered by other rules

Top-Down

Top-down rule formation Start with a rule that covers all possible instances,

i.e., 100% coverage and poor precision Specialize rule to increase precision Select the set k of the most precise rules

Example Rule learning using (LP)2 for each tag typeUser-provided

threshold for the coverage of each rule


Problem Due to the limited availability of labeled

data, purely automated data-driven method for rule induction are not adequate

Labeled Data

Rules

Solution Hybrid of automated and manual methods

to improve rule-based systems

Summary

Rule-based methods for entity extraction

Conflicts need to be resolved

“I’m booked on the train leaving from Paris at 6 hours 31”

Rule: Location(Token string = “from”)({DictionaryLookup=location}):loc -> Location=:loc

How sets of rules are created

ENTITY EXTRACTION: RULE-BASED METHODS “I’m booked on the train leaving from Paris at 6 hours 31” Rule: Location (Token string = “from”) ({DictionaryLookup=location}):loc.

Documents