Information Extraction with Tree Automata Induction Raymond Kosala 1, Jan Van den Bussche 2, Maurice Bruynooghe 1, Hendrik Blockeel 1 1 Katholieke Universiteit.

Information Extraction with Tree Automata Induction

Raymond Kosala1, Jan Van den Bussche2,

Maurice Bruynooghe1, Hendrik Blockeel1

1 Katholieke Universiteit Leuven, Belgium2 University of Limburg, Belgium

2

Outline

• Introduction: – information extraction (IE) – grammatical inference

• Approach

• k-testable and g-testable algorithms

• Preliminary result

• Further work

3

IE from unstructured documents

• Extract certain fields of interest from a text

• Learner is trained with (positive) examples

• Each learner focuses on a single field marked with ‘x’

Requirement 2

Requirement 1

Job Title

Company Name

4

Grammatical inference

: finite alphabet• Regular language L *• Given: set of examples (pos. or neg.)• Task: infer a DFA compatible with examples• Quality criterion:

– Exact learning in the limit– PAC– etc.

• Large body of work

5

IE with grammatical inference

• Mark field ‘x’ as special token

• Infer DFA for the language L =

{S over ( x)* | the field to be extracted is marked by x}

• Only positive examples

6

IE from structured documents

• Previous works learn string language

• XML or HTML data: tree structured

• Natural extension is to learn a tree language

“x has a b-brother”

• Extraction of a field can depend on structural context

a

a

b

b x a c

c c c c

x

a

b

b a a a

c c c c c c

7

Learning process

< > …..……….…… . < >

Structured document

Parsed and annotated

Transformed

Tree automatalearner

8

Testing process

< > …..……….…… . < >

Structured document

Parsed Transformed

With each text nodes replaced, run the tree automaton

Output

9

Page examples

10

Why do we need the context?

• Not enough to differentiate the fields of interest that depend on the structural context.

• Can be chosen automatically.

|-- tr |-- td |-- td |-- lastupdate (CDATA) |-- td |-- b |-- 12/4/98 |-- td |-- tr |-- td |-- tr |-- td |-- td |-- organization (CDATA) |-- td |-- b |-- ABC |-- td |-- tr |-- td

11

Tree automata

• Ranked alphabet : finite set of function symbols with arities.

E.g. = {a(2), b(2), c(0)}

• Tree ground term over

a(c,a(b(c,c),b(c,c))) : tree with depth 3

• Tree automaton: M = (, Q, , F). is a set of transitions of the form: v(q1, …, qn) q

Where v , n is the arity of v , qi and q Q

c

a

a

b b

c c c c

12

Example

• Given an automaton M with the following transitions:

1 : c q0

2 : a(q0 , q0) q0

3 : b(q0 , q0) accept

• M accepts a tree t t has b-node as the root

a

b

a

a c

c c

c c

c

b

a

c c

a

a

c

c c

13

Unranked trees

• XML/HTML: bib

… paper report book paper …

• The number of children is not fixed by the label• Two approaches:

1. Generalize notion of tree automaton to unranked trees. L transition rules:

v(e) q , where e is regular expression over Q

2. Encode as ranked tree

14

Encoding of unranked trees

• There are well-known methods of encoding to binary trees, we use:– encode(T) = encodef(T)

v if F1 = F2 = vright(encodef(F2)) if F1 = , F2

– encodef(v(F1), F2) = vleft(encodef(F1)) if F1 , F2 = v(encodef(F1), encodef(F2)) otherwise

Where: T := v(F), v F :=

F := T, F

• Example:

becomes |

b_right

a_left

c

a d

b

a

dc

a

15

k-testable tree languages

• Languages in which membership can be checked by just looking at subtrees of length k-1 that appear in the tree.

• k-roots:

• k-forks:

• k-subtrees:

1))...((Ø

))...((1

11

1

{)())...((

kttvdepthif

otherwisettvr

m

jjkmk

m

mk

tfttvf

1))...((Ø

)...(1

11

1

{)())...((

kttvdepthif

otherwisettv

m

jjkmk

m

m

tsttvs

1))()...((1 111

{))...((

kifvotherwisetrtrvmk mkk

ttvr

16

k-testable tree languages (cont.)

• An example t: html head body title h1 table

f2(t): html head body

head body title h1 table

r2(t): html head body

s2(t): body head table title h1 h1 table title

17

k-testable algorithm [Rico-Juan, et al.]

Given: a set of positive examples T, a positive integer k

Q, FS, = Ø

For each t T,– Let R = rk-1(t); F = fk(t); S = sk-1(t);

– Q = Q R rk-1( F ) S ;

– FS = FS R; v(t1, …,tm) S : = {m(v(t1, …,tm)) = v(t1, …,tm)}

v(t1, …,tm) F : = {m(v(t1, …,tm)) = rk-1(v(t1, …,tm))}

18

g-testable algorithm

• Idea: generalize state transitions from forks that are not important for the extraction.

• Important forks are those that contain ‘x’ and (possibly) the distinguishing context.

a

b c

d e

* *

* *

a

b c

* *

at: gen(t,1): gen(t,2):

19

g-testable algorithmGiven: a set of positive examples T, positive integer k and l FS, , Ss, CF, OF, OF’ = ØFor each t T,

– Let R = rk-1(t); F = fk(t); S = sk-1(t);– Ss = Ss S– FS = FS R v(t1, …,tm) S : = {m(v(t1, …,tm)) = v(t1, …,tm)} – CF = CF {f | f F, f contains x}– OF = OF {f | f F, f does not contain x}

For each of OF,– of’ = gen(of,l)– If of’ covers one of CF then

OF’ = OF’ of else OF’ = OF’ of’

Let F’ = OF’ CF

Q = Q F rk-1( F’ ) Ss

v(t1, …,tm) F’ : = {m(v(t1, …,tm)) = rk-1(v(t1, …,tm))}

20

g-testable algorithm example• A set of examples T (with k = 2 and l = 1): html html head body head body title h1 x table title h1 x • F = f2(T):

html head body body head body title h1 x h1 x table

• FS = {html}• OF = {html(head, body), head(title)} • CF = {body(h1, x), body(h1, x, table)}• OF’ = {html(*, *), head(*)}

• R = r1(T): html

• Ss = s1(T): table , title , h1, x

21

g-testable algorithm example (cont.)

• F’ = {html(*, *), head(*), body(h1, x), body(h1, x, table)}• Transitions from the trees in the subtrees s1(T):

(table) = table (title) = title (h1) = h1 (x) = x

• Transitions from the trees in the generalized forks F’: (html(*, *)) = html (head(*)) = head (body(h1, x)) = body (body(h1, x, table)) = body

• Q = {html ; head ; body ; table ; title ; h1; x}

22

Experiment

• Two benchmark datasets: Internet Address Finder (IAF) and Quote Server (QS).

• Comparison with: HMM, Stalker, and BWI.• The highlights of our method:

– More expressive.– Doesn’t require:

• manual specifications of windows length of the prefix and suffix of the target field (HMM and BWI)

• special tokens of the delimiters such as “:” “>” (Stalker and BWI) • embedded catalog tree (Stalker)

• The limitations:– The field that can be extracted limited to whole node– Slower when extracting

23

Experiment results

The results in % are:

Dataset: IAF-altname IAF-org QS-date QS-vol Shakespeare Prec Rec F1 Prec Rec F1 Prec Rec F1 Prec Rec F1 Prec Rec F1--------------------------------------------------------------------------------------------------------------------------HMM 1.7 90 3.4 16.8 89.7 28.4 36.3 100 53.3 18.4 96.2 30.9 Stalker 100 - - 48.0 - - 0 - - 0 - -BWI 90.9 43.5 58.8 77.5 45.9 57.7 100 100 100 100 61.9 76.5k-testable 100 73.9 85 100 57.9 73.3 100 60.5 75.4 100 73.6 84.8 56.2 90 69.2g-testable 100 73.9 85 100 82.6 90.5 100 60.5 75.4 100 73.6 84.8 69.2 90 78.2Parameters 4 and (5,2) 2 and (3,2) 2 and (3,2) 5 and (6,5) 3 and (4,2)

* F1 is the harmonic mean of recall and precision * The results of HMM, Stalker and BWI are adopted from [Freitag & Kushmerick]

24

Further work

• More generalization while using bigger context is achieved, but sometimes the binarisation makes the context far from the field of interest the generalization cannot go very far

• Work on the algorithm that can work directly with unranked trees.

Information Extraction with Tree Automata Induction Raymond Kosala 1, Jan Van den Bussche 2, Maurice Bruynooghe 1, Hendrik Blockeel 1 1 Katholieke Universiteit.

Documents

belgium slide

positive examples stalker

target field

single field

large body of work slide

x requirement

freitag kushmerick slide

bwi special tokens