Information Extraction with Tree Automata Induction Raymond Kosala 1 , Jan Van den Bussche 2 , Maurice Bruynooghe 1 , Hendrik Blockeel 1 1 Katholieke Universiteit Leuven, Belgium 2 University of Limburg, Belgium
Dec 15, 2015
Information Extraction with Tree Automata Induction
Raymond Kosala1, Jan Van den Bussche2,
Maurice Bruynooghe1, Hendrik Blockeel1
1 Katholieke Universiteit Leuven, Belgium2 University of Limburg, Belgium
2
Outline
• Introduction: – information extraction (IE) – grammatical inference
• Approach
• k-testable and g-testable algorithms
• Preliminary result
• Further work
3
IE from unstructured documents
• Extract certain fields of interest from a text
• Learner is trained with (positive) examples
• Each learner focuses on a single field marked with ‘x’
Requirement 2
Requirement 1
Job Title
Company Name
4
Grammatical inference
: finite alphabet• Regular language L *• Given: set of examples (pos. or neg.)• Task: infer a DFA compatible with examples• Quality criterion:
– Exact learning in the limit– PAC– etc.
• Large body of work
5
IE with grammatical inference
• Mark field ‘x’ as special token
• Infer DFA for the language L =
{S over ( x)* | the field to be extracted is marked by x}
• Only positive examples
6
IE from structured documents
• Previous works learn string language
• XML or HTML data: tree structured
• Natural extension is to learn a tree language
“x has a b-brother”
• Extraction of a field can depend on structural context
a
a
b
b x a c
c c c c
x
a
b
b a a a
c c c c c c
7
Learning process
< > …..……….…… . < >
Structured document
Parsed and annotated
Transformed
Tree automatalearner
8
Testing process
< > …..……….…… . < >
Structured document
Parsed Transformed
With each text nodes replaced, run the tree automaton
Output
10
Why do we need the context?
• Not enough to differentiate the fields of interest that depend on the structural context.
• Can be chosen automatically.
|-- tr |-- td |-- td |-- lastupdate (CDATA) |-- td |-- b |-- 12/4/98 |-- td |-- tr |-- td |-- tr |-- td |-- td |-- organization (CDATA) |-- td |-- b |-- ABC |-- td |-- tr |-- td
11
Tree automata
• Ranked alphabet : finite set of function symbols with arities.
E.g. = {a(2), b(2), c(0)}
• Tree ground term over
a(c,a(b(c,c),b(c,c))) : tree with depth 3
• Tree automaton: M = (, Q, , F). is a set of transitions of the form: v(q1, …, qn) q
Where v , n is the arity of v , qi and q Q
c
a
a
b b
c c c c
12
Example
• Given an automaton M with the following transitions:
1 : c q0
2 : a(q0 , q0) q0
3 : b(q0 , q0) accept
• M accepts a tree t t has b-node as the root
a
b
a
a c
c c
c c
c
b
a
c c
a
a
c
c c
13
Unranked trees
• XML/HTML: bib
… paper report book paper …
• The number of children is not fixed by the label• Two approaches:
1. Generalize notion of tree automaton to unranked trees. L transition rules:
v(e) q , where e is regular expression over Q
2. Encode as ranked tree
14
Encoding of unranked trees
• There are well-known methods of encoding to binary trees, we use:– encode(T) = encodef(T)
v if F1 = F2 = vright(encodef(F2)) if F1 = , F2
– encodef(v(F1), F2) = vleft(encodef(F1)) if F1 , F2 = v(encodef(F1), encodef(F2)) otherwise
Where: T := v(F), v F :=
F := T, F
• Example:
becomes |
b_right
a_left
c
a d
b
a
dc
a
15
k-testable tree languages
• Languages in which membership can be checked by just looking at subtrees of length k-1 that appear in the tree.
• k-roots:
• k-forks:
• k-subtrees:
1))...((Ø
))...((1
11
1
{)())...((
kttvdepthif
otherwisettvr
m
jjkmk
m
mk
tfttvf
1))...((Ø
)...(1
11
1
{)())...((
kttvdepthif
otherwisettv
m
jjkmk
m
m
tsttvs
1))()...((1 111
{))...((
kifvotherwisetrtrvmk mkk
ttvr
16
k-testable tree languages (cont.)
• An example t: html head body title h1 table
f2(t): html head body
head body title h1 table
r2(t): html head body
s2(t): body head table title h1 h1 table title
17
k-testable algorithm [Rico-Juan, et al.]
Given: a set of positive examples T, a positive integer k
Q, FS, = Ø
For each t T,– Let R = rk-1(t); F = fk(t); S = sk-1(t);
– Q = Q R rk-1( F ) S ;
– FS = FS R; v(t1, …,tm) S : = {m(v(t1, …,tm)) = v(t1, …,tm)}
v(t1, …,tm) F : = {m(v(t1, …,tm)) = rk-1(v(t1, …,tm))}
18
g-testable algorithm
• Idea: generalize state transitions from forks that are not important for the extraction.
• Important forks are those that contain ‘x’ and (possibly) the distinguishing context.
a
b c
d e
* *
* *
a
b c
* *
at: gen(t,1): gen(t,2):
19
g-testable algorithmGiven: a set of positive examples T, positive integer k and l FS, , Ss, CF, OF, OF’ = ØFor each t T,
– Let R = rk-1(t); F = fk(t); S = sk-1(t);– Ss = Ss S– FS = FS R v(t1, …,tm) S : = {m(v(t1, …,tm)) = v(t1, …,tm)} – CF = CF {f | f F, f contains x}– OF = OF {f | f F, f does not contain x}
For each of OF,– of’ = gen(of,l)– If of’ covers one of CF then
OF’ = OF’ of else OF’ = OF’ of’
Let F’ = OF’ CF
Q = Q F rk-1( F’ ) Ss
v(t1, …,tm) F’ : = {m(v(t1, …,tm)) = rk-1(v(t1, …,tm))}
20
g-testable algorithm example• A set of examples T (with k = 2 and l = 1): html html head body head body title h1 x table title h1 x • F = f2(T):
html head body body head body title h1 x h1 x table
• FS = {html}• OF = {html(head, body), head(title)} • CF = {body(h1, x), body(h1, x, table)}• OF’ = {html(*, *), head(*)}
• R = r1(T): html
• Ss = s1(T): table , title , h1, x
21
g-testable algorithm example (cont.)
• F’ = {html(*, *), head(*), body(h1, x), body(h1, x, table)}• Transitions from the trees in the subtrees s1(T):
(table) = table (title) = title (h1) = h1 (x) = x
• Transitions from the trees in the generalized forks F’: (html(*, *)) = html (head(*)) = head (body(h1, x)) = body (body(h1, x, table)) = body
• Q = {html ; head ; body ; table ; title ; h1; x}
22
Experiment
• Two benchmark datasets: Internet Address Finder (IAF) and Quote Server (QS).
• Comparison with: HMM, Stalker, and BWI.• The highlights of our method:
– More expressive.– Doesn’t require:
• manual specifications of windows length of the prefix and suffix of the target field (HMM and BWI)
• special tokens of the delimiters such as “:” “>” (Stalker and BWI) • embedded catalog tree (Stalker)
• The limitations:– The field that can be extracted limited to whole node– Slower when extracting
23
Experiment results
The results in % are:
Dataset: IAF-altname IAF-org QS-date QS-vol Shakespeare Prec Rec F1 Prec Rec F1 Prec Rec F1 Prec Rec F1 Prec Rec F1--------------------------------------------------------------------------------------------------------------------------HMM 1.7 90 3.4 16.8 89.7 28.4 36.3 100 53.3 18.4 96.2 30.9 Stalker 100 - - 48.0 - - 0 - - 0 - -BWI 90.9 43.5 58.8 77.5 45.9 57.7 100 100 100 100 61.9 76.5k-testable 100 73.9 85 100 57.9 73.3 100 60.5 75.4 100 73.6 84.8 56.2 90 69.2g-testable 100 73.9 85 100 82.6 90.5 100 60.5 75.4 100 73.6 84.8 69.2 90 78.2Parameters 4 and (5,2) 2 and (3,2) 2 and (3,2) 5 and (6,5) 3 and (4,2)
* F1 is the harmonic mean of recall and precision * The results of HMM, Stalker and BWI are adopted from [Freitag & Kushmerick]