Top Banner
Another approach to Information Extraction Marek Nekvasil [email protected] using Extended Ontologies
46

Another approach to Information Extraction Marek Nekvasil [email protected] using Extended Ontologies.

Dec 27, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

Another approach to Information Extraction

Marek [email protected]

using Extended Ontologies

Page 2: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

agenda

gathering information with wrappers ways to build a wrapper using and extending an ontology templates and patterns suggesting a simple wrapper induction

method

Page 3: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

wrapping up a document

synonym to identifying relevant information in the document

there are many ways how to wrap a document up

Page 4: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

wrapper classes

string-based wrappers Kushmerick‘s wrapper classes

tree-based wrappers XPath Elog finite automata

Methods Comparison

Page 5: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

<HTML> <TITLE>Ceny pobytů</TITLE> <BODY> <B>Řecko - Lefkada</B> <I>16 299 Kč</I><BR> <B>Mallorca - Santa Ponsa</B> <I>21 100 Kč</I><BR> <B>Egypt - Sharm El Sheikh</B> <I>18 500 Kč</I><BR> <B>Egypt - Ghiza</B> <I>19 049 Kč</I><BR> </BODY></HTML>

LR class

basic class (stands for Left-Right) 2n parameters (2 for every part of

extracted tuple) example:

suitable wrapper LR(<B>; </B>; <I>; </I>)

Page 6: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

other LR class derivates

Nicolas Kushmerick‘s classes HLRT (Head-Left-Right-Tail) OCLR (Opening-Closing-Left-Right) HOCLRT (…) N-LR or N-HLRT (Nested-…)

Page 7: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

XPath wrappers

using XPath queries to identify data in the tree representation of a document

often using just the very basic features of the XPath language

usually building queries from the root of a document

Page 8: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

Elog

declarative language similar to Prolog uses predicates to generate instances

used in the Lixto tool example of Elog wrapper

Page 9: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

finite automata

FSM can be used for wrapping in various ways

usually used for searching in the linear representation of a document

Carme shows it is possible to use FSM for searching in the tree structure

Page 10: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

methods comparison

Tree-based wrappers are more error-prone than linear string-based wrappers

Elog and N-LR allow extraction not only from tabular data structure but also from a general hierarchical data structure

XPath wrappers reuse a well defined standard

Page 11: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

agenda

gathering information with wrappers ways to build a wrapper using and extending an ontology templates and patterns suggesting a simple wrapper induction

method

Page 12: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

building a wrapper

by hand Oracle and PAC analysis interactive visual pattern design tree-fragment queries tree traversal pattern generalization and many other …

Page 13: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

PAC analysis

uses an abstract function called Oracle to gather enough example instances of extracted class (asuming it‘s embrased by human)

gathers examples until it has enough N to suggest a wrapper class with a designated error e on a given probality level 1-d, using the formula:

finally searches for the first set of parameters of the wrapper to match all the exmaples

d

R

eN

)(log

2 >

Page 14: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

interactive visual pattern design

used in Lixto tool to craft wrappers in Elog language

first user points out the example instances which makes a generating rule, a pattern

then the user forms conditions (filters) of the patterns to restrict them, which is done visually

Page 15: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

interactive condition building in Lixto

Page 16: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

tree-fragment queries

searching such a minimum XPath query that forms a tree-prefix to all examples tree-prefix examples

Page 17: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

tree traversal pattern generalization

application of the graph theory on the generalized document tree

searching the shortest path through the document tree and thus forming an efficient XPath query

Page 18: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

agenda

gathering information with wrappers ways to build a wrapper using and extending an ontology templates and patterns suggesting a simple wrapper induction

method

Page 19: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

ontologies and wrappers

ontology is a knowledge model we can make a knowledge model that

summarizes what information we are going to extract

with a nifty extension we can use the ontology to identify examples of what we are going to extract

theese examples can be used to build a wrapper with any method

Page 20: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

ontology in OWL

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#"> <owl:Ontology rdf:about=""> <owl:imports rdf:resource=“http://www.somedomain.com/x“/> </owl:Ontology> <owl:Class rdf:ID=“class_A“> <owl:disjointWith rdf:resource=“#class_B“/> </owl:Class> <owl:Class rdf:ID=“class_C“> <owl:subClassOf rdf:resource=“#class_A“/> </owl:Class> <owl:DatatypeProperty rdf:ID="property_A"> <rdfs:domain rdf:resource="#class_A"/> </owl:DatatypeProperty></rdf:RDF>

Page 21: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

extending OWL

in the terms of ontologies we extract values of datatype properties

therefore we need some technique to identify (and rank) possible instances of theese values

we suggest a way to define complex templates of typical values of a datatype property

Page 22: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

placing a template into the ontology

we estabilish a new namespace: xmlns:ot="http://st.vse.cz/~XNEKM06/ontologytemplates#„

in the new namespace we use an element <ot:Template> to write a template down

such a template can only be joined with a datatype property <owl:DatatypeProperty rdf:ID=„property_A"> <rdfs:domain rdf:resource="#class_B"/> <ot:Template ...> ... </ot:Template> </owl:DatatypeProperty>

Page 23: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

agenda

gathering information with wrappers ways to build a wrapper using and extending an ontology templates and patterns suggesting a simple wrapper induction

method

Page 24: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

patterns

pattern – a general rule that can be evaluated against any continuous part of a document to see with what degree it matches

Page 25: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

template

template – a set of rules that can be evaluated as a whole against any continuous part of a document to see with what degree it matches

a template is a special case of a pattern

thus a template can contain other templates

Page 26: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

simple patterns

pattern has an internal algorythm that can (with some parameters) identify possible matches throughout the document with a pattern match degree as an output

moreover we need to infer a degree of evidence certainty which should be our confidence that it really is a value that the pattern was to identify

Page 27: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

deriving the degree of evidence certainty 1

let us define two propositions: A – the pattern algorythm identified a

given part of a document E – the part really should have been

identified by that pattern A and E are logical propositions and in

fuzzy logic their truth value is a real number from the interval <0; 1>

Page 28: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

deriving the degree of evidence certainty 2

intuitively there should be a relationA E

thanks to modus ponens rule we can write in basic logic

(A & (A E)) E of that we can derive

val(E) val(A & (A E)) and while not wanting to overestimate the

evidence certainty we setval(E) = val(A & (A E))

Page 29: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

deriving the degree of evidence certainty 3

now we introduce a parameter of the patternval (A E) = p

we call it pattern precision using for examle Łukasiewicz‘ logic we can

derivee = max (0, a + p -1)

where e stands for val(E) and A for val(A)

Page 30: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

deriving the degree of evidence certainty 4

without doubt it‘s true that(E A) E, and (A E) E

while in Łukasiewicz‘ logic we can derive from the above

(A S E) (E A)

and therefore(E A) (A E)

Page 31: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

deriving the degree of evidence certainty 5

while we substitute (E A) for (E A) we can derive

(E A) E and we introduce a second parameter

val (E A) = c which we call a pattern completeness

Page 32: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

deriving the degree of evidence certainty 6

combinig the two rules above we can derive an ultimate rule

((A & (A E)) (E A)) E and while still not wanting to

overestimate the evidence certainty we can write down (in Łukasiewicz‘ logic)

e = max (max (0, a + p -1), 1 – c)

Page 33: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

simple patterns summary

a pattern identifies a given place in the document with a pattern match degree denoted as a

every pattern has two parameters: p – precision and c – completeness

the degree of pattern evidence certainty can then be calculated as

e = max (a + p -1, 1 – c)

Page 34: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

composite patterns

as to forming a template we can combine the fragmentary simple patterns together

computing the evidence certainty is the same as it was in case of simple patterns however we have to derive a pattern match degree somehow

Page 35: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

deriving the composite pattern match degree

joining evidences of two patterns can be viewed as joining two fuzzy sets

for this we can use either a set union (asociated with disjuntion) or a set intersection (asociated with conjunction)

therefore we compute the composite pattern match degree as the conjuncion or disjunction of evidence certainties of all component patterns

so we get two kinds of templates: conjoint and disjoint

Page 36: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

the nature of templates

for the calculations we use the formulae of min-conjuntion and max-disjunction

the parameters p and c of component patterns now get a new meaning

in a disjoint template a high value of p means that the pattern forms a sufficient condition

in a conjoint template a high value of c means that the pattern forms a necessary condition

Page 37: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

writing down the templates

we write the template down as to match it with the ontology as was shown before:<ot:Template ot:p=“0.95“ ot:c=“0.8“ ot:type=“disjoint“>

...

</ot:Template>

the component patterns will be written in the form of nested xml tags

Page 38: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

a few kinds of patterns

<ot:String ot:p=“0.7“>Egypt</ot:String> <ot:Stringlist ot:source=“c:\temp\zeme.txt“ ot:c=“0.62“/> <ot:Concatenation> ..</..> <ot:Context ot:side="left" ot:maxdistance="1" ot:c="0.5">..</..> <ot:Number ot:min = “1“ ot:min = “10“ /> <ot:Distribution ot:type="gauss" ot:mean="10900"

ot:variance="9200000"/> <ot:Regexp> ..</..> …

Page 39: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

example template

<ot:Template ot:type="disjoint" ot:c="0.9"> <ot:Concatenation> <ot:Distribution ot:type="gauss" ot:mean="10900"

ot:variance="9200000"/> <ot:Stringlist> <ot:String ot:case="any">kc</ot:string> <ot:String ot:case="any">kč</ot:string> <ot:String ot:case="same">,-</ot:string> </ot:Stringlist> </ot:Concatenation> <ot:Context ot:side="left" ot:maxdistance="2" ot:p="0.6"> <ot:Template> <ot:String ot:case="any">cena</ot:string> <ot:String ot:case="any">cena:</ot:string> </ot:Template> </ot:Context> </ot:Template>

Page 40: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

agenda

gathering information with wrappers ways to build a wrapper using and extending an ontology templates and patterns suggesting a simple wrapper induction

method

Page 41: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

anotating the document

fisrt of all we can use the ontology as a model of the extracted data

then we would have to use the templates included in the ontology to identify possible example instances of the extracted values

theese examples can be used with any wrapper induction method

Page 42: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

purifying the evidences

while every pattern has the precision attribute, we can say that up to (1-p)% of the template evidences can be false

we can make segments of the evidences based on thei absolute XPath

then we calculate the sum of confidences of all evidences in such a segment and ignore (1-p)% of the segments with the lowest sum

Page 43: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

generalizing the segments

we generalize the segment using the variable index in the XPath

comparing the number of this generalized segment‘s elements with the original, we can use the completeness parameter to measure the probable error of such a generalization

Page 44: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

matching the segments

we can match the segments of patterns of more datatype properties and form thus complex rules for extracting the instances of ontology classes

the matching can be based on the number of their elements or on the conformity of their XPath

Page 45: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

future work suggestions

integration with some wrapper generation tool

automatic learnig of the patterns using other properties of ontologies,

such as cardinalities

Page 46: Another approach to Information Extraction Marek Nekvasil xnekm06@vse.cz using Extended Ontologies.

thank you for your time