Top Banner
Fusing automatically extracted annotations for the Semantic Web Andriy Nikolov Knowledge Media Institute The Open University
53
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Fusing semantic data

Fusing automatically extracted annotations for the Semantic

WebAndriy NikolovKnowledge Media InstituteThe Open University

Page 2: Fusing semantic data

2

Outline

• Motivation• Handling fusion subtasks

– problem-solving method approach

• Processing inconsistencies– applying uncertainty reasoning

• Overcoming schema heterogeneity– Linked Data scenario

Page 3: Fusing semantic data

3

Database scenario

• Classical scenario (database domain)– Merging information from datasets

containing partially overlapping information

Name Year of birth E-mail Address

H. Schmidt 1972 [email protected]

J. Smith 1983 [email protected]

Name Year of birth E-Mail Job position

Wen, Zhao 1980 [email protected]

Schmidt, Hans 1973 [email protected]

Page 4: Fusing semantic data

4

Database scenario

• Coreference resolution (record linkage)– Resolving ambiguous identities

Name Year of birth E-mail Address

H. Schmidt 1972 [email protected]

J. Smith 1983 [email protected]

Name Year of birth E-Mail Job position

Wen, Zhao 1980 [email protected]

Schmidt, Hans 1973 [email protected]

Page 5: Fusing semantic data

5

Database scenario

• Inconsistency resolution– Handling contradictory pieces of data

Name Year of birth E-mail Address

H. Schmidt 1972 [email protected]

J. Smith 1983 [email protected]

Name Year of birth E-Mail Job position

Wen, Zhao 1980 [email protected]

Schmidt, Hans 1973 [email protected]

Page 6: Fusing semantic data

6

Semantic data scenario

• Database domain:– A record belongs to a single

table– Table structure defines

relevant attributes– Inconsistency of values

• Semantic data:– Classes are organised into

hierarchies– One individual may belong

to several classes– Available properties depend

on the level of granularity– Other types of

inconsistencies are possible• E.g., class disjointness

foaf:Person

sweto:Person

foaf:namexsd:string

xsd:stringfoaf:mbox

sweto:Placesweto:lives_in

sweto:Organization

sweto:affiliated_with

sweto:Researcherxsd:stringsome:has_degree

Page 7: Fusing semantic data

7

Motivating scenario – X-Media

RDF

Images

Other data

Annotation FusionText

Internal corporate reports (Intranet)

Pre-defined public sources (WWW)

Domain ontology

KnoFuss

Knowledge base

Page 8: Fusing semantic data

8

Outline

• Motivation• Handling fusion subtasks

– problem-solving method approach

• Processing inconsistencies– applying uncertainty reasoning

• Overcoming schema heterogeneity– Linked Data scenario

Page 9: Fusing semantic data

9

Handling fusion subtasks

• For each subtask, several available methods exist

• Example: coreference resolution– Aggregated attribute similarity

• [Fellegi&Sunter 1969]

– String similarity• Levenshtein, Jaro, Jaro-Winkler

– Machine learning• Clustering• Classification

– Rule-based

Page 10: Fusing semantic data

10

Handling fusion subtasks

• All methods have their pros and cons– Rule-based

• High precision• Restricted to a specific domain

– Machine learning• Require sufficient training data

– String similarity• Lower precision• Still need configuration (e.g., distance metric, threshold, set of

attributes to include)

• Trade-off between the quality of results and applicability range – better precision requires more domain-specific knowledge

Page 11: Fusing semantic data

11

Problem-solving method approach

• Fusion task is decomposed into subtasks• Algorithms defined as methods solving a particular

task• Each method is formally described using the fusion

ontology– Task handled by the method– Applicability criteria– Domain knowledge required– Reliability of output

• Methods are selected based on their capabilities

Page 12: Fusing semantic data

12

KnoFuss architecture

Fusion KBIntermediate data

Main KB

KnoFuss

CoreferenceResolutionMethod

ConflictDetectionMethod

ConflictResolutionMethod

Method library

New data

Fusion ontology

• Method library– Contains implementation of each technique for specific

subtasks (problem-solving method [Motta 1999])• Fusion ontology

– Describes method capabilities– Defines intermediate structures (mappings, conflict sets, etc.)

Page 13: Fusing semantic data

13

Task decomposition

Knowledge fusion

Coreferenceresolution

Knowledge base

updating

Modelconfiguration

Dependency identification

Dependency resolution

Linkdiscovery

Source KB

TargetKB

(fused)

TargetKB

Page 14: Fusing semantic data

14

Method selection

Adaptive learning matcher Application context:

Publication

Application context:Journal Article

rdf:type owl:Thing

datatypeProperty ?x

reliability =0.4

rdf:type sweto:Publication

rdfs:label ?x

sweto:year ?y

reliability =0.8

rdf:type sweto:Article

rdfs:label ?x

sweto:year ?y

sweto:journal ?z

sweto:volume ?a

reliability =0.9

• Depends on:– Range of applicability– Reliability

• Configuration parameters– Generic (for individuals of unknown types)– Context-dependent

Page 15: Fusing semantic data

15

Using class hierarchy

• Configuring machine-learning methods:– Using training instances for a subclass to learn

generic models for superclasses

owl:Thing

foaf:Person foaf:Document

sweto:Publication

sweto:Article sweto:Article_in_Proceedings

year

name

volume book_title

label

journal_name

Ind1: {label, year, book_title}Ind2: {label, year, book_title}Ind3: {label, year, book_title}

Ind1: {label, year}Ind2: {label, year}Ind3: {label, year}

Ind1: {label}Ind2: {label}Ind3: {label}

Page 16: Fusing semantic data

16

Using class hierarchy

• Configuring machine-learning methods:– Combining training instances for subclasses to

learn a generic model for a superclasssweto:Publication

sweto:Article sweto:Article_in_Proceedings

year

volumebook_title

label

journal_name

Ind1: {label, year, book_title}Ind2: {label, year, book_title}Ind3: {label, year, book_title}

Ind1: {label, year}Ind2: {label, year}Ind3: {label, year}Ind4: {label, year}Ind5: {label, year}Ind6: {label, year}

Ind4: {label, year, journal_name, volume}Ind5: {label, year, journal_name, volume}Ind6: {label, year, journal_name, volume}

Page 17: Fusing semantic data

17

Outline

• Motivation• Handling fusion subtasks

– problem-solving method approach

• Processing inconsistencies– applying uncertainty reasoning

• Overcoming schema heterogeneity– Linked Data scenario

Page 18: Fusing semantic data

18

Data quality problems

• Causes of inconsistency– Data errors

• Obsolete data• Mistakes of manual annotators• Errors of information extraction algorithms

– Coreference resolution errors• Automatic methods not 100% reliable

• Applying uncertainty reasoning– Estimated reliability of separate pieces of

data– Domain knowledge defined in the ontology

Page 19: Fusing semantic data

19

Refining fused data

• Additional evidence:– Ontological schema restrictions

• Disjointness• Cardinality• …

– Neighborhood graph• Mappings between related entities

– Provenance• Uncertainty of candidate mappings• Uncertainty of data statements• “Cleanness” of data sources

Page 20: Fusing semantic data

20

Dempster-Shafer theory of evidence

• Bayesian probability theory:Assigning probabilities to atomic alternatives: – p(true)=0.6 ! p(false)=0.4 – Sometimes hard to assign– Negative bias:

Extraction uncertainty less than 0.5 – negative evidence rather than insufficient evidence

Dempster-Shafer theory: Assigning confidence degrees (masses) to sets of alternatives– m({true}) = 0.6– m({false}) = 0.1– m({true;false})=0.3

probability

support

plausibility

Page 21: Fusing semantic data

21

Dependency detection

• Identifying and localizing conflicts– Using formal diagnosis [Reiter 1987] in

combination with standard ontological reasoning

ArticleArticle ProceedingsProceedings

Paper_10Paper_10

owl:disjointWithowl:disjointWith owl:FunctionalPropertyowl:FunctionalProperty

rdf:typerdf:type

hasYear2007

hasYear 2006

E. MottaE. Motta V.S. UrenV.S. UrenhasAuthor hasAuthor

Page 22: Fusing semantic data

22

Belief networks (cont)

• Valuation networks [Shenoy and Shafer 1990]

• Network nodes – OWL axioms– Variable nodes

• ABox statements (I2X, R(I1, I2))

• One variable – the statement itself

– Valuation nodes• TBox axioms (XtY)• Mass distribution between several variables

(I2X, I2Y, I2XtY)

Page 23: Fusing semantic data

23

Belief networks (cont)

• Belief network construction– Using translation rules– Rule antecedents:

• Existence of specific OWL axioms (one rule per OWL construct)

• Existence of network nodes– Example rule:

• Explicit ABox statements:IF I2X THEN CREATE N1(I2X)

• TBox inferencing:IF Trans(R) AND EXIST N1(R(I1, I2)) AND EXIST N2(R(I2, I3)) THEN CREATE N3(Trans(R)) AND CREATE N4(R(I1, I3))

Page 24: Fusing semantic data

24

Example

#Paper_10#Paper_10

ArticleArticle ProceedingsProceedings

owl:disjointWithowl:disjointWith

rdf:type rdf:type

Page 25: Fusing semantic data

25

Example

#Paper_10#Paper_10

ArticleArticle ProceedingsProceedings

owl:disjointWithowl:disjointWith

rdf:type rdf:type

#Paper_102Article

#Paper_102Proceedings

Page 26: Fusing semantic data

26

Example

#Paper_10#Paper_10

ArticleArticle ProceedingsProceedings

owl:disjointWithowl:disjointWith

rdf:type rdf:type

#Paper_102Article

#Paper_102Proceedings

Article v :Proceedings

Page 27: Fusing semantic data

27

Example

#Paper_102Article

#Paper_102Proceedings

Article v :Proceedings

m(true)=0.8m(false) = 0

m({true;false})=0.2

m(true)=0.6m(false) = 0

m({true;false})=0.4

#Paper_102 Article

#Paper_102 Proceedings

false false

false true

true false

true true

m( )=1.0

m( )=0.0

Page 28: Fusing semantic data

28

Example

#Paper_102Article

#Paper_102Proceedings

Article v :Proceedings

m(true)=0.8m(false) = 0

m({true;false})=0.2

m(true)=0.6m(false) = 0

m({true;false})=0.4

#Paper_102 Article

#Paper_102 Proceedings

false false

false true

true false

false true

true false

m( )=0.15 -Dempster’s rule

m(

m(

)=0.23

)=0.62

Page 29: Fusing semantic data

29

Example

#Paper_102Article

#Paper_102Proceedings

Article v :Proceedings

m(true)=0.62m(false) = 0.23

m({true;false})=0.15

m(true)=0.23m(false) = 0.62

m({true;false})=0.15

#Paper_102 Article

#Paper_102 Proceedings

false false

false true

true false

false true

true false

m( )=0.15

m(

m(

)=0.23

)=0.62

Page 30: Fusing semantic data

30

Belief propagation

• Translating subontology into a belief network– Using provenance and confidence values of data statements– Coreferencing algorithm precision for owl:sameAs mappings

• Data refinement:– Detecting spurious mappings– Removing unreliable data statements

Articlev:in_Proc

Ind1=Ind2

Functional(year)

Article(Ind1)

(0.99;1.0)/(0.97;0.98)

in_Proc(Ind2) Ind1=Ind2

inProc(Ind1)

Ind1=Ind2

year(Ind1, 2007)

year(Ind2, 2007)

year(Ind1, 2006)

(0.9;1.0)/(0.74;0.82) (0.92;1.0)/(0.2;0.21) (0.85;1.0)/(0.72;0.85)

(0.95;1.0)/(0.91;0.96)

Page 31: Fusing semantic data

31

Neighbourhood graph

• Non-functional relations: varying impact

Paper_10Paper_10 H. SchmidtH. SchmidthasAuthor

Paper_11Paper_11 Schmidt, HansSchmidt, Hans

owl:sameAs (0.9) owl:sameAs (0.3)

hasAuthor

ProceedingsProceedings

rdf:type

rdf:type

PersonPerson

GermanyGermany H. SchmidtH. Schmidtcitizen_of

GermanyGermany Schmidt, HansSchmidt, Hans

owl:sameAs (1.0) owl:sameAs (0.3)

citizen_of

CountryCountry

rdf:type

rdf:type

PersonPerson

Page 32: Fusing semantic data

32

Neighborhood graph

• Implicit relations: set co-membership

Person11 = Person12

Person21 = Person22

Coauthor(Person12, Person22)

Person11 = Person12

Coauthor(Person11, Person22)

Person21 = Person22

Coauthor(Person21, Person22)

“Bard, J.B.L.”=“Jonathan Bard”

“Webber, B.L.”=“Bonnie L. Webber”

0.84/(0.86;1.0)

0.16/(0.83;1.0)1.0/(1.0;1.0)

1.0/(1.0;1.0)

Page 33: Fusing semantic data

33

Provenance

• Initial belief assignments:– Data statements

(source AND/OR extractor confidence)

– Candidate mappings (precision of attribute similarity algorithms)

– Source “cleanness” – contains duplicates or not

Arl_Va Arl_Tx

Arlington = Arl_Tx

Arlington Arl_Tx

Arlington = Arl_Va

Arlington = Arl_Va

Arl_Va Arl_Tx

1.0/(1.0;1.0)

0.9/(0.31;0.35)

Arlington, Virginia

0.95/(0.65;0.69)

Arlington, Texas

Page 34: Fusing semantic data

34

Experiments

• Datasets:– Publication 1

• AKT• Rexa• SWETO-DBLP

– Cora• database community benchmark• translated into RDF• 2 versions used

– different structure – different gold standard

Page 35: Fusing semantic data

35

Experiments

Dataset No Matcher Publication

Prec. Recall F1 Prec. Recall F1

AKT/Rexa 1 Jaro-Winkler 0.950 0.833 0.887 0.969 0.832 0.895

2 L2 Jaro-Winkler 0.879 0.956 0.916 0.923 0.956 0.939

AKT/DBLP 3 Jaro-Winkler 0.922 0.952 0.937 0.992 0.952 0.971

4 L2 Jaro-Winkler 0.389 0.984 0.558 0.838 0.983 0.905

Rexa/DBLP 5 Jaro-Winkler 0.899 0.933 0.916 0.944 0.932 0.938

6 L2 Jaro-Winkler 0.546 0.982 0.702 0.823 0.981 0.895

Cora (I) 7 Monge-Elkan 0.735 0.931 0.821 0.939 0.836 0.884

Cora (II) 8 Monge-Elkan 0.698 0.986 0.817 0.958 0.956 0.957

• Publication individuals– Ontological restrictions mainly influence

precision

Page 36: Fusing semantic data

36

Experiments

Dataset No Matcher Person

Prec. Recall F1 Prec. Recall F1

AKT/Rexa 7 L2 Jaro-Winkler 0.738 0.888 0.806 0.788 0.935 0.855

AKT/DBLP 8 L2 Jaro-Winkler 0.532 0.746 0.621 0.583 0.921 0.714

Rexa/DBLP 9 Jaro-Winkler 0.965 0.755 0.846 0.968 0.876 0.920

Cora (I) 10 L2 Jaro-Winkler 0.983 0.879 0.928 0.981 0.895 0.936

Cora (II) 11 L2 Jaro-Winkler 0.999 0.994 0.997 0.999 0.994 0.997

• Person individuals– Evidence coming from the neighborhood graph– Mainly influences recall

Page 37: Fusing semantic data

37

Outline

• Motivation• Handling fusion subtasks

– problem-solving method approach

• Processing inconsistencies– applying uncertainty reasoning

• Overcoming schema heterogeneity– Linked Data scenario

Page 38: Fusing semantic data

38

Advanced scenario

• Linked Data cloud: network of public RDF repositories [Bizer et al. 2009]

• Added value: coreference links (owl:sameAs)

Page 39: Fusing semantic data

39

Data linking: current state

• Automatic instance matching algorithms– SILK, ODDLinker, KnoFuss, …

• Pairwise matching of datasets– Requires significant configuration

effort

• Transitive closure of links– Use of “reference” datasets

Page 40: Fusing semantic data

40

Reference datasets

Page 41: Fusing semantic data

41

Problems

• Transitive closures often incomplete– Reference dataset is incomplete– Missing intermediate links– Direct comparison of relevant datasets is

desirable

• Schema heterogeneity– Which instances to compare?– Which properties are relevant

A BReference

Page 42: Fusing semantic data

42

Schema matching

• Interpretation mismatches– dbpedia:Actor = professional actor– movie:actor = anybody who participated in a movie

• Class interpretation “as used” vs “as designed”– FOAF: foaf:Person = any person– DBLP: foaf:Person = computer scientist

• Instance-based ontology matching

Repository Richard Nixon David Garrick

dbpedia:Actor DBPedia - +

movie:Actor LinkedMDB + -

Page 43: Fusing semantic data

43

KnoFuss - enhanced

Knowledge fusion

Ontology integration

Knowledge base

integration

Ontology matching

Instancetransformation

Coreferenceresolution

Dependency resolution

Source KB

TargetKB

SPARQL query translation

Page 44: Fusing semantic data

44

Schema matching

• Step 1: inferring schema mappings from pre-existing instance mappings

• Step 2: utilizing schema mappings to produce new instance mappings

Ontology 1 Ontology 2

Dataset 1 Dataset 2

Ontology 1 Ontology 2

Dataset 1 Dataset 2

Page 45: Fusing semantic data

45

• Background knowledge:– Data-level

(intermediate repositories)

– Schema-level (datasets with more fine-grained schemas)

Overview

Page 46: Fusing semantic data

46

Algorithm

• Step 1:– Obtaining transitive closure of

existing mappings

LinkedMDB DBPedia

movie:music_contributor/2490

MusicBrainz

music:artist/a16…9fdf

= =

dbpedia:Ennio_Morricone

Page 47: Fusing semantic data

47

Algorithm

• Step 2: Inferring class and property mappings– ClassOverlap and PropertyOverlap mappings– Confidence (classes A, B) = |c(A)Åc(B)| / min(c(|A|), c(|B|))

(overlap coefficient)– Confidence (properties r1, r2) = |c(X)|/||c(Y)|

• X – identity clusters with equivalent values of r1 and r2• Y – all identity clusters which have values for both r1 and r2

LinkedMDB DBPediaMusicBrainz

music:artist/a16…9fdf

==

dbpedia:Ennio_Morriconemovie:music_contributor/2490

movie:music_contributor dbpedia:Artist

is_a is_a

Page 48: Fusing semantic data

48

• Step 3: Inferring data patterns

• Functionality restrictions

• IF 2 equivalent movies do not have overlapping actors AND have different release dates THEN break the equivalence link

• Note:– Only usable if not taken

into account at the initial instance matching stage

Algorithm

Page 49: Fusing semantic data

49

Algorithm

• Step 4: utilizing mappings and patterns– Run instance-level matching for individuals

of strongly overlapping classes– Use patterns to filter out existing mappings

• LinkedMDB

SELECT ?uri

WHERE {

?uri rdf:type movie:music_contributor .

}

• DBPediaSELECT ?uriWHERE { ?uri rdf:type

dbpedia:Artist . }

Page 50: Fusing semantic data

50

Results

• Class mappings:– Improvement in recall

• Previously omitted mappings were discovered after direct comparison of instances

• Data patterns– Improved precision

• Filtered out spurious mappings• Identified 140 mappings

between movies as “potentially spurious”

• 132 identified correctly

00.10.20.30.40.50.60.70.80.9

1

Existing KnoFuss(only)

Combined

PrecisionRecallF1-measure

00.10.20.30.40.50.60.70.80.9

1

Existing KnoFuss(only)

Combined

PrecisionRecallF1-measure

00.10.20.30.40.50.60.70.80.9

1

Existing KnoFuss(only)

Combined

PrecisionRecallF1-measure

DBPedia/

DBLP

DBPedia/

LinkedMDB

DBPedia/

BookMashup

Page 51: Fusing semantic data

51

Future work

• From the pairwise scenario to the network of repositories

• Combining schema and data integration in an efficient way

• Evaluating data sources– Which data source(s) to link to?– Which data source(s) to select data

from?

Page 52: Fusing semantic data

52

Questions?

Thanks for your attention

Page 53: Fusing semantic data

53

References

[Shenoy and Shafer 1990] P. Shenoy, G. Shafer. Axioms for probability and belief-function propagation. In: Readings in uncertain reasoning. San Francisco: Morgan Kaufmann, pp. 575-610, 1990

[Motta 1999] E. Motta. Reusable components for knowledge modelling. Amsterdam: IOS Press, 1999

[Bizer et al 2009] C. Bizer, T. Heath, T. Berners-Lee. Linked Data - the story so far. International Journal on Semantic Web and Information Systems 5(3), pp. 1-22, 2009

[Fellegi and Sunter 1969] Ivan P. Fellegi and Alan B. Sunter. A theory for record linkage. Journal of American Statistical Association, 64(328):1183-1210, 1969

[Reiter 1987] R. Reiter. A theory of diagnosis from first principles. Artificial Intelligence, 32(1):57-95, 1987