1 A Tutorial on Instance Matching Benchmarks Evangelia Daskalaki, Institute of Computer Science – FORTH , Greece Tzanina Saveta, Institute of Computer Science – FORTH , Greece Irini Fundulaki, Institute of Computer Science – FORTH , Greece Melanie Herschel, Universitaet Stuttgart ESWC 2016 , May 30 th , Anissaras – Crete , Greece http://www.ics.forth.gr/isl/BenchmarksTutorial/
109
Embed
A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
A Tutorial on Instance MatchingBenchmarks
Evangelia Daskalaki,
Institute of Computer Science – FORTH , Greece
Tzanina Saveta,Institute of Computer Science – FORTH , Greece
Irini Fundulaki,Institute of Computer Science – FORTH , Greece
Melanie Herschel,Universitaet Stuttgart
ESWC 2016 , May 30th, Anissaras – Crete , Greece
http://www.ics.forth.gr/isl/BenchmarksTutorial/
2A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Teaser Slide
• We will talk about Benchmarks
• Benchmarks are generally a set of tests to assess computer systems performance
• Specifically we will talk about: Instance Matching (IM) Benchmark for Linked Data.
3A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Overview
• Introduction into Linked Data
• Instance Matching
• Benchmarks for Linked Data
– Why Benchmarks?
– Benchmarks Characteristics
– Benchmarks Dimensions
• Benchmarks in the literature
– Benchmark Systems
– Synthetic Benchmarks
– Real Benchmarks
• Summary & Conclusions
4A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Linked Data - The LOD Cloud
Media
Government
Geographic
Publications
User-generated
Life sciences
Cross-domain
5A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Linked Data – The LOD Cloud
*Adapted from Suchanek & Weikum tutorial@SIGMOD 2013
Same entity can be described in
different sources
6A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Different Descriptions of Same Entity in Different Sources
"Riva del Garda description in GeoNames"
"Riva del Garda description in DBpedia"
7A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Overview
• Introduction into Linked Data
• Instance Matching
• Benchmarks for linked Data
– Why Benchmarks?
– Benchmarks Characteristics
– Benchmarks Dimensions
• Benchmarks in the literature
– Benchmark Generators
– Synthetic Benchmarks
– Real Benchmarks
• Summary & Conclusions
8A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Instance Matching: the cornerstone for Linked Data
data acquisition
data
evolution
data integration
open/social data
How can we automatically recognizemultiple mentions of the same entity
across or within sources?=
Instance Matching
9A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Instance Matching
• Problem has been considered for more than half a decade in Computer Science [EIV07]
• Traditional instance matching over relational data (known as record linkage)
Title Genre Year Director
Troy Action 2004 Petersen
Troj History Petersen
contradictionmissing
value
Nicely and homogeneously structured data. Value variations
Typically few sources compared
10A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Web Data Instance Matching« The Early Days »
• IM algorithms for semi-structured XML model used to represent and exchange data.
m1,movie
t1,title s1,set
a11,
actor
a12,
actor
Troy
Brad
Pitt
Eric
Bana
m2,movie
t2,title s2,set
a21,
actor
a22,
actor
Troja
Brad
Pit
Erik
Bana
a23,
actor
Brian
Cox
y1,year
2004
y2,year
04
Solutions assume one common schema
Structural variation
11A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Instance Matching Today
SetsRDF/OWL triples
*Adapted from Suchanek & Weikum tutorial@SIGMOD 2013
Many sources to match
Rich semanticsValue
StructureLogical variations
12A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Need for IM techniques
• People interconnect their dataset with existing ones.
– These links are often manually curated (or semi-automatically generated).
• Size and number of datasets is huge, so it is vital to automatically detect additional links : making the graph more dense.
13A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Benchmarking
Instance matching research has led to
the development of various systems.
–How to compare these?
–How can we assess their performance?
–How can we push the systems to get better?
These systems need to be benchmarked!
14A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Overview
• Introduction into Linked Data
• Instance Matching
• Benchmarks for linked Data
– Why Benchmarks?
– Benchmarks Characteristics
– Benchmarks Dimensions
• Benchmarks in the literature
– Benchmark Generators
– Synthetic Benchmarks
– Real Benchmarks
• Summary & Conclusions
15A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Benchmarking
“A Benchmark specifies a workload characterizingtypical applications in the specific domain. Theperformance of various computer systems on thisworkload, gives a rough estimate of their relativeperformance on that problem domain”
[G92]
16A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Instance Matching Benchmark Ingredients [FLM08]
Organized into test cases each addressing different kind of requirements:
• Datasets
The raw material of the benchmarks. These are the source and the target dataset that will be matched together to find the links
• Gold Standard (Ground Truth / Reference Alignment)
The “correct answer sheet” used to judge the completeness and soundness of the instance matching algorithms.
• Metrics
The performance metric(s) that determine the systems behavior and performance
17A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Datasets Characteristics
Nature of data (Real vs. Synthetic)
Schema (Same vs. Different)
Domain (dependent vs. independent)
Language (One vs. Multiple)
18A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Real vs. Synthetic Datasets
Real datasets :
– Realistic conditions for heterogeneity problems
– Realistic distributions
– Error prone Reference Alignment
Synthetic datasets:
– Fully controlled test conditions
– Accurate Gold Standards
– Unrealistic distributions
– Systematic heterogeneity problems
19A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Data Variations in Datasets
Value Variations
Structural Variations
Logical Variations
Combination of the variations
Multilingual variations
20
Variations
Value
- Name style abbreviation
- Typographical errors
- Change format (date/gender/number)
- Synonym Change
- Multilingualism
Structural
-Change property depth
-Delete/Add property
-Split property values
-Transformation of object to data type property
-Transformation of data to object type property
Logical
-Delete/Modify Class Assertions-Invert property assertions-Change property hierarchy-Assert disjoint classes
cogito-first_sentence (uri1, “George Wheeler Dryden (August 31, 1892 in London - September 30, 1957 in Los Angeles) was an English actor and film director, the son of Hannah Chaplin and” ...)
cogito-first_sentence (uri2,uri3)
hasDataValue (uri3, “George Wheeler Dryden (August 31, 1892 in London -September 30, 1957 in Los Angeles) was an English actor and film director, the son of Hannah Chaplin and” ...)
cogito-tag (uri1, “Actor”) cogito-tag (uri2,uri4)
hasDataValue (uri4, “Actor”)
*Triples in the form of property (subject ,object)
40A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Logical Variations IIMB 2009
Property name Original instance Transformed instance
type “Sportsperson” owl:Thing
wikipedia-name “Sammy Lee” “Sammy Lee”
cogito-first_sentence “Dr. Sammy Lee (born August 1, 1920 in Fresno, California) is the first Asian American to win an Olympic gold…”
“Dr. Sammy Lee (born August 1, 1920 in Fresno, California) is the first Asian American to win an Olympic gold …”
cogito-tag “Sportperson” “Sportperson”
cogito-domain “Sport” “Sport “
Sportsperson subClassOf Thing
*Triples in the form of property, object
41A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
1. Good results from all the systems2. Well known domain and datasets3. No logical variations
92A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Overview DI 2011C
har
acte
rist
ics
Systematic Procedure
Quality
Equity
Volume
Dissemination
Availability
Ground Truth
Value Variations
Structural Variations
Logical Variations
MultilingualityVar
iati
on
s
3
93A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Comparison of Real Benchmarks
94A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Overview
• Introduction into Linked Data
• Instance Matching
• Benchmarks for linked Data
– Why Benchmarks?
– Benchmarks Characteristics
– Benchmarks Dimensions
• Benchmarks in the literature
– Benchmark Systems
– Synthetic Benchmarks
– Real Benchmarks
• Summary and Conclusions
95A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Wrapping up: Benchmarks
Which benchmarks included multilingual datasets?
OAEI RDFT
2013 (French-English)
ID-REC 2014 (English- Italian)
Author Task (English –
Italian)
96A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Wrapping up: Benchmarks
Which benchmarks included value variations into the test cases?
OAEI IIMB 2009
OAEI IIMB 2010
OAEI Persons-Restaurants
2010ONTOBI
OAEI IIMB 2011
Sandbox 2012OAEI IIMB
2012
OAEI RDFT
2013
ID-REC 2014SPIMBENCH
2015Author Task
2015ARS
DI 2010 DI 2011
97A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Wrapping up: Benchmarks
Which benchmarks included structural variations into the test cases?
OAEI IIMB 2009
OAEI IIMB 2010
OAEI Persons-Restaurants
2010ONTOBI
OAEI IIMB 2011
OAEI IIMB 2012
OAEI RDFT
2013 ID-REC 2014
SPIMBENCH 2015
Author Task 2015
ARS DI 2010
DI 2011
98A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Wrapping up: Benchmarks
Which benchmarks included logical variations into the test cases?
OAEI IIMB 2009
OAEI IIMB 2010
OAEI IIMB 2011
OAEI IIMB 2012
SPIMBENCH 2015
99A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Wrapping up: Benchmarks
Which benchmarks included combination of the variations into the test cases?
IIMB 2009 IIMB 2010 IIMB 2011
IIMB 2012 RDFT 2013 ID-REC 2014
SPIMBENCH 2015
Author Task 2015
100A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Wrapping up: Benchmarks
Which benchmarks are more voluminous?
SPIMBENCH 2015
ARS
DI 2011
101A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Wrapping up: Benchmarks
Which benchmarks included both combination of the variations and was voluminous at the same
time?
SPIMBENCH 2015
102A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Open Issues
• Issue 1:
Only one benchmark that tackles both, combination of variations and scalability issues
• Issue 2 :
Not enough IM benchmark using the full expressiveness of RDF/OWL language
103A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Wrapping Up: Systems for Benchmarks
Outcomes as far as systems are concerned:
• Systems can handle the value variations, the structural variation, and the simple logical variations separately.
• More work needed for complex variations (combination of value, structural, and logical)
• More work needed for structural variations
• Enhancement of systems to cope with the clustering of the mappings (1-n mappings)
104A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Conclusion
• Many instance matching benchmarks have been proposed
• Each of them answering to some of the needs of instance matching systems.
• It is high time now to start creating benchmarks that will “show the way to the future”
• Extend the limits of existing systems.
105
Questions? Comments?
Thank you!
106A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
References (1)
# Reference Abbreviation
1
J. L. Aguirre, K. Eckert, A. F. J. Euzenat, W. R. van Hage, L. Hollink, C. Meilicke, A. N. D. Ritze, F. Scharffe, P. Shvaiko, O. Svab-Zamazal, C. Trojahn, E. Jimenez-Ruiz, B. C. Grau, and B. Zapilko. Results of the ontology alignment evaluation initiative 2012. In OM, 2012. [AEE+12]
2 I. Bhattacharya and L. Getoor. Entity resolution in graphs. Mining Graph Data. Wiley and Sons, 2006. [BG06]
3
J. Euzenat, A. Ferrara, L. Hollink, A. Isaac, C. Joslyn, V. Malaise, C. Meilicken, A. Nikolov, J. Pane, M. Sabou, F. Scharffe, P. Shvaiko, V. S. H., Stuckenschmidt, O. Svab-Zamazal, V. Svatek, , C. Trojahn, G. Vouros, and S. Wang. Results of the Ontology Alignment Evaluation Initiative 2009. In OM, 2009. [EFH+09]
4
J. Euzenat, A. Ferrara, C. Meilicke, J. Pane, F. Schare, P. Shvaiko, H. Stuckenschmidt, O. Svab- Zamazal, V. Svatek, and C. Trojahn. Results of the Ontology Alignment Evaluation Initiative 2010. In OM, 2010. [EFM+10]
5
A. F. J. Euzenat, W. R. van Hage, L. Hollink, C. Meilicke, A. N. D. Ritze, F. Scharffe, P. Shvaiko, H. Stuckenschmidt, O. Svab-Zamazal, and C. Trojahn. Results of the Ontology Alignment Evaluation Initiative 2011. In OM, 2011. [EHH+11]
6A. K. Elmagarmid, P. Ipeirotis, and V. Verykios. Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 2007. [EIV07]
7J.Euzenat and P. Shvaiko, editors. Ontology Matching. Springer-Verlag, 2007.
[ES07]
8 A. Ferrara, D. Lorusso, S. Montanelli, and G. Varese. Towards a Benchmark for Instance Matching. In OM, 2008. [FLM08]
9A. Ferrara, S. Montanelli, J. Noessner, and H. Stuckenschmidt. Benchmarking Matching Applications on the Semantic Web. In ESWC, 2011. [FMN+11]
10J. Gray, editor. The Benchmark Handbook for Database and Transaction Systems. Morgan Kaufmann, 1993.
[G93]
107A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
References (2)
# Reference Abbreviation
11
B. C. Grau, Z. Dragisic, K. Eckert, A. F. J. Euzenat, R. Granada, V. Ivanova, E. Jimenez-Ruiz, A. O. Kempf, P. Lambrix, A. Nikolov, H. Paulheim, D. Ritze, F. Schare, P. Shvaiko, C. Trojahn, and O. Zamazal. Results of the ontology alignment evaluation initiative 2013. In OM, 2013. [GDE+13]
12Gray, A.J.G., Groth, P., Loizou, A., et al.: Applying linked data approaches to pharmacology: Architectural decisions and implementation. Semantic Web. (2012). [GGL+12]
13P. Hayes. RDF Semantics. www.w3.org/TR/rdf-mt, February 2004.
[H04]
14R. Isele and C. Bizer. Learning linkage rules using genetic programming. In OM, 2011.
[IB11]
15A. Isaac, L. van der Meij, S. Schlobach, and S. Wang. An Empirical Study of Instance-Based Ontology Matching. In ISWC/ASWC, 2007. [IMS07]
16E. Ioannou, N. Rassadko, and Y. Velegrakis. On Generating Benchmark Data for Entity Matching. Journal of Data Semantics, 2012. [IRV12]
17A. Jentzsch, J. Zhao, O. Hassanzadeh, K.-H. Cheung, M. Samwald, and B. Andersson. Linking open drug data. In Linking Open Data Triplification Challenge, I-SEMANTICS, 2009. [JZH+09]
18C. Li, L. Jin, and S. Mehrotra. Supporting ecient record linkage for large data sets using mapping techniques. In WWW, 2006. [LJM06]
19D. L. McGuinness and F. van Harmelen. OWL Web Ontology Language. http://www.w3.org/TR/owl-features/, 2004. [MH04]
20 B. M. F. Manola, E. Miller. RDF Primer. www.w3.org/TR/rdf-primer, February 2004. [MM04]
21M. Cheatham, Z. Dragisic, J. Euzenat, et. Al., Results of the Ontology Alignment Evaluation Initiative 2015, Proc. 10th ISWC workshop on ontology matching, OM 2015 [CDE15]
108A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.
Reference (3)
# Reference Abbreviation
21J. Noessner, M. Niepert, C. Meilicke, and H. Stuckenschmidt. Leveraging Terminological Structure for Object Reconciliation. In ESWC, 2010. [NNM10]
22A. Nikolov, V. Uren, E. Motta, and A. de Roeck. Refining instance coreferencing results using belief propagation. In ASWC, 2008. [NUM+08]
23M. Perry. TOntoGen: A Synthetic Data Set Generator for Semantic Web Applications. AIS SIGSEMIS, 2(2), 2005.
[P05]
24E. Prud'hommeaux and A. Seaborne. SPARQL Query Language for RDF. www.w3.org/TR/rdfsparql- query, January 2008. [PS08]
25S. Wang, G. Englebienne, and S.Schlobach: Learning Concept Mappingd from Instance Similarity International Semantic Web Conference 2008: 339-355 [WES08]
26
Williams, A.J., Harland, L., Groth, P., Pettifer, S., Chichester, C., Willighagen, E.L., Evelo, C.T., Blomberg, N., Ecker, G., Goble, C., Mons, B.: Open PHACTS: Semantic interoperability for drug discovery. Drug Discovery Today. 17, 1188–1198 (2012). [WHG+12]
27K. Zaiss, S. Conrad, and S. Vater. A Benchmark for Testing Instance-Based Ontology Matching Methods. In KMIS, 2010. [Z10]
28Jim Gray. Benchmark Handbook: For Database and Transaction Processing Systems, ISBN:1558601597, 1992
[G92]
29T. Saveta, E. Daskalaki, G. Flouris, I. Fundulaki, M. Herschel, A.-C. Ngonga Ngomo, Pushing the Limits of Instance Matching Systems: A Semantics-Aware Benchmark for Linked Data, WWW 2015. [SDF+15]
30T.Saveta, E. Daskalaki, G. Flouris, I. Fundulaki, M. Herschel, A.-C. Ngonga Ngomo, LANCE: Piercing to the Heart of Instance Matching Tool, ISWC 2015, pp 375-391. [SDFF+15]
31
Z. Dragisic, K. Eckert, J. Euzenat, D. Faria, A. Ferrara, R. Granada, V. Ivanova, E. Jimenez-Ruiz, A. Oskar Kempf, P. Lambrix, S. Montanelli, H. Paulheim, D. Ritze, P. Shvaiko, A. Solimando, C. Trojahn, O. Zamaza, and B. Cuenca Grau, Results of the Ontology Alignment Evaluation Initiative 2014, Proc. 9th ISWC workshop on ontology matching, OM 2014. [DEE14]
109A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.