1 Large Knowledge Collider (LarKC) : A Platform for Web Scale Reasoning Ning Zhong 1,3 , Frank van Harmelen 2 , Yi Zeng 3 , Zhisheng Huang 2 Maebashi Institute of Technology, Japan Vrije University Amsterdam, the Netherlands International WIC Institute, Beijing University of Technology, China http://www.larkc.eu
63
Embed
Large Knowledge Collider (LarKC) : A Platform for Web Scale Reasoning
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Large Knowledge Collider (LarKC) : A Platform for Web Scale Reasoning
Ning Zhong1,3, Frank van Harmelen2, Yi Zeng3, Zhisheng Huang2
Maebashi Institute of Technology, JapanVrije University Amsterdam, the Netherlands
International WIC Institute, Beijing University of Technology, China
http://www.larkc.eu
2
Late breaking news:Google Video now also annotated with RDF-a (using vocabularies from Yahoo and Facebook)
The World is Creating the Linked Data Every Day!
3
four m
illion docu
ments per day
four m
illion docu
ments per day
4
5
http://www.zemanta.com/
6
toxic releases consumer expenditure
recent earthquakes consumer price index
crime statistics tornado reports
assaults on police trade statistics
social benefits river elevations
unemployment rates energy consumption
7
Things to do with data.gov
8
9
10
<rdf:RDF><rdf:Description rdf:about="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4bfc4.rdf<rdfs:label>Description of the artist Yeah Yeah Yeahs</rdfs:label><foaf:primaryTopic rdf:resource="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4b</rdf:Description><mo:MusicArtist rdf:about="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4bfc4#a<rdf:type rdf:resource="http://purl.org/ontology/mo/MusicGroup"/><foaf:name>Yeah Yeah Yeahs</foaf:name><ov:sortLabel>Yeah Yeah Yeahs</ov:sortLabel><bio:event><bio:Birth><bio:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime
Different parallel computing models:− Peer-to-peer (MaRVIN)− Map-Reduce (Reasoning-Hadoop)
29
The MaRVIN
Way!
inputdata
outputdata
compute
compute
compute
compute
compute
compute
Divide-Conquer-Swap
Eyal Oren
Spyros Kotoulas
30
… is:
− a distributed technique for computing RDFS/OWL closure
… scales by:
− distributing computation over many nodes
− approximate (sound but incomplete) reasoning
− anytime convergence (more complete over time)
… runs on:
− in principle: any grid, using Ibis middleware
− the DAS-3 distributed supercomputer (300 nodes)
MARVIN (Massive RDF Versatile Inference Network)
31
Divide-Conquer-Swap
SPLIT
COMPUTE
JOIN
Rep
eat
32
Current performance
200 Million triples in 7.2 minutes on 64 nodes.
33
Reasoning-Hadoop!
RDFS/OWL reasoning with the MapReduce framework.
34
The MapReduceDistributed Programming Model
Initially designed and developed by Google in 2004 for large data processing [Jeffrey & Sanjay 2004].
The computation is expressed with two functions: map and reduce.
Map-Reduce on 64 machines:
Peak inference rates at 8M triples/secSustained inference rates at 4M triples/sec
A p CA q BD r DE r DF r C
Map
Map
Reduce
Reduce
C 2p 1r 3q 1D 3F 1
<C,_,_><A,_,_>
<C,_,_>
<F,_,_>
......
Map-Reduce Jacopo Urbani
35
What to about the problem of success:
cognitive heuristics
On very large datasets, incompleteness is the ruleMust stop before we are finishedWhen to stop?Stopping rules are important− determine length of computation
(don’t stop too late)− quality of result
(don’t stop too early)
Stopping Rules
Take inspiration from economics, biology, psychology
Lael Schooler
Humans have good heuristics for when to stop problem solving:
“Name capital cities in Europe”:London, Paris, Berlin, Rome, Amsterdam, …Milan, Madrid, …., ….., Paris, ….,
Time between solutions
Wrong answers Repetitions
When to switch between tasks?
Humans (& animals) are verygood finding this optimum
Humans (& animals) are verygood finding this optimum
Where do the axioms come from?• Which subset to use?• Relevance measures
• Example: syntactic relevance:• δ(α,β)=1 if α,β share a concept symbol• δ(α,β)=k if δ(α,γ)=k-1 and
β,γ share a concept symbol• very simple measure,
very syntactically unstable, but:
Gives a high quality sound approximation(> 90% recall, 100% precision for small k)Gives a high quality sound approximation(> 90% recall, 100% precision for small k)
Zhisheng Huang
Take data-selection seriously
exploit the grounding of logical symbols in natural language
• Google distance as relevance measure
= symmetric conditional probability of co-occurrence
= estimate of semantic distance
)}(log),(min{loglog),(log)}(log),(max{log),(
yfxfMyxfyfxfyxNGD
−−
=
Gives almost perfect “forgetting function”for matching class definitions in 2 vocabulariesGives almost perfect “forgetting function”for matching class definitions in 2 vocabularies
Zhisheng Huang
Take identifiers seriously
42
Unifying Search and Reasoning from the Viewpoint of Granularity
Unifying Search and Reasoning (ReaSearch) [Fensel & Frank 2007]
A comparative study of TI during 1990-2008 and IR in 2009
Difference on the contribution values from papers published in different years
A comparative study on the prediction and real publication numbers by the power law model
A comparative study on the prediction and real publication numbers by the exponential law model
47
Evaluations and the Released Dataset
• interest retentions vs future interests.publication >= 100top 9 interests 2000 to 2007 1226 persons49.54% predict 3 out of 9 interests.
• 615,124 computer scientists in the SwetoDBLP dataset.• http://wiki.larkc.eu/csri-rdf
48
DBLP-SSE : DBLP Search Support Engine
Recent interests are extracted using the power law interest retention model.
Terms with high frequency do not necessarily have high interest retention. (e.g. “Knowledge”)
49
DBLP-SSE : DBLP Search Support Engine
* Web Intelligence and Artificial Intelligence in Education. * Artificial Intelligence Exchange and Service Tie to All Test Environments (AI-ESTATE)-A New Standard for System Diagnostics. * Semantic Model for Artificial Intelligence Based on Molecular Computing. * Open Information Systems Semantics for Distributed Artificial Intelligence. * Artificial Intelligence and Financial Services.* …
with current interests constraints (Top 5 results)List 2 :
* PROLOG Programming for Artificial Intelligence, Second Edition. * Artificial Intelligence Architectures for Composition and Performance Environment. * Artificial Intelligence in Music Education: A Critical Review. * Music, Intelligence and Artificiality. Artificial Intelligence and Music Education. * Musical Knowledge: What can Artificial Intelligence Bring to the Musician?* ...
without current interests constraints (Top 5 results)List 1 :
Artificial IntelligenceQuery :
Web, Service, Semantic, Architecture, Model, Ontology, Knowledge, Computing, Language
Top 9 interests
Dieter FenselLog in
50
Multi-level Completeness Strategy
Low completeness
High completeness
Limited Time
More time Available
One practical question :
How to choose the nodes to be reasoned over?
51
Choosing the pivotal nodes in the network first !
Another one: If I stop in here, what is the completeness like now!
52
Multi-level Completeness Strategy
degree(n, Pcn) to stop Satisfied authors AI authors
70 2885 151
30 17121 579
11 78868 1142
4 277417 1704
1 575447 2225
0 615124 2355
Comparison of predicted and actual completeness value.
Unifying search and reasoning with multilevel completeness and anytime behavior.
Completeness Prediction Function :
|)||)'((||)'(||))'(||(||)(||))'(||)((||)(|)(
NiNsubiNreliNsubNiNreliNsubiNsubiNreliPC
−×+−×−×
=
“Who are authors in Artificial Intelligence?”
Nodes are grouped together by Node degrees under a perspective.
53
Multi-level Specificity Strategy
general
Specific
Limited Time
More time Available
54
A Case Study on Multi-level Specificity StrategySpecificity Relevant Keywords Number of Authors
Answers to “Who are the authors in Artificial Intelligence?” in multiple levels of specificity according to the hierarchical ontology of Artificial Intelligence.
Specificity Number of authors Completeness
Level 1 2355 0.85%
Level 1,2 207468 75.11%
Level 1,2,3 276205 100%
A comparative study on the answers in different levels of specificity.
55
The Multi-perspective Strategy
Multiple representation of Knowledge [Minsky2006]
User needs may differ from each other
< -- > expect answers from different perspectives.
Normalized Degree Distribution of predicates in SwetoDBLP dataset
56
The Multi-perspective Strategy
Fig. 2. Coauthor number distributionin the SwetoDBLP dataset.
Fig. 3. log-log diagram of Figure 2. Fig. 4. A zoomed in version of Figure 2.
Fig. 5. A zoomed in version of coauthordistribution for Artificial Intelligence".
Fig. 6. Publication number distribution in the SwetoDBLP dataset.
Fig. 7. log-log diagram of Figure 6.
Under different perspectives, the distribution characteristics are different!
57
Comparison of Results from Different Perspectives
Publication number perspective Coauthor number perspectiveThomas S. Huang (387) Carl Kesselman (312)John Mylopoulos (261) Thomas S. Huang (271)Hsinchun Chen (260) Edward A. Fox (269)Henri Prade (252) Lei Wang (250)Didier Dubois (241) John Mylopoulos (245)Thomas Eiter (219) Ewa Deelman (237)... ...
A partial result of the multilevel specificity reasoning task The list of authors in Artificial Intelligence" in level 1 from two perspectives.
Summarizing
The Semantic Web is rapidly becoming real
Scale is becoming a real problem
Different ways of scaling up:− parallelization
− exploiting cognitive heuristics
Stopping rules, cognitive memory retention, etc.− data-selection for incomplete reasoning.
− New Forms of Reasoning.
59
LarKC Chinese Forum
60
AcknowledgementSlides for this talk is mainly from 3 previous talks :
Frank van Harmelen. Large Scale Reasoning on the Semantic Web or: When success is becoming a problem. Invited talk at the 2009 International Joint Conferences on Active Media Technology and Brain Informatics.
Yi Zeng. Unifying Web-scale Search and Reasoning from the viewpoint of Granularity. the 2009 International Joint Conferences on Active Media Technology and Brain Informatics.
Spyros. Marvin and the Billion Triple Challenge. Super Computing Seminar, University of Amsterdam, 2008.
References[Berners-Lee1999] Berners-Lee, T., Fischetti, M.: Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by Its Inventor. HarperSanFrancisco (1999)
[Fensel2007] Fensel, D., van Harmelen, F.: Unifying reasoning and search to web scale. IEEE Internet Computing 11(2) (2007) 94-96
[Minsky2006] Minsky, M. The Emotion Machine : commonsense thinking, artificial intelligence, and the future of the human mind. Simon & Schuster, 2006.
[Rogers 2007] Rogers, T., Patterson, K.: Object categorization: Reversals and explanations of the basic-level advantage. Journal of Experimental Psychology: General 136(3) (2007) 451-469
[Wickelgren1976] Wickelgren, W.: Memory storage dynamics. In: Handbook of learning and cognitive processes. Hillsdale, NJ: Lawrence Erlbaum Associates (1976) 321-361
[Aleman-Meza2007] Aleman-Meza, B. Hakimpour, F., Arpinar, I., Sheth, A.: Swetodblp ontology of computer science publications. Web Semantics: Science, Services and Agents on the World Wide Web 5(3) (2007) 151-155
[Ebbinghaus1913] Ebbinghaus, H.: Memory: A Contribution to Experimental Psychology Hermann Ebbinghaus. Teachers College, Columbia University (1913)