Large Knowledge Collider (LarKC) : A Platform for Web Scale Reasoning

1

Large Knowledge Collider (LarKC) : A Platform for Web Scale Reasoning

Ning Zhong1,3, Frank van Harmelen2, Yi Zeng3, Zhisheng Huang2

Maebashi Institute of Technology, JapanVrije University Amsterdam, the Netherlands

International WIC Institute, Beijing University of Technology, China

http://www.larkc.eu

2

Late breaking news:Google Video now also annotated with RDF-a (using vocabularies from Yahoo and Facebook)

The World is Creating the Linked Data Every Day!

3

four m

illion docu

ments per day

four m

illion docu

ments per day

4

5

http://www.zemanta.com/

6

toxic releases consumer expenditure

recent earthquakes consumer price index

crime statistics tornado reports

assaults on police trade statistics

social benefits river elevations

unemployment rates energy consumption

7

Things to do with data.gov

8

9

10

<rdf:RDF><rdf:Description rdf:about="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4bfc4.rdf<rdfs:label>Description of the artist Yeah Yeah Yeahs</rdfs:label><foaf:primaryTopic rdf:resource="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4b</rdf:Description><mo:MusicArtist rdf:about="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4bfc4#a<rdf:type rdf:resource="http://purl.org/ontology/mo/MusicGroup"/><foaf:name>Yeah Yeah Yeahs</foaf:name><ov:sortLabel>Yeah Yeah Yeahs</ov:sortLabel><bio:event><bio:Birth><bio:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime

</bio:event><owl:sameAs rdf:resource="http://dbpedia.org/resource/Yeah_Yeah_Yeahs"/>

<mo:image rdf:resource="/music/images/artists/7col_in/584c04d2-4acc-491b-8a0a-e63<foaf:page rdf:resource="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4bfc4.html"/<mo:musicbrainz rdf:resource="http://musicbrainz.org/artist/584c04d2-4acc-491b-8a0a<foaf:homepage rdf:resource="http://www.yeahyeahyeahs.com/"/><mo:wikipedia rdf:resource="http://en.wikipedia.org/wiki/Yeah_Yeah_Yeahs"/><mo:myspace rdf:resource="http://www.myspace.com/yeahyeahyeahs"/><mo:member rdf:resource="/music/artists/a1439b8d-672a-446f-a7ff-6f09d68254b3#art<mo:member rdf:resource="/music/artists/14d44067-99c2-4f77-b58b-138f0b6911fa#ar<mo:member rdf:resource="/music/artists/20dc35ec-6cc1-4c66-98a3-4a6116cb3869#a...

11

<foaf:made><mo:Record><dc:title>It's Blitz!</dc:title><mo:musicbrainz rdf:resource="http://musicbrainz.org/release/9c4177fe-bdce-4f9d-ab<rev:hasReview rdf:resource="/music/reviews/hnp2#review"/></mo:Record>

</foaf:made>.....<mo:MusicArtist rdf:about="/music/artists/a1439b8d-672a-446f-a7ff-6f09d68254b3#arti<foaf:name>Brian Chase</foaf:name>

</mo:MusicArtist>

<mo:MusicArtist rdf:about="/music/artists/14d44067-99c2-4f77-b58b-138f0b6911fa#art<foaf:name>Karen O</foaf:name>

</mo:MusicArtist>

<mo:MusicArtist rdf:about="/music/artists/20dc35ec-6cc1-4c66-98a3-4a6116cb3869#art<foaf:name>Nick Zinner</foaf:name>

</mo:MusicArtist></rdf:RDF>

12

AND much more…

13

What to do for the success of Web-scale Semantic Data Processing?

Unifying Search and Reasoning (ReaSearch) [Fensel & Frank 2007]

Refining Search by Reasoning [Berners-Lee 1999]

Refining Reasoning by Search[Fensel & Frank 2007]

14

The LarKC Consortium

14

13 partner institutions (from 11 countries, 2 from Asia)

The Large Knowledge Collider

a platform for infinitely scalable reasoning on the data-web

16

“a configurable platform for infinitely scalable semantic web reasoning”

“pipeline” suggestslinear structure:

but in LarKC also:

17

What to about the problem of success:

parallelization

18

Supermarket!

Takes seconds

19

Supermarket!

Takes a couple of minutes

20

Supermarket!

Get a better register

21

Massive Data(even Web Scale

Data!)

Ooops!

22

From Linked Data WebsiteMore than 7x108 triples

23

Cashier1: 53Cashier2: 14Cashier3: 33Cashier4: 72Cashier2: 34Cashier3: 13Cashier4: 32--------------------Total : 340

Parallelization

I am with Web-scale data : 7x10^8 triples

24


two for the price of one?

2nd for half price?

Data dependencies

25



2nd for half price?

Split Responsibility

Fruit

Vegetables

Household

Packaged

Rest

26



2nd for half price?

LoadBalancing

Fruit

Vegetables

Household

Packaged

Rest

27


With a box of detergent

and a box of cereal get a

free pen!

Data dependencies

Fruit

Vegetables

Household

Packaged

Rest

For RDF data, any triple can refer to any URI.

28

Towards Parallelization and Distribution

Different parallel computing models:− Peer-to-peer (MaRVIN)− Map-Reduce (Reasoning-Hadoop)

29

The MaRVIN

Way!

inputdata

outputdata

compute

compute

compute

compute

compute

compute

Divide-Conquer-Swap

Eyal Oren

Spyros Kotoulas

30

… is:

− a distributed technique for computing RDFS/OWL closure

… scales by:

− distributing computation over many nodes

− approximate (sound but incomplete) reasoning

− anytime convergence (more complete over time)

… runs on:

− in principle: any grid, using Ibis middleware

− the DAS-3 distributed supercomputer (300 nodes)

MARVIN (Massive RDF Versatile Inference Network)

31

Divide-Conquer-Swap

SPLIT

COMPUTE

JOIN

Rep

eat

32

Current performance

200 Million triples in 7.2 minutes on 64 nodes.

33

Reasoning-Hadoop!

RDFS/OWL reasoning with the MapReduce framework.

34

The MapReduceDistributed Programming Model

Initially designed and developed by Google in 2004 for large data processing [Jeffrey & Sanjay 2004].

The computation is expressed with two functions: map and reduce.

Map-Reduce on 64 machines:

Peak inference rates at 8M triples/secSustained inference rates at 4M triples/sec

A p CA q BD r DE r DF r C

Map

Map

Reduce

Reduce

C 2p 1r 3q 1D 3F 1

<C,_,_><A,_,_>

<C,_,_>

<F,_,_>

......

Map-Reduce Jacopo Urbani

35


cognitive heuristics

On very large datasets, incompleteness is the ruleMust stop before we are finishedWhen to stop?Stopping rules are important− determine length of computation

(don’t stop too late)− quality of result

(don’t stop too early)

Stopping Rules

Take inspiration from economics, biology, psychology

Lael Schooler

Humans have good heuristics for when to stop problem solving:

“Name capital cities in Europe”:London, Paris, Berlin, Rome, Amsterdam, …Milan, Madrid, …., ….., Paris, ….,

Time between solutions

Wrong answers Repetitions

When to switch between tasks?

Humans (& animals) are verygood finding this optimum

Humans (& animals) are verygood finding this optimum

Lael Schooler

hard task & easy taskhard task & easy task combined task

combined task

39


data selection

Where do the axioms come from?• Which subset to use?• Relevance measures

• Example: syntactic relevance:• δ(α,β)=1 if α,β share a concept symbol• δ(α,β)=k if δ(α,γ)=k-1 and

β,γ share a concept symbol• very simple measure,

very syntactically unstable, but:

Gives a high quality sound approximation(> 90% recall, 100% precision for small k)Gives a high quality sound approximation(> 90% recall, 100% precision for small k)

Zhisheng Huang

Take data-selection seriously

exploit the grounding of logical symbols in natural language

• Google distance as relevance measure

= symmetric conditional probability of co-occurrence

= estimate of semantic distance

)}(log),(min{loglog),(log)}(log),(max{log),(

yfxfMyxfyfxfyxNGD

−−

=

Gives almost perfect “forgetting function”for matching class definitions in 2 vocabulariesGives almost perfect “forgetting function”for matching class definitions in 2 vocabularies

Zhisheng Huang

Take identifiers seriously

42

Unifying Search and Reasoning from the Viewpoint of Granularity

Unifying Search and Reasoning (ReaSearch) [Fensel & Frank 2007]

Refining Search by Reasoning [Berners-Lee 1999]

Refining Reasoning by Search[Fensel & Frank 2007]

Granularity

Human Problem Solving Web Problem Solving

Basic level advantage, Cognitive Memory RetentionMulti-level, multi-perspective, Variable Precision

Inspire!

Barriers for Web-scale Problem Solving

(1) most relevant data vs search results space [Berners-Lee 1999]. (2) Traditional reasoning systems vs Web-scale data vs rational time [Fensel 2007].

43

43

Concrete Strategies

• The Starting Point.• Multi-level Completeness.• Multi-level Specificity.• Multi-perspective.

44

The Starting Point Strategy

[Collins 1969] Collins, A.M. and Quillian, M.R. Retrieval time from semantic memory. Journal of Verbal Learning and Verbal Behaviour, 8, 240-247.

45

(I) The Starting Point Strategy

• (Frequency and Recency) Exponential Model for Interest Retention :

• (Frequency and Recency) Power Model for Interest Retention :

∑ ==

n

jjimiTI

1),()(

ibTn

jAejimiEIR −

=×=∑ 1

),()(

bi

n

jATjimiPIR −

=×=∑ 1

),()(

The “ Basic level advantage ” [Rogers2007].

Concepts in a basic level -- > more frequently than other terms [Wisniewski1989].

• (Frequency) Total Interest :

As a step forward “familiar term” in basic level, “interests retention” focuses on frequency and recency at the same time.

Interest retention models < -- > Cognitive memory retention models [Anderson, Schooler 1991].

46

Interest Retention and Interest Prediction

A comparative study of TI during 1990-2008 and IR in 2009

Difference on the contribution values from papers published in different years

A comparative study on the prediction and real publication numbers by the power law model

A comparative study on the prediction and real publication numbers by the exponential law model

47

Evaluations and the Released Dataset

• interest retentions vs future interests.publication >= 100top 9 interests 2000 to 2007 1226 persons49.54% predict 3 out of 9 interests.

• 615,124 computer scientists in the SwetoDBLP dataset.• http://wiki.larkc.eu/csri-rdf

48

DBLP-SSE : DBLP Search Support Engine

Recent interests are extracted using the power law interest retention model.

Terms with high frequency do not necessarily have high interest retention. (e.g. “Knowledge”)

49

DBLP-SSE : DBLP Search Support Engine

* Web Intelligence and Artificial Intelligence in Education. * Artificial Intelligence Exchange and Service Tie to All Test Environments (AI-ESTATE)-A New Standard for System Diagnostics. * Semantic Model for Artificial Intelligence Based on Molecular Computing. * Open Information Systems Semantics for Distributed Artificial Intelligence. * Artificial Intelligence and Financial Services.* …

with current interests constraints (Top 5 results)List 2 :

* PROLOG Programming for Artificial Intelligence, Second Edition. * Artificial Intelligence Architectures for Composition and Performance Environment. * Artificial Intelligence in Music Education: A Critical Review. * Music, Intelligence and Artificiality. Artificial Intelligence and Music Education. * Musical Knowledge: What can Artificial Intelligence Bring to the Musician?* ...

without current interests constraints (Top 5 results)List 1 :

Artificial IntelligenceQuery :

Web, Service, Semantic, Architecture, Model, Ontology, Knowledge, Computing, Language

Top 9 interests

Dieter FenselLog in

50

Multi-level Completeness Strategy

Low completeness

High completeness

Limited Time

More time Available

One practical question :

How to choose the nodes to be reasoned over?

51

Choosing the pivotal nodes in the network first !

Another one: If I stop in here, what is the completeness like now!

52

Multi-level Completeness Strategy

degree(n, Pcn) to stop Satisfied authors AI authors

70 2885 151

30 17121 579

11 78868 1142

4 277417 1704

1 575447 2225

0 615124 2355

Comparison of predicted and actual completeness value.

Unifying search and reasoning with multilevel completeness and anytime behavior.

Completeness Prediction Function :

|)||)'((||)'(||))'(||(||)(||))'(||)((||)(|)(

NiNsubiNreliNsubNiNreliNsubiNsubiNreliPC

−×+−×−×

=

“Who are authors in Artificial Intelligence?”

Nodes are grouped together by Node degrees under a perspective.

53

Multi-level Specificity Strategy

general

Specific

Limited Time

More time Available

54

A Case Study on Multi-level Specificity StrategySpecificity Relevant Keywords Number of Authors

Level 1 Artificial Intelligence 2355Level 2 Agents

Automated Reasoning

Cognition

Constriants

Games

Knowledge Representation

Natural Language

Robot

…

9157

222

19775

8744

3817

1537

2939

16425

…

Level 3 Case-Based Reasoning

Cognitive Modeling

Decision Trees

Search

Translation

Web Intelligence

…

1133

76

1112

32079

4414

122

…

Answers to “Who are the authors in Artificial Intelligence?” in multiple levels of specificity according to the hierarchical ontology of Artificial Intelligence.

Specificity Number of authors Completeness

Level 1 2355 0.85%

Level 1,2 207468 75.11%

Level 1,2,3 276205 100%

A comparative study on the answers in different levels of specificity.

55

The Multi-perspective Strategy

Multiple representation of Knowledge [Minsky2006]

User needs may differ from each other

< -- > expect answers from different perspectives.

Normalized Degree Distribution of predicates in SwetoDBLP dataset

56

The Multi-perspective Strategy

Fig. 2. Coauthor number distributionin the SwetoDBLP dataset.

Fig. 3. log-log diagram of Figure 2. Fig. 4. A zoomed in version of Figure 2.

Fig. 5. A zoomed in version of coauthordistribution for Artificial Intelligence".

Fig. 6. Publication number distribution in the SwetoDBLP dataset.

Fig. 7. log-log diagram of Figure 6.

Under different perspectives, the distribution characteristics are different!

57

Comparison of Results from Different Perspectives

Publication number perspective Coauthor number perspectiveThomas S. Huang (387) Carl Kesselman (312)John Mylopoulos (261) Thomas S. Huang (271)Hsinchun Chen (260) Edward A. Fox (269)Henri Prade (252) Lei Wang (250)Didier Dubois (241) John Mylopoulos (245)Thomas Eiter (219) Ewa Deelman (237)... ...

A partial result of the multilevel specificity reasoning task The list of authors in Artificial Intelligence" in level 1 from two perspectives.

Summarizing

The Semantic Web is rapidly becoming real

Scale is becoming a real problem

Different ways of scaling up:− parallelization

− exploiting cognitive heuristics

Stopping rules, cognitive memory retention, etc.− data-selection for incomplete reasoning.

− New Forms of Reasoning.

59

LarKC Chinese Forum

60

AcknowledgementSlides for this talk is mainly from 3 previous talks :

Frank van Harmelen. Large Scale Reasoning on the Semantic Web or: When success is becoming a problem. Invited talk at the 2009 International Joint Conferences on Active Media Technology and Brain Informatics.

Yi Zeng. Unifying Web-scale Search and Reasoning from the viewpoint of Granularity. the 2009 International Joint Conferences on Active Media Technology and Brain Informatics.

Spyros. Marvin and the Billion Triple Challenge. Super Computing Seminar, University of Amsterdam, 2008.

61

Contact Info

[email protected]://www.larkc.eu

Want to play with LarKC?Want to contribute plugins?Want to deploy LarKC?

Want to play with LarKC?Want to contribute plugins?Want to deploy LarKC?

Asia: Ning Zhong: [email protected] Zeng : [email protected]

@ WIC

Asia: Ning Zhong: [email protected] Zeng : [email protected]

@ WIC

62

References[Berners-Lee1999] Berners-Lee, T., Fischetti, M.: Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by Its Inventor. HarperSanFrancisco (1999)

[Fensel2007] Fensel, D., van Harmelen, F.: Unifying reasoning and search to web scale. IEEE Internet Computing 11(2) (2007) 94-96

[Michalski1986] Michalski, R.S. and Winston, P.H. Variable precision logic. Artificial Intelligence, 29(2), 121–146, 1986.

[Minsky2006] Minsky, M. The Emotion Machine : commonsense thinking, artificial intelligence, and the future of the human mind. Simon & Schuster, 2006.

[Rogers 2007] Rogers, T., Patterson, K.: Object categorization: Reversals and explanations of the basic-level advantage. Journal of Experimental Psychology: General 136(3) (2007) 451-469

[Wickelgren1976] Wickelgren, W.: Memory storage dynamics. In: Handbook of learning and cognitive processes. Hillsdale, NJ: Lawrence Erlbaum Associates (1976) 321-361

[Aleman-Meza2007] Aleman-Meza, B. Hakimpour, F., Arpinar, I., Sheth, A.: Swetodblp ontology of computer science publications. Web Semantics: Science, Services and Agents on the World Wide Web 5(3) (2007) 151-155

[Ebbinghaus1913] Ebbinghaus, H.: Memory: A Contribution to Experimental Psychology Hermann Ebbinghaus. Teachers College, Columbia University (1913)

63

Thank you!

Large Knowledge Collider (LarKC) : A Platform for Web Scale Reasoning

Technology

karen o

linked data

dn iio nllll o iim rr

late br e google aking

yi zeng3

frank van harmelen2

tedf ro

rannota sodf