Reasoning in Large and Uncertain RDF Knowledge Bases Martin Theobald Joint work with: Maximilian Dylla, Timm Meiser, Ndapa Nakashole, Christina Tefliuodi, Yafang Wang, Mohamed Yahya, Mauro Sozio, and Fabian Suchanek Max Planck Institute Informatics
Interactive Reasoning in Large and Uncertain
RDF Knowledge BasesMartin Theobald
Joint work with:Maximilian Dylla, Timm Meiser, Ndapa Nakashole, Christina Tefliuodi, Yafang Wang, Mohamed Yahya,
Mauro Sozio, and Fabian Suchanek
Max Planck Institute Informatics
French Marriage Problem
...
marriedTo: person personmarriedTo: person person
marriedTo_French: person personmarriedTo_French: person person
2
x,y,z: marriedTo(x,y) marriedTo(x,z) y=z
x,y,z: marriedTo(x,y) marriedTo(x,z) y=z
French Marriage Problem
Facts in KB: New facts or fact candidates:marriedTo (Hillary, Bill)marriedTo (Carla, Nicolas)marriedTo (Angelina, Brad)
marriedTo (Cecilia, Nicolas)marriedTo (Carla, Benjamin)marriedTo (Carla, Mick)marriedTo (Michelle, Barack)marriedTo (Yoko, John)marriedTo (Kate, Leonardo)marriedTo (Carla, Sofie)marriedTo (Larry, Google)
1) for recall: pattern-based harvesting2) for precision: consistency reasoning1) for recall: pattern-based harvesting2) for precision: consistency reasoning
3x,y,z: marriedTo(x,y) marriedTo(x,z) y=z x,y,z: marriedTo(x,y) marriedTo(x,z) y=z
Agenda
– URDF: Reasoning in Uncertain Knowledge Bases • Resolving uncertainty at query-time• Lineage of answers• Propositional vs. probabilistic reasoning• Temporal reasoning extensions
– UViz: The URDF Visualization Frontend• Demo!
4
URDF: Reasoning in Uncertain KB’s
• Knowledge harvesting from the Web may yield knowledge bases which are
– Incomplete bornIn(Albert_Einstein,?x) {}
– IncorrectbornIn(Albert_Einstein,?x) {Stuttgart}
– Inconsistent bornIn(Albert_Einstein,?x) {Ulm, Stuttgart}
• Combine grounding of first-order logic rules with additional step of consistency reasoning– Propositional – Constrained Weighted MaxSat– Probabilistic – Lineage & Possible Worlds Semantics
At query time! 5
[Theobald,Sozio,Suchanek,Nakashole: MPII Tech-Report‘10]
0.7 0.2
Soft Rules vs. Hard Constraints
(Soft) Inference Rules vs. (Hard) Consistency Constraints
• People may live in more than one placelivesIn(x,y) marriedTo(x,z) livesIn(z,y)livesIn(x,y) hasChild(x,z) livesIn(z,y)
• People are not born in different places/on different datesbornIn(x,y) bornIn(x,z) y=z
• People are not married to more than one person (at the same time, in most countries?)marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z
disjoint(t1,t2)6
[0.6] [0.2]
Soft Rules vs. Hard Constraints (ct’d)
Enforce FD‘s (e.g., mutual exclusion) as hard constraints:
Generalize to other forms of constraints:Hard constraint Soft constraint
hasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t
firstPaper(x,p) firstPaper(y,q) author(p,x) author(p,y) inYear(p) > inYear(q)+5years
hasAdvisor(x,y)[0.6]
livesIn(x,y) type(y,City) locatedIn(y,z) type(z,Country) livesIn(x,z)
hasAdvisor(x,y) hasAdvisor(x,z) y=zCombine soft and hard constraintsNo longer regular MaxSatConstrained (weighted) MaxSat instead
Combine soft and hard constraintsNo longer regular MaxSatConstrained (weighted) MaxSat instead
7
Datalog-style grounding (deductive & potentially recursive soft rules)Datalog-style grounding (deductive & potentially recursive soft rules)
Deductive Grounding (SLD Resolution/Datalog)
\/\/
R1R1 R3R3R2R2
RDF Base Facts F1: marriedTo(Bill, Hillary) F2: represents(Hillary, New_York)
F3: governorOf(Bill, Arkansas)
RDF Base Facts F1: marriedTo(Bill, Hillary) F2: represents(Hillary, New_York)
F3: governorOf(Bill, Arkansas)
/\/\
F1F1 \/\/
R2R2 R3R3R1R1
F2F2
XX F3F3
… XXXX
Answers (derived facts): livesIn(Bill, Arkansas) livesIn(Bill, New_York)
Answers (derived facts): livesIn(Bill, Arkansas) livesIn(Bill, New_York)
8
QuerylivesIn(Bill, ?x)
QuerylivesIn(Bill, ?x)
8
First-Order Rules (Horn clauses)R1: livesIn(?x, ?y) :- marriedTo(?x, ?z), livesIn(?z, ?y)R2: livesIn(?x, ?y) :- represents(?x, ?y)R3: livesIn(?x, ?y) :- governorOf(?x, ?y)
First-Order Rules (Horn clauses)R1: livesIn(?x, ?y) :- marriedTo(?x, ?z), livesIn(?z, ?y)R2: livesIn(?x, ?y) :- represents(?x, ?y)R3: livesIn(?x, ?y) :- governorOf(?x, ?y)
URDF: Reasoning ExampleRules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z)
[0.4]
graduatedFrom(x,y) graduatedFrom(x,z) x=z
Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z)
[0.4]
graduatedFrom(x,y) graduatedFrom(x,z) x=zJeffJeff
StanfordStanford
UniversityUniversity
type[1.0]
SurajitSurajit
PrincetonPrinceton
DavidDavid
Computer ScientistComputer Scientist
worksAt[0.9]
type[1.0]
type[1.0]
type[1.0]type[1.0]
graduatedFrom[0.6]
graduatedFrom[0.7]
graduatedFrom[0.9]
hasAdvisor[0.8]hasAdvisor[0.7]
9
KB: Base Facts
Derived FactsgradFr(Surajit,Stanfor
d)gradFr(David,Stanford)
Derived FactsgradFr(Surajit,Stanfor
d)gradFr(David,Stanford)
graduatedFrom[?]graduatedFrom[?]
URDF: CNF Construction & MaxSat Solving
10
[Theobald,Sozio,Suchanek,Nakashole: MPII Tech-Report‘10]
Query graduatedFrom(?x,?y)
Query graduatedFrom(?x,?y)CNF (graduatedFrom(Surajit, Stanford) graduatedFrom(Surajit, Princeton))
(graduatedFrom(David, Stanford) graduatedFrom(David, Princeton))
(hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))
(hasAcademicAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford))
worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton) graduatedFrom(David, Stanford)
CNF (graduatedFrom(Surajit, Stanford) graduatedFrom(Surajit, Princeton))
(graduatedFrom(David, Stanford) graduatedFrom(David, Princeton))
(hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))
(hasAcademicAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford))
worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton) graduatedFrom(David, Stanford)
0.4
0.4
0.90.80.70.60.70.90.0
1) Deductive Grounding– Yields only facts and rules which
are relevant for answering the query (dependency graph D)
2) Boolean Formula in CNF consisting of– Grounded hard rules– Grounded soft rules (weighted)– Base facts (weighted)
3) Propositional Reasoning– Compute truth assignment for
all facts in D such that the sum of weights is maximized
Compute “most likely” possible world
URDF: Lineage & Possible Worlds
11
1) Deductive Grounding– Same as before, but trace
lineage of query answers
2) Lineage DAG (not CNF!) consisting of– Grounded hard rules– Grounded soft rules– Base factsplus: derivation structure
3) Probabilistic Inference– Marginalization: aggregate probabilities of all
possible worlds where the answer is “true”
– Drop “impossible worlds”
\/\/
graduatedFrom(Surajit, Princeton)
graduatedFrom(Surajit, Princeton)
graduatedFrom(Surajit, Stanford)
graduatedFrom(Surajit, Stanford)
/\/\
graduatedFrom(Surajit,
Princeton)[0.7]
graduatedFrom(Surajit,
Princeton)[0.7]
hasAdvisor(Surajit,Jeff)
[0.8]
hasAdvisor(Surajit,Jeff)
[0.8]
worksAt(Jeff,Stanford)[0.9]
worksAt(Jeff,Stanford)[0.9]
graduatedFrom(Surajit,
Stanford)[0.6]
graduatedFrom(Surajit,
Stanford)[0.6]
Query graduatedFrom(Surajit,?y)
Query graduatedFrom(Surajit,?y)
0.7x(1-0.888)=0.078 (1-0.7)x0.888=0.266
1-(1-0.72)x(1-0.6)=0.888
0.8x0.9 =0.72
0.6
0.7
0.90.8
Grounding first-order Horn formulas (Datalog)
– Decidable– EXPTIME-complete, PSPACE-complete (including recursion, but in P w/o recursion)
Max-Sat (Constrained & Weighted)– NP-complete
Probabilistic inference in graphical models– #P-complete
Grounding first-order Horn formulas (Datalog)
– Decidable– EXPTIME-complete, PSPACE-complete (including recursion, but in P w/o recursion)
Max-Sat (Constrained & Weighted)– NP-complete
Probabilistic inference in graphical models– #P-complete
Classes & Complexities
12
FOL OWLOWL-DL/lite
Horn
Monte Carlo Simulation (I)
13
[Karp,Luby,Madras: J.Alg.’89]
F = X1X2 X1X3 X2X3F = X1X2 X1X3 X2X3
cnt = 0repeat N times randomly choose X1, X2, X3 {0,1} if F(X1, X2, X3) = 1 then cnt = cnt+1P = cnt/Nreturn P /* Pr'(F) */
cnt = 0repeat N times randomly choose X1, X2, X3 {0,1} if F(X1, X2, X3) = 1 then cnt = cnt+1P = cnt/Nreturn P /* Pr'(F) */
Theorem: If N ≥ (1/ Pr(F)) × (4 ln(2/)/2) then: Pr[ | P/Pr(F) - 1 | > ] < Theorem: If N ≥ (1/ Pr(F)) × (4 ln(2/)/2) then: Pr[ | P/Pr(F) - 1 | > ] <
May be very big for small Pr(F)
May be very big for small Pr(F)
X1X2 X1X3
X2X3
Boolean formula:
Zero/One-estimatortheorem
Works for any F(not in PTIME)Works for any F(not in PTIME)
Naïve sampling:
Monte Carlo Simulation (II)
14
cnt = 0; S = Pr(C1) + … + Pr(Cm)repeat N times randomly choose i {1,2,…, m}, with prob. Pr(Ci)/S randomly choose X1, …, Xn {0,1} s.t. Ci = 1 if C1=0 and C2=0 and … and Ci-1= 0 then cnt = cnt+1P = cnt/Nreturn P /* Pr'(F) */
cnt = 0; S = Pr(C1) + … + Pr(Cm)repeat N times randomly choose i {1,2,…, m}, with prob. Pr(Ci)/S randomly choose X1, …, Xn {0,1} s.t. Ci = 1 if C1=0 and C2=0 and … and Ci-1= 0 then cnt = cnt+1P = cnt/Nreturn P /* Pr'(F) */
Theorem: If N ≥ (1/m) × (4 ln(2/)/2) then: Pr[ |P/Pr(F) - 1| > ] < Theorem: If N ≥ (1/m) × (4 ln(2/)/2) then: Pr[ |P/Pr(F) - 1| > ] <
F = C1 C2 . . . CmF = C1 C2 . . . Cm
Improved sampling:
Now it’s better
Now it’s better
Only for F in DNF in PTIMEOnly for F in DNF in PTIME
[Karp,Luby,Madras: J.Alg.’89]Boolean formula in DNF:
Learning “Soft” Rules Extend Inductive Logic Programming (ILP) techniques to
large and incomplete knowledge bases
15
Software tools: alchemy.cs.washington.eduhttp://www.doc.ic.ac.uk/~shm/progol.html http://dtai.cs.kuleuven.be/ml/systems/claudien
Goal: learn livesIn(?x,?y) bornIn(?x,?y)
LiLilivesIn(x,y)
bornIn(x,y)
livesIn(x,z)
Positive ExampleslivesIn(?x,?y) bornIn(?x,?y)
Negative Examples livesIn(?x,?y) bornIn(?x,?y) livesIn(?x,?z)
LiLi
Background knowledge
More Variants of Consistency Reasoning
• Propositional Reasoning– Constrained Weighted MaxSat solver
• Lineage & Possible Worlds (independent base facts)– Monte Carlo simulations (Luby-Karp)
• First-Order Logic & Probabilistic Graphical Models– Markov Logic (currently via interface to Alchemy*)
[Richardson & Domingos: ML’06]– Even more general: Factor Graphs [McCallum et al. 2008]– MCMC sampling for probabilistic inference
16
*Alchemy – Open-Source AI: http://alchemy.cs.washington.edu/
Experiments
• URDF: SLD grounding & MaxSat solving
17
|C| - # literals in soft rules|S| - # literals in hard rules
• URDF vs. Markov Logic (MAP inference & MC-SAT)
• YAGO Knowledge Base: 2 Mio entities, 20 Mio facts• Basic query answering: SLD grounding & MaxSat solving of 10 queries over 16 soft
rules (partly recursive) & 5 hard rules (bornIn, diedIn, marriedTo, …)• Asymptotic runtime checks: runtime comparisons for synthetic soft rule expansions
French Marriage Problem (Revisited)
Facts in KB:
New fact candidates:
marriedTo (Hillary, Bill)marriedTo (Carla, Nicolas)marriedTo (Angelina, Brad)
marriedTo (Cecilia, Nicolas)marriedTo (Carla, Benjamin)marriedTo (Carla, Mick)divorced (Madonna, Guy)domPartner (Angelina, Brad)
1:
2:
3:
validFrom (2, 2008)
validFrom (4, 1996) validUntil (4, 2007)validFrom (5, 2010)validFrom (6, 2006)validFrom (7, 2008)
4: 5:6:7:8:
JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
18
Challenge: Temporal Knowledge HarvestingFor all people in Wikipedia (100,000‘s) gather all spouses, incl. divorced & widowed, and corresponding time periods! >95% accuracy, >95% coverage, in one night!
19
Difficult Dating
20
(Even More Difficult) Implicit Datingvague dates relative datesvague dates relative dates
narrative textrelative ordernarrative textrelative order
22
TARSQI: Extracting Time Annotations
Hong Kong is poised to hold the first election in more than half <TIMEX3 tid="t3" TYPE="DURATION" VAL="P100Y">a century</TIMEX3> that includes a democracy advocate seeking high office in territory controlled by the Chinese government in Beijing. A pro-democracy politician, Alan Leong, announced <TIMEX3 tid="t4" TYPE="DATE" VAL="20070131">Wednesday</TIMEX3> that he had obtained enough nominations to appear on the ballot to become the territory’s next chief executive. But he acknowledged that he had no chance of beating the Beijing-backed incumbent, Donald Tsang, who is seeking re-election. Under electoral rules imposed by Chinese officials, only 796 people on the election committee – the bulk of them with close ties to mainland China – will be allowed to vote in the <TIMEX3 tid="t5" TYPE="DATE" VAL="20070325">March 25</TIMEX3> election. It will be the first contested election for chief executive since Britain returned Hong Kong to China in <TIMEX3 tid="t6" TYPE="DATE" VAL="1997">1997</TIMEX3>. Mr. Tsang, an able administrator who took office during the early stages of a sharp economic upturn in <TIMEX3 tid="t7" TYPE="DATE" VAL="2005">2005</TIMEX3>, is popular with the general public. Polls consistently indicate that three-fifths of Hong Kong’s people approve of the job he has been doing. It is of course a foregone conclusion – Donald Tsang will be elected and will hold office for <TIMEX3 tid="t9" beginPoint="t0" endPoint="t8“ TYPE="DURATION" VAL="P5Y">another five years </TIMEX3>, said Mr. Leong, the former chairman of the Hong Kong Bar Association.
[Verhagen et al: ACL‘05]http://www.timeml.org/site/tarsqi/
extraction errors!extraction errors!
23
13 Relations between Time Intervals
A Before B B After A
A Meets B B MetBy A
A Overlaps B B OverlappedBy A
A Starts B B StartedBy A
A During B B Contains A
A Finishes B B FinishedBy A
A Equal B
A B
AB
AB
AB
AB
AB
AB
[Allen, 1984; Allen & Hayes, 1989]
24
0.08 0.120.16
Possible Worlds in Time (I)
0.36
0.40.6
State Relation
‘03 ‘05 ‘07
1.0
Base Facts
DerivedFacts
[Wang,Yahya,Theobald: VLDB/MUD Workshop ‘10]
0.20.20.10.4
‘05‘00 ‘02
0.9
‘07
State Relation
‘04
‘03 ‘04 ‘07‘05
25
playsFor(Beckham,Real) playsFor(Ronaldo,Real)
playsFor(Beckham, Real, T1) playsFor(Ronaldo, Real, T2) overlaps(T1,T2)
teamMates(Beckham, Ronaldo,T3)
State
0.06
0.300.12
0.20.30.6
Possible Worlds in Time (II)
0.30.5
State Event
0.06
Event
‘95 ‘98 ‘02 ‘96 ‘99 ‘00
‘96 ‘98 ‘00 ‘01‘99
0.54
0.9 1.0
‘01playsFor(Beckham, United) wonCup(United, ChampionsLeague)
Base Facts
DerivedFacts
Non-independent
Independent
[Wang,Yahya,Theobald: VLDB/MUD Workshop ‘10]
26
playsFor(Beckham, United, T1) wonCup(United, ChampionsL,T2) overlaps(T1,T2)
won(Beckham, ChampionsL,T3)
• Closed and complete representation model (incl. lineage) Stanford Trio project [Widom: CIDR’05, Benjelloun et al: VLDB’06]
• Interval computation remains linear in the number of bins• Confidence computation per bin is #P-complete
In general requires possible-worlds-based sampling techniques (Luby-Karp, Gibbs sampling, etc.)
Need
Lineage!Need
Lineage!0.12
Agenda
– URDF: Reasoning in Uncertain Knowledge Bases • Resolving uncertainty at query-time• Lineage of answers• Propositional vs. probabilistic reasoning• Temporal reasoning extensions
– UViz: The URDF Visualization Frontend• Demo!
27
UViz: The URDF Visualization Engine
• UViz System Architecture– Flash client– Tomcat server (JRE)– Relational backend
(JDBC)– Remote Method
Invocation & Object Serialization (BlazeDS)
28
UViz: The URDF Visualization Engine
Demo!Demo!
29