10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

10 Years of Probabilistic Querying – What Next?

Martin TheobaldUniversity of Antwerp

Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Andre Melo, Iris Miliaraki, Luc de Raedt, Mauro Sozio, Fabian Suchanek

“ The important thing is to not stop questioning ... One cannot help but be in awe when contemplating the mysteries of eternity, of life, of the marvelous structure of reality. It is enough if one tries merely to comprehend a little of this mystery every day.”

- Albert Einstein, 1936

“The Marvelous Structure of Reality”Joseph M. HellersteinKeynote at WebDB 2003, San Diego

Look, There is Structure!

The important thing is to not stop questioning


Plethora of natural-language-processing techniques & tools Part-Of-Speech (POS) Tagging Named-Entity Recognition &

Disambiguation (NERD) Dependency Parsing Semantic Role Labeling

Text is not just “unstructured data”C1


Plethora of natural-language-processing techniques & tools Part-Of-Speech (POS) Tagging Named-Entity Recognition &

Disambiguation (NERD) Dependency Parsing Semantic Role Labeling

Text is not just “unstructured data” But:

Even the best NLP tools frequently yield errors

Facts found on the Web are logically inconsistent

Web-extracted knowledge bases are inherently incomplete

C1

bornOn(Jeff, 09/22/42)gradFrom(Jeff, Columbia)hasAdvisor(Jeff, Arthur)hasAdvisor(Surajit, Jeff)knownFor(Jeff, Theory)

type(Jeff, Author)[0.9]

author(Jeff, Drag_Book)[0.8]

author(Jeff,Cind_Book)[0.6]

worksAt(Jeff, Bell_Labs)[0.7]

type(Jeff, CEO)[0.4]

Information Extraction

YAGO/DBpedia et al.

New fact candidates

>120 M facts for YAGO2(mostly from Wikipedia infoboxes)

100’s M additional facts from Wikipedia free-text

7

instanceOf

instanceOfinstanceOf

instanceOf

http://www.mpi-inf.mpg.de/yago-naga/

YAGO Knowledge BaseEntity

Max_Planck

Apr 23, 1858

Person

City

Countrysubclass

Locationsubclass

subclass

bornOn

“Max Planck”

means

subclass

Oct 4, 1947 diedOn

Kiel

bornInNobel Prize

Erwin_Planck

fatherOfhasWon

Scientist

means

“Max Karl Ernst Ludwig Planck”

Physicist

subclassBiologist

subclass

Germany

Politician

Angela Merkel

Schleswig-Holstein

State

“Angela Dorothea Merkel”

Oct 23, 1944diedOn

Organization

subclass

Max_Planck Society

instanceOf

means

instanceOf

subclass

means

“Angela Merkel”

means

citizenOf

locatedIn

locatedIn

subclass

3 M entities, 120 M facts100 relations, 200k

classesaccuracy 95%

subclass

instanceOf

8http://linkeddata.org/

Linked Open Data

As of Sept. 2011: >200 linked-data sources >30 billion RDF triples >400 million owl:sameAs links

http://linkeddata.org/

9

Maybe Even More Importantly:

Linked Vocabularies!

Source: http://en.wikipedia.org/wiki/Linked_data

LinkedData.org Instance & class links

between DBpedia, WordNet, OpenCyc, GeoNames, and many more…

Schema.org Common vocabulary released

by Google, Yahoo!, BING to annotate Web pages, incl. links to DBpedia.

Micro-Formats: RDFa (W3C)

<html xmlns="http://www.w3.org/1999/xhtml"

xmlns:dc="http://purl.org/dc/elements/1.1/"

version="XHTML+RDFa 1.0" xml:lang="en">

<head><title>Martin's Home Page</title>

<base href="http://adrem.ua.ac.be/~tmartin/" />

<meta property="dc:creator" content= "Martin" /> </head>

http://en.wikipedia.org/wiki/Linked_data

10

As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase

11

Application 1: Enrichment of Search Results

“Recent Advances in Structured Data and the Web.” Alon Y. Halevy, Keynote at ICDE 2013, Brisbane

12

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder.

Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby.The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik.

After discovering that Salander has hacked into his computer, he persuades her to assisthim with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer.

A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Application II: Machine Reading

same same

samesame

same

same

uncleOf

owns

hires

headOf

affairWith

affairWithenemyOf

Etzioni, Banko, Cafarella: Machine Reading. AAAI’06Mitchell, Carlson et al.: Toward an Architecture for Never-Ending Language Learning. AAAI’10

13

Application III: Natural-Language Question Answering

evi.com (formerly trueknowledge.com)

14

Application III: Natural-Language Question Answering

wolframalpha.com>10 trillion(!) facts

>50,000 search algorithms

>5,000 visualizations

15

IBM Watson: Deep Question Answering

99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain

This town is known as "Sin City" & its downtown is "Glitter Gulch"

William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel

As of 2010, this is the only former Yugoslav republic in the EU

www.ibm.com/innovation/us/watson/index.htm

Knowledge back-ends

Question classification & decomposition

D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine, Fall 2010.

http://www.ibm.com/innovation/us/watson/index.htm

16

http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/

Multilingual Question Answering over Linked Data (QALD-3), CLEF 2011-13

Natural-Language QA over Linked Data

<question id="4" answertype="resource" aggregation="false" onlydbo="true"> <string lang="en">Which river does the Brooklyn Bridge cross?</string> <string lang="de">Welchen Fluss überspannt die Brooklyn Bridge?</string> <string lang="es">¿Por qué río cruza la Brooklyn Bridge?</string> <string lang="it">Quale fiume attraversa il ponte di Brooklyn?</string> <string lang="fr">Quelle cours d'eau est traversé par le pont de Brooklyn?</string> <string lang="nl">Welke rivier overspant de Brooklyn Bridge?</string> <keywords lang="en">river, cross, Brooklyn Bridge</keywords> <keywords lang="de">Fluss, überspannen, Brooklyn Bridge</keywords> <keywords lang="es">río, cruza, Brooklyn Bridge</keywords> <keywords lang="it">fiume, attraversare, ponte di Brooklyn</keywords> <keywords lang="fr">cours d'eau, pont de Brooklyn</keywords> <keywords lang="nl">rivier, Brooklyn Bridge, overspant</keywords> <query> PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX res: <http://dbpedia.org/resource/> SELECT DISTINCT ?uri WHERE { res:Brooklyn_Bridge dbo:crosses ?uri . } </query></question>



17

<topic id="2012374" category="Politics"> <jeopardy_clue>Which German politician is a successor of another politician who stepped down before his or her actual term was over, and what is the name of their political ancestor?</jeopardy_clue> <keyword_title>German politicians successor other stepped down before actual term name ancestor</keyword_title> <sparql_ft> SELECT ?s ?s1 WHERE { ?s rdf:type <http://dbpedia.org/class/yago/GermanPoliticians> . ?s1 <http://dbpedia.org/property/successor> ?s . FILTER FTContains (?s, "stepped down early") . } </sparql_ft></topic>

https://inex.mmci.uni-saarland.de/tracks/lod/

INEX Linked Data Track, CLEF 2012-13

Natural-Language QA over Linked Data



18

Outline

Probabilistic Databases Stanford’s Trio System: Data, Uncertainty & Lineage Handling Uncertain RDF Data: URDF (Max-Planck-Institute/U-Antwerp)

Probabilistic & Temporal Databases Sequenced vs. Non-Sequenced Semantics Interval Alignment & Probabilistic Inference

Probabilistic Programming Statistical Relational Learning Learning “Interesting” Deduction Rules

Summary & Challenges

19

Probabilistic databases combine first-order logic and probability theory

in an elegant way:

Declarative: Queries formulated in SQL/Relational Algebra/Datalog, support for updates, transactions, etc.

Deductive: Well-studied resolution algorithms for SQL/Relational Algebra/Datalog (top-down/bottom-up), indexes, automatic query optimization

Scalable (?): Polynomial data complexity (SQL), but #P-complete for the probabilistic inference

Probabilistic Databases: A Panacea to All of the Afore

Tasks?

C2

20

Special Cases:

Query Semantics: (“Marginal Probabilities”) Run query Q against each instance Di; for each

answer tuple t, sum up the probabilities of all instances Di where t exists.

A probabilistic database Dp (compactly) encodes a probability distribution over a finite set of deterministic

database instances Di.

WorksAt(Sub, Obj)

Jeff Stanford

Jeff Princeton

WorksAt(Sub, Obj)

Jeff Stanford

WorksAt(Sub, Obj)

Jeff Princeton

WorksAt(Sub, Obj)

0.42 0.18 0.28 0.12

WorksAt(Sub, Obj)

p

Jeff Stanford 0.6

Jeff Princeton 0.7

(1) Tuple-independent PDB (II) Block-independent PDB

Note: (I) and (II)

are not equivalent!

Probabilistic Database

WorksAt(Sub, Obj)

p

Jeff Stanford 0.6

Princeton 0.4

21

Stanford Trio System

1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidence values4. Lineage

Uncertainty-Lineage Databases (ULDBs)

[Widom: CIDR 2005]

22

Trio’s Data Model

1. Alternatives: uncertainty about value

Saw (witness, color, car)

Amy red, Honda ∥ red, Toyota ∥ orange, Mazda

Three possible

instances

23

Six possible

instances

Trio’s Data Model

1. Alternatives2. ‘?’ (Maybe): uncertainty about presence

?


Amy red, Honda ∥ red, Toyota ∥ orange, Mazda

Betty blue, Acura

24

Trio’s Data Model

1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences: weighted uncertainty

Still six possible instances, each with a

probability

?


Amy red, Honda 0.5 ∥ red, Toyota 0.3 ∥ orange, Mazda 0.2

Betty blue, Acura 0.6

25

So Far: Model is Not Closed

Saw (witness, car)

Cathy

Honda ∥ Mazda

Drives (person, car)

Jimmy, Toyota ∥ Jimmy, Mazda

Billy, Honda ∥ Frank, Honda

Hank, Honda

Suspects

Jimmy

Billy ∥ Frank

Hank

Suspects = πperson(Saw ⋈ Drives)

???

Does not correctlycapture possibleinstances in theresult

CANNOT

26

Example with Lineage

ID

Saw (witness, car)

11

Cathy

Honda ∥ Mazda

ID

Drives (person, car)

21


22


23

Hank, Honda

ID

Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank


???

λ(31) = (11,2) Λ (21,2)λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)λ(33) = (11,1) Λ 23

27

Example with Lineage

ID Saw (witness, car)

11

Cathy

Honda ∥ Mazda

ID Drives (person, car)

21


22


23

Hank, Honda

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank


???

λ(31) = (11,2) Λ (21,2)λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)λ(33) = (11,1) Λ 23

Correctly captures possible instances inthe result

(4)

28

Operational Semantics

Closure: up-arrow always

exists

Completeness: any (finite) set of possible instances can be represented

Dp

D1, D2,…, Dn D1’, D2’, …, Dm’

Dp′

possibleinstances

Q on eachinstance

rep. ofinstances

directimplementation

But: data complexity is #P-complete!

29

Summary on Trio’s Data Model

1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidence values4. Lineage

Uncertainty-Lineage Databases (ULDBs)

Theorem: ULDBs are closed and complete.

Formally studied properties like minimization, equivalence, approximation and membership based on lineage. [Benjelloun, Das Sarma, Halevy,

Widom, Theobald: VLDB-J. 2008]

30

Basic Complexity Issue

Theorem [Valiant:1979]For a Boolean expression E, computing Pr(E) is #P-complete

NP = class of problems of the form “is there a witness ?” SAT#P = class of problems of the form “how many witnesses ?” #SAT

The decision problem for 2CNF is in PTIME.The counting problem for 2CNF is already #P-complete.

(will be coming back to this later again…)

[Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries"]

…back to Information

Extraction

bornIn(Barack, Honolulu)bornIn(Barack, Kenya)

Uncertain RDF (URDF): Facts & Rules

Extensional Knowledge (the “facts”) High-confidence facts: existing knowledge base (“ground truth”) New fact candidates: extracted fact candidates with confidences Linked-Data & integration of various knowledge sources: Ontology merging or explicitly linked facts (owl:sameAs, owl:equivProp.)

Large “Probabilistic Database” of RDF facts

Intensional Knowledge (the “rules”) Soft rules: deductive grounding & lineage (Datalog/SLD resolution) Hard rules: consistency constraints (more general FOL rules) Propositional & probabilistic inference

At query-time!

Soft Rules vs. Hard Rules

(Soft) Deduction Rules vs. (Hard) Consistency Constraints

People may live in more than one placelivesIn(x,y) marriedTo(x,z) livesIn(z,y)livesIn(x,y) hasChild(x,z) livesIn(z,y)

People are not born in different places/on different datesbornIn(x,y) bornIn(x,z) y=zbornOn(x,y) bornOn(x,z) y=z

People are not married to more than one person (at the same time, in most countries?)

marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z

disjoint(t1,t2)

[0.8]

[0.5]

[0.8]

[0.5]

Soft Rules vs. Hard Rules

(Soft) Deduction Rules vs. (Hard) Consistency Constraints

People may live in more than one placelivesIn(x,y) marriedTo(x,z) livesIn(z,y)livesIn(x,y) hasChild(x,z) livesIn(z,y)

People are not born in different places/on different datesbornIn(x,y) bornIn(x,z) y=zbornOn(x,y) bornOn(x,z) y=z

People are not married to more than one person (at the same time, in most countries?)

marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z

disjoint(t1,t2)

Deductive Database:

Datalog, core of SQL & Relational Algebra, RDF/S, OWL2-RL, etc.

More General FOL Constraints: Datalog plus constraints,

X-tuples in PDB’s,owl:FunctionalProperty, owl:disjointWith,

etc.

URDF Running Example

Jeff

Stanford

University

type[1.0]

Surajit

Princeton

David

Computer Scientist

worksAt[0.9]

type[1.0]

type[1.0]

type[1.0]type[1.0]

graduatedFrom[0.6]

graduatedFrom[0.7]

graduatedFrom[0.9]

hasAdvisor[0.8]hasAdvisor[0.7]

KB: RDF Base Facts

Derived FactsgradFr(Surajit,Stanford

)gradFr(David,Stanford)

graduatedFrom[?]graduatedFrom[?] graduatedFrom[?]

graduatedFrom[?]

Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z)

[0.4]

graduatedFrom(x,y) graduatedFrom(x,z) y=z

Basic Types of Inference

MAP Inference

Find the most likely assignment to query variables y under a given evidence x.

Compute: arg max y P( y | x) (NP-complete

for MaxSAT)

Marginal/Success Probabilities

Probability that query y is true in a random world under a given evidence x.

Compute: ∑y P( y | x) (#P-complete already for conjunctive queries)

General Route: Grounding & MaxSAT Solving

Query graduatedFrom(x, y)

CNF (graduatedFrom(Surajit, Stanford) graduatedFrom(Surajit, Princeton))

(graduatedFrom(David, Stanford) graduatedFrom(David, Princeton))

(hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))

(hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford))

worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton)

1000

1000

0.4

0.4

0.9 0.8 0.7 0.6 0.7 0.9

1) Grounding– Consider only facts (and

rules) which are relevant for answering the query

2) Propositional formula in CNF, consisting of– Grounded soft & hard rules– Weighted base facts

3) Propositional Reasoning– Find truth assignment to

facts such that the total weight of the satisfied clauses is maximized

MAP inference: compute “most likely” possible world

[Theobald,Sozio,Suchanek,Nakashole: VLDS’12]

Find: arg max y P( y | x) Resolves to a variant of

MaxSAT for propositional formulas

URDF: MaxSAT Solving with Soft & Hard Rules

{ graduatedFrom(Surajit, Stanford), graduatedFrom(Surajit, Princeton) }

{ graduatedFrom(David, Stanford), graduatedFrom(David, Princeton) }

(hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))

(hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford))

worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton)

0.4

0.4

0.9 0.8 0.7 0.6 0.7 0.9

S:

Mute

x-c

onst

.

Special case: Horn-clauses as soft rules & mutex-constraints as hard rules

C:

Weig

hte

d H

orn

cla

use

s (C

NF)

Compute W0 = ∑clauses C w(C) P(C is satisfied);For each hard constraint S { For each fact f in St { Compute Wf+

t = ∑clauses C w(C) P(C is sat. | f = true); } Compute WS-

t = ∑clauses C w(C) P(C is sat. | St = false); Choose truth assignment to f in St that maximizes Wf+

t , WS-t ;

Remove satisfied clauses C; t++;}

• Runtime: O(|S||C|)

• Approximation guarantee of 1/2

MaxSAT Alg.

Experiment (I): MAP Inference

URDF: Grounding & MaxSAT solving

|C| - # literals in grounded soft rules|S| - # literals in grounded hard rules

URDF MaxSAT vs. Markov Logic

(MAP inference & MC-SAT)

• YAGO Knowledge Base: 2 Mio entities, 20 Mio facts• Query Answering: Deductive grounding & MaxSAT solving for 10 queries

over 16 soft rules (partly recursive) & 5 hard rules (bornIn, diedIn, marriedTo, …)

• Asymptotic runtime checks via synthetic (random) soft rule expansions


✔ MAP Inference



for MaxSAT)




[Yahya,Theobald: RuleML’11 Dylla,Miliaraki,Theobald: ICDE’13]

Deductive Grounding with Lineage (SLD Resolution in Datalog/Prolog)

\/

/\

graduatedFrom(Surajit,

Princeton)[0.7]

hasAdvisor(Surajit,Jeff)

[0.8]

worksAt(Jeff,Stanford)[0.9]


Stanford)[0.6]

Query graduatedFrom(Surajit, y)

C D

A B

A(B (CD)) A(B (CD))

graduatedFrom(Surajit, Princeton)

graduatedFrom(Surajit, Stanford)Q1 Q2

Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z)

[0.4]

graduatedFrom(x,y) graduatedFrom(x,z) y=zBase FactsgraduatedFrom(Surajit, Princeton)

[0.7]graduatedFrom(Surajit, Stanford)

[0.6]graduatedFrom(David, Princeton)

[0.9]hasAdvisor(Surajit, Jeff) [0.8]hasAdvisor(David, Jeff) [0.7]worksAt(Jeff, Stanford) [0.9]type(Princeton, University) [1.0]type(Stanford, University) [1.0]type(Jeff, Computer_Scientist) [1.0]type(Surajit, Computer_Scientist)

[1.0]type(David, Computer_Scientist)

[1.0]

Lineage & Possible Worlds

1) Deductive Grounding Dependency graph of the query Trace lineage of individual query

answers

2) Lineage DAG (not in CNF),

consisting of Grounded soft & hard rules Probabilistic base facts

3) Probabilistic Inference Compute marginals:

P(Q): sum up the probabilities of all possible worlds that entail the query answers’ lineage

P(Q|H): drop “impossible worlds”

\/

/\


Princeton)[0.7]


[0.8]



Stanford)[0.6]


0.7x(1-0.888)=0.078 (1-0.7)x0.888=0.266

1-(1-0.72)x(1-0.6)=0.888

0.8x0.9=0.72

C D

A B

A(B (CD)) A(B (CD))



[Das Sarma,Theobald,Widom: ICDE’08 Dylla,Miliaraki,Theobald: ICDE’13]

Possible Worlds Semantics

A:0.7

B:0.6

C:0.8

D:0.9

Q2: A(B(CD))

P(W)

1 1 1 1 0 0.7x0.6x0.8x0.9 = 0.3024

1 1 1 0 0 0.7x0.6x0.8x0.1 = 0.0336

1 1 0 1 0 … = 0.0756

1 1 0 0 0 … = 0.0084

1 0 1 1 0 … = 0.2016

1 0 1 0 0 … = 0.0224

1 0 0 1 0 … = 0.0504

1 0 0 0 0 … = 0.0056

0 1 1 1 1 0.3x0.6x0.8x0.9 = 0.1296

0 1 1 0 1 0.3x0.6x0.8x0.1 = 0.0144

0 1 0 1 1 0.3x0.6x0.2x0.9 = 0.0324

0 1 0 0 1 0.3x0.6x0.2x0.1 = 0.0036

0 0 1 1 1 0.3x0.4x0.8x0.9 = 0.0864

0 0 1 0 0 … = 0.0096

0 0 0 1 0 … = 0.0216

0 0 0 0 0 … = 0.0024

1.0

0.2664

0.412

P(Q2)=0.2664

P(Q2|H)=0.2664 / 0.412 = 0.6466

P(Q1)=0.0784 P(Q1|H)=0.0784 / 0.412 = 0.1903

0.0784

Hard rule H: A (B (CD))

Inference in Probabilistic Databases Safe query plans [Dalvi,Suciu: VLDB-J’07]

Can propagate confidences along with relational operators.

Read-once functions [Sen,Deshpande,Getoor: PVLDB’10]

Can factorize Boolean formula (in polynomial time) into read-once form, where every variable occurs at most once.

Knowledge compilation [Olteanu et al.: ICDT’10, ICDT’11] Can decompose Boolean formula into ordered binary

decision diagram (OBDD), such that inference resolves to independent-and and independent-or operations over the decomposed formula.

Top-k pruning [Ré,Davli,Suciu: ICDE’07; Karp,Luby,Madras: J-Alg.’89] Can return top-k answers based on lower and upper

bounds, even without knowing their exact marginal probabilities.

Multi-Simulation: run multiple Markov-Chain-Monte-Carlo (MCMC) simulations in parallel.

Monte Carlo Simulation (I)

E = X1X2 v X1X3 v X2X3

cnt = 0repeat N times randomly choose X1, X2, X3 {0,1} if E(X1, X2, X3) = 1 then cnt = cnt+1P = cnt/Nreturn P /* estimate for true Pr(F) */

Theorem: If N ≥ (1/ Pr(E)) × (4 ln(2/d)/e2) then: Pr[ | P/Pr(E) - 1 | > e ] < d

N may be very big for small

Pr(E)

X1X2 X1X3

X2X3

Boolean formula:

Zero/One-EstimatorTheorem

Works for any E(not in PTIME)

Naïve sampling:

[Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries"Karp,Luby,Madras: J-Alg.’89]

cnt = 0; S = Pr(C1) + … + Pr(Cm)repeat N times randomly choose i {1,2,…, m}, with prob. Pr(Ci)/S randomly choose X1, …, Xn {0,1} s.t. Ci = 1 if C1= 0 and C2= 0 and … and Ci-1= 0 then cnt = cnt+1P = cnt/Nreturn P /* estimate for true Pr(E) */

Theorem: If N ≥ (1/m) × (4 ln(2/d)/e2) then: Pr[ |P/Pr(E) - 1| > e ] < d

E = C1 v C2 v . . . v Cm

Importance sampling:

This is better!

Only for E in DNF in PTIME

Boolean formula in DNF:

Monte Carlo Simulation (II)[Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries"

Karp,Luby,Madras: J-Alg.’89]

Top-k Ranking by Marginal Probabilities

\/


Stanford)[0.6]





Princeton)[0.7]A B

graduatedFrom(Surajit, y=Stanford)

/\


[0.8]


C D

Datalog/SLD resolution Top-down grounding allows

us to compute lower and upper bounds on the marginal probabilities of answer candidates before rules are fully grounded.

Subgoals may represent sets of answer candidates.

First-order lineage formulas: Φ(Q1) = A

Φ(Q2) = B y gradFrom(Surajit,y)

Prune entire set of answer candidates represented by Φ.

[Dylla,Miliaraki,Theobald: ICDE’13]

Bounds for First-Order Formulas

Theorem 1:Given a (partially grounded) first-order lineage formula Φ:

Φ(Q2) = B y gradFrom(S,y)

Lower bound Plow (for all query answers that can be obtained

from grounding Φ) Substitute y gradFrom(S,y) with false (or true if negated).

Plow(Q2) = P(B false) = P(B) = 0.6

Upper bound Pup (for all query answers that can be obtained

from grounding Φ) Substitute y gradFrom(S,y) with true (or false if negated).

Pup(Q2) = P(B true) = P(true) = 1.0

Proof: (sketch)

Substitution of a subformula with false reduces the number of models (possible worlds) that satisfy Φ; substitution with true increases them.

[Dylla,Miliaraki,Theobald: ICDE’13]

Theorem II:Let Φ1,…, Φn be a series of first-order lineage formulas obtained from grounding Φ via SLD resolution, and let φ be the propositional lineage formula of an answer obtained from this grounding procedure. Then rewriting each Φi according to Theorem 1 into Pi,low and Pi,up creates a monotonic series of lower and upper bounds that converges to P(φ).

0 = P(false) P(B false) = 0.6 P(B (C D)) = 0.888 P(B true) = P(true)

= 1

Proof: (sketch, via induction)

Substitution of true with a formula reduces the number of models that satisfy Φ; substitution of false with a formula increases this number.

Convergence of Bounds[Dylla,Miliaraki,Theobald: ICDE’13]

P2,up(Qj)

P2,low(Qj)

Top-k Pruning“Fagin’s Algorithm”

Maintain two disjoint queues: Top-k queue sorted by Plow and Candidates sorted by Pup

Return the top-k queue at the t’th grounding step when:

Pi,low(Qk) | Qk Top-k > Pi,up(Qj) | Qj Candidates Drop Qj from the Candidates queue.

P1,up(Qj)

P1,low(Qj)

k-th lower bound Pn,up(Qj)

Pn,low(Qj)

#SLD steps t

Marginal probability

1

0

[Fagin et al.’01; Balke,Kießling’02; Dylla,Miliaraki,Theobald: ICDE’13]

Top-k Stopping Condition“Fagin’s Algorithm”

Maintain two disjoint queues: Top-k queue sorted by Plow and Candidates sorted by Pup

Return the top-k queue at the t’th grounding step when:

Pi,low(Qk) | Qk Top-k > Pi,up(Qj) | Qj Candidates Stop and return the top-2 query answers.

2-nd lower bound

[Fagin et al.’01; Balke,Kießling’02; Dylla,Miliaraki,Theobald: ICDE’13]

k = 2

Pt,up(Q2)

Pt,low(Q2)

Pt,up(Q1)

Pt,low(Q1)

@SLD step t

Marginal probability

1

0

Pt,low(Qm)

Pt,up(Qm)

Experiment (II): Computing Marginals

IMDB data with 26 Mio facts about movies, directors, actors, etc. 4 query patterns, each instantiated to 1,000 queries (showing

runtime averages) Q1 – safe, non-repeating hierarchical Q2 – unsafe, repeating hierarchical Q3 – unsafe, head-hierarchical Q4 – general unsafe

Non-Rep. Hierarchical Q1 Rep. Hierarchical Q2 Head-Hierarchical Q3 General Unsafe Q410

100

1,000

10,000

100,000

Top-10 Top-20 Top-50 MultiSim Top-10 MultiSim Top-20 MultiSim Top-50 Postgres MayBMS Trio

ms

Experiment (II): Computing Marginals

Runtime vs. number of top-k results;

single join query

Percentage of tuples scanned from input relations

IMDB data set, 26 Mio facts


✔

MAP Inference



for MaxSAT)




✔

Probabilistic & Temporal Database

Sequenced Semantics & Snapshot Reducibility: Built-in semantics: reduce temporal-relational operators to

their non-temporal counterparts at each snapshot of the database.

Coalesce/split tuples with consecutive time intervals based on their lineages.

Non-Sequenced Semantics Queries can freely manipulate timestamps just like regular

attributes. Single temporal operator ≤T supports all of Allen’s 13 temporal relations.

Deduplicate tuples with overlapping time intervals based on their lineages.

A temporal-probabilistic database DTp (compactly) encodes a probability distribution over a finite set of deterministic database instances Di and a finite time

domain T.BornIn(Sub,Obj)

T p

DeNiro Green- which

[1943, 1944)

0.9

DeNiro Tribeca [1998, 1999)

0.6[Dignös, Gamper, Böhlen: SIGMOD’12]

[Dylla,Miliaraki,Theobald: PVLDB’13]

Wedding(Sub,Obj)

T p

DeNiro Abbott [1936, 1940)

0.3


0.7

Divorce(Sub,Obj)

T p


0.8

Temporal Alignment & Deduplication

Non-Sequenced Semantics:

f1

1936 1976 1988

f2 ¬f3

f2 f3f1 ¬f3

f1 f3

(f1 f3) (f1 ¬f3)

(f1 f3) (f1 ¬f3) (f2 f3) (f2 ¬f3)

(f1 f3) (f2 ¬f3)

T

MarriedTo(X,Y)[Tb1,tmax) Wedding(X,Y)[Tb1,Te1) ¬Divorce(X,Y)[Tb2,Te2)

MarriedTo(X,Y)[Tb1,Te2) Wedding(X,Y)[Tb1,Te1) Divorce(X,Y)[Tb2,Te2) Te1 ≤T Tb2

BaseFacts

DeducedFacts

Dedupl.Facts

Wedding(DeNiro,Abbott)

Wedding(DeNiro,Abbott)

Divorce(DeNiro,Abbott)

tmax

f2

f3

tmin

0.08

0.120.16

0.40.6

‘03 ‘05 ‘07playsFor(Beckham, Real, T1)

Base Facts

DerivedFacts

0.20.20.1

0.4

‘05‘00 ‘02 ‘07playsFor(Ronaldo, Real, T2)

‘04

‘03 ‘04 ‘07‘05

playsFor(Beckham, Real, T1)Ù playsFor(Ronaldo, Real, T2)Ù overlaps(T1,T2, T3)

t3 teamMates(Beckham, Ronaldo, t3)

teamMates(Beckham, Ronaldo, T3)

Inference in Temporal-Probabilistic Databases

[Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13]

0.40.6

‘03 ‘05 ‘07playsFor(Beckham, Real, T1)

Base Facts

DerivedFacts

playsFor(Ronaldo, Real, T2)

0.20.20.1

‘05‘00 ‘02 ‘07‘04

0.4

0.08

0.120.16

‘03 ‘04 ‘07‘05

playsFor(Zidane, Real, T3)

teamMates(Beckham, Zidane, T5)

teamMates(Ronaldo, Zidane, T6)


Non-independent

Independent



playsFor(Beckham, Real, T1)Base Facts

DerivedFacts

playsFor(Ronaldo, Real, T2)

playsFor(Zidane, Real, T3)

teamMates(Beckham, Zidane, T5)

teamMates(Ronaldo, Zidane, T6)

Non-independent

Independent

Closed and complete representation model (incl. lineage)

Temporal alignment is linear in the number of input intervals

Confidence computation per interval remains #P-hard

In general requires Monte Carlo approximations (Luby-Karp for DNF, MCMC-style sampling), decompositions, or top-k pruning


Need

Lineage!



Experiment (III): Temporal Alignment & Probabilistic

Inference

1,827 base facts with temporal annotations Extracted from free-text biographies from Wikipedia, IMDB.com,

biography.com 11 handcrafted temporal deduction rules, e.g.: MarriedTo(X,Y)[Tb1,Te2) Wedding(X,Y)[Tb1,Te1) Divorce(X,Y)[Tb2,Te2) Te1 ≤T Tb2

21 handcrafted temporal consistency constraints, e.g.: BornIn(X,Y)[Tb1,Te1) MarriedTo(X,Y)[Tb2,Te2) Te1 ≤T Tb2

Statistical Relational Learning& Probabilistic Programming

SRL combines first-order logic and probabilistic inference

Employs relational data as input, but with a focus also on learning the relations (facts, rules & weights)

Knowledge compilation for probabilistic inference Including recent techniques for “lifted inference”

Markov Logic Networks (U-Washington) Grounding of weighted first-order rules over a function-

free Herbrand base into an undirected graphical model ( Markov Random Field)

Probabilistic Programming (ProbLog, KU-Leuven) Deductive grounding over a set of base facts into a

directed graphical model (SLD proofs Bayesian Net)

Learning Soft Deduction Rules

Inductive learning algorithm based on dynamic programming

A-priori-style pre-filtering & pruning of low-support join patterns

Adaptation of confidence and support measures from data mining

Learning “interesting” rules with constants and type constraints

Ground truth for IivesIn (only partially known)Knowledge base for livesIn (known positive examples)Facts inferred for livesIn from the body of the rule bornIn (only partially correct)

Goal: Inductively learn soft rule S: livesIn(x,y) :- bornIn(x,y)

GKB

R

||

||)|()(

Body

BodyHeadBodyHeadPSconfidence

Learning “Interesting” Deduction Rules (I)

Plots for the distribution of income versus quarterOfBirth and educationLevel over actual US census data from Oct. 2009 (>1 billion RDF facts).

Divergence from “Overall population” shows strong correlation of income with educationLevel but not with quarterOfBirth.

income

re/.

fre

q.

Overall populationQOB-1st-quarterQOB-2nd-quarterQOB-3rd-quarterQOB-4th-quarter

incomere

/. f

req.

income(x, y), quarterOfBirth(x, z) income(x, y), educationLevel(x, z)

Learning “Interesting” Deduction Rules (II)

Divergence measured using Kullback-Leibler or χ2 between “Overall population” with “Nursery school to Grade 4” and “Professional school degree” over discretized income domain.

re/.

fre

q.

low medium high

income(x, y) :- educationLevel(x, z)

income(x, “low”) :- educationLevel(x, “Nursery school to Grade 4”)

income(x, “medium”) :- educationLevel(x, “Professional school degree”)

income(x, “high”) :- educationLevel(x, “Professional school degree”)

– Overall population– Nursery school to Grade 4– Professional school degree

income

ontological rigor

human effort

Names & PatternsEntities & Relations

Open-Domain & Unsuper-vised

Domain-OrientedTrainingData/Facts

< „N. Portman“, „honored with“, „Academy Award“>, < „Jeff Bridges“, „expected to win“, „Oscar“ > < „Bridges“, „nominated for“, „Academy Award“>

wonAward: Person Prizetype (Meryl_Streep, Actor)wonAward (Meryl_Streep, Academy_Award)

wonAward (Natalie_Portman, Academy_Award)wonAward (Ethan_Coen, Palme_d‘Or)

Summary & Challenges (I)Web-Scale Information Extraction

ontological rigor

human effort

Names & PatternsEntities & Relations

Open-Domain & Unsuper-vised

Domain-OrientedTrainingData/Facts

Summary & Challenges (I)Web-Scale Information Extraction

TextRunner

ReadTheWeb / NELL

Probase

Freebase

YAGO2DBpedia 3.8

Sofie /Prospera

StatSnowball /EntityCube

?

-----

WebTables /FusionTables

Summary & Challenges (II)RDF is Not Enough!

HMM’s, CRF’s, PCFG’s (not in this talk) yield much richer output structures than just triplets.

Extraction of facts beliefs, modifiers,

modalities, etc.. intensional knowledge

(“rules”) More expressive but

canonical representation of natural language: trees, graphs, objects, frames (F-logic, KL-one, CycL, OWL, etc.)

All combined with structured probabilistic inference

Summary & Challenges (III)Scalable Probabilistic Inference

“Domain-liftable” FO formula

X,YPeople

smokes(X) friends(X,Y)

smokes(Y)

Exact lifted inference via Weighted-First-Order-Model-Counting (WFOMC) Probability of a query depends only on the size(s) of the domain(s), a

weight function for the first-order predicates, and the weighted model count over the FO d-DNNF.

[Van den Broeck’11]: Compilation rules and inference algorithms for FO d-DNNF’s

[Jha & Suciu’11]: Classes of SQL queries which admit polynomial-size (propositional) d-DNNF’s

Approximate inference via Belief Propagation, MCMC-style sampling, etc.

Scale-out via distributed grounding & inference: TrinityRDF (MSR), GraphLab2 (MIT)

CorrespondingFO d-DNNF circuit

Final Summary

Text is not just unstructured data.

Probabilistic databases combine first-order logic and probability theory in an elegant way.

Natural-Language-Processing people, Database guys, and Machine-Learning folks: it’s about time to join your forces!

C1

C2

C3

Demo!

urdf.mpi-inf.mpg.de

http://infao5501.ag5.mpi-sb.mpg.de:8080/urdf/UViz.html

References Maximilian Dylla, Iris Miliaraki, and Martin Theobald: A Temporal-Probabilistic Database Model

for Information Extraction. PVLDB 6(14), 2013 (to appear) Maximilian Dylla, Iris Miliaraki, and Martin Theobald: Top-k Query Processing in Probabilistic

Databases with Non-Materialized Views. ICDE 2013, 2013 Ndapandula Nakashole, Mauro Sozio, Fabian Suchanek, Martin Theobald: Query-Time

Reasoning in Uncertain RDF Knowledge Bases with Soft and Hard Rules. VLDS 2012: 15-20 Mohamed Yahya, Martin Theobald: D2R2: Disk-Oriented Deductive Reasoning in a RISC-Style

RDF Engine. RuleML America 2011: 81-96 Timm Meiser, Maximilian Dylla, Martin Theobald: Interactive Reasoning in Uncertain RDF

Knowledge Bases. CIKM 2011: 2557-2560 Ndapandula Nakashole, Martin Theobald, Gerhard Weikum: Scalable Knowledge Harvesting

with High Precision and High Recall. WSDM 2011: 227-236 Maximilian Dylla, Mauro Sozio, Martin Theobald: Resolving Temporal Conflicts in Inconsistent

RDF Knowledge Bases. BTW 2011: 474-493 Yafang Wang, Mohamed Yahya, Martin Theobald: Time-aware Reasoning in Uncertain

Knowledge Bases.

MUD 2010: 51-65 Ndapandula Nakashole, Martin Theobald, Gerhard Weikum: Find your Advisor: Robust

Knowledge Gathering from the Web. WebDB 2010 Anish Das Sarma, Martin Theobald, Jennifer Widom: LIVE: A Lineage-Supported Versioned

DBMS. SSDBM 2010: 416-433 Anish Das Sarma, Martin Theobald, Jennifer Widom: Exploiting Lineage for Confidence

Computation in Uncertain and Probabilistic Databases. ICDE 2008: 1023-1032 Omar Benjelloun, Anish Das Sarma, Alon Y. Halevy, Martin Theobald, Jennifer Widom:

Databases with uncertainty and lineage. VLDB J. 17(2): 243-264 (2008)

10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Documents

subclass instanceof

sameas links slide

dbpediayagofreebase

brisbane slide

questioning slide

brooklyn bridge ro

brooklyn bridge fiume

brooklyn bridge fluss