Top Banner
10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig, Andre Melo, Iris Miliaraki, Luc de Raedt, Mauro Sozio, Fabian Suchanek
71

10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Mar 29, 2015

Download

Documents

Daniella Gibble
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

10 Years of Probabilistic Querying – What Next?

Martin TheobaldUniversity of Antwerp

Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Andre Melo, Iris Miliaraki, Luc de Raedt, Mauro Sozio, Fabian Suchanek

Page 2: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

“ The important thing is to not stop questioning ... One cannot help but be in awe when contemplating the mysteries of eternity, of life, of the marvelous structure of reality. It is enough if one tries merely to comprehend a little of this mystery every day.”

- Albert Einstein, 1936

“The Marvelous Structure of Reality”Joseph M. HellersteinKeynote at WebDB 2003, San Diego

Page 3: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Look, There is Structure!

The important thing is to not stop questioning

Page 4: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Look, There is Structure!

Plethora of natural-language-processing techniques & tools Part-Of-Speech (POS) Tagging Named-Entity Recognition &

Disambiguation (NERD) Dependency Parsing Semantic Role Labeling

Text is not just “unstructured data”C1

Page 5: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Look, There is Structure!

Plethora of natural-language-processing techniques & tools Part-Of-Speech (POS) Tagging Named-Entity Recognition &

Disambiguation (NERD) Dependency Parsing Semantic Role Labeling

Text is not just “unstructured data” But:

Even the best NLP tools frequently yield errors

Facts found on the Web are logically inconsistent

Web-extracted knowledge bases are inherently incomplete

C1

Page 6: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

bornOn(Jeff, 09/22/42)gradFrom(Jeff, Columbia)hasAdvisor(Jeff, Arthur)hasAdvisor(Surajit, Jeff)knownFor(Jeff, Theory)

type(Jeff, Author)[0.9]

author(Jeff, Drag_Book)[0.8]

author(Jeff,Cind_Book)[0.6]

worksAt(Jeff, Bell_Labs)[0.7]

type(Jeff, CEO)[0.4]

Information Extraction

YAGO/DBpedia et al.

New fact candidates

>120 M facts for YAGO2(mostly from Wikipedia infoboxes)

100’s M additional facts from Wikipedia free-text

Page 7: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

7

instanceOf

instanceOfinstanceOf

instanceOf

http://www.mpi-inf.mpg.de/yago-naga/

YAGO Knowledge BaseEntity

Max_Planck

Apr 23, 1858

Person

City

Countrysubclass

Locationsubclass

subclass

bornOn

“Max Planck”

means

subclass

Oct 4, 1947 diedOn

Kiel

bornInNobel Prize

Erwin_Planck

fatherOfhasWon

Scientist

means

“Max Karl Ernst Ludwig Planck”

Physicist

subclassBiologist

subclass

Germany

Politician

Angela Merkel

Schleswig-Holstein

State

“Angela Dorothea Merkel”

Oct 23, 1944diedOn

Organization

subclass

Max_Planck Society

instanceOf

means

instanceOf

subclass

means

“Angela Merkel”

means

citizenOf

locatedIn

locatedIn

subclass

3 M entities, 120 M facts100 relations, 200k

classesaccuracy 95%

subclass

instanceOf

Page 8: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

8http://linkeddata.org/

Linked Open Data

As of Sept. 2011: >200 linked-data sources >30 billion RDF triples >400 million owl:sameAs links

Page 9: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

9

Maybe Even More Importantly:

Linked Vocabularies!

Source: http://en.wikipedia.org/wiki/Linked_data

LinkedData.org Instance & class links

between DBpedia, WordNet, OpenCyc, GeoNames, and many more…

Schema.org Common vocabulary released

by Google, Yahoo!, BING to annotate Web pages, incl. links to DBpedia.

Micro-Formats: RDFa (W3C)

<html xmlns="http://www.w3.org/1999/xhtml"

xmlns:dc="http://purl.org/dc/elements/1.1/"

version="XHTML+RDFa 1.0" xml:lang="en">

<head><title>Martin's Home Page</title>

<base href="http://adrem.ua.ac.be/~tmartin/" />

<meta property="dc:creator" content= "Martin" /> </head>

Page 10: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

10

As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase

Page 11: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

11

Application 1: Enrichment of Search Results

“Recent Advances in Structured Data and the Web.” Alon Y. Halevy, Keynote at ICDE 2013, Brisbane

Page 12: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

12

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder.

Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby.The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik.

After discovering that Salander has hacked into his computer, he persuades her to assisthim with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer.

A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Application II: Machine Reading

same same

samesame

same

same

uncleOf

owns

hires

headOf

affairWith

affairWithenemyOf

Etzioni, Banko, Cafarella: Machine Reading. AAAI’06Mitchell, Carlson et al.: Toward an Architecture for Never-Ending Language Learning. AAAI’10

Page 13: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

13

Application III: Natural-Language Question Answering

evi.com (formerly trueknowledge.com)

Page 14: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

14

Application III: Natural-Language Question Answering

wolframalpha.com>10 trillion(!) facts

>50,000 search algorithms

>5,000 visualizations

Page 15: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

15

IBM Watson: Deep Question Answering

99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain

This town is known as "Sin City" & its downtown is "Glitter Gulch"

William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel

As of 2010, this is the only former Yugoslav republic in the EU

www.ibm.com/innovation/us/watson/index.htm

Knowledge back-ends

Question classification & decomposition

D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine, Fall 2010.

Page 16: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

16

http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/

Multilingual Question Answering over Linked Data (QALD-3), CLEF 2011-13

Natural-Language QA over Linked Data

<question id="4" answertype="resource" aggregation="false" onlydbo="true"> <string lang="en">Which river does the Brooklyn Bridge cross?</string> <string lang="de">Welchen Fluss überspannt die Brooklyn Bridge?</string> <string lang="es">¿Por qué río cruza la Brooklyn Bridge?</string> <string lang="it">Quale fiume attraversa il ponte di Brooklyn?</string> <string lang="fr">Quelle cours d'eau est traversé par le pont de Brooklyn?</string> <string lang="nl">Welke rivier overspant de Brooklyn Bridge?</string> <keywords lang="en">river, cross, Brooklyn Bridge</keywords> <keywords lang="de">Fluss, überspannen, Brooklyn Bridge</keywords> <keywords lang="es">río, cruza, Brooklyn Bridge</keywords> <keywords lang="it">fiume, attraversare, ponte di Brooklyn</keywords> <keywords lang="fr">cours d'eau, pont de Brooklyn</keywords> <keywords lang="nl">rivier, Brooklyn Bridge, overspant</keywords> <query> PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX res: <http://dbpedia.org/resource/> SELECT DISTINCT ?uri WHERE { res:Brooklyn_Bridge dbo:crosses ?uri . } </query></question>

Page 17: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

17

<topic id="2012374" category="Politics"> <jeopardy_clue>Which German politician is a successor of another politician who stepped down before his or her actual term was over, and what is the name of their political ancestor?</jeopardy_clue> <keyword_title>German politicians successor other stepped down before actual term name ancestor</keyword_title> <sparql_ft> SELECT ?s ?s1 WHERE { ?s rdf:type <http://dbpedia.org/class/yago/GermanPoliticians> . ?s1 <http://dbpedia.org/property/successor> ?s . FILTER FTContains (?s, "stepped down early") . } </sparql_ft></topic>

https://inex.mmci.uni-saarland.de/tracks/lod/

INEX Linked Data Track, CLEF 2012-13

Natural-Language QA over Linked Data

Page 18: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

18

Outline

Probabilistic Databases Stanford’s Trio System: Data, Uncertainty & Lineage Handling Uncertain RDF Data: URDF (Max-Planck-Institute/U-Antwerp)

Probabilistic & Temporal Databases Sequenced vs. Non-Sequenced Semantics Interval Alignment & Probabilistic Inference

Probabilistic Programming Statistical Relational Learning Learning “Interesting” Deduction Rules

Summary & Challenges

Page 19: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

19

Probabilistic databases combine first-order logic and probability theory

in an elegant way:

Declarative: Queries formulated in SQL/Relational Algebra/Datalog, support for updates, transactions, etc.

Deductive: Well-studied resolution algorithms for SQL/Relational Algebra/Datalog (top-down/bottom-up), indexes, automatic query optimization

Scalable (?): Polynomial data complexity (SQL), but #P-complete for the probabilistic inference

Probabilistic Databases: A Panacea to All of the Afore

Tasks?

C2

Page 20: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

20

Special Cases:

Query Semantics: (“Marginal Probabilities”) Run query Q against each instance Di; for each

answer tuple t, sum up the probabilities of all instances Di where t exists.

A probabilistic database Dp (compactly) encodes a probability distribution over a finite set of deterministic

database instances Di.

WorksAt(Sub, Obj)

Jeff Stanford

Jeff Princeton

WorksAt(Sub, Obj)

Jeff Stanford

WorksAt(Sub, Obj)

Jeff Princeton

WorksAt(Sub, Obj)

0.42 0.18 0.28 0.12

WorksAt(Sub, Obj)

p

Jeff Stanford 0.6

Jeff Princeton 0.7

(1) Tuple-independent PDB (II) Block-independent PDB

Note: (I) and (II)

are not equivalent!

Probabilistic Database

WorksAt(Sub, Obj)

p

Jeff Stanford 0.6

Princeton 0.4

Page 21: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

21

Stanford Trio System

1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidence values4. Lineage

Uncertainty-Lineage Databases (ULDBs)

[Widom: CIDR 2005]

Page 22: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

22

Trio’s Data Model

1. Alternatives: uncertainty about value

Saw (witness, color, car)

Amy red, Honda ∥ red, Toyota ∥ orange, Mazda

Three possible

instances

Page 23: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

23

Six possible

instances

Trio’s Data Model

1. Alternatives2. ‘?’ (Maybe): uncertainty about presence

?

Saw (witness, color, car)

Amy red, Honda ∥ red, Toyota ∥ orange, Mazda

Betty blue, Acura

Page 24: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

24

Trio’s Data Model

1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences: weighted uncertainty

Still six possible instances, each with a

probability

?

Saw (witness, color, car)

Amy red, Honda 0.5 ∥ red, Toyota 0.3 ∥ orange, Mazda 0.2

Betty blue, Acura 0.6

Page 25: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

25

So Far: Model is Not Closed

Saw (witness, car)

Cathy

Honda ∥ Mazda

Drives (person, car)

Jimmy, Toyota ∥ Jimmy, Mazda

Billy, Honda ∥ Frank, Honda

Hank, Honda

Suspects

Jimmy

Billy ∥ Frank

Hank

Suspects = πperson(Saw ⋈ Drives)

???

Does not correctlycapture possibleinstances in theresult

CANNOT

Page 26: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

26

Example with Lineage

ID

Saw (witness, car)

11

Cathy

Honda ∥ Mazda

ID

Drives (person, car)

21

Jimmy, Toyota ∥ Jimmy, Mazda

22

Billy, Honda ∥ Frank, Honda

23

Hank, Honda

ID

Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

Suspects = πperson(Saw ⋈ Drives)

???

λ(31) = (11,2) Λ (21,2)λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)λ(33) = (11,1) Λ 23

Page 27: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

27

Example with Lineage

ID Saw (witness, car)

11

Cathy

Honda ∥ Mazda

ID Drives (person, car)

21

Jimmy, Toyota ∥ Jimmy, Mazda

22

Billy, Honda ∥ Frank, Honda

23

Hank, Honda

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

Suspects = πperson(Saw ⋈ Drives)

???

λ(31) = (11,2) Λ (21,2)λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)λ(33) = (11,1) Λ 23

Correctly captures possible instances inthe result

(4)

Page 28: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

28

Operational Semantics

Closure: up-arrow always

exists

Completeness: any (finite) set of possible instances can be represented

Dp

D1, D2,…, Dn D1’, D2’, …, Dm’

Dp′

possibleinstances

Q on eachinstance

rep. ofinstances

directimplementation

But: data complexity is #P-complete!

Page 29: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

29

Summary on Trio’s Data Model

1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidence values4. Lineage

Uncertainty-Lineage Databases (ULDBs)

Theorem: ULDBs are closed and complete.

Formally studied properties like minimization, equivalence, approximation and membership based on lineage. [Benjelloun, Das Sarma, Halevy,

Widom, Theobald: VLDB-J. 2008]

Page 30: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

30

Basic Complexity Issue

Theorem [Valiant:1979]For a Boolean expression E, computing Pr(E) is #P-complete

NP = class of problems of the form “is there a witness ?” SAT#P = class of problems of the form “how many witnesses ?” #SAT

The decision problem for 2CNF is in PTIME.The counting problem for 2CNF is already #P-complete.

(will be coming back to this later again…)

[Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries"]

Page 31: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

…back to Information

Extraction

bornIn(Barack, Honolulu)bornIn(Barack, Kenya)

Page 32: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Uncertain RDF (URDF): Facts & Rules

Extensional Knowledge (the “facts”) High-confidence facts: existing knowledge base (“ground truth”) New fact candidates: extracted fact candidates with confidences Linked-Data & integration of various knowledge sources: Ontology merging or explicitly linked facts (owl:sameAs, owl:equivProp.)

Large “Probabilistic Database” of RDF facts

Intensional Knowledge (the “rules”) Soft rules: deductive grounding & lineage (Datalog/SLD resolution) Hard rules: consistency constraints (more general FOL rules) Propositional & probabilistic inference

At query-time!

Page 33: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Soft Rules vs. Hard Rules

(Soft) Deduction Rules vs. (Hard) Consistency Constraints

People may live in more than one placelivesIn(x,y) marriedTo(x,z) livesIn(z,y)livesIn(x,y) hasChild(x,z) livesIn(z,y)

People are not born in different places/on different datesbornIn(x,y) bornIn(x,z) y=zbornOn(x,y) bornOn(x,z) y=z

People are not married to more than one person (at the same time, in most countries?)

marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z

disjoint(t1,t2)

[0.8]

[0.5]

Page 34: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

[0.8]

[0.5]

Soft Rules vs. Hard Rules

(Soft) Deduction Rules vs. (Hard) Consistency Constraints

People may live in more than one placelivesIn(x,y) marriedTo(x,z) livesIn(z,y)livesIn(x,y) hasChild(x,z) livesIn(z,y)

People are not born in different places/on different datesbornIn(x,y) bornIn(x,z) y=zbornOn(x,y) bornOn(x,z) y=z

People are not married to more than one person (at the same time, in most countries?)

marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z

disjoint(t1,t2)

Deductive Database:

Datalog, core of SQL & Relational Algebra, RDF/S, OWL2-RL, etc.

More General FOL Constraints: Datalog plus constraints,

X-tuples in PDB’s,owl:FunctionalProperty, owl:disjointWith,

etc.

Page 35: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

URDF Running Example

Jeff

Stanford

University

type[1.0]

Surajit

Princeton

David

Computer Scientist

worksAt[0.9]

type[1.0]

type[1.0]

type[1.0]type[1.0]

graduatedFrom[0.6]

graduatedFrom[0.7]

graduatedFrom[0.9]

hasAdvisor[0.8]hasAdvisor[0.7]

KB: RDF Base Facts

Derived FactsgradFr(Surajit,Stanford

)gradFr(David,Stanford)

graduatedFrom[?]graduatedFrom[?] graduatedFrom[?]

graduatedFrom[?]

Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z)

[0.4]

graduatedFrom(x,y) graduatedFrom(x,z) y=z

Page 36: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Basic Types of Inference

MAP Inference

Find the most likely assignment to query variables y under a given evidence x.

Compute: arg max y P( y | x) (NP-complete

for MaxSAT)

Marginal/Success Probabilities

Probability that query y is true in a random world under a given evidence x.

Compute: ∑y P( y | x) (#P-complete already for conjunctive queries)

Page 37: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

General Route: Grounding & MaxSAT Solving

Query graduatedFrom(x, y)

CNF (graduatedFrom(Surajit, Stanford) graduatedFrom(Surajit, Princeton))

(graduatedFrom(David, Stanford) graduatedFrom(David, Princeton))

(hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))

(hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford))

worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton)

1000

1000

0.4

0.4

0.9 0.8 0.7 0.6 0.7 0.9

1) Grounding– Consider only facts (and

rules) which are relevant for answering the query

2) Propositional formula in CNF, consisting of– Grounded soft & hard rules– Weighted base facts

3) Propositional Reasoning– Find truth assignment to

facts such that the total weight of the satisfied clauses is maximized

MAP inference: compute “most likely” possible world

Page 38: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

[Theobald,Sozio,Suchanek,Nakashole: VLDS’12]

Find: arg max y P( y | x) Resolves to a variant of

MaxSAT for propositional formulas

URDF: MaxSAT Solving with Soft & Hard Rules

{ graduatedFrom(Surajit, Stanford), graduatedFrom(Surajit, Princeton) }

{ graduatedFrom(David, Stanford), graduatedFrom(David, Princeton) }

(hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))

(hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford))

worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton)

0.4

0.4

0.9 0.8 0.7 0.6 0.7 0.9

S:

Mute

x-c

onst

.

Special case: Horn-clauses as soft rules & mutex-constraints as hard rules

C:

Weig

hte

d H

orn

cla

use

s (C

NF)

Compute W0 = ∑clauses C w(C) P(C is satisfied);For each hard constraint S { For each fact f in St { Compute Wf+

t = ∑clauses C w(C) P(C is sat. | f = true); } Compute WS-

t = ∑clauses C w(C) P(C is sat. | St = false); Choose truth assignment to f in St that maximizes Wf+

t , WS-t ;

Remove satisfied clauses C; t++;}

• Runtime: O(|S||C|)

• Approximation guarantee of 1/2

MaxSAT Alg.

Page 39: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Experiment (I): MAP Inference

URDF: Grounding & MaxSAT solving

|C| - # literals in grounded soft rules|S| - # literals in grounded hard rules

URDF MaxSAT vs. Markov Logic

(MAP inference & MC-SAT)

• YAGO Knowledge Base: 2 Mio entities, 20 Mio facts• Query Answering: Deductive grounding & MaxSAT solving for 10 queries

over 16 soft rules (partly recursive) & 5 hard rules (bornIn, diedIn, marriedTo, …)

• Asymptotic runtime checks via synthetic (random) soft rule expansions

Page 40: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Basic Types of Inference

✔ MAP Inference

Find the most likely assignment to query variables y under a given evidence x.

Compute: arg max y P( y | x) (NP-complete

for MaxSAT)

Marginal/Success Probabilities

Probability that query y is true in a random world under a given evidence x.

Compute: ∑y P( y | x) (#P-complete already for conjunctive queries)

Page 41: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

[Yahya,Theobald: RuleML’11 Dylla,Miliaraki,Theobald: ICDE’13]

Deductive Grounding with Lineage (SLD Resolution in Datalog/Prolog)

\/

/\

graduatedFrom(Surajit,

Princeton)[0.7]

hasAdvisor(Surajit,Jeff)

[0.8]

worksAt(Jeff,Stanford)[0.9]

graduatedFrom(Surajit,

Stanford)[0.6]

Query graduatedFrom(Surajit, y)

C D

A B

A(B (CD)) A(B (CD))

graduatedFrom(Surajit, Princeton)

graduatedFrom(Surajit, Stanford)Q1 Q2

Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z)

[0.4]

graduatedFrom(x,y) graduatedFrom(x,z) y=zBase FactsgraduatedFrom(Surajit, Princeton)

[0.7]graduatedFrom(Surajit, Stanford)

[0.6]graduatedFrom(David, Princeton)

[0.9]hasAdvisor(Surajit, Jeff) [0.8]hasAdvisor(David, Jeff) [0.7]worksAt(Jeff, Stanford) [0.9]type(Princeton, University) [1.0]type(Stanford, University) [1.0]type(Jeff, Computer_Scientist) [1.0]type(Surajit, Computer_Scientist)

[1.0]type(David, Computer_Scientist)

[1.0]

Page 42: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Lineage & Possible Worlds

1) Deductive Grounding Dependency graph of the query Trace lineage of individual query

answers

2) Lineage DAG (not in CNF),

consisting of Grounded soft & hard rules Probabilistic base facts

3) Probabilistic Inference Compute marginals:

P(Q): sum up the probabilities of all possible worlds that entail the query answers’ lineage

P(Q|H): drop “impossible worlds”

\/

/\

graduatedFrom(Surajit,

Princeton)[0.7]

hasAdvisor(Surajit,Jeff)

[0.8]

worksAt(Jeff,Stanford)[0.9]

graduatedFrom(Surajit,

Stanford)[0.6]

Query graduatedFrom(Surajit, y)

0.7x(1-0.888)=0.078 (1-0.7)x0.888=0.266

1-(1-0.72)x(1-0.6)=0.888

0.8x0.9=0.72

C D

A B

A(B (CD)) A(B (CD))

graduatedFrom(Surajit, Princeton)

graduatedFrom(Surajit, Stanford)Q1 Q2

[Das Sarma,Theobald,Widom: ICDE’08 Dylla,Miliaraki,Theobald: ICDE’13]

Page 43: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Possible Worlds Semantics

A:0.7

B:0.6

C:0.8

D:0.9

Q2: A(B(CD))

P(W)

1 1 1 1 0 0.7x0.6x0.8x0.9 = 0.3024

1 1 1 0 0 0.7x0.6x0.8x0.1 = 0.0336

1 1 0 1 0 … = 0.0756

1 1 0 0 0 … = 0.0084

1 0 1 1 0 … = 0.2016

1 0 1 0 0 … = 0.0224

1 0 0 1 0 … = 0.0504

1 0 0 0 0 … = 0.0056

0 1 1 1 1 0.3x0.6x0.8x0.9 = 0.1296

0 1 1 0 1 0.3x0.6x0.8x0.1 = 0.0144

0 1 0 1 1 0.3x0.6x0.2x0.9 = 0.0324

0 1 0 0 1 0.3x0.6x0.2x0.1 = 0.0036

0 0 1 1 1 0.3x0.4x0.8x0.9 = 0.0864

0 0 1 0 0 … = 0.0096

0 0 0 1 0 … = 0.0216

0 0 0 0 0 … = 0.0024

1.0

0.2664

0.412

P(Q2)=0.2664

P(Q2|H)=0.2664 / 0.412 = 0.6466

P(Q1)=0.0784 P(Q1|H)=0.0784 / 0.412 = 0.1903

0.0784

Hard rule H: A (B (CD))

Page 44: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Inference in Probabilistic Databases Safe query plans [Dalvi,Suciu: VLDB-J’07]

Can propagate confidences along with relational operators.

Read-once functions [Sen,Deshpande,Getoor: PVLDB’10]

Can factorize Boolean formula (in polynomial time) into read-once form, where every variable occurs at most once.

Knowledge compilation [Olteanu et al.: ICDT’10, ICDT’11] Can decompose Boolean formula into ordered binary

decision diagram (OBDD), such that inference resolves to independent-and and independent-or operations over the decomposed formula.

Top-k pruning [Ré,Davli,Suciu: ICDE’07; Karp,Luby,Madras: J-Alg.’89] Can return top-k answers based on lower and upper

bounds, even without knowing their exact marginal probabilities.

Multi-Simulation: run multiple Markov-Chain-Monte-Carlo (MCMC) simulations in parallel.

Page 45: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Monte Carlo Simulation (I)

E = X1X2 v X1X3 v X2X3

cnt = 0repeat N times randomly choose X1, X2, X3 {0,1} if E(X1, X2, X3) = 1 then cnt = cnt+1P = cnt/Nreturn P /* estimate for true Pr(F) */

Theorem: If N ≥ (1/ Pr(E)) × (4 ln(2/d)/e2) then: Pr[ | P/Pr(E) - 1 | > e ] < d

N may be very big for small

Pr(E)

X1X2 X1X3

X2X3

Boolean formula:

Zero/One-EstimatorTheorem

Works for any E(not in PTIME)

Naïve sampling:

[Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries"Karp,Luby,Madras: J-Alg.’89]

Page 46: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

cnt = 0; S = Pr(C1) + … + Pr(Cm)repeat N times randomly choose i {1,2,…, m}, with prob. Pr(Ci)/S randomly choose X1, …, Xn {0,1} s.t. Ci = 1 if C1= 0 and C2= 0 and … and Ci-1= 0 then cnt = cnt+1P = cnt/Nreturn P /* estimate for true Pr(E) */

Theorem: If N ≥ (1/m) × (4 ln(2/d)/e2) then: Pr[ |P/Pr(E) - 1| > e ] < d

E = C1 v C2 v . . . v Cm

Importance sampling:

This is better!

Only for E in DNF in PTIME

Boolean formula in DNF:

Monte Carlo Simulation (II)[Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries"

Karp,Luby,Madras: J-Alg.’89]

Page 47: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Top-k Ranking by Marginal Probabilities

\/

graduatedFrom(Surajit,

Stanford)[0.6]

Query graduatedFrom(Surajit, y)

graduatedFrom(Surajit, Princeton)

graduatedFrom(Surajit, Stanford)Q1 Q2

graduatedFrom(Surajit,

Princeton)[0.7]A B

graduatedFrom(Surajit, y=Stanford)

/\

hasAdvisor(Surajit,Jeff)

[0.8]

worksAt(Jeff,Stanford)[0.9]

C D

Datalog/SLD resolution Top-down grounding allows

us to compute lower and upper bounds on the marginal probabilities of answer candidates before rules are fully grounded.

Subgoals may represent sets of answer candidates.

First-order lineage formulas: Φ(Q1) = A

Φ(Q2) = B y gradFrom(Surajit,y)

Prune entire set of answer candidates represented by Φ.

[Dylla,Miliaraki,Theobald: ICDE’13]

Page 48: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Bounds for First-Order Formulas

Theorem 1:Given a (partially grounded) first-order lineage formula Φ:

Φ(Q2) = B y gradFrom(S,y)

Lower bound Plow (for all query answers that can be obtained

from grounding Φ) Substitute y gradFrom(S,y) with false (or true if negated).

Plow(Q2) = P(B false) = P(B) = 0.6

Upper bound Pup (for all query answers that can be obtained

from grounding Φ) Substitute y gradFrom(S,y) with true (or false if negated).

Pup(Q2) = P(B true) = P(true) = 1.0

Proof: (sketch)

Substitution of a subformula with false reduces the number of models (possible worlds) that satisfy Φ; substitution with true increases them.

[Dylla,Miliaraki,Theobald: ICDE’13]

Page 49: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Theorem II:Let Φ1,…, Φn be a series of first-order lineage formulas obtained from grounding Φ via SLD resolution, and let φ be the propositional lineage formula of an answer obtained from this grounding procedure. Then rewriting each Φi according to Theorem 1 into Pi,low and Pi,up creates a monotonic series of lower and upper bounds that converges to P(φ).

0 = P(false) P(B false) = 0.6 P(B (C D)) = 0.888 P(B true) = P(true)

= 1

Proof: (sketch, via induction)

Substitution of true with a formula reduces the number of models that satisfy Φ; substitution of false with a formula increases this number.

Convergence of Bounds[Dylla,Miliaraki,Theobald: ICDE’13]

Page 50: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

P2,up(Qj)

P2,low(Qj)

Top-k Pruning“Fagin’s Algorithm”

Maintain two disjoint queues: Top-k queue sorted by Plow and Candidates sorted by Pup

Return the top-k queue at the t’th grounding step when:

Pi,low(Qk) | Qk Top-k > Pi,up(Qj) | Qj Candidates Drop Qj from the Candidates queue.

P1,up(Qj)

P1,low(Qj)

k-th lower bound Pn,up(Qj)

Pn,low(Qj)

#SLD steps t

Marginal probability

1

0

[Fagin et al.’01; Balke,Kießling’02; Dylla,Miliaraki,Theobald: ICDE’13]

Page 51: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Top-k Stopping Condition“Fagin’s Algorithm”

Maintain two disjoint queues: Top-k queue sorted by Plow and Candidates sorted by Pup

Return the top-k queue at the t’th grounding step when:

Pi,low(Qk) | Qk Top-k > Pi,up(Qj) | Qj Candidates Stop and return the top-2 query answers.

2-nd lower bound

[Fagin et al.’01; Balke,Kießling’02; Dylla,Miliaraki,Theobald: ICDE’13]

k = 2

Pt,up(Q2)

Pt,low(Q2)

Pt,up(Q1)

Pt,low(Q1)

@SLD step t

Marginal probability

1

0

Pt,low(Qm)

Pt,up(Qm)

Page 52: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Experiment (II): Computing Marginals

IMDB data with 26 Mio facts about movies, directors, actors, etc. 4 query patterns, each instantiated to 1,000 queries (showing

runtime averages) Q1 – safe, non-repeating hierarchical Q2 – unsafe, repeating hierarchical Q3 – unsafe, head-hierarchical Q4 – general unsafe

Non-Rep. Hierarchical Q1 Rep. Hierarchical Q2 Head-Hierarchical Q3 General Unsafe Q410

100

1,000

10,000

100,000

Top-10 Top-20 Top-50 MultiSim Top-10 MultiSim Top-20 MultiSim Top-50 Postgres MayBMS Trio

ms

Page 53: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Experiment (II): Computing Marginals

Runtime vs. number of top-k results;

single join query

Percentage of tuples scanned from input relations

IMDB data set, 26 Mio facts

Page 54: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Basic Types of Inference

MAP Inference

Find the most likely assignment to query variables y under a given evidence x.

Compute: arg max y P( y | x) (NP-complete

for MaxSAT)

Marginal/Success Probabilities

Probability that query y is true in a random world under a given evidence x.

Compute: ∑y P( y | x) (#P-complete already for conjunctive queries)

Page 55: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Probabilistic & Temporal Database

Sequenced Semantics & Snapshot Reducibility: Built-in semantics: reduce temporal-relational operators to

their non-temporal counterparts at each snapshot of the database.

Coalesce/split tuples with consecutive time intervals based on their lineages.

Non-Sequenced Semantics Queries can freely manipulate timestamps just like regular

attributes. Single temporal operator ≤T supports all of Allen’s 13 temporal relations.

Deduplicate tuples with overlapping time intervals based on their lineages.

A temporal-probabilistic database DTp (compactly) encodes a probability distribution over a finite set of deterministic database instances Di and a finite time

domain T.BornIn(Sub,Obj)

T p

DeNiro Green- which

[1943, 1944)

0.9

DeNiro Tribeca [1998, 1999)

0.6[Dignös, Gamper, Böhlen: SIGMOD’12]

[Dylla,Miliaraki,Theobald: PVLDB’13]

Wedding(Sub,Obj)

T p

DeNiro Abbott [1936, 1940)

0.3

DeNiro Abbott [1976, 1977)

0.7

Divorce(Sub,Obj)

T p

DeNiro Abbott [1988, 1989)

0.8

Page 56: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Temporal Alignment & Deduplication

Non-Sequenced Semantics:

f1

1936 1976 1988

f2 ¬f3

f2 f3f1 ¬f3

f1 f3

(f1 f3) (f1 ¬f3)

(f1 f3) (f1 ¬f3) (f2 f3) (f2 ¬f3)

(f1 f3) (f2 ¬f3)

T

MarriedTo(X,Y)[Tb1,tmax) Wedding(X,Y)[Tb1,Te1) ¬Divorce(X,Y)[Tb2,Te2)

MarriedTo(X,Y)[Tb1,Te2) Wedding(X,Y)[Tb1,Te1) Divorce(X,Y)[Tb2,Te2) Te1 ≤T Tb2

BaseFacts

DeducedFacts

Dedupl.Facts

Wedding(DeNiro,Abbott)

Wedding(DeNiro,Abbott)

Divorce(DeNiro,Abbott)

tmax

f2

f3

tmin

Page 57: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

0.08

0.120.16

0.40.6

‘03 ‘05 ‘07playsFor(Beckham, Real, T1)

Base Facts

DerivedFacts

0.20.20.1

0.4

‘05‘00 ‘02 ‘07playsFor(Ronaldo, Real, T2)

‘04

‘03 ‘04 ‘07‘05

playsFor(Beckham, Real, T1)Ù playsFor(Ronaldo, Real, T2)Ù overlaps(T1,T2, T3)

t3 teamMates(Beckham, Ronaldo, t3)

teamMates(Beckham, Ronaldo, T3)

Inference in Temporal-Probabilistic Databases

[Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13]

Page 58: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

0.40.6

‘03 ‘05 ‘07playsFor(Beckham, Real, T1)

Base Facts

DerivedFacts

playsFor(Ronaldo, Real, T2)

0.20.20.1

‘05‘00 ‘02 ‘07‘04

0.4

0.08

0.120.16

‘03 ‘04 ‘07‘05

playsFor(Zidane, Real, T3)

teamMates(Beckham, Zidane, T5)

teamMates(Ronaldo, Zidane, T6)

teamMates(Beckham, Ronaldo, T4)

Non-independent

Independent

Inference in Temporal-Probabilistic Databases

[Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13]

Page 59: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

playsFor(Beckham, Real, T1)Base Facts

DerivedFacts

playsFor(Ronaldo, Real, T2)

playsFor(Zidane, Real, T3)

teamMates(Beckham, Zidane, T5)

teamMates(Ronaldo, Zidane, T6)

Non-independent

Independent

Closed and complete representation model (incl. lineage)

Temporal alignment is linear in the number of input intervals

Confidence computation per interval remains #P-hard

In general requires Monte Carlo approximations (Luby-Karp for DNF, MCMC-style sampling), decompositions, or top-k pruning

teamMates(Beckham, Ronaldo, T4)

Need

Lineage!

Inference in Temporal-Probabilistic Databases

[Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13]

Page 60: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Experiment (III): Temporal Alignment & Probabilistic

Inference

1,827 base facts with temporal annotations Extracted from free-text biographies from Wikipedia, IMDB.com,

biography.com 11 handcrafted temporal deduction rules, e.g.: MarriedTo(X,Y)[Tb1,Te2) Wedding(X,Y)[Tb1,Te1) Divorce(X,Y)[Tb2,Te2) Te1 ≤T Tb2

21 handcrafted temporal consistency constraints, e.g.: BornIn(X,Y)[Tb1,Te1) MarriedTo(X,Y)[Tb2,Te2) Te1 ≤T Tb2

Page 61: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Statistical Relational Learning& Probabilistic Programming

SRL combines first-order logic and probabilistic inference

Employs relational data as input, but with a focus also on learning the relations (facts, rules & weights)

Knowledge compilation for probabilistic inference Including recent techniques for “lifted inference”

Markov Logic Networks (U-Washington) Grounding of weighted first-order rules over a function-

free Herbrand base into an undirected graphical model ( Markov Random Field)

Probabilistic Programming (ProbLog, KU-Leuven) Deductive grounding over a set of base facts into a

directed graphical model (SLD proofs Bayesian Net)

Page 62: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Learning Soft Deduction Rules

Inductive learning algorithm based on dynamic programming

A-priori-style pre-filtering & pruning of low-support join patterns

Adaptation of confidence and support measures from data mining

Learning “interesting” rules with constants and type constraints

Ground truth for IivesIn (only partially known)Knowledge base for livesIn (known positive examples)Facts inferred for livesIn from the body of the rule bornIn (only partially correct)

Goal: Inductively learn soft rule S: livesIn(x,y) :- bornIn(x,y)

GKB

R

||

||)|()(

Body

BodyHeadBodyHeadPSconfidence

Page 63: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Learning “Interesting” Deduction Rules (I)

Plots for the distribution of income versus quarterOfBirth and educationLevel over actual US census data from Oct. 2009 (>1 billion RDF facts).

Divergence from “Overall population” shows strong correlation of income with educationLevel but not with quarterOfBirth.

income

re/.

fre

q.

Overall populationQOB-1st-quarterQOB-2nd-quarterQOB-3rd-quarterQOB-4th-quarter

incomere

/. f

req.

income(x, y), quarterOfBirth(x, z) income(x, y), educationLevel(x, z)

Page 64: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Learning “Interesting” Deduction Rules (II)

Divergence measured using Kullback-Leibler or χ2 between “Overall population” with “Nursery school to Grade 4” and “Professional school degree” over discretized income domain.

re/.

fre

q.

low medium high

income(x, y) :- educationLevel(x, z)

income(x, “low”) :- educationLevel(x, “Nursery school to Grade 4”)

income(x, “medium”) :- educationLevel(x, “Professional school degree”)

income(x, “high”) :- educationLevel(x, “Professional school degree”)

– Overall population– Nursery school to Grade 4– Professional school degree

income

Page 65: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

ontological rigor

human effort

Names & PatternsEntities & Relations

Open-Domain & Unsuper-vised

Domain-OrientedTrainingData/Facts

< „N. Portman“, „honored with“, „Academy Award“>, < „Jeff Bridges“, „expected to win“, „Oscar“ > < „Bridges“, „nominated for“, „Academy Award“>

wonAward: Person Prizetype (Meryl_Streep, Actor)wonAward (Meryl_Streep, Academy_Award)

wonAward (Natalie_Portman, Academy_Award)wonAward (Ethan_Coen, Palme_d‘Or)

Summary & Challenges (I)Web-Scale Information Extraction

Page 66: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

ontological rigor

human effort

Names & PatternsEntities & Relations

Open-Domain & Unsuper-vised

Domain-OrientedTrainingData/Facts

Summary & Challenges (I)Web-Scale Information Extraction

TextRunner

ReadTheWeb / NELL

Probase

Freebase

YAGO2DBpedia 3.8

Sofie /Prospera

StatSnowball /EntityCube

?

-----

WebTables /FusionTables

Page 67: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Summary & Challenges (II)RDF is Not Enough!

HMM’s, CRF’s, PCFG’s (not in this talk) yield much richer output structures than just triplets.

Extraction of facts beliefs, modifiers,

modalities, etc.. intensional knowledge

(“rules”) More expressive but

canonical representation of natural language: trees, graphs, objects, frames (F-logic, KL-one, CycL, OWL, etc.)

All combined with structured probabilistic inference

Page 68: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Summary & Challenges (III)Scalable Probabilistic Inference

“Domain-liftable” FO formula

X,YPeople

smokes(X) friends(X,Y)

smokes(Y)

Exact lifted inference via Weighted-First-Order-Model-Counting (WFOMC) Probability of a query depends only on the size(s) of the domain(s), a

weight function for the first-order predicates, and the weighted model count over the FO d-DNNF.

[Van den Broeck’11]: Compilation rules and inference algorithms for FO d-DNNF’s

[Jha & Suciu’11]: Classes of SQL queries which admit polynomial-size (propositional) d-DNNF’s

Approximate inference via Belief Propagation, MCMC-style sampling, etc.

Scale-out via distributed grounding & inference: TrinityRDF (MSR), GraphLab2 (MIT)

CorrespondingFO d-DNNF circuit

Page 69: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Final Summary

Text is not just unstructured data.

Probabilistic databases combine first-order logic and probability theory in an elegant way.

Natural-Language-Processing people, Database guys, and Machine-Learning folks: it’s about time to join your forces!

C1

C2

C3

Page 71: 10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

References Maximilian Dylla, Iris Miliaraki, and Martin Theobald: A Temporal-Probabilistic Database Model

for Information Extraction. PVLDB 6(14), 2013 (to appear) Maximilian Dylla, Iris Miliaraki, and Martin Theobald: Top-k Query Processing in Probabilistic

Databases with Non-Materialized Views. ICDE 2013, 2013 Ndapandula Nakashole, Mauro Sozio, Fabian Suchanek, Martin Theobald: Query-Time

Reasoning in Uncertain RDF Knowledge Bases with Soft and Hard Rules. VLDS 2012: 15-20 Mohamed Yahya, Martin Theobald: D2R2: Disk-Oriented Deductive Reasoning in a RISC-Style

RDF Engine. RuleML America 2011: 81-96 Timm Meiser, Maximilian Dylla, Martin Theobald: Interactive Reasoning in Uncertain RDF

Knowledge Bases. CIKM 2011: 2557-2560 Ndapandula Nakashole, Martin Theobald, Gerhard Weikum: Scalable Knowledge Harvesting

with High Precision and High Recall. WSDM 2011: 227-236 Maximilian Dylla, Mauro Sozio, Martin Theobald: Resolving Temporal Conflicts in Inconsistent

RDF Knowledge Bases. BTW 2011: 474-493 Yafang Wang, Mohamed Yahya, Martin Theobald: Time-aware Reasoning in Uncertain

Knowledge Bases.

MUD 2010: 51-65 Ndapandula Nakashole, Martin Theobald, Gerhard Weikum: Find your Advisor: Robust

Knowledge Gathering from the Web. WebDB 2010 Anish Das Sarma, Martin Theobald, Jennifer Widom: LIVE: A Lineage-Supported Versioned

DBMS. SSDBM 2010: 416-433 Anish Das Sarma, Martin Theobald, Jennifer Widom: Exploiting Lineage for Confidence

Computation in Uncertain and Probabilistic Databases. ICDE 2008: 1023-1032 Omar Benjelloun, Anish Das Sarma, Alon Y. Halevy, Martin Theobald, Jennifer Widom:

Databases with uncertainty and lineage. VLDB J. 17(2): 243-264 (2008)