Querying Graph-Structured Datadb.in.tum.de/teaching/ws1920/foundationsde/RDFQueryOpt.pdf · Europeana Nomenclator Asturias Red Uno Internacional GNOSS Geo Wordnet Bio2RDF HGNC Ctic

Querying Graph-Structured Data

Thomas Neumann

Technische Universitat Munchen

November 4, 2016

Motivation

Many interesting data sets of a graph structure.

• very flexible

• easy to model

• but difficult to query

• often very large

• no obvious structure

• how to store andprocess?

Linked Datasets as of August 2014

Uniprot

AlexandriaDigital Library

Gazetteer

lobidOrganizations

chem2bio2rdf

MultimediaLab University

Ghent

Open DataEcuador

GeoEcuador

Serendipity

UTPLLOD

GovAgriBusDenmark

DBpedialive

URIBurner

Linguistics

Social Networking

Life Sciences

Cross-Domain

Government

User-Generated Content

Publications

Geographic

Media

Identifiers

EionetRDF

lobidResources

WiktionaryDBpedia

Viaf

Umthes

RKBExplorer

Courseware

Opencyc

Olia

Gem.Thesaurus

AudiovisueleArchieven

DiseasomeFU-Berlin

Eurovocin

SKOS

DNBGND

Cornetto

Bio2RDFPubmed

Bio2RDFNDC

Bio2RDFMesh

IDS

OntosNewsPortal

AEMET

ineverycrea

LinkedUser

Feedback

MuseosEspaniaGNOSS

Europeana

NomenclatorAsturias

Red UnoInternacional

GNOSS

GeoWordnet

Bio2RDFHGNC

CticPublic

Dataset

Bio2RDFHomologene

Bio2RDFAffymetrix

MuninnWorld War I

CKAN

GovernmentWeb Integration

forLinkedData

Universidadde CuencaLinkeddata

Freebase

Linklion

Ariadne

OrganicEdunet

GeneExpressionAtlas RDF

ChemblRDF

BiosamplesRDF

IdentifiersOrg

BiomodelsRDF

ReactomeRDF

Disgenet

SemanticQuran

IATI asLinked Data

DutchShips and

Sailors

Verrijktkoninkrijk

IServe

Arago-dbpedia

LinkedTCGA

ABS270a.info

RDFLicense

EnvironmentalApplicationsReferenceThesaurus

Thist

JudaicaLink

BPR

OCD

ShoahVictimsNames

Reload

Data forTourists in

Castilla y Leon

2001SpanishCensusto RDF

RKBExplorer

Webscience

RKBExplorerEprintsHarvest

NVS

EU AgenciesBodies

EPO

LinkedNUTS

RKBExplorer

Epsrc

OpenMobile

Network

RKBExplorerLisbon

RKBExplorer

Italy

CE4R

EnvironmentAgency

Bathing WaterQuality

RKBExplorerKaunas

OpenData

Thesaurus

RKBExplorerWordnet

RKBExplorer

ECS

AustrianSki

Racers

Social-semweb

Thesaurus

DataOpenAc Uk

RKBExplorer

IEEE

RKBExplorer

LAAS

RKBExplorer

Wiki

RKBExplorer

JISC

RKBExplorerEprints

RKBExplorer

Pisa

RKBExplorer

Darmstadt

RKBExplorerunlocode

RKBExplorer

Newcastle

RKBExplorer

OS

RKBExplorer

Curriculum

RKBExplorerResex

RKBExplorer

Roma

RKBExplorerEurecom

RKBExplorer

IBM

RKBExplorer

NSF

RKBExplorer

kisti

RKBExplorer

DBLP

RKBExplorer

ACM

RKBExplorerCiteseer

RKBExplorer

Southampton

RKBExplorerDeepblue

RKBExplorerDeploy

RKBExplorer

Risks

RKBExplorer

ERA

RKBExplorer

OAI

RKBExplorer

FT

RKBExplorer

Ulm

RKBExplorer

Irit

RKBExplorerRAE2001

RKBExplorerDotac

RKBExplorerBudapest

SwedishOpen Cultural

Heritage

Radatana

CourtsThesaurus

GermanLabor LawThesaurus

GovUKTransport

Data

GovUKEducation

Data

EnaktingMortality

EnaktingEnergy

EnaktingCrime

EnaktingPopulation

EnaktingCO2Emission

EnaktingNHS

RKBExplorerCrime

RKBExplorercordis

Govtrack

GeologicalSurvey of

AustriaThesaurus

GeoLinkedData

GesisThesoz

Bio2RDFPharmgkb

Bio2RDFSabiorkBio2RDF

Ncbigene

Bio2RDFIrefindex

Bio2RDFIproclass

Bio2RDFGOA

Bio2RDFDrugbank

Bio2RDFCTD

Bio2RDFBiomodels

Bio2RDFDBSNP

Bio2RDFClinicaltrials

Bio2RDFLSR

Bio2RDFOrphanet

Bio2RDFWormbase

BIS270a.info

DM2E

DBpediaPT

DBpediaES

DBpediaCS

DBnary

AlpinoRDF

YAGO

PdevLemon

Lemonuby

Isocat

Ietflang

Core

KUPKB

GettyAAT

SemanticWeb

Journal

OpenlinkSWDataspaces

MyOpenlinkDataspaces

Jugem

Typepad

AspireHarperAdams

NBNResolving

Worldcat

Bio2RDF

Bio2RDFECO

Taxon-conceptAssets

Indymedia

GovUKSocietal

WellbeingDeprivation imd

EmploymentRank La 2010

GNULicenses

GreekWordnet

DBpedia

CIPFA

Yso.fiAllars

Glottolog

StatusNetBonifaz

StatusNetshnoulle

Revyu

StatusNetKathryl

ChargingStations

AspireUCL

Tekord

Didactalia

ArtenueVosmedios

GNOSS

LinkedCrunchbase

ESDStandards

VIVOUniversityof Florida

Bio2RDFSGD

Resources

ProductOntology

DatosBne.es

StatusNetMrblog

Bio2RDFDataset

EUNIS

GovUKHousingMarket

LCSH

GovUKTransparencyImpact ind.Households

In temp.Accom.

UniprotKB

StatusNetTimttmy

SemanticWeb

Grundlagen

GovUKInput ind.

Local AuthorityFunding FromGovernment

Grant

StatusNetFcestrada

JITA

StatusNetSomsants

StatusNetIlikefreedom

DrugbankFU-Berlin

Semanlink

StatusNetDtdns

StatusNetStatus.net

DCSSheffield

AtheliaRFID

StatusNetTekk

ListaEncabezaMientosMateria

StatusNetFragdev

Morelab

DBTuneJohn PeelSessions

RDFizelast.fm

OpenData

Euskadi

GovUKTransparency

Input ind.Local auth.Funding f.

Gvmnt. Grant

MSC

Lexinfo

StatusNetEquestriarp

Asn.us

GovUKSocietal

WellbeingDeprivation ImdHealth Rank la

2010

StatusNetMacno

OceandrillingBorehole

AspireQmul

GovUKImpact

IndicatorsPlanning

ApplicationsGranted

Loius

Datahub.io

StatusNetMaymay

Prospectsand

TrendsGNOSS

GovUKTransparency

Impact IndicatorsEnergy Efficiency

new Builds

DBpediaEU

Bio2RDFTaxon

StatusNetTschlotfeldt

JamendoDBTune

AspireNTU

GovUKSocietal

WellbeingDeprivation Imd

Health Score2010

LoticoGNOSS

UniprotMetadata

LinkedEurostat

AspireSussex

Lexvo

LinkedGeoData

StatusNetSpip

SORS

GovUKHomeless-

nessAccept. per

1000

TWCIEEEvis

AspireBrunel

PlanetDataProject

Wiki

StatusNetFreelish

Statisticsdata.gov.uk

StatusNetMulestable

Enipedia

UKLegislation

API

LinkedMDB

StatusNetQth

SiderFU-Berlin

DBpediaDE

GovUKHouseholds

Social lettingsGeneral NeedsLettings Prp

NumberBedrooms

AgrovocSkos

MyExperiment

ProyectoApadrina

GovUKImd CrimeRank 2010

SISVU

GovUKSocietal

WellbeingDeprivation ImdHousing Rank la

2010

StatusNetUni

Siegen

OpendataScotland Simd

EducationRank

StatusNetKaimi

GovUKHouseholds

Accommodatedper 1000

StatusNetPlanetlibre

DBpediaEL

SztakiLOD

DBpediaLite

DrugInteractionKnowledge

Base

StatusNetQdnx

AmsterdamMuseum

AS EDN LOD

RDFOhloh

DBTuneartistslast.fm

AspireUclan

HellenicFire Brigade

Bibsonomy

NottinghamTrent

ResourceLists

OpendataScotland SimdIncome Rank

RandomnessGuide

London

OpendataScotland

Simd HealthRank

SouthamptonECS Eprints

FRB270a.info

StatusNetSebseb01

StatusNetBka

ESDToolkit

HellenicPolice

StatusNetCed117

OpenEnergy

Info Wiki

StatusNetLydiastench

OpenDataRISP

Taxon-concept

Occurences

Bio2RDFSGD

UIS270a.info

NYTimesLinked Open

Data

AspireKeele

GovUKHouseholdsProjectionsPopulation

W3C

OpendataScotland

Simd HousingRank

ZDB

StatusNet1w6

StatusNetAlexandre

Franke

DeweyDecimal

Classification

StatusNetStatus

StatusNetdoomicile

CurrencyDesignators

StatusNetHiico

LinkedEdgar

GovUKHouseholds

2008

DOI

StatusNetPandaid

BrazilianPoliticians

NHSJargon

Theses.fr

LinkedLifeData

Semantic WebDogFood

UMBEL

OpenlyLocal

StatusNetSsweeny

LinkedFood

InteractiveMaps

GNOSS

OECD270a.info

Sudoc.fr

GreenCompetitive-

nessGNOSS

StatusNetIntegralblue

WOLD

LinkedStockIndex

Apache

KDATA

LinkedOpenPiracy

GovUKSocietal

WellbeingDeprv. ImdEmpl. Rank

La 2010

BBCMusic

StatusNetQuitter

StatusNetScoffoni

OpenElection

DataProject

Referencedata.gov.uk

StatusNetJonkman

ProjectGutenbergFU-Berlin

DBTropes

StatusNetSpraci

Libris

ECB270a.info

StatusNetThelovebug

Icane

GreekAdministrative

Geography

Bio2RDFOMIM

StatusNetOrangeseeds

NationalDiet Library

WEB NDLAuthorities

UniprotTaxonomy

DBpediaNL

L3SDBLP

FAOGeopolitical

Ontology

GovUKImpact

IndicatorsHousing Starts

DeutscheBiographie

StatusNetldnfai

StatusNetKeuser

StatusNetRusswurm

GovUK SocietalWellbeing

Deprivation ImdCrime Rank 2010

GovUKImd Income

Rank La2010

StatusNetDatenfahrt

StatusNetImirhil

Southamptonac.uk

LOD2Project

Wiki

DBpediaKO

DailymedFU-Berlin

WALS

DBpediaIT

StatusNetRecit

Livejournal

StatusNetExdc

Elviajero

Aves3D

OpenCalais

ZaragozaTurruta

AspireManchester

Wordnet(VU)

GovUKTransparency

Impact IndicatorsNeighbourhood

Plans

StatusNetDavid

Haberthuer

B3Kat

PubBielefeld

Prefix.cc

NALT

Vulnera-pedia

GovUKImpact

IndicatorsAffordable

Housing Starts

GovUKWellbeing lsoa

HappyYesterday

Mean

FlickrWrappr

Yso.fiYSA

OpenLibrary

AspirePlymouth

StatusNetJohndrink

Water

StatusNetGomertronic

Tags2conDelicious

StatusNettl1n

StatusNetProgval

Testee

WorldFactbookFU-Berlin

DBpediaJA

StatusNetCooleysekula

ProductDB

IMF270a.info

StatusNetPostblue

StatusNetSkilledtests

NextwebGNOSS

EurostatFU-Berlin

GovUKHouseholds

Social LettingsGeneral NeedsLettings PrpHousehold

Composition

StatusNetFcac

DWSGroup

OpendataScotlandGraph

Simd Rank

DNB

CleanEnergyData

Reegle

OpendataScotland SimdEmployment

Rank

ChroniclingAmerica

GovUKSocietal

WellbeingDeprivation

Imd Rank 2010

StatusNetBelfalas

AspireMMU

StatusNetLegadolibre

BlukBNB

StatusNetLebsanft

GADMGeovocab

GovUKImd Score

2010

SemanticXBRL

UKPostcodes

GeoNames

EEARod

AspireRoehampton

BFS270a.info

CameraDeputatiLinkedData

Bio2RDFGeneID

GovUKTransparency

Impact IndicatorsPlanning

ApplicationsGranted

StatusNetSweetie

Belle

O'Reilly

GNI

CityLichfield

GovUKImd

Rank 2010

BibleOntology

Idref.fr

StatusNetAtari

Frosch

Dev8d

NobelPrizes

StatusNetSoucy

ArchiveshubLinkedData

LinkedRailway

DataProject

FAO270a.info

GovUKWellbeing

WorthwhileMean

Bibbase

Semantic-web.org

BritishMuseum

Collection

GovUKDev LocalAuthorityServices

CodeHaus

Lingvoj

OrdnanceSurveyLinkedData

Wordpress

EurostatRDF

StatusNetKenzoid

GEMET

GovUKSocietal

WellbeingDeprv. imdScore '10

MisMuseosGNOSS

GovUKHouseholdsProjections

totalHouseolds

StatusNet20100

EEA

CiardRing

OpendataScotland Graph

EducationPupils by

School andDatazone

VIVOIndiana

University

Pokepedia

Transparency270a.info

StatusNetGlou

GovUKHomelessnessHouseholds

AccommodatedTemporary

Housing Types

STWThesaurus

forEconomics

DebianPackageTrackingSystem

DBTuneMagnatune

NUTSGeo-vocab

GovUKSocietal

WellbeingDeprivation ImdIncome Rank La

2010

BBCWildlifeFinder

StatusNetMystatus

MiguiadEviajesGNOSS

AcornSat

DataBnf.fr

GovUKimd env.

rank 2010

StatusNetOpensimchat

OpenFoodFacts

GovUKSocietal


Education Rank La2010

LODACBDLS

FOAF-Profiles

StatusNetSamnoble

GovUKTransparency

Impact IndicatorsAffordable

Housing Starts

StatusNetCoreyavisEnel

Shops

DBpediaFR

StatusNetRainbowdash

StatusNetMamalibre

PrincetonLibrary

Findingaids

WWWFoundation

Bio2RDFOMIM

Resources

OpendataScotland Simd

GeographicAccess Rank

Gutenberg

StatusNetOtbm

ODCLSOA

StatusNetOurcoffs

Colinda

WebNmasunoTraveler

StatusNetHackerposse

LOV

GarnicaPlywood

GovUKwellb. happy

yesterdaystd. dev.

StatusNetLudost

BBCProgram-

mes

GovUKSocietal


EnvironmentRank 2010

Bio2RDFTaxonomy

Worldbank270a.info

OSM

DBTuneMusic-brainz

LinkedMarkMail

StatusNetDeuxpi

GovUKTransparency

ImpactIndicators

Housing Starts

BizkaiSense

GovUKimpact

indicators energyefficiency new

builds

StatusNetMorphtown

GovUKTransparency

Input indicatorsLocal authoritiesWorking w. tr.

Families

ISO 639Oasis

AspirePortsmouth

ZaragozaDatos

AbiertosOpendataScotland

SimdCrime Rank

Berlios

StatusNetpiana

GovUKNet Add.Dwellings

Bootsnall

StatusNetchromic

Geospecies

linkedct

Wordnet(W3C)

StatusNetthornton2

StatusNetmkuttner

StatusNetlinuxwrangling

EurostatLinkedData

GovUKsocietal

wellbeingdeprv. imdrank '07

GovUKsocietal

wellbeingdeprv. imdrank la '10

LinkedOpen Data

ofEcology

StatusNetchickenkiller

StatusNetgegeweb

DeustoTech

StatusNetschiessle

GovUKtransparency

impactindicatorstr. families

Taxonconcept

GovUKservice

expenditure

GovUKsocietal

wellbeingdeprivation imd

employmentscore 2010

Linked Open Data cloud is use. Contains data sets with billions ofentries.

Thomas Neumann Querying Graph-Structured Data 2 / 32

Graph-structured data

One way to model graph-structured data is to use RDF (ResourceDescription Framework).

• conceptually a directed graph with edge labels

• each edge represents a fact (triple in RDF notation)

• triples have the form (subject, predicate, object)

Example:

• <obj1 > <cityName> ’Berlin’

• <obj1 > <isCapitalOf> <obj2 >

• <obj2 > <countryName> ’Germany’

Berlinobj2

isCapitalOf

obj1

Germany

cityName

countryN

ame

...

Everything is encoded as triples, queries operate on triples.


SPARQL Protocol and RDF Query Language

All capitals in Europe:

SELECT ?capital ?country

WHERE {

?x <cityName> ?capital.

?x <isCapitalOf> ?y.

?y <countryName> ?country.

?y <isInContinent> <Europe>.

}

• querying via pattern matching in RDF graph

• queries are sets of triple patterns

• variable occurrences imply joins

Problem: huge graph, many variable bindings possible


How to process SPARQL queries?

• we could use a (relational) database

• load the graph as triples into a table

• patterns form filters and joins

• produces the correct answer

• but very inefficient

• the database does not “understand” the graph structure

• a specialized RDF engine is more efficient

• I will talk about RDF-3X here (open source)


Indexing RDF Graphs

Primary data structure: clustered B+-trees

• stores triples in lexicographical order

• allows for good compression (differences are small)

• sequential disk accesses, fast lookups

Example: Sort order (S,P,O), triple pattern: (obj1, pred , ?x)⇒ Read range (obj1, pred ,−∞)-(obj1, pred ,∞) in B+-tree

Which sort order to choose?

• index is heavily compressed, space consumption not that critical

• 3! = 6 possible Orderings ⇒ 6 B+-trees

• always the ’right’ sort order available, efficient merge joins

e.g. ?x <cityName> ?capital.?x <isCapitalOf> ?y. ⇒(cityName, ?x , ?capital)PSO B (isCapitolOf , ?x , ?y)PSO


Runtime Improvements

RDF-3X uses many techniques to improve runtime performance:

• compressed B-trees reduce size and improve I/O performance

• exhaustive indexing often allows for cheap merge joins

• sideways information passing skips over large parts of the data

• works on compressed/encoded data as much as possible

• ...

Optimize performance and minimize disk I/O.


Indexing is Not Enough

select *

where {

?s yago:created ?product.

?s yago:hasLatitude ?lat.

?s yago:hasLongitude ?long

}

on2

on1

hasLongitude hasLatitude

created

Suboptimal: | on1 | = 140 MlnRuntime: 65 ms

on2

on1

created hasLatitude

hasLongitude

Optimal: | on1 | = 14 KRuntime: 20 ms

Query optimization has a huge impact, sometimes orders of magnitudes.


Cardinality Estimation

Traditional estimating :

• estimates for individual predicates and joins

• combined assuming independence

• statistical synopses

Not well suited for RDF data


Why are Standard Histograms not Enough?

Some number from the Yago data set:

sel(σP=isCitizenOf) 1.06 ∗ 10−4

sel(σO=United States) 6.41 ∗ 10−4

sel(σP=isCitizenOf∧O=United States) 4.86 ∗ 10−5

sel(σP=isCitizenOf) ∗ sel(σO=United States) 6.80 ∗ 10−8

• independence assumption does not hold

• leads to severe underestimation

• multi-dimensional histograms would help (expensive)

• looking at individual triples is not enough

For RDF data, correlation is the norm!


Why is Correlation a Problem?

Correlation occurs across triples:

• some triples are closely related

• independence does not hold

Very common:

• soft functional dependencies

• if we know bind triple pattern,the others become unselective

• not captured by attributehistograms

Example Triples

< o1 > <title> ”The Tree and I”.< o1 > <author> <R. Pecker>.< o1 > <author> <D. Owl>.< o1 > <year> ”1996”.


Why Not Sampling?

RDF is very unfriendlyfor sampling

• no schema

• one huge ”relation”

• billions of tuples

• very diverse

Yago sample

<wikicategory Wilderness Areas of Illinois> rdfs:label ”Wilderness Areas ofIllinois” .

<Telephone numbers in Cameroon> rdfs:label ”\u002b237” .<Washington Park Race Track> rdfs:label ”Washington Park” .<Seth R.J.J. High School> rdfs:label ”Sett R\u002eJ\u002eJ\u002e High

School” .<Tengasu> rdfs:label ”Tengasu” .<Immaculate Heart Academy> rdfs:label ”Immaculate Heart Academy” .<Sion, Switzerland> rdfs:label ”Sion\u002c Switzerland” .<wordnet heroism 104857738> rdfs:label ”gallantry” .<Khyber Pakhtunkhwa> rdfs:label ”Khyber\u002dPakhtunkhwa” .<J%C3%A1nos Pap> rdfs:label ”Janos Pap” .<wikicategory Jan Smuts> rdfs:label ”Jan Smuts” ....

Sample would have to be huge to be useful.


Capturing Correlations

We classify the tuples using characteristic sets

• compact data structure

• groups triples by ”behavior”

• within a group, triples are more homogeneous

• groups are annotated with occurrence statistics

• allows for deriving estimates for whole query fragments

• captures correlations within tuples and across tuples

Allows for very accurate cardinality estimates.


Characteristic SetsObservation: nodes are characterized by outgoing edges

SC (s) := {p|∃o : (s, p, o) ∈ R}.SC (R) := {SC (s)|∃p, o : (s, p, o) ∈ R}.

Example

< o1 > <title> ”The Tree and I”. < o1 > <author> <R. Pecker>.< o1 > <author> <D. Owl>. < o1 > <year> ”1996”.< o2 > <title> ”Emma”. < o2 > <author> <J. Austen>.< o2 > <year> ”1815”. <J. Austen> <hasName> ”Jane Austen”.<J. Austen> <bornIn> <Steventon>.

SC (o1) = {title, author , year}

SC (o2) = {title, author , year}

SC = {{title, author , year}2, {hasName, bornIn}1}


Estimating Distinct SubjectsWe can use characteristic sets for cardinality estimation

query: select distinct ?ewhere { ?e <author> ?a. ?e <title> ?t. }

cardinality:∑

S∈{S |S∈SC (R)∧{author ,title}⊆S} count(S)

• the computation is exact! (only for distinct, though)

• can estimate a large number of joins in one step

• number of characteristic sets is surprisingly low

Number of Characteristic Sets

triples characteristic setsYago 40,114,899 9,788LibraryThing 36,203,751 6,834UniProt 845,074,885 613


Occurrence Annotations

Without distinct we need occurrence annotations

distinct |{s|∃p, o : (s, p, o) ∈ R ∧ SC (s) = S}|count(p1) |{(s, p1, o)|(s, p1, o) ∈ R ∧ SC (s) = S}|count(p2) |{(s, p2, o)|(s, p2, o) ∈ R ∧ SC (s) = S}|. . . . . .

Example

select ?a ?t where { ?e <author> ?a. ?e <title> ?t. }

distinct author title year

1000 2300 1010 1090

Estimate: 1000 ∗ 23001000 ∗

10101000 = 2323

• no longer exact, but very accurate in practice


Using Characteristic Sets

• characteristic sets accurately describe individual subjects

• but a query touches more than one subject

• combine characteristics sets to form whole queries

General strategy:

• exploit as much information about correlation as possible

• ignore the joins order (”holistic” estimates)

• avoids ”fleeing to ignorance”

• cover the query with characteristic sets


Example

select ?a ?t where { ?b <author>?a. ?b <title>?t. ?b <year>”2009”.?b <publishedBy>?p. ?p <name>”ACM”. }

?b

?a ?t

2009 ?p ACM

author title

year

publishedByname

(?b, author, ?a) (?b, title, ?t)

(?b, year, 2009) (?b, publishedBy, ?p) (?p, name, ACM)

RDF query graph traditional query graph

• we cover the query with characteristic sets

• prefer large sets over small sets

• assume independence for the rest


Example


?b

?a ?t

2009 ?p ACM

author title

year

publishedByname








Example


?b

?a ?t

2009 ?p ACM

author title

year

publishedByname








Challenges of SPARQL query optimization

Query Optimization:

Query Compilation ⇒ Query Execution(dominated by query optimization)

RDF-3X 78 s 2 sVirtuoso 7 1.3 s 384 s

(next slides) 1.2 s 2 s

We ran a query with 17 joins on YAGO dataset (100 Mln triples)



Query Optimization:







Query Optimization:






Why does it happen?

Properties of the model:

• RDF is a very verbose format

• TPC-H Q5: 5 joins in SQL vs 26 joins in SPARQL (assuming a triplestore storage)

• Dynamic Programming (RDF-3X) becomes too expensive

Properties of the data:

• Lots of correlations, including structural

• If an entity has a LastName, it is likely to have a FirstName

• Greedy Algorithm (Virtuoso) often makes wrong choices in thebeginning


Combining Estimation and OptimizationGiven a SPARQL query:

?p

German novellist

Nobel Prize ?place

?book ?city

Italy

?long ?lat

typewonPrize bornIn

created linksToloca

tedIn

hasLong hasL

at

• How to optimize star-shaped subqueries?

• How to capture selectivities between subqueries?

• How to optimize arbitrary-shaped queries?



?p

German novellist

Nobel Prize ?place

?book ?city

Italy

?long ?lat

typewonPrize bornIn

created linksToloca

tedIn

hasLong hasL

at






?p

German novellist

Nobel Prize ?place

?book ?city

Italy

?long ?lat

typewonPrize bornIn

created linksToloca

tedIn

hasLong hasL

at






?p

German novellist

Nobel Prize ?place

?book ?city

Italy

?long ?lat

typewonPrize bornIn

created linksToloca

tedIn

hasLong hasL

at





Optimizing star-shaped subqueries

?p

?place1

?type ?place2

?s

livedIn

type

bornIn

created

• {type, livedIn, bornIn, created} → 1025 entities

• Characteristic Set• Count all distinct Char.Sets with number of

occurrences• Accurate estimation of cardinalities of

star-shaped queries

• One step beyond: what is the rarest subset ofthe given CS?

• {type, livedIn, bornIn} → 13304 entities• {type, livedIn, created} → 6593 entities• {type, bornIn, created} → 6800 entities• {livedIn, bornIn, created} → 2399 entities

• type is not present in the rarest subset; wewant to join it the last


Optimizing star-shaped subqueries

?p

?place1

?type ?place2

?s

livedIn

type

bornIn

created

• {type, livedIn, bornIn, created} → 1025 entities

• Characteristic Set• Count all distinct Char.Sets with number of

occurrences• Accurate estimation of cardinalities of

star-shaped queries

• One step beyond: what is the rarest subset ofthe given CS?

• {type, livedIn, bornIn} → 13304 entities• {type, livedIn, created} → 6593 entities• {type, bornIn, created} → 6800 entities• {livedIn, bornIn, created} → 2399 entities

• type is not present in the rarest subset; wewant to join it the last


Example

{type, livedIn, bornIn, created}, ID : 154

{livedIn, bornIn, created}, ID : 27

{livedIn, created}, ID : 6

onID: 154

onID: 27

onID: 6

(?p, created , ?o1) (?p, livedIn, ?o3)

(?p, bornIn, ?o2)

(?p, type, ?o4)


Properties of the algorithm

• Linear time, top-down, greedy

• Does not assume independence between predicates (unlike bottom-upgreedy)


Cardinality estimates in arbitrary queries

?p

Thomas Mann

German novellist

Nobel Prize ?place

Zurich

?city

Lubeck

Germany

?long

10◦ E

?lat

53◦ N

type

wonPrize livedIn

bornInloca

tedIn

hasLong hasL

at

• How to estimate the cardinality of this query?

• Two subqueries depend on each other: every person is likely to haveone birthplace in the data

• Just multiplying their frequencies is a big underestimation

• We will construct a lightweight statistics of the dataset

• Count how frequently these two star-shaped subgraphs appeartogether


Cardinality estimates in arbitrary queries

?p

Thomas Mann

German novellist

Nobel Prize

?place

Zurich

?city

Lubeck

Germany

?long

10◦ E

?lat

53◦ N

type

wonPrize livedIn

bornInloca

tedIn

hasLong hasL

at

• How to estimate the cardinality of this query?

• Two subqueries depend on each other: every person is likely to haveone birthplace in the data

• Just multiplying their frequencies is a big underestimation

• We will construct a lightweight statistics of the dataset

• Count how frequently these two star-shaped subgraphs appeartogether


Characteristic Pairs

• Characteristic Pair: Two Characteristic Sets that appear connectedvia an edge in the dataset

• Identifying CP: one scan over the data once the Char.Sets arecomputed

• In the worst case, the number of CP grows quadratically withdifferent Char.Sets

• But we are only interested in very frequent ones

• If the pair is rare, the independence assumption holds


Char.Pairs: Estimating the cardinalities

select distinct ?s ?owhere { ?s p1 ?x1.

?s p2 ?x2.?s p3 ?o.?o p4 ?y1. }

• {Si} ← Char.Sets with {p1, p2, p3}• {S ′i } ← Char.Sets with {p4}• Form all the Char.Pairs between {Si}

and {S ′i }• Get their counts, sum up


Outline

Given a SPARQL query:

?p

German novellist

Nobel Prize ?place

?book ?city

Italy

?long ?lat

typewonPrize bornIn

created linksToloca

tedIn

hasLong hasL

at





Outline

Given a SPARQL query:

?p

German novellist

Nobel Prize ?place

?book ?city

Italy

?long ?lat

typewonPrize bornIn

created linksToloca

tedIn

hasLong hasL

at





Query simplification

?p

?P1

German novellist

Nobel Prize ?place

?book ?city

?P2

Italy

?long ?lat

type

wonPrize bornIn

created

createds1

linksTo

linksTos2

located

In

hasLong hasL

at

• We start with identifying optimal plans for subqueries

• Now, we remove them from the SPARQL query graph, and run theDynamic Programming algo

• We know the selectivities between the subqueries

Entities Partial Plan Cost

{P1} (wonPrize on type) on bornIn 3000{P2} (locatedIn on hasLong) on hasLat 5000{book} IndexScan(P = linksTo, S =?book) 4500{P1, book} ((wonPrize on type) on bornIn) on wrote 7500

. . . . . . . . .



?p

?P1

German novellist

Nobel Prize ?place

?book

?city

?P2

Italy

?long ?lat

type

wonPrize bornIn

created

createds1

linksTo

linksTos2

located

In

hasLong hasL

at

• We start with identifying optimal plans for subqueries• Now, we remove them from the SPARQL query graph, and run the

Dynamic Programming algo

• We know the selectivities between the subqueries



. . . . . . . . .



?p

?P1

German novellist

Nobel Prize ?place

?book

?city

?P2

Italy

?long ?lat

type

wonPrize bornIn

created

createds1

linksTo

linksTos2

located

In

hasLong hasL

at

• We start with identifying optimal plans for subqueries• Now, we remove them from the SPARQL query graph, and run the

Dynamic Programming algo• We know the selectivities between the subqueries



. . . . . . . . .



?p

?P1

German novellist

Nobel Prize ?place

?book

?city

?P2

Italy

?long ?lat

type

wonPrize bornIn

created

createds1

linksTo

linksTos2

located

In

hasLong hasL

at



. . . . . . . . .


Compile and Runtime for YAGO

Query Size (number of joins)total runtime (optimization time)

Algo [10, 20) [20, 30) [30, 40) [40, 50]

DP 7745(7130) - - -DP-CS 65767(65223) - - -Greedy 857 (133) 1236 (413) 2204 (838) 4145 (1194)

HSP 1025 (2) 3189 (3) 4102 (4) 10720 (5)Char.Pairs 660 (150) 967 (315) 1211 (348) 2174 (890)


Other Challenges

• complex paths (transitivity etc.)

• complex aggregates

• updates

• transactions

• ...

Many hard problems, need careful analysis and tests.


Conclusion

Graph Data Processing is hard

• complex, not schema, correlations, etc.

• requires efficient storage and indexing

• query optimization is essential

• powerful techniques pay off very quickly

Many interesting problems still open.


Querying Graph-Structured Datadb.in.tum.de/teaching/ws1920/foundationsde/RDFQueryOpt.pdf · Europeana Nomenclator Asturias Red Uno Internacional GNOSS Geo Wordnet Bio2RDF HGNC Ctic

Documents