Querying Graph-Structured Data Thomas Neumann Technische Universit¨ at M¨ unchen November 4, 2016
Querying Graph-Structured Data
Thomas Neumann
Technische Universitat Munchen
November 4, 2016
Motivation
Many interesting data sets of a graph structure.
• very flexible
• easy to model
• but difficult to query
• often very large
• no obvious structure
• how to store andprocess?
Linked Datasets as of August 2014
Uniprot
AlexandriaDigital Library
Gazetteer
lobidOrganizations
chem2bio2rdf
MultimediaLab University
Ghent
Open DataEcuador
GeoEcuador
Serendipity
UTPLLOD
GovAgriBusDenmark
DBpedialive
URIBurner
Linguistics
Social Networking
Life Sciences
Cross-Domain
Government
User-Generated Content
Publications
Geographic
Media
Identifiers
EionetRDF
lobidResources
WiktionaryDBpedia
Viaf
Umthes
RKBExplorer
Courseware
Opencyc
Olia
Gem.Thesaurus
AudiovisueleArchieven
DiseasomeFU-Berlin
Eurovocin
SKOS
DNBGND
Cornetto
Bio2RDFPubmed
Bio2RDFNDC
Bio2RDFMesh
IDS
OntosNewsPortal
AEMET
ineverycrea
LinkedUser
Feedback
MuseosEspaniaGNOSS
Europeana
NomenclatorAsturias
Red UnoInternacional
GNOSS
GeoWordnet
Bio2RDFHGNC
CticPublic
Dataset
Bio2RDFHomologene
Bio2RDFAffymetrix
MuninnWorld War I
CKAN
GovernmentWeb Integration
forLinkedData
Universidadde CuencaLinkeddata
Freebase
Linklion
Ariadne
OrganicEdunet
GeneExpressionAtlas RDF
ChemblRDF
BiosamplesRDF
IdentifiersOrg
BiomodelsRDF
ReactomeRDF
Disgenet
SemanticQuran
IATI asLinked Data
DutchShips and
Sailors
Verrijktkoninkrijk
IServe
Arago-dbpedia
LinkedTCGA
ABS270a.info
RDFLicense
EnvironmentalApplicationsReferenceThesaurus
Thist
JudaicaLink
BPR
OCD
ShoahVictimsNames
Reload
Data forTourists in
Castilla y Leon
2001SpanishCensusto RDF
RKBExplorer
Webscience
RKBExplorerEprintsHarvest
NVS
EU AgenciesBodies
EPO
LinkedNUTS
RKBExplorer
Epsrc
OpenMobile
Network
RKBExplorerLisbon
RKBExplorer
Italy
CE4R
EnvironmentAgency
Bathing WaterQuality
RKBExplorerKaunas
OpenData
Thesaurus
RKBExplorerWordnet
RKBExplorer
ECS
AustrianSki
Racers
Social-semweb
Thesaurus
DataOpenAc Uk
RKBExplorer
IEEE
RKBExplorer
LAAS
RKBExplorer
Wiki
RKBExplorer
JISC
RKBExplorerEprints
RKBExplorer
Pisa
RKBExplorer
Darmstadt
RKBExplorerunlocode
RKBExplorer
Newcastle
RKBExplorer
OS
RKBExplorer
Curriculum
RKBExplorerResex
RKBExplorer
Roma
RKBExplorerEurecom
RKBExplorer
IBM
RKBExplorer
NSF
RKBExplorer
kisti
RKBExplorer
DBLP
RKBExplorer
ACM
RKBExplorerCiteseer
RKBExplorer
Southampton
RKBExplorerDeepblue
RKBExplorerDeploy
RKBExplorer
Risks
RKBExplorer
ERA
RKBExplorer
OAI
RKBExplorer
FT
RKBExplorer
Ulm
RKBExplorer
Irit
RKBExplorerRAE2001
RKBExplorerDotac
RKBExplorerBudapest
SwedishOpen Cultural
Heritage
Radatana
CourtsThesaurus
GermanLabor LawThesaurus
GovUKTransport
Data
GovUKEducation
Data
EnaktingMortality
EnaktingEnergy
EnaktingCrime
EnaktingPopulation
EnaktingCO2Emission
EnaktingNHS
RKBExplorerCrime
RKBExplorercordis
Govtrack
GeologicalSurvey of
AustriaThesaurus
GeoLinkedData
GesisThesoz
Bio2RDFPharmgkb
Bio2RDFSabiorkBio2RDF
Ncbigene
Bio2RDFIrefindex
Bio2RDFIproclass
Bio2RDFGOA
Bio2RDFDrugbank
Bio2RDFCTD
Bio2RDFBiomodels
Bio2RDFDBSNP
Bio2RDFClinicaltrials
Bio2RDFLSR
Bio2RDFOrphanet
Bio2RDFWormbase
BIS270a.info
DM2E
DBpediaPT
DBpediaES
DBpediaCS
DBnary
AlpinoRDF
YAGO
PdevLemon
Lemonuby
Isocat
Ietflang
Core
KUPKB
GettyAAT
SemanticWeb
Journal
OpenlinkSWDataspaces
MyOpenlinkDataspaces
Jugem
Typepad
AspireHarperAdams
NBNResolving
Worldcat
Bio2RDF
Bio2RDFECO
Taxon-conceptAssets
Indymedia
GovUKSocietal
WellbeingDeprivation imd
EmploymentRank La 2010
GNULicenses
GreekWordnet
DBpedia
CIPFA
Yso.fiAllars
Glottolog
StatusNetBonifaz
StatusNetshnoulle
Revyu
StatusNetKathryl
ChargingStations
AspireUCL
Tekord
Didactalia
ArtenueVosmedios
GNOSS
LinkedCrunchbase
ESDStandards
VIVOUniversityof Florida
Bio2RDFSGD
Resources
ProductOntology
DatosBne.es
StatusNetMrblog
Bio2RDFDataset
EUNIS
GovUKHousingMarket
LCSH
GovUKTransparencyImpact ind.Households
In temp.Accom.
UniprotKB
StatusNetTimttmy
SemanticWeb
Grundlagen
GovUKInput ind.
Local AuthorityFunding FromGovernment
Grant
StatusNetFcestrada
JITA
StatusNetSomsants
StatusNetIlikefreedom
DrugbankFU-Berlin
Semanlink
StatusNetDtdns
StatusNetStatus.net
DCSSheffield
AtheliaRFID
StatusNetTekk
ListaEncabezaMientosMateria
StatusNetFragdev
Morelab
DBTuneJohn PeelSessions
RDFizelast.fm
OpenData
Euskadi
GovUKTransparency
Input ind.Local auth.Funding f.
Gvmnt. Grant
MSC
Lexinfo
StatusNetEquestriarp
Asn.us
GovUKSocietal
WellbeingDeprivation ImdHealth Rank la
2010
StatusNetMacno
OceandrillingBorehole
AspireQmul
GovUKImpact
IndicatorsPlanning
ApplicationsGranted
Loius
Datahub.io
StatusNetMaymay
Prospectsand
TrendsGNOSS
GovUKTransparency
Impact IndicatorsEnergy Efficiency
new Builds
DBpediaEU
Bio2RDFTaxon
StatusNetTschlotfeldt
JamendoDBTune
AspireNTU
GovUKSocietal
WellbeingDeprivation Imd
Health Score2010
LoticoGNOSS
UniprotMetadata
LinkedEurostat
AspireSussex
Lexvo
LinkedGeoData
StatusNetSpip
SORS
GovUKHomeless-
nessAccept. per
1000
TWCIEEEvis
AspireBrunel
PlanetDataProject
Wiki
StatusNetFreelish
Statisticsdata.gov.uk
StatusNetMulestable
Enipedia
UKLegislation
API
LinkedMDB
StatusNetQth
SiderFU-Berlin
DBpediaDE
GovUKHouseholds
Social lettingsGeneral NeedsLettings Prp
NumberBedrooms
AgrovocSkos
MyExperiment
ProyectoApadrina
GovUKImd CrimeRank 2010
SISVU
GovUKSocietal
WellbeingDeprivation ImdHousing Rank la
2010
StatusNetUni
Siegen
OpendataScotland Simd
EducationRank
StatusNetKaimi
GovUKHouseholds
Accommodatedper 1000
StatusNetPlanetlibre
DBpediaEL
SztakiLOD
DBpediaLite
DrugInteractionKnowledge
Base
StatusNetQdnx
AmsterdamMuseum
AS EDN LOD
RDFOhloh
DBTuneartistslast.fm
AspireUclan
HellenicFire Brigade
Bibsonomy
NottinghamTrent
ResourceLists
OpendataScotland SimdIncome Rank
RandomnessGuide
London
OpendataScotland
Simd HealthRank
SouthamptonECS Eprints
FRB270a.info
StatusNetSebseb01
StatusNetBka
ESDToolkit
HellenicPolice
StatusNetCed117
OpenEnergy
Info Wiki
StatusNetLydiastench
OpenDataRISP
Taxon-concept
Occurences
Bio2RDFSGD
UIS270a.info
NYTimesLinked Open
Data
AspireKeele
GovUKHouseholdsProjectionsPopulation
W3C
OpendataScotland
Simd HousingRank
ZDB
StatusNet1w6
StatusNetAlexandre
Franke
DeweyDecimal
Classification
StatusNetStatus
StatusNetdoomicile
CurrencyDesignators
StatusNetHiico
LinkedEdgar
GovUKHouseholds
2008
DOI
StatusNetPandaid
BrazilianPoliticians
NHSJargon
Theses.fr
LinkedLifeData
Semantic WebDogFood
UMBEL
OpenlyLocal
StatusNetSsweeny
LinkedFood
InteractiveMaps
GNOSS
OECD270a.info
Sudoc.fr
GreenCompetitive-
nessGNOSS
StatusNetIntegralblue
WOLD
LinkedStockIndex
Apache
KDATA
LinkedOpenPiracy
GovUKSocietal
WellbeingDeprv. ImdEmpl. Rank
La 2010
BBCMusic
StatusNetQuitter
StatusNetScoffoni
OpenElection
DataProject
Referencedata.gov.uk
StatusNetJonkman
ProjectGutenbergFU-Berlin
DBTropes
StatusNetSpraci
Libris
ECB270a.info
StatusNetThelovebug
Icane
GreekAdministrative
Geography
Bio2RDFOMIM
StatusNetOrangeseeds
NationalDiet Library
WEB NDLAuthorities
UniprotTaxonomy
DBpediaNL
L3SDBLP
FAOGeopolitical
Ontology
GovUKImpact
IndicatorsHousing Starts
DeutscheBiographie
StatusNetldnfai
StatusNetKeuser
StatusNetRusswurm
GovUK SocietalWellbeing
Deprivation ImdCrime Rank 2010
GovUKImd Income
Rank La2010
StatusNetDatenfahrt
StatusNetImirhil
Southamptonac.uk
LOD2Project
Wiki
DBpediaKO
DailymedFU-Berlin
WALS
DBpediaIT
StatusNetRecit
Livejournal
StatusNetExdc
Elviajero
Aves3D
OpenCalais
ZaragozaTurruta
AspireManchester
Wordnet(VU)
GovUKTransparency
Impact IndicatorsNeighbourhood
Plans
StatusNetDavid
Haberthuer
B3Kat
PubBielefeld
Prefix.cc
NALT
Vulnera-pedia
GovUKImpact
IndicatorsAffordable
Housing Starts
GovUKWellbeing lsoa
HappyYesterday
Mean
FlickrWrappr
Yso.fiYSA
OpenLibrary
AspirePlymouth
StatusNetJohndrink
Water
StatusNetGomertronic
Tags2conDelicious
StatusNettl1n
StatusNetProgval
Testee
WorldFactbookFU-Berlin
DBpediaJA
StatusNetCooleysekula
ProductDB
IMF270a.info
StatusNetPostblue
StatusNetSkilledtests
NextwebGNOSS
EurostatFU-Berlin
GovUKHouseholds
Social LettingsGeneral NeedsLettings PrpHousehold
Composition
StatusNetFcac
DWSGroup
OpendataScotlandGraph
Simd Rank
DNB
CleanEnergyData
Reegle
OpendataScotland SimdEmployment
Rank
ChroniclingAmerica
GovUKSocietal
WellbeingDeprivation
Imd Rank 2010
StatusNetBelfalas
AspireMMU
StatusNetLegadolibre
BlukBNB
StatusNetLebsanft
GADMGeovocab
GovUKImd Score
2010
SemanticXBRL
UKPostcodes
GeoNames
EEARod
AspireRoehampton
BFS270a.info
CameraDeputatiLinkedData
Bio2RDFGeneID
GovUKTransparency
Impact IndicatorsPlanning
ApplicationsGranted
StatusNetSweetie
Belle
O'Reilly
GNI
CityLichfield
GovUKImd
Rank 2010
BibleOntology
Idref.fr
StatusNetAtari
Frosch
Dev8d
NobelPrizes
StatusNetSoucy
ArchiveshubLinkedData
LinkedRailway
DataProject
FAO270a.info
GovUKWellbeing
WorthwhileMean
Bibbase
Semantic-web.org
BritishMuseum
Collection
GovUKDev LocalAuthorityServices
CodeHaus
Lingvoj
OrdnanceSurveyLinkedData
Wordpress
EurostatRDF
StatusNetKenzoid
GEMET
GovUKSocietal
WellbeingDeprv. imdScore '10
MisMuseosGNOSS
GovUKHouseholdsProjections
totalHouseolds
StatusNet20100
EEA
CiardRing
OpendataScotland Graph
EducationPupils by
School andDatazone
VIVOIndiana
University
Pokepedia
Transparency270a.info
StatusNetGlou
GovUKHomelessnessHouseholds
AccommodatedTemporary
Housing Types
STWThesaurus
forEconomics
DebianPackageTrackingSystem
DBTuneMagnatune
NUTSGeo-vocab
GovUKSocietal
WellbeingDeprivation ImdIncome Rank La
2010
BBCWildlifeFinder
StatusNetMystatus
MiguiadEviajesGNOSS
AcornSat
DataBnf.fr
GovUKimd env.
rank 2010
StatusNetOpensimchat
OpenFoodFacts
GovUKSocietal
WellbeingDeprivation Imd
Education Rank La2010
LODACBDLS
FOAF-Profiles
StatusNetSamnoble
GovUKTransparency
Impact IndicatorsAffordable
Housing Starts
StatusNetCoreyavisEnel
Shops
DBpediaFR
StatusNetRainbowdash
StatusNetMamalibre
PrincetonLibrary
Findingaids
WWWFoundation
Bio2RDFOMIM
Resources
OpendataScotland Simd
GeographicAccess Rank
Gutenberg
StatusNetOtbm
ODCLSOA
StatusNetOurcoffs
Colinda
WebNmasunoTraveler
StatusNetHackerposse
LOV
GarnicaPlywood
GovUKwellb. happy
yesterdaystd. dev.
StatusNetLudost
BBCProgram-
mes
GovUKSocietal
WellbeingDeprivation Imd
EnvironmentRank 2010
Bio2RDFTaxonomy
Worldbank270a.info
OSM
DBTuneMusic-brainz
LinkedMarkMail
StatusNetDeuxpi
GovUKTransparency
ImpactIndicators
Housing Starts
BizkaiSense
GovUKimpact
indicators energyefficiency new
builds
StatusNetMorphtown
GovUKTransparency
Input indicatorsLocal authoritiesWorking w. tr.
Families
ISO 639Oasis
AspirePortsmouth
ZaragozaDatos
AbiertosOpendataScotland
SimdCrime Rank
Berlios
StatusNetpiana
GovUKNet Add.Dwellings
Bootsnall
StatusNetchromic
Geospecies
linkedct
Wordnet(W3C)
StatusNetthornton2
StatusNetmkuttner
StatusNetlinuxwrangling
EurostatLinkedData
GovUKsocietal
wellbeingdeprv. imdrank '07
GovUKsocietal
wellbeingdeprv. imdrank la '10
LinkedOpen Data
ofEcology
StatusNetchickenkiller
StatusNetgegeweb
DeustoTech
StatusNetschiessle
GovUKtransparency
impactindicatorstr. families
Taxonconcept
GovUKservice
expenditure
GovUKsocietal
wellbeingdeprivation imd
employmentscore 2010
Linked Open Data cloud is use. Contains data sets with billions ofentries.
Thomas Neumann Querying Graph-Structured Data 2 / 32
Graph-structured data
One way to model graph-structured data is to use RDF (ResourceDescription Framework).
• conceptually a directed graph with edge labels
• each edge represents a fact (triple in RDF notation)
• triples have the form (subject, predicate, object)
Example:
• <obj1 > <cityName> ’Berlin’
• <obj1 > <isCapitalOf> <obj2 >
• <obj2 > <countryName> ’Germany’
Berlinobj2
isCapitalOf
obj1
Germany
cityName
countryN
ame
...
Everything is encoded as triples, queries operate on triples.
Thomas Neumann Querying Graph-Structured Data 3 / 32
SPARQL Protocol and RDF Query Language
All capitals in Europe:
SELECT ?capital ?country
WHERE {
?x <cityName> ?capital.
?x <isCapitalOf> ?y.
?y <countryName> ?country.
?y <isInContinent> <Europe>.
}
• querying via pattern matching in RDF graph
• queries are sets of triple patterns
• variable occurrences imply joins
Problem: huge graph, many variable bindings possible
Thomas Neumann Querying Graph-Structured Data 4 / 32
How to process SPARQL queries?
• we could use a (relational) database
• load the graph as triples into a table
• patterns form filters and joins
• produces the correct answer
• but very inefficient
• the database does not “understand” the graph structure
• a specialized RDF engine is more efficient
• I will talk about RDF-3X here (open source)
Thomas Neumann Querying Graph-Structured Data 5 / 32
Indexing RDF Graphs
Primary data structure: clustered B+-trees
• stores triples in lexicographical order
• allows for good compression (differences are small)
• sequential disk accesses, fast lookups
Example: Sort order (S,P,O), triple pattern: (obj1, pred , ?x)⇒ Read range (obj1, pred ,−∞)-(obj1, pred ,∞) in B+-tree
Which sort order to choose?
• index is heavily compressed, space consumption not that critical
• 3! = 6 possible Orderings ⇒ 6 B+-trees
• always the ’right’ sort order available, efficient merge joins
e.g. ?x <cityName> ?capital.?x <isCapitalOf> ?y. ⇒(cityName, ?x , ?capital)PSO B (isCapitolOf , ?x , ?y)PSO
Thomas Neumann Querying Graph-Structured Data 6 / 32
Runtime Improvements
RDF-3X uses many techniques to improve runtime performance:
• compressed B-trees reduce size and improve I/O performance
• exhaustive indexing often allows for cheap merge joins
• sideways information passing skips over large parts of the data
• works on compressed/encoded data as much as possible
• ...
Optimize performance and minimize disk I/O.
Thomas Neumann Querying Graph-Structured Data 7 / 32
Indexing is Not Enough
select *
where {
?s yago:created ?product.
?s yago:hasLatitude ?lat.
?s yago:hasLongitude ?long
}
on2
on1
hasLongitude hasLatitude
created
Suboptimal: | on1 | = 140 MlnRuntime: 65 ms
on2
on1
created hasLatitude
hasLongitude
Optimal: | on1 | = 14 KRuntime: 20 ms
Query optimization has a huge impact, sometimes orders of magnitudes.
Thomas Neumann Querying Graph-Structured Data 8 / 32
Cardinality Estimation
Traditional estimating :
• estimates for individual predicates and joins
• combined assuming independence
• statistical synopses
Not well suited for RDF data
Thomas Neumann Querying Graph-Structured Data 9 / 32
Why are Standard Histograms not Enough?
Some number from the Yago data set:
sel(σP=isCitizenOf) 1.06 ∗ 10−4
sel(σO=United States) 6.41 ∗ 10−4
sel(σP=isCitizenOf∧O=United States) 4.86 ∗ 10−5
sel(σP=isCitizenOf) ∗ sel(σO=United States) 6.80 ∗ 10−8
• independence assumption does not hold
• leads to severe underestimation
• multi-dimensional histograms would help (expensive)
• looking at individual triples is not enough
For RDF data, correlation is the norm!
Thomas Neumann Querying Graph-Structured Data 10 / 32
Why is Correlation a Problem?
Correlation occurs across triples:
• some triples are closely related
• independence does not hold
Very common:
• soft functional dependencies
• if we know bind triple pattern,the others become unselective
• not captured by attributehistograms
Example Triples
< o1 > <title> ”The Tree and I”.< o1 > <author> <R. Pecker>.< o1 > <author> <D. Owl>.< o1 > <year> ”1996”.
Thomas Neumann Querying Graph-Structured Data 11 / 32
Why Not Sampling?
RDF is very unfriendlyfor sampling
• no schema
• one huge ”relation”
• billions of tuples
• very diverse
Yago sample
<wikicategory Wilderness Areas of Illinois> rdfs:label ”Wilderness Areas ofIllinois” .
<Telephone numbers in Cameroon> rdfs:label ”\u002b237” .<Washington Park Race Track> rdfs:label ”Washington Park” .<Seth R.J.J. High School> rdfs:label ”Sett R\u002eJ\u002eJ\u002e High
School” .<Tengasu> rdfs:label ”Tengasu” .<Immaculate Heart Academy> rdfs:label ”Immaculate Heart Academy” .<Sion, Switzerland> rdfs:label ”Sion\u002c Switzerland” .<wordnet heroism 104857738> rdfs:label ”gallantry” .<Khyber Pakhtunkhwa> rdfs:label ”Khyber\u002dPakhtunkhwa” .<J%C3%A1nos Pap> rdfs:label ”Janos Pap” .<wikicategory Jan Smuts> rdfs:label ”Jan Smuts” ....
Sample would have to be huge to be useful.
Thomas Neumann Querying Graph-Structured Data 12 / 32
Capturing Correlations
We classify the tuples using characteristic sets
• compact data structure
• groups triples by ”behavior”
• within a group, triples are more homogeneous
• groups are annotated with occurrence statistics
• allows for deriving estimates for whole query fragments
• captures correlations within tuples and across tuples
Allows for very accurate cardinality estimates.
Thomas Neumann Querying Graph-Structured Data 13 / 32
Characteristic SetsObservation: nodes are characterized by outgoing edges
SC (s) := {p|∃o : (s, p, o) ∈ R}.SC (R) := {SC (s)|∃p, o : (s, p, o) ∈ R}.
Example
< o1 > <title> ”The Tree and I”. < o1 > <author> <R. Pecker>.< o1 > <author> <D. Owl>. < o1 > <year> ”1996”.< o2 > <title> ”Emma”. < o2 > <author> <J. Austen>.< o2 > <year> ”1815”. <J. Austen> <hasName> ”Jane Austen”.<J. Austen> <bornIn> <Steventon>.
SC (o1) = {title, author , year}
SC (o2) = {title, author , year}
SC = {{title, author , year}2, {hasName, bornIn}1}
Thomas Neumann Querying Graph-Structured Data 14 / 32
Estimating Distinct SubjectsWe can use characteristic sets for cardinality estimation
query: select distinct ?ewhere { ?e <author> ?a. ?e <title> ?t. }
cardinality:∑
S∈{S |S∈SC (R)∧{author ,title}⊆S} count(S)
• the computation is exact! (only for distinct, though)
• can estimate a large number of joins in one step
• number of characteristic sets is surprisingly low
Number of Characteristic Sets
triples characteristic setsYago 40,114,899 9,788LibraryThing 36,203,751 6,834UniProt 845,074,885 613
Thomas Neumann Querying Graph-Structured Data 15 / 32
Occurrence Annotations
Without distinct we need occurrence annotations
distinct |{s|∃p, o : (s, p, o) ∈ R ∧ SC (s) = S}|count(p1) |{(s, p1, o)|(s, p1, o) ∈ R ∧ SC (s) = S}|count(p2) |{(s, p2, o)|(s, p2, o) ∈ R ∧ SC (s) = S}|. . . . . .
Example
select ?a ?t where { ?e <author> ?a. ?e <title> ?t. }
distinct author title year
1000 2300 1010 1090
Estimate: 1000 ∗ 23001000 ∗
10101000 = 2323
• no longer exact, but very accurate in practice
Thomas Neumann Querying Graph-Structured Data 16 / 32
Using Characteristic Sets
• characteristic sets accurately describe individual subjects
• but a query touches more than one subject
• combine characteristics sets to form whole queries
General strategy:
• exploit as much information about correlation as possible
• ignore the joins order (”holistic” estimates)
• avoids ”fleeing to ignorance”
• cover the query with characteristic sets
Thomas Neumann Querying Graph-Structured Data 17 / 32
Example
select ?a ?t where { ?b <author>?a. ?b <title>?t. ?b <year>”2009”.?b <publishedBy>?p. ?p <name>”ACM”. }
?b
?a ?t
2009 ?p ACM
author title
year
publishedByname
(?b, author, ?a) (?b, title, ?t)
(?b, year, 2009) (?b, publishedBy, ?p) (?p, name, ACM)
RDF query graph traditional query graph
• we cover the query with characteristic sets
• prefer large sets over small sets
• assume independence for the rest
Thomas Neumann Querying Graph-Structured Data 18 / 32
Example
select ?a ?t where { ?b <author>?a. ?b <title>?t. ?b <year>”2009”.?b <publishedBy>?p. ?p <name>”ACM”. }
?b
?a ?t
2009 ?p ACM
author title
year
publishedByname
(?b, author, ?a) (?b, title, ?t)
(?b, year, 2009) (?b, publishedBy, ?p) (?p, name, ACM)
RDF query graph traditional query graph
• we cover the query with characteristic sets
• prefer large sets over small sets
• assume independence for the rest
Thomas Neumann Querying Graph-Structured Data 18 / 32
Example
select ?a ?t where { ?b <author>?a. ?b <title>?t. ?b <year>”2009”.?b <publishedBy>?p. ?p <name>”ACM”. }
?b
?a ?t
2009 ?p ACM
author title
year
publishedByname
(?b, author, ?a) (?b, title, ?t)
(?b, year, 2009) (?b, publishedBy, ?p) (?p, name, ACM)
RDF query graph traditional query graph
• we cover the query with characteristic sets
• prefer large sets over small sets
• assume independence for the rest
Thomas Neumann Querying Graph-Structured Data 18 / 32
Challenges of SPARQL query optimization
Query Optimization:
Query Compilation ⇒ Query Execution(dominated by query optimization)
RDF-3X 78 s 2 sVirtuoso 7 1.3 s 384 s
(next slides) 1.2 s 2 s
We ran a query with 17 joins on YAGO dataset (100 Mln triples)
Thomas Neumann Querying Graph-Structured Data 19 / 32
Challenges of SPARQL query optimization
Query Optimization:
Query Compilation ⇒ Query Execution(dominated by query optimization)
RDF-3X 78 s 2 sVirtuoso 7 1.3 s 384 s
(next slides) 1.2 s 2 s
We ran a query with 17 joins on YAGO dataset (100 Mln triples)
Thomas Neumann Querying Graph-Structured Data 19 / 32
Challenges of SPARQL query optimization
Query Optimization:
Query Compilation ⇒ Query Execution(dominated by query optimization)
RDF-3X 78 s 2 sVirtuoso 7 1.3 s 384 s
(next slides) 1.2 s 2 s
We ran a query with 17 joins on YAGO dataset (100 Mln triples)
Thomas Neumann Querying Graph-Structured Data 19 / 32
Why does it happen?
Properties of the model:
• RDF is a very verbose format
• TPC-H Q5: 5 joins in SQL vs 26 joins in SPARQL (assuming a triplestore storage)
• Dynamic Programming (RDF-3X) becomes too expensive
Properties of the data:
• Lots of correlations, including structural
• If an entity has a LastName, it is likely to have a FirstName
• Greedy Algorithm (Virtuoso) often makes wrong choices in thebeginning
Thomas Neumann Querying Graph-Structured Data 20 / 32
Combining Estimation and OptimizationGiven a SPARQL query:
?p
German novellist
Nobel Prize ?place
?book ?city
Italy
?long ?lat
typewonPrize bornIn
created linksToloca
tedIn
hasLong hasL
at
• How to optimize star-shaped subqueries?
• How to capture selectivities between subqueries?
• How to optimize arbitrary-shaped queries?
Thomas Neumann Querying Graph-Structured Data 21 / 32
Combining Estimation and OptimizationGiven a SPARQL query:
?p
German novellist
Nobel Prize ?place
?book ?city
Italy
?long ?lat
typewonPrize bornIn
created linksToloca
tedIn
hasLong hasL
at
• How to optimize star-shaped subqueries?
• How to capture selectivities between subqueries?
• How to optimize arbitrary-shaped queries?
Thomas Neumann Querying Graph-Structured Data 21 / 32
Combining Estimation and OptimizationGiven a SPARQL query:
?p
German novellist
Nobel Prize ?place
?book ?city
Italy
?long ?lat
typewonPrize bornIn
created linksToloca
tedIn
hasLong hasL
at
• How to optimize star-shaped subqueries?
• How to capture selectivities between subqueries?
• How to optimize arbitrary-shaped queries?
Thomas Neumann Querying Graph-Structured Data 21 / 32
Combining Estimation and OptimizationGiven a SPARQL query:
?p
German novellist
Nobel Prize ?place
?book ?city
Italy
?long ?lat
typewonPrize bornIn
created linksToloca
tedIn
hasLong hasL
at
• How to optimize star-shaped subqueries?
• How to capture selectivities between subqueries?
• How to optimize arbitrary-shaped queries?
Thomas Neumann Querying Graph-Structured Data 21 / 32
Optimizing star-shaped subqueries
?p
?place1
?type ?place2
?s
livedIn
type
bornIn
created
• {type, livedIn, bornIn, created} → 1025 entities
• Characteristic Set• Count all distinct Char.Sets with number of
occurrences• Accurate estimation of cardinalities of
star-shaped queries
• One step beyond: what is the rarest subset ofthe given CS?
• {type, livedIn, bornIn} → 13304 entities• {type, livedIn, created} → 6593 entities• {type, bornIn, created} → 6800 entities• {livedIn, bornIn, created} → 2399 entities
• type is not present in the rarest subset; wewant to join it the last
Thomas Neumann Querying Graph-Structured Data 22 / 32
Optimizing star-shaped subqueries
?p
?place1
?type ?place2
?s
livedIn
type
bornIn
created
• {type, livedIn, bornIn, created} → 1025 entities
• Characteristic Set• Count all distinct Char.Sets with number of
occurrences• Accurate estimation of cardinalities of
star-shaped queries
• One step beyond: what is the rarest subset ofthe given CS?
• {type, livedIn, bornIn} → 13304 entities• {type, livedIn, created} → 6593 entities• {type, bornIn, created} → 6800 entities• {livedIn, bornIn, created} → 2399 entities
• type is not present in the rarest subset; wewant to join it the last
Thomas Neumann Querying Graph-Structured Data 22 / 32
Example
{type, livedIn, bornIn, created}, ID : 154
{livedIn, bornIn, created}, ID : 27
{livedIn, created}, ID : 6
onID: 154
onID: 27
onID: 6
(?p, created , ?o1) (?p, livedIn, ?o3)
(?p, bornIn, ?o2)
(?p, type, ?o4)
Thomas Neumann Querying Graph-Structured Data 23 / 32
Properties of the algorithm
• Linear time, top-down, greedy
• Does not assume independence between predicates (unlike bottom-upgreedy)
Thomas Neumann Querying Graph-Structured Data 24 / 32
Cardinality estimates in arbitrary queries
?p
Thomas Mann
German novellist
Nobel Prize ?place
Zurich
?city
Lubeck
Germany
?long
10◦ E
?lat
53◦ N
type
wonPrize livedIn
bornInloca
tedIn
hasLong hasL
at
• How to estimate the cardinality of this query?
• Two subqueries depend on each other: every person is likely to haveone birthplace in the data
• Just multiplying their frequencies is a big underestimation
• We will construct a lightweight statistics of the dataset
• Count how frequently these two star-shaped subgraphs appeartogether
Thomas Neumann Querying Graph-Structured Data 25 / 32
Cardinality estimates in arbitrary queries
?p
Thomas Mann
German novellist
Nobel Prize
?place
Zurich
?city
Lubeck
Germany
?long
10◦ E
?lat
53◦ N
type
wonPrize livedIn
bornInloca
tedIn
hasLong hasL
at
• How to estimate the cardinality of this query?
• Two subqueries depend on each other: every person is likely to haveone birthplace in the data
• Just multiplying their frequencies is a big underestimation
• We will construct a lightweight statistics of the dataset
• Count how frequently these two star-shaped subgraphs appeartogether
Thomas Neumann Querying Graph-Structured Data 25 / 32
Characteristic Pairs
• Characteristic Pair: Two Characteristic Sets that appear connectedvia an edge in the dataset
• Identifying CP: one scan over the data once the Char.Sets arecomputed
• In the worst case, the number of CP grows quadratically withdifferent Char.Sets
• But we are only interested in very frequent ones
• If the pair is rare, the independence assumption holds
Thomas Neumann Querying Graph-Structured Data 26 / 32
Char.Pairs: Estimating the cardinalities
select distinct ?s ?owhere { ?s p1 ?x1.
?s p2 ?x2.?s p3 ?o.?o p4 ?y1. }
• {Si} ← Char.Sets with {p1, p2, p3}• {S ′i } ← Char.Sets with {p4}• Form all the Char.Pairs between {Si}
and {S ′i }• Get their counts, sum up
Thomas Neumann Querying Graph-Structured Data 27 / 32
Outline
Given a SPARQL query:
?p
German novellist
Nobel Prize ?place
?book ?city
Italy
?long ?lat
typewonPrize bornIn
created linksToloca
tedIn
hasLong hasL
at
• How to optimize star-shaped subqueries?
• How to capture selectivities between subqueries?
• How to optimize arbitrary-shaped queries?
Thomas Neumann Querying Graph-Structured Data 28 / 32
Outline
Given a SPARQL query:
?p
German novellist
Nobel Prize ?place
?book ?city
Italy
?long ?lat
typewonPrize bornIn
created linksToloca
tedIn
hasLong hasL
at
• How to optimize star-shaped subqueries?
• How to capture selectivities between subqueries?
• How to optimize arbitrary-shaped queries?
Thomas Neumann Querying Graph-Structured Data 28 / 32
Query simplification
?p
?P1
German novellist
Nobel Prize ?place
?book ?city
?P2
Italy
?long ?lat
type
wonPrize bornIn
created
createds1
linksTo
linksTos2
located
In
hasLong hasL
at
• We start with identifying optimal plans for subqueries
• Now, we remove them from the SPARQL query graph, and run theDynamic Programming algo
• We know the selectivities between the subqueries
Entities Partial Plan Cost
{P1} (wonPrize on type) on bornIn 3000{P2} (locatedIn on hasLong) on hasLat 5000{book} IndexScan(P = linksTo, S =?book) 4500{P1, book} ((wonPrize on type) on bornIn) on wrote 7500
. . . . . . . . .
Thomas Neumann Querying Graph-Structured Data 29 / 32
Query simplification
?p
?P1
German novellist
Nobel Prize ?place
?book
?city
?P2
Italy
?long ?lat
type
wonPrize bornIn
created
createds1
linksTo
linksTos2
located
In
hasLong hasL
at
• We start with identifying optimal plans for subqueries• Now, we remove them from the SPARQL query graph, and run the
Dynamic Programming algo
• We know the selectivities between the subqueries
Entities Partial Plan Cost
{P1} (wonPrize on type) on bornIn 3000{P2} (locatedIn on hasLong) on hasLat 5000{book} IndexScan(P = linksTo, S =?book) 4500{P1, book} ((wonPrize on type) on bornIn) on wrote 7500
. . . . . . . . .
Thomas Neumann Querying Graph-Structured Data 29 / 32
Query simplification
?p
?P1
German novellist
Nobel Prize ?place
?book
?city
?P2
Italy
?long ?lat
type
wonPrize bornIn
created
createds1
linksTo
linksTos2
located
In
hasLong hasL
at
• We start with identifying optimal plans for subqueries• Now, we remove them from the SPARQL query graph, and run the
Dynamic Programming algo• We know the selectivities between the subqueries
Entities Partial Plan Cost
{P1} (wonPrize on type) on bornIn 3000{P2} (locatedIn on hasLong) on hasLat 5000{book} IndexScan(P = linksTo, S =?book) 4500{P1, book} ((wonPrize on type) on bornIn) on wrote 7500
. . . . . . . . .
Thomas Neumann Querying Graph-Structured Data 29 / 32
Query simplification
?p
?P1
German novellist
Nobel Prize ?place
?book
?city
?P2
Italy
?long ?lat
type
wonPrize bornIn
created
createds1
linksTo
linksTos2
located
In
hasLong hasL
at
Entities Partial Plan Cost
{P1} (wonPrize on type) on bornIn 3000{P2} (locatedIn on hasLong) on hasLat 5000{book} IndexScan(P = linksTo, S =?book) 4500{P1, book} ((wonPrize on type) on bornIn) on wrote 7500
. . . . . . . . .
Thomas Neumann Querying Graph-Structured Data 29 / 32
Compile and Runtime for YAGO
Query Size (number of joins)total runtime (optimization time)
Algo [10, 20) [20, 30) [30, 40) [40, 50]
DP 7745(7130) - - -DP-CS 65767(65223) - - -Greedy 857 (133) 1236 (413) 2204 (838) 4145 (1194)
HSP 1025 (2) 3189 (3) 4102 (4) 10720 (5)Char.Pairs 660 (150) 967 (315) 1211 (348) 2174 (890)
Thomas Neumann Querying Graph-Structured Data 30 / 32
Other Challenges
• complex paths (transitivity etc.)
• complex aggregates
• updates
• transactions
• ...
Many hard problems, need careful analysis and tests.
Thomas Neumann Querying Graph-Structured Data 31 / 32
Conclusion
Graph Data Processing is hard
• complex, not schema, correlations, etc.
• requires efficient storage and indexing
• query optimization is essential
• powerful techniques pay off very quickly
Many interesting problems still open.
Thomas Neumann Querying Graph-Structured Data 32 / 32