Collaborative Project LOD2 – Creating Knowledge out of Interlinked Data Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month 36, 31/08/2013 Actual Submission Date Month 36, 31/08/2013 Work Package WP5 ‐ Linked Data Browsing, Visualization and Authoring Interfaces Task T5.1 Type Report Approval Status Approved Version 1.0 Number of Pages 49 Filename LOD2_D5_1_4_GEO_Benchmark_Evaluation.pdf Abstract: This report describes the evaluation of the LOD2 Geo Benchmark, developed to ensure that RDF storage engines provide the proper level of functionality and performance to facilitate the needs of Linked Data Browsing, Visualization and Authoring Interfaces. The information in this document reflects only the author’s views and the European Community is not liable for any use that may be made of the information contained therein. The information in this document is provided “as is” without guarantee or warranty of any kind, express or implied, including but not limited to the fitness of the information for a particular purpose. The user thereof uses the information at his/ her sole risk and liability. Project co‐funded by the European Commission within the Seventh Framework Programme (2007 – 2013) Project Number: 257943 Start Date of Project: 01/09/2010 Duration: 48 months
49
Embed
LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Theinformationinthisdocumentreflectsonlytheauthor’sviewsandtheEuropeanCommunityisnotliableforanyusethatmaybe made of the information contained therein. The information in this document is provided “as is” without guarantee orwarrantyofanykind,expressorimplied,includingbutnotlimitedtothefitnessoftheinformationforaparticularpurpose.Theuserthereofusestheinformationathis/hersoleriskandliability.
Thisbenchmarkisnotintendedasapurelyscientificdeliverable,itisratherfocusedinaddressingpractical challenges in the Geo Browsing components, as developed by University Leipzig(browser.linkedgeodata.org). In particular, it highlights performance problems encountered whenlaying out linked objects on amap,whichmay have highly different zoom levels. The performancechallengeismakingsurethatperformancealwaysremainsinteractive,irrespectiveofthezoomlevelorfacetselections.
Thisreportcoincideswiththeopen‐sourcereleaseofv2.0oftheLOD2GeoBench.Theevaluationpresentedheregoesbeyond theoneat the initial specification inD5.1.2whichwas runon justonesystem(analphapre‐releaseversionofVirtuoso7).Hereweaddbenchmarkingonmultiplesystems,onlargedatasizes(scalefactor100)andusingclusterhardware,insteadofjustasinglemachine.
The overall message coming out of these experiments is that to create high‐performance(interactive) geospatial faceted browing interfaces, specific pre‐computation and indexing effort isneeded(thisisembodiedbythe“quad”implementation).Thismeansthatontheonehand,applicationdesigners need to think of their data access strategy. On the other hand, more hooks for physicaltuningareneededinRDFdatabasesystemstomakethispossible.
1. IntroductionGeographic informationmanagement is a generallywell‐understood task in datamanagement.
Relational database systems technologically support geographical data, sometimes by incorporatingmulti‐dimensional indexing structures like the RTree, or using simple uni‐dimensional BTrees (inconjunctionwith a space‐filling curve). In RDF datamanagement,manyRDF stores support spatialdata management, providing functions to test geospatial predicates; sometimes technologicallysupportedbydatastructuressuchastheRTree.Thesespecificsystemextensionsarebeingreplacedby general adoption of the proposed GeoSPARQL standard proposed by the Open GeospatialConsortium. As such, application development and deploymentwhere the data involves geographyshouldbesupportablewithRDFdatabasesystems.ThisactivityinLOD2takesthattothetest.
InthepastdeliverableD5.1.2,anewdatabaseandapplicationbenchmarkforfacetedgeographicquerying was introduced, called the LOD2 GeoBench (v1.0). The underlying goal for creating thisbenchmarkisfocusonimprovingtheuserexperiencefortheGeospatialBrowserdevelopedbyAKSWinthecontextoftheLOD2project(browser.linkedgeodata.org),bothbyinfluencingthedesignoftheapplicationandmymeasuringandimprovingtherawpowerforgeographicalqueryexecutioninRDFdatabasesystems.
InthisdeliverablewereportonaseriesofexperimentswhenrunningtheLOD2GeoBenchonfourdifferentsystems:OWLIM5.3,OpenlinkVirtuosoV6(opensource),OpenlinkVirtuosoV7(opensource)and Openlink Virtuoso V7 Cluster Edition. The hardware platform usedwas the SCILENS databasecomputeclusteratCWI.Thishand‐builtclusterconsistsofthreedifferentlayersofnodes,ofwhichweused the highest “bricks” layer, built out of 16 large servers (16 cores, 256GB RAM). This sameplatformwasused to create the record‐breaking runswith150billion tripleson theBSBMExploreandBusinessIntelligencebenchmarks(seedeliverableD2.1.4and1).
1.1 OutlineIn Section 2,we describe the LOD2 GeoBench benchmark in its v2.0 version; released in open
source in conjunctionwith this deliverable. The benchmark can currently be implemented by RDFdatabasessystemsinfourdifferentways(basic,rtree,rtree++andquad),whichwedescribeindetail.
InSection3,weprovideanddiscusstheresultswhenrunningthebenchmarkatscalefactors1,10and100ontheplatformsdescribedabove.Whenusingthe“quad”implementation,whichprovidesimprecise answers, RDF database systems turn out to be capable of sustaining tens of concurrentclient requests simultaneously on a single machine. Considering that real users of the GeospatialBrowserwouldusesignificantthinktimeinbetweenqueries,thismeansthatasinglemachinecouldsupporthundredsofconcurrentusers.Ifpreciseanswersarerequired,theseexperimentsshowthatRDFbasedgeographicalsupport(“rtree++”)provideshighperformanceinqueriesthataremoderatelytostronglyzoomed in;whilequerieson largegeographicalareas(zoomedout)wouldstillhave lowperformance – though it is evident that this problem cannot be eliminated inside RDF databasesystems; only application redesign can overcome it. In all, the experimental results show clearimprovementsoverthesituation18monthsago,andasdocumentedinD5.1.2.
In Section 4 we make some forward looking statements and recommendations both forapplicationdesigningeographicalfacetedbrowsing,aswellonthesideofRDFdatabasetechnology.Inshort,applicationdesignshouldthinkaheadandcreateadditional(indexing)datastructures,inorderto ensure interactive performance at all times. Such physical database design is very common inrelational database systems, but almost completelyundeveloped inRDFdatabase systems.On theirpart,RDFsystemsshouldexposemorefeaturestoenablesuchadditional(indexing)opportunities.
The LOD2 GeoBench is an RDF database/application benchmark for faceted geographicalquerying. In particular, its queries use a combination of geographical selection and grouping andcounting by facets. Such faceted querying in itsmainstream use (outside RDF, e.g. using relationaltechnology) is known to be ahardproblem.Theproblembeing, that grouping and countingby thefacetrequiresa lotofcomputationaleffort if therearemany facet instancesqualifyingtheselection,yetduetotheinfiniteamountofpossibleselectionpredicatesitishardtopreparethesystemforthis.Thus,queriesinvolvingmillionsofinstancesmustreallygroupandcountmillionsoftuples(ortriples)andmakingsuchpartofaninteractivesystemthatshouldrenderaresultscreenwithin0.2secondsisachallenge.Also,facetedbrowsingserversonthewebmaybeusedbymanyclientssimultaneously.Assuch, the database system answering the queries should be capable of providing this interactiveexperiencetomanyusersatthesametime.
The goal of the LOD2 GeoBench result metric (queries per second per $) is to highlight theperformance and architecture problems faced by the Linked Geodata Browser application(browser.linkedgeodata.org),which is being developed atUniversity of Leipzig as part of the LOD2project.Specifically,itisintendedtostimulateboth(i)technicalprogressinRDFdatabasetechnology,improving both the query execution and query optimization support for geographical queries inSPARQLbackends,and(ii)tostimulatethinkingaboutapossibleredesignofRDF‐basedapplicationslike the Linked Geodata Browser. This suggestion for redesignpoints toward an opportunity toredesign physical RDF databases, where for specific access patterns and queries, the applicationarchitectandDBAcoulddecidetopre‐createcertainindexesandmaterializedviews(notethatthisisphrased in relational database terms, in practice this could take the form of additional synthetictriples).
TheLOD2GeoBenchwasdevelopedasdeliverableD5.1.2intheLOD2project,18monthsearlier.Coincidingwith this report,wehavereleasedaversionv2.0of thebenchmark,whosesoftwareanddocumentationisavailableinopensource:
http://svn.aksw.org/lod2/LOD2‐GeoBench
We therefore continue with a re‐cap of the benchmark design and description, including adescriptionofwhathaschangedinv2.0.
We call this core dataset the SF1 dataset. It contains roughly 10M geographic objects. Theamount of triples (130M) is significantly higher, and the uncompressed size in bytes is 20GB. For
Thebenchmarkthereforescales thiscoreofrealdatatoanycardinal factorx*SFbycopyingalltriples in all datasets x times, appending the string “_y” (for all y: 0<y<x) to all URIs starting withhttp://linkedgeodata.org/. Thismeanswe getmanymore facets in the Ontology and every facet isduplicated x times in the dataset, belonging to new copies of the instances. This kind of scaling ishighlysimilartotheoneproposedintheDBpediabenchmark,andmimicswhatwouldhappenifmorepropertiesofOpenStreetMapwouldgetincludedinthehttp://linkedgeodata.org/dump.
Thev1.0versionoftheLOD2GeoBenchwouldjustmaketheycopiesofthesamedatainstance,with different subject URIs, replicating the data. The geographic feature (point, polygon, polyline)would just be the same among the copies. This replication strategy backfires in systems that onlycreateRTreegeographicalsearchacceleratorstructuresontheuniquesetofliterals–Virtuosobeingsuchanexample.Thatis,becausethegeographicfeatureswerecopiedandremainedequal,theuniquesetofgeographicliteralswouldnotgrow,andhencethesizeoftheRTreewouldnotgrow.
Thev2.0versionoftheLOD2GeoBench,nowreleased,changesthescalingproceduretoshifteachreplicatedgeographical featurebya tinyrandom(lat,long)delta (encompassinga fewmeters).Thisway, all geographical features areunique, yet the setof such features still is realistic in its size andpositiondistribution.Thiswas themain reason to startwith a “real” coredataset in the firstplace,sinceitisveryhardtocreatesyntheticrandomlygeneratedgeographicaldatathat“makessense”andconformstoreal‐worlddistributions.
Since April 2011, there have been new releases of the core dataset in April and August 2013,whichcontainroughlythesamedata,butactualizedfromOpenStreetMap,splitintherawtripledatafilesbydatafacetcategory(theyusedtobetogether).However,intheLOD2GeoBenchV2.0wehavenot moved to this new core dataset. The rationale has been to keep the v1.0 and v2.0 of LOD2GeoBenchascompatibleaspossible.Having (onlyslightly)more triplesandhaving themactualizedfromOpenStreetMap isof limitedvalue forourpurposeshere. It is,however,possible that a futureversionofthisbenchmarkwillstartusingnewLinkedGeoDatadatasetreleases,ifaloneforthereasonthatthebenchmarkspecificationreliesonthedatareleasebeingonlineanddownloadable.
randomseed,deterministicallypicks10centerpoints,andexecutes12steps,eachstepconsistingoftwo queries: the Facet Count Query (FCQ) and an Instance Retrieval Query (IRQ) or an InstanceAggregationQueries (IAQ).Thus theworkload in total consistsof240queries.The sequenceof12stepsisasfollows:
The power query workload executes a query run directly after data load. It is immediatelyfollowed by the throughput workload. In the power workload, the queries in the query run areexecutedpurely after eachother. In the throughputworkload,multiplequery runs (generatedwithdifferent parameters), run concurrently on the system. The typical concurrency levels to test are2,4,8,16.
visiblewindow.This is anaggregationquery that countsalloccurrences foreach facet in thequerywindow, be it a currently selected (active) facet or not. The query parameters here are the querycenterpoint(LATITUDE,LONGITUDE)andthewindowHEIGHTandWIDTHindegrees.
facets. Torenderascreen, thebenchmarkwillalwaysselect4 facets.This isapureselectionquery(rectangulargeographicwindowandfacets),thereisnogroupingoraggregationinvolved.Inadditionto the parameters LATITUDE,LONGITUDE,HEIGHT andWIDTH, this queryhence also receives fourURIparametersFACET1,FACET2,FACET3,FACET4identifyingthefacetsofinterest.
D5.1.4–v1.0
Page10
Figure 1: The Linked Geodata Browser mis‐handling situations with too many results: queries get
cancauseperformanceandusabilityproblems.Forinstance,tryimaginingtovisualizeallstreetlightsin all of Germany as markers on a map on a computer screen. This would mean that millions oflampposticonsneedtobeplacedonthescreen,whichdoesnotevenhaveenoughpixelsforthat.Theresulting drawing is bound to be judged as convoluted by average users. Further, even to arrive atsuchadrawnmapisaperformancechallenge,sincethequeryreturnsmanyresults,whichneedtobeprocessed (and, dependingon the architectureof the application,might alsoneed to be sent to theclient,e.g.awebbrowser).
Theinstanceaggregationquerydealswiththeproblemoftoomanyinstancesbysummarizingtheinstancesgeographically.ThisqueryisusedintheLOD2GeoBenchinsteadoftheInstanceQueryonthe first four zoom levels (the first six steps). For this purposes, it divides the map into 40x20conceptualsquaretiles,andjustallowsonemarkerperactivefacetinsideonetile.Itdoescounthowmany instances fall in a tile, and it displays the most relevant marker in a tile for display (in thebenchmark,wedonot really choose themost relevantmarker, but choose theonewith the largestsubjectURI–i.e.arandomone)andacountofoccurrences.
WidthandHeight.ThezoomlevelZatscalefactorSFcorrespondstoalongitudewidthof9/2Zdegreesanda latitudeheightof4.5/2Zdegrees.Notethatthe lowestzoomlevel=0selects9degreeslongitude and 4.5 degrees latitude, which roughly corresponds with an area like Germany minusBavaria.Atzoomlevel7,thewindowisdownto0.07by0.03degrees,asmalldowntownarea.
These facet categories were chosen by analyzing the frequency of the various facets in OSM.Concretely,theabovefacetsarechosnfromthefacetsthathavethehighestfrequencyofoccurrence.Thesewerechoseninorder(i)tomakethequerieswhenzoomedoutchallengingastheywillselectmanyinstancesand(ii)toguaranteethatatthehighestzoomlevelstillanonzeroamountofinstancesareinthewindow.
Further,fromthesetofveryfrequentfacets(whichislargerthantheabove),weselectedgroupsoffacetsthathavequitesimilarfrequenciesandputthemintheabovefourgroups.Thatis,thereareroughly1millionplaces,parkingsandvillages,and200.000sport,postboxandsupermarketfeatures.Each query in the LOD2 GeoBench workload picks one from each category, e.g. (Parking, School,Tourism,Sport).Thatway,thequeriesalwayshaveahighlyasimilarfrequencycharacteristic.Thisinturnhelpstocreatemorestableperformancerunsamongtheresultsofrunningthesamequerywithdifferent parameter bindings (this is something that e.g. BSBMdoes not do,making it very hard tounderstandhowgoodorbadasystembehavesonacertainquery–asthismayvaryenormouslyonthechosenparameter).
At scale factorx*SFwith (x>1), these facets are suffixedwitha random“_y”,withy:0<=y<x.RecallthattheLOD2GeoBenchwhenscalingthedatasettoalargersize,notonlycreatescopiesofallgeographic featureswith a different subject URI, but also uses different property URIs, i.e. suffixedwith_y.Asmentioned,thefacetsusedarerelativelyfrequentfacets;theirfrequencyinthecoredatasetisindicatedinparenthesis.Atzoomlevel0weexpectroughly70Kinstancesintotalbelongingtoanyofthefourselected;theexpectedamountdecreasesateachzoomlevel,tojustahundredatzoomlevel 7. Note that aswe are focusing on high‐density areas (European city centers), the amount ofinstancesina4xsmallersub‐window(zoom‐in)isinfactlessthan4xsmaller.
Browser,which is thesumof the facetcountqueryandthe instance(aggregation)query;butthis isreportedintheinverse,hencePagePerSec.Fromabenchmarkrun,thatexecuteseachstep10times,we derive an overall PagePerSec score at that step by averaging the 10 results (query latency inseconds). For multi‐stream runs, we add the PagePerSec metric results for each stream to get acombinedPagePerSecresult.
D5.1.4–v1.0
Page12
2.2.3.2 PagePerSecondPer$1000(PagePerSec/K$)To take into account the cost of the hardware used in various implementations,we divide the
PagePerSecmetric by themonetary cost of the hardware and softwareused: PagePerSec/K$. If theRDFsystemisacommercialsoftwareproduct,thepriceforsoftwaremustbethedollar(listprice,nodiscounts). The price quoted for hardware must be the publicly available end user price of thehardwareatanonlinemerchantatthedatethebenchmarkwasrun.
highzoomlevels.Forthisreason,twodifferentsub‐metricsarereported,wheretheLowZoomScoreisderivedfromstep1‐step6;andtheHighZoomScorederivedfromstep6‐step12.WeusethegeometricmeanasthemethodtocombinethePagePerSecscoresfromthevarioussteps,becausethisrewardsrelative improvements at any step equally in the overall score, even if the individual scores at thevariousstepsarequitediverse.Similarly, theLOD2GeoBenchTotalScore(LGB‐TS)isthegeometricmeanoftheLowZoomScore(LGB‐LS)andHighZoomScore(LGB‐HS).
file and produces x output files 0<=y<x with _y suffixes in the URIs. It should be used on all coredatasetfiles.ThesefilescanthenbeimportedintheRDFdatabasesystem.Generatingthecopiesofthecoredatasetfileshouldnotbeincludedindatabaseloadtime.
QueryGenerator. Thebenchmark comeswith aquery generator (geoqgen.c), that given a runnumberandascalefactor(SF)generates240textualqueries.Therunnumberis:
theamountoffacetinstancesintherectangularquerywindow.Thebasicstrategyisnottoassumeanygeographical support in theRDF backend and perform the selection on the (lat,long) values,whichleadstotehfollowingSPARQ1.1text:
select ?f as ?facet count(?s) as ?cnt
where { ?s <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?a;
Typically,RDFstoreswillevaluatethisquerybyusingrangescansonthePOSorOPS index forrespectivelythelatitudeandlongitudepredicate,andintersecttheresultingtriplestreamsfromtheseonsubject.Thismeansthat if (say) theselectivityof thequery is1/10of the full latituderangeand1/10ofthefulllongituderange,and(say)hence1/100ofthetotaldatabase,theintermediateresultbefore the intersection is in the range of 1/10of the dataset.Hence, it it is 10x larger than strictlynecessary.Still,thisapproachissimpleandportable(itwillworkonanySPARQL1.1backend).
ThesecondqueryineachstepistheInstanceRetrievalQuery,ortheInstanceAggregationQuery.We start wth the Instance Retrieval Query. This query retrieves all the facet instances inside (oroverlappingwith)thequerywindow,forfourchosenfacets.
The Map displayed by the Linked Geodata Browser shows markers for all instances of theselected facets. To render a screen, the benchmark will always select 4 facets so there are fourdifferentFACETparameters,FACET1,FACET2,FACET3,FACET4:
sparql select ?s as ?instance ?f as ?facet ?a as ?lat ?o as ?lon
Arguably, the selection on any of the four facets could also be done in a filter – however it isbelieved that thecurrent syntaxand theonewithdisjunctiveexpressionswouldusually lead to thesame physical query plan anyway. It should be noted that if desired, such an alternative, yetequivalent,querysyntaxwouldbepermissibleinaLOD2GeoBenchresult.
D5.1.4–v1.0
Page14
2.3.2 RTreeandRTree++ImplementationsIf an RDF database system supports efficient evaluation of geographical predicates (e.g. by
creating an RTree index in advance), such is very relevant for the LOD2 GeoBench. We allowreasonable query variants, for instance if the RDF database system being tested has specificgeographicsupport,thiscanbeused.
For instance, Virtuoso v6 provides RTree based indexing allowing to test spatial intersectionwithin a radius. It is possible to drawa circle around the querywindowanduse the radius of thiscircule and the center point of the window in this syntax. This was the first RTree syntax variantimplementedbyLOD2GeoBench(inv1.0)andthereforecarriesthename“rtree”:
select ?f as ?facet count(?s) as ?cnt
where { ?s <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?a;
This allows direct translation of the LOD2 GeoBench window queries into a geographicalpredicate.NotethatthepreviousqueryforVirtuosov6wouldcombineaquerywitharadius(circlequery)witha subsequent (lat lon) filter. InLOD2GeoBench, thisdirectBOXcomparison, supportedfromv2.0on,isdenoted“rtree++”.
only RDF database technology, but also the application design itself. Taking the analogy to GoogleMaps,onecanbeassuredthatratherthanqueryingfromasingledatacollectionforallzoomsettings,theresultscreensarerenderedfroma(pre‐generated)separatedatasetforeachdifferentzoomlevel.EventhoughGoogleMapslikelydoesnotrelyonrelationaldatabasetechnology,thisapproachwouldbelikehavingdifferenttablesstorethegeographicaldataofthevariouszoomlevels.Theadvantageisthat these tables canbe designed such thatwhen the zoomwindow is very large (low zoom level),irrelevant data that would be too big to show would be pruned, or frequency counts could besummarized (e.g. keep the amount of lampposts in Germany for each zipcode, rather than allindividuallampposts).Thisway,theselowerzoomlevelshavetooperateonmuchlessdata,allowingtheapplicationtoexhibitinteractiveperformancealways.
The quad approach, described here, formally is not a valid implementation of the LOD2GeoBench,as itwillprovideslightly incorrectqueryanswers,buthas thepotential toachievemuchbetterperformance,withonlyminorqualityreductioninthequeryanswersprovided.ItsperformancecanbemeasuredwiththeLOD2GeoBench.
Themainideaistocreateadditionalindexingtriplesthat(i)accelerategeospatialdataaccessat multiple zoom resolutions, even on systems that do not provide specific geospatial support (ii)
D5.1.4–v1.0
Page15
precomputes certain subquery results in order to accelerate query results, for all three types ofqueries(facetcount,instance, andinstanceaggregation).
QuadTiles. The geospatial acceleration comes from partitioning the 2D space according toQuadTiles,which isaZ‐orderingof the (LONGITUDE,LATITUDE)space into32‐bitsnumbers,whereLONGITUDEandLATITUDEgetdiscretizedfromtheirnormaldoubleprecisionranges[‐180,180]resp.[‐90,90] to the short integer [0,65536]. The below pictures from theOpenStreetMapwiki illustratethis:
QuadTileannotationscanbeexploitedbyaddingextraRDFtriplesthatannotateasubjectthathas a geography with those rectangles it overlaps with (one QuadTile triple for each). Each suchannotation for a geographical subject would add one triple with a property, e.g.http://linkedgeodata.org/intersects/quadtile and a value which would be the integer QuadTilenumber.Itistoberemarkedthatthisworksfineforpoints,butlargepolygonsmightgetneedmanytriplesiftheirsurfaceislarge.IntheOSMcoredataset,thisdoesnotseemtobeanissue,though.
Sowehave52‐bitsapproachwitha32‐bitsQuadTilenumberintheminorbitsanda20‐bitsfacetintegerinthemajorbits.WebaptizethesecombinationsofQuadTileandfacetnumbers“FacetTiles”.In caseof an equi‐selectionof FACET such as found in the InstanceQueries, thenumber rangewillhavethemajorbits(Facetpart)thesameintheMinandMaxvaluesofallrangesandonlyvaryinthelowerbits(QuadTiles).Hence,FacetTilessharewithQuadtilesalltheirnicegeospatiallocalityaspects,insuchsituations.
Itisrelativelyeasytomapageospatialquerywindowintoa(seriesof)rangerestrictionsontheQuadTilenumbers.Thisusuallygivesalimitednumberofconjunctiveranges,butstillitisoftenagoodideatousetheSPARQL1.1subqueryfeatureandencloseinthisSPARQLqueryasubquerythatsimplyhasasinglerangeconsistingoftheMINandMAXvalueofthemultiplerangesweareafter.ThisideatopresentabasicquerywithonlyoneselectionrangeisaworkaroundforweaknessesinSPARQLqueryoptimizers, that would otherwise not recognize the opportunity to use the POS index on thehttp://linkedgeodata.org/intersects/facettileproperty.Similarly,giventhatwequeryforfourFACETs,itmayworkbesttousetheabovequerymodeltoretrievealldataforonefacet,andwriteaquerythatunion‐sfoursuchsub‐queries.
Notethatinprinciple,giventhattheInstanceQueryisusedonthehighzoomlevelsonly,whereresultsetsarenotverylarge,thiswillleadtofourlocalindexlookupsinthePOSindex.ThismayworkbetterthananormalRTreewoulddo,becauseintheRTreeonewouldhaveallinstancesofallfacets,notonlythefourfacetsofinterest.ThismeansthatanRTreeselectionqueryintheleafnodesitvisitswill only find a low percentage of the data to be relevant for the query. Onewould need a kind ofpartitioned RTree (partitioned on facet) to get the same kind of locality as FacetTile. An exampleInstance Retrieval Query is below, shortened by having it only query two facets(http://linkedgeodata.org/ontology/Village, http://linkedgeodata.org/ontology/Supermarket) ratherthanfour:
select ?s as ?instance ?f as ?facet ?a as ?lat ?o as ?lon
where
{ #where-start
{ #union-start
{ #subquery-start
select ?s <http://linkedgeodata.org/ontology/Village> as ?f
The Facet Count Query, as said, does not have locality on Facet, so it can better exploit theTileFacetnumberingthanaFacetTilenumbering.WecanthushenceaddalsoTileFacetannotationstoall instances they intersect with. This speeds up the query, certainly on systems without built‐ingeospatial support (RTrees) as the geographical predicate can now be translated into a rangerestriction thatwill workwell on a POS index. Furthermore, we could pre‐aggregate the retrievedtuplesonthefacetnumber(lowerbits)beforeevenjoiningthemtoothertriples.
However,especiallyatthelowerzoomlevels,whereareasthesizeofGermanyfallinthevisiblewindow, suchquerieswill have to aggregate hundreds of thousands of triples, even at the smallestSF=1;andlinearlymoreathigherscalefactors.Aggregatingthismuchdata,evenifdeliveredfastbyaPOS index is still heavy CPUwork that can take various seconds at least andwhichwillmake thisquerynon‐interactiveathigherscalefactors.
Thereforewe donot add TileFacet annotations to instances, but usepre‐computation for the FacetCountQuery. Wedothisatvariousresolutions intherangeof12‐26bits,becausethe lowestzoomlevelselects1/402ofthedata(roughly26socorrespondingto6bitsforbothdimensions,so12bits),wherethedeepestzoomlevelis7stepsdeeper,soat26bits.Hence,weproposeTileFacetcountpre‐computationat7granularities:12,14,16,18,20,22and24bits.
Itisnowamatterofdeterminingaproperbitgranularityforevaluatingaquery,dependingonthezoom level. A good heuristic is to use the lowest granularity level atwhich at least one tile is fullyenclosedby thequerywindow(and ifnosuch levelexist,use thehighestbitgranularity);and thentranslatethewindowselectionpredicateinaseriesofrangepredicatesonTileFacets,likebefore.
The extra triples we keep hold the pre‐computed counts at the various resolutions for anyrectangleforeachTileFacetatthatresolution(e.g.16bits).Forallfacetsinstances,wegeneratetwotripleswithasubjectintheformofhttp://linkedgeodata.org/facetcount/0000XXXXXXandas:
property http://linkedgeodata.org/facetcount/tilefacet16, with as value its TileFacet number,with theQuadTile number part truncated to 16 bits in this case. This represents thus a certainrectableinthe2Dspace.
propertyhttp://linkedgeodata.org/facetcount/count,andasvaluethenumberofoccurencesofafacet. Note thatwe only need to generate http://linkedgeodata.org/facetcount triples for facetsthat have a non‐zero count in a certain rectangle. As such, the amount of these pre‐computedtriplesisalwayssignificantlylowerthantheamountofTileFacetannotationsweaddedbefore.
D5.1.4–v1.0
Page18
Property http://linkedgeodata.org/facetcount/facet stores the facet URI (i.e. dhttp://www.w3.org/1999/02/22‐rdf‐syntax‐ns#type value). It could be derived from thetilefacet16number,buthavingthisasatriplesimplifiesapplicationdevelopment.
The Facet Count Query can now be formulated by selecting all tiles at some granularity thatoverlapwiththequerywindow,andsummingupthesepre‐computedcounts.Hereisanexample:
select ?f as ?facet xsd:integer(sum(?c * 0.512)) as ?cnt
Thedownsideofthisapproachisthatthefacetcountsprovidedwillbeanoverestimationofthereal facet counts, since the tiles from which the precomputed counts originate may (will) extendbeyond the visible window. However, users may tolerate such inaccuracies; but especially for thelower counts, itmight be annoying. One could envision a system that,when a userswants the realcountforanon‐frequentfacet,wecouldcomputetheexactvalue(withaseparatequeryexploitingtheFacetTileannotations,asintheprevioussection).
Thecurrentquerygeneratortriestocorrectforoverestimatingbynormalizingtheprecomputedresult to thesizeof thequerybox,bydividingwith theboxused foranswering thequery(which isequalorlarger).Intheaboveexample,thisleadstothe0,512constantinthefirstline,asonlyslightlyoverhalfofthere‐usedprecomputedresultsisinsidethequerybox.
TheproblemoflargequerywindowsatlowzoomlevelsalsooccursintheInstanceAggregationQuery.Recall thatthisquerytacklestheinformationoverloadproblemofwaytoomanymarkersbycombiningmarkers that are near to each other into a singlemarker, and visualizes a count of howmanyinstancesfallunderit.Similartopre‐computingcountspertile,weobservethatthisaggregationper facet per tile can also be pre‐computed. Note that herewe again needmarkers for only a fewfacets,sousingtheFacetTilenumbershereworksbest.SincetheInstanceAggregationQueryisonlyusedatthelowerzoomlevels,wecanjustindexthisatgranularities12,14,16and18bits.Thus,foreachtileatallgranularities(e.g.16bits)inwhichafacetoccursatleastonce,wegenerateanartificialnewsubjecthttp://linkedgeodata.org/facetmap/0000YYYYYYYYinthreetripleswithas:
property http://linkedgeodata.org/facetmap/facettile16 and as value its FacetTile number(identifyingarectangleinwhichtheclusteredmarkerlies).
properties holding the position http://linkedgeodata.org/facetmap/latitude andhttp://linkedgeodata.org/facetmap/longitudeofthemarker.
D5.1.4–v1.0
Page19
property http://linkedgeodata.org/facetmap/count, and as value the number of occurences of afacet into that 16x8 cell inside the tile. Againwe only add such pre‐computed triples if a facetoccursinacertaincell,sotheamountofgeneratedhttp://linkedgeodata.org/facetmap/triplesissignificantlylowerthantheamountofTileFacetannotationsweaddedbefore.
AnimplementationoftheInstanceAggregationQueryexploitingthesepre‐computedtriples,firstchoosesanappropriatebitgranularity for thezoom level.Thenallabove tiles thatoverlapwith thequerywindowarefetched;nextthereal(latitude,longitude)valuesfromtheexamplemarkersinthemarefetchedandfilteredagainwiththequerywindow.Thismapisthenpresented.Becauseweusepre‐aggregated data, just like in case of the Facet Count Query, the problem was having to aggregatehundredsofthousandsofinstances;andthispre‐computationisguaranteedtoavoidthis;asanytilemaximallycontains128points;andweaccessonlyfewtiles.
Anexample InstanceAggregationQuery isbelow, shortenedbyhaving itonlyquery two facets(http://linkedgeodata.org/ontology/Village, http://linkedgeodata.org/ontology/Supermarket) ratherthanfour:
select ?f as ?facet ?latlon ?cnt
where
{ #where-start
{ #subquery-start
select ?f ?x ?y max(concat(xsd:string(?a)," ",xsd:string(?o))) as ?latlon count(*) as ?cnt
where
{ #where-start
{ #subquery-start
select ?f ?a ?o xsd:integer(20*(?a - 43.5141)/4.5) as ?y
SincethequerywindowwillnotperfectlyalignwithQuadTilesboundariesattheresolutionused,andforthemarkercombinationinthepre‐computedtilesweuselesscells(16x8;becauseaquerywillbeanswered frommultiplecells), theclustercombinationwillgivedifferentresults thantheofficialLOD2 GeoBench Instance Aggregation Query, even if we later re‐aggregatemarkers on the desired40x20grid.Fortheuserexperience,theeffectofthisislikelytobeofminorimportance.
SCILENSisanewkindofhardwareclusterthathasbeendesignedfromthegrounduptoservelarge‐scaledatamanagement. Themachines in the SCILENS cluster areorganized in threedifferentlevels,called‘pebbles’,‘rocks’,and‘bricks’.Eachleveldecreasesinamountofnodesbuttheindividualmachinesusedinthelevelincreaseincomputationalanddiskresources(andpricetag).TheSCILENSclusterusescheapconsumerhardware,optimizedtopackasmuchpowerinaslittlespace,makinguseofconsumerhome‐threatermini‐PCcases(‘Shuttlebox’),connectedbyhigh‐performanceInfinibandnetwork.
Due to thenegativeperformance impactofnetwork trafficduringSPARQLqueryprocessingonlarge clusters (where joins tend to be 'communicating' joins where allmachines need to exchangedata),wherenetworkusagevolumeincreasessuper‐linearlywithmorenodes,itisgenerallybetterinRDFstores toworkwith fewernodeswithmore (RAM)resources thanwithmanynodeswith littleresources.Thus,wechoseasourexperimentalplatformthe‘bricks’layerofSCILENS,thatconsistsofsixteen256GBRAMmachines,eachwith16coresrunningat2.4GHz(dualsocketIntelservers,worth$8K).TheclusterrunsFedoraLinux.Theprice‐tagoftheeightmachinesinvolvedintheexperiments,inclusivetheInfinibandnetworkinfrastructureisroughly$100K.
The SCILENS cluster contains much more I/O resources per CPU core than usual in computeclusters.TherelationbetweenCPUpowerandI/OresourcesiscapturedbytheAmdahlnumber.ThisnumberistheamountofI/ObytespercoreCPUcyclethesystemcandeliver.IncaseoftheSCILENSclusterthisnumberiscloseto1.0whereastypicalclustersatsupercomputingfacilitiessuchasLISAatSARA,onlygetto0.2(1byteper5cycles).Wedoconfess,whileallthisI/Opowerisinteresting,intheworkloads presented so farmost data is RAM resident. One reasonwas that the high‐performancemulti‐SSD I/O subsystem of the bricks layer at the time of testing was not yet operational. Thisprovidesgroundforafollow‐upexperimentusingthisfastI/Olayer.Weexpectthistoacceleratetheloadphase,andalsotoallowtoaddressevenlargerdatasetsefficientlyonthesamehardware.
OWLIM‐SE v5.3: we used the non‐cluster version of Ontotext’s OWLIM, which efficientlysupportsgeographicalquerying,asitstoresgeographicfeaturesinanRTree.OWLIM5.3withgeographicextensionisproprietarysoftware,butwehavenohardinformationonthecostof
Figure 2: The 'rocks' and 'pebbles' layers of the SCILENS cluster are hand‐built from384 Shuttleboxes,packingCPUandamplediskresourcesinlittlespace.
D5.1.4–v1.0
Page21
OWLIMat the timeofpreparationof thisdocument, soweomitted the scoresper$ for thissystem.
VirtuosoV6open source is still themostwidely used RDF store around (V7 open sourcebinary builds have starteddistributing only sinceAugust 2013). ThisOpenLinkproduct hasspecificsupportforgeographicalpredicates,albeitsomewhatlimited.Asdiscussed,directBOX(rectangularwindow)selectionsonlatitude,longitudearenotpossible,soweusetheRADIUSpre‐filteringapproach.
VirtuosoV7opensource:thismajornewreleasehasbeenstronglyinfluencedbytheLOD2project, wherein CWI advised Openlink on the introduction of numerous architecturalenhancements. Specifically, V7 introduces columnar storage for RDF triples as well asvectorizedexecution;patternedafterCWIresearchdatabasesystemprototypes.VirtuosoV7wasreleasedin2013andgenerallyofferssignificantstoragesavings,reducedmemoryusage,andimprovedcomputationalperformanceoverV6.
VirtuosoV7ClusterEdition:thismajornewreleaseofthe(nonopen‐source)clustereditionhas been documented in D2.1.6 and introduces a new vectorized cluster based executionparadigm that allows to parallelize any (complex) SPARQL query over a cluster of computenodes.As a result, it canhandle complex SPARQLqueries, such as theBusiness Intelligenceworkload of BSBM, but also the LOD2 GeoBenchmuchmore efficiently (or at all) than theclusterversionofV6everdid.Themonetarycostatthetimeofwritingofenterpriseversionfordepartmentserversis$25K.
thedatasize.Thesetriplesarethenbulkloadedintothesystems.Incaseofowlim,afterbulkloading,theRTreegeographical indexneedstobecreated.Thetimeneededforthis is includedinthebelowtable(andisalwaysasmallpartoftherealloadtime).
Loading Virtuoso6 was done using single loading process, since parallel loading wouldconsistentlyhangthesystem.WementionedalreadythefactthatloadingthescaleddatasizesalsohitanerrormessageintheRTreeloadingcode(abug),eveninsingle‐threadedmode,whichpreventedusfromtestingVirtuosoV6onthelargerdatasizes.
Virtuoso7 used the native parallel loading procedure and was done by running 14 loadingprocessesinparallel.LoadingVirtuosoV7ClusterEditionwasdonebyrunning2processespernode,givingintotal32loadingprocesses(2x2nodes/machinex8machines).
The results at SF1have the approximate implementation “quad” in front, especially inVirtuosoV7,however,theimprovedRTreesupport“rtree++”comesquiteclose.Notably,high‐zoomquadqueriesworkedbetterinV6,sothereseemstobeeitheranoptimizerperformanceregression,oranundesiredeffect of the vectorized columnar execution inV7. Because the low‐zoom (compute‐intensive) quadqueriesaremuchfasteronV7,itsoverallscoreishigher.
AnotherinterestingcomparisonistheimpactoftheLOD2R&Dactivitiesinthepastfewyears,atleastin thisbenchmark, forVirtuosoversus its strongest competitor,OWLIM.WhereasOWLIMgenerallywas equivalent or faster than VirtuosoV6 (compare owlim rtree++ with V6 rtree), the rtree basedscore in Virtuoso improved by a factor 7, creating a significant performance advantage. Besidesimprovements to the RTree functionality, this is very likely caused by the columnar vectorizedexecutionmodelthatVirtuosoadoptedinV7,inspiredbyCWIresearchinthatarea.
Moving to SF10, though throughput drops by a factor 3,we see the relative advantage of the quadapproachimprovedramatically:
Whereasthepreviousexperimentswereusingasingleserver,nowwemovetoresultsobtainedwith8servers. One strategy that is applicable to any technology in a read‐only workload like this, is toreplicatethedatabaseinmultipleservers,anddividethequeriesamongthem.Weremainusingjust8querystreams,suchthateachservergetsasinglestream.Inordertouseallcores,thesystemsmustnowparallelizetheindividualqueriesinordertomakeuseoftheCPUresources.Thisexplainslackoflinearscale‐up inall systems.Note that replicationstill isaverypowerful technique inheavyread‐onlyworkloads:wecansafelyexpectthatwhenusing8replicatedserverswith64concurrentquerystreams, the resultswill be 8‐fold those in Figure 4 (for example, 8*virtuoso7would then reach athroughputscoreof128insteadofjust12).
We also tested the “true” cluster database system provided by OpenLink, i.e. Virtuoso V7 ClusterEdition.Datahereisnotreplicatedinallservers,butpartitionedamongthematloadtime.Thiscausesall queries to be parallelized. This explains the superior scores (with quad overall being the best)obtainedonthissystem,under light load.Having inmindthetheoreticalpeakusageofat least128,platformutilizationofv7clusterat15,canstillbesignificantlyoptimized.
Overall, the absolute performance of the peak throughput drops from50PagesPerSec at SF1 to 15PagePerSecatSF10.ThiscouldpartlybeexplainedbythelossofdatalocalityatSF10,butitcouldalsoindicate some query optimization problems. Namely, in the Virtuoso V7 quad implementation, theplans donot have a heavy computational load (as this has beenprecomputed) and in principle thecomplexity of all queries should be logarithmic to data size. In this sense, a drop of a factor 3mayindicatethatthequeryoptimizerdoesnotfindtheoptimalplansyet.
FinallywecomputedthePagePerSec/K$scoreforallbenchmarkedproducts,bothsingle‐serverand8‐node cluster setups, excludingOWLIM5.3 (forwhichwe lack pricing information). It turns out that
The last overall benchmark results are the scores at SF100. Here, the trends continue, thoughperformanceofthedifferentqueryvariantsisstableandthedifferentresultsarenearertoeachother.
SF1#8/1:AtSF1onasingleserverwith8concurrentquerystreams(whichshouldkeepthe8coresbusyatleast),theresultsshowthatforVirtuosoV7thehighestperformanceisachievedwiththenew RTree functionality (rtree++); however the performance linearly improves with higher zoomlevel, and isquitepoorat the lower zoom levels.At the lower zoom levels, the approximate “quad”approachismuchbetter.Interestingly,V6achievesbetterperformancethanV7.
SF10#8/1:AtSF10onasingleserverwith8querystreams,theadvantageofthequadapproachinVirtuosoincreasesconsiderably.DifferenttoSF1,atSF10onV7theperformanceexceedsthatofV7rtree++andofV6quadconsiderably,thelatterbecauseofimprovementsmadeinthequeryoptimizerthat favour the InstanceRetrievalQuery. At SF10, the rtree++ performance is no longer very good.ThiscanbeexplainedbythefactthatinsidetheRTreeinstancesbelongingtoallfacetsarestored,notjustthefourfacetsrequiredbytheInstanceRetrievalQuery.Atlargerscalefactor,theincreaseddatasizecausesI/Otostartplayingarole.InthisexperimentatSF10,VirtuosoV6couldnotbetestedduetothedataloadingbug,mentionedearlier.
InthisexperimentwealsoseeVirtuosoV7ClusterEditionresultsoneightidenticalmachines.Itisstriking that the basic variant performs very close to rtree. If we further compare basic betweencluster (8machines) and a singlemachine, scalability for low zoom levels is near linear (factor 8).These queries perform a lot ofwork,which gets parallelized. However, the gains in the high zoomlevels,which access less data, aremore limited. It is an open questionwhy rtree does not providemuchbenefitinaclustersetting.
SF10#8/8:We nowmove to experiments at SF10with 8 query streams on 8 servers. In thefollowingwecomparetheVirtuosoV7ClusterEditionapproachwithsimplereplication.Theformer,followinga“true”clusterapproach,partitionsalldataacrossallservers,henceeachserverstores1/8thofthedata,andqueriesgetspreadoutoverallservers(parallelized).Simplereplication,incontrasts,
0
5
10
15
20
25
30
35
40
45
50
1 2 3 4 5 6 7 8 9 10 11 12
owlim5.310basic
owlim5.310quad
owlim5.310rtree++
v7cluster/810basic
v7cluster/810quad
v7cluster/810rtree
virtuoso710basic
virtuoso710quad
virtuoso710rtree
virtuoso710rtree++
D5.1.4–v1.0
Page27
loads the same data independently in eight different machines, and then executes the 8‐streambenchmarktestbyrunningthesingle‐streamtestindependentlyonall8machines.Assuchtheresultsof thisexperiment are roughly8‐foldhigher than the single‐streamsingle‐server test.Note that thehardwareismorethan8timesexpensive($8Kvs$100K,duetothecostoftheinfinibandswitch,oneinfiniband network card per server and cabling). This added price differencemake the replicationstrategy lessattractive in thePagePerSecond/$scores,presented later.The replicationexperimentsaremarkedwithastarinthelegend.
Replication: This experiment shows replicated owlim (owim5.3*) to compete on higher zoomlevelswith itsRTreesupport.ThereplicateVirtuosoV7(virtuoso7*)withthequadapproachscoreshigh,thoughisvulnerableintheFacetRetrievalQueryatthelowerzoomlevelswhereitisused(steps7‐9)andwhentherearemanyqueryresults.Theoverallwinner intermsofperformanceisClusterEditionwithquads(thegreendashes)thankstomorereliableperformanceatsteps7‐9,eventhoughitlosesouttoreplicationatsteps10‐12.
Intheexperimentsuntilnow,weshowtheperformanceperstep,howeverpleaserecallthateachstep is the combination of two queries. For steps 1‐6, it is a Facet Count Query with a InstanceAggregationQuery,andforsteps7‐12itisaFacetCountQueryfollowedbyaInstanceRetrievalQuery.Also, levels 6,8,10,12 just pan (to a partially overlapping area at the zame level),wheres the otherquerieszoomin.Intheabovethisisvisibleinqueries6,8,10,12scoringabovethetwotrendlinesthatonecanconstructinthestep1‐6and7‐12segments.
Analysisof IndivualQueryPerformance. However,what is also interesting is to look at theindividual queries. Each query stream consists of 24 queries, two per step. First, the Facet CountQuery,thentheInstanceAggregationorRetrievalQuery.
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10 11 12
v7cluster/8100rtree
v7cluster/8100quad
v7cluster/8100basic
D5.1.4–v1.0
Page29
Theabove figuresshowatSF1#8/1on theright theQueries‐Per‐Secondachievedby theFacetCountQuery,andontheleftbytheInstanceAggregationQuery(steps1‐6)andtheInstanceRetrievalQuery(steps7‐12). Ifweexaminethescale,ateachstep, the Instancequeriesarethebottleneck. Infact,onVirtuoso7theFacetCountQuerydoesnotneedthequadapproximation,asrtree++isamongthe best.Herewe confirm that the performance dip at step 7‐9 is causedby the InstanceRetrievalQuery.Thereasonforthisisthelargeamountofinstancesatthesezoomlevels.Assuch,theseresultspoint to the fact that in the benchmark the switch‐over from Instance Aggregation to InstanceRetrievalshouldbetterbemadeatadeeperzoomlevel.
0
5
10
15
20
25
30
35
40
0 1 2 3 4 5 6 7 8 9101112
1owlim5.3basic
1owlim5.3quad
1owlim5.3rtree++
1virtuoso6basic
1virtuoso6rtree 0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 101112
InstanceInstance
Aggregation Retrieval
QueryQuery
FacetCountQuery
D5.1.4–v1.0
Page30
At SF10#8/1 (above) the cost balance between InstanceQueries (left graph) and Facet CountQueries(rightgraph)shift,astheybecomemorecomparable.Still,thebottleneckisinthefirstthreeInstance Retrieval Queries (step 7‐9, left). It is remarkable in these results that for the InstanceRetrievalQueries,owlim5.3doesagood job instep8‐12(left); in factbeating theVirtuoso7rtree++approach.
AtSF100#8/8theFacetCount(rightgraph)andInstanceQueries(leftgraph)areroughlythesamecost.Here, thequadapproachreallywins in theFacetCountQuery (right).The facet InstanceQueries (left) generally have lower performance, especially between query 7‐12 (i.e., the FacetRetrievalQuery).
The benchmark itself. The LOD2 GeoBench is a challenging benchmark, specifically the InstanceAggregation and Retrieval Queries pose an intense workload to the system. We see that exactimplementations (i.e. basic, rtree, rtree++ but not quad) have a hard time scaling the InstanceAggregationQuerywellatthehigherzoomlevels.WealsoseethattheInstanceRetrievalQueryatthefirst zoom levels where it is used (7‐9) causes a dip in performance due to such retrieval queriesyieldingmanyinstancesandaccessingmanydatapagesinthedatabasesubsystem.Ontheonehandthistellsusthatthebenchmarkisinteresting.Publishingaboutthisbenchmarkwillputemphasisonfinding better solutions to e.g. the Instance Retrieval Query, e.g. by pushing the envelope in queryoptimization. Further, the inherent problems in the lower zoom levels may help the RDF servervendorstoprovidebetterhookstoperformindexingandpre‐computation.Asideasforav3.0ofthebenchmark,we should consider changing the switchover point from InstanceAggregationQuery toInstance Retrieval query at a deeper zoom level. This would be a natural reaction in a real‐lifeapplication toensuredependable latenciesacrossqueries.Further, in the futureweneed to testonlarger data, and with many more concurrent query streams. Finally, a better analysis of theperformancestabilityoftheresultsisneeded.Becauseweareworkingonrealdata,thecardinalitiesoftheselectionsarenotfullypredictableandcanvaryconsiderably,potentiallyintroducingnoiseinthebenchmarkscores.Thiscouldbeaddressedbyhavingthequerygeneratorbeingevenmoreintelligentingeneratingquerypatterns,suchastogenerateproperevenlybalancedparameterbindings.
ThestateofRDFdatabasetechnology.ThethreerightmostresultgroupsinFigure3areanexampleof the achievements in theLOD2project,whereacademic researchperformedbyCWIon columnarandvectorizedqueryexecutionhasmeasurably improvedtheperformanceof theOpenlinkVirtuosoproduct from V6 to V7 by a factor 7, in this case; creating a competitive advantage. In general,geographicalindextechnologyisshownbytheLOD2GeoBenchtobequiteeffective,inthehighzoomlevels. The plans do show some unexpected results, with certain quad Virtuoso V7 query plansbecoming slower than in V6, which likely is down to query optimizer issues. Query optimizationremainsoneofthebiggestchallengesinSPARQLqueryexecution;whichintheLOD2GeoBenchshowsin faults in properly handling the disjunctive queries (the four FACET selections) and the complexquadexpressions.
Even though the thinking in theRDFcommunitymaybe thatRDFdatabasesupport isclosing inonindustry readiness on relational technology, the LOD2 GeoBench shows some very significantconceptualholes.Forinstance,inrelationaltechnologythereareimportantphysicaldesignconcepts,suchasmaterializedviews andclustered indexes, explicitly created for certainpredicates.Theseconcepts are not possible to express in the RDF world. For instance, in a multi‐resolution mapsituation, a relational DBA or database designer would likely develop multiple tables at multipleresolutions, and create separate (RTree) indexes for these. Such tables, or materialized views thatstoreprecomputedexpressions(likefacetcountsatacertaingranularity).Thismeansthatqueriesona lowzoomlevelwouldonlyaccessthematerializedviewrelevant for itonly,whichona lowzoomlevel could have pruned most of dthe detailed data (the individual lamp posts). Accessing thatmaterializedviewthroughitsRTree indexwillbeefficient. Ifallmaterializedviewsforthedifferentresolutionswouldbeunifiedintoonebigdatastructure,allinformationforotherzoomlevelswouldendupintermingledinthesamediskblocksoftheRTree,suchthatmostofthedatascannedwouldbeirrelevant(becauseforadifferentresolution).Thisunifyingofalldata inonebigbucket iswhattheRDFmodeldoes.Whatisneededaremechanismstocreatematerializedviews(maybebyconstructingderiveddatainaspecialkindoftriplegraph)andallowingcertainindexes(suchasRTree)tobebuiltseparately for such a triple graph. That way the RTree will only contain relevant information.Currently,RDFdatabasetechnologydoesnotoffersuchdatabasedesignconcepts.
D5.1.4–v1.0
Page38
RDF geographical browsing application design: faceted browsing on large datasets needs pre‐computation.There isnoway aGoogleMaps experience canbe created straight from the rawbasedata (triples) in a dataset. The quad approach described and benchmarked here specificallytransformstheapplicationdatabaseneedsinsuchawaythatprecomputationofexpressionsbecomespossible. In this case, thequadapproachprecomputes facet instance counts for all tiles, atmultipledifferentgranularities.Queriesthenusetheseprecomputedcountstoavoidhavingtogotothebasedata. Itcannotbestressedenoughthatwithoutprecomputation,queriesatahighzoomlevelwouldneverperformwell,norwouldtheyeverproducenice‐lookingresults(justmillionsoflamp‐poststhatcannotbesensiblydrawnonamap).Thereisalsolittlehopethatsuchprecomputationandindexingcouldbearrangedfullyautomatically.Thismeansthatapplicationdesignersneedtotakethedatabasedesignissueveryseriously.
The latest version of the LOD2Geographical Browser adds significant new features thatmakes theassociationbetweengeographical informationandRDFdata flexibletospecify.Theolderversion,ofwhich a screenshot has been posted in Figure 1, just assumed that the geographical literal (point,polyline,polygon)wouldbeadirectpropertyofafacetinstancesubject.Itis,however,alsopossibletoassociate facet instances over long(er) join paths to geographical literals. The consequence of suchlongerjoinpathsisthatgeographicalquerieswillexperiencelesslocalityfromtheRTreejoinpath,butmoreimportantly,aninterfacewheresuchjoinpathscouldbevariedatrun‐timeflexiblywouldmakeit much more difficult to generate materialized views (such as our pre‐generated quad triples).Creating a Browsing Interface that flexibly allows users to specify these associations, yet rendersresultpages in interactive timeonvery large‐scaledata isextremelychallenging(ifnot impossible).Anotherissueiswhetherordinaryusers,accessing(RDF)dataviagraphicalinterfacesarelookingfortheflexibilitytoassociatejoinpathsthroughacomplexdatamodel,thatislikelyunknowntothem.Itseemsmorepropablethatifrelevantcomplexjoinpathsbetweeninstancesandtheirgeographyexist,itwouldbe the taskofanapplicationdesigner to identify these. In suchacase, theaforementioneddesiredmaterializedviewfunctionalitythatiscalledforinRDFdatabasesystemswouldthencomeinhandytopre‐materializethesegeographiesasdirectpropertiesandacceleratetheminseparateRTreeindexes.
Virtuoso7(forbothsingleandtheclusterversion):We used a development version of OpenLink Virtuoso Universal Server: Version07.00.3203‐pthreadsforLinuxasofAug182013
5.2 HardwareWe used CWI Scilens (www.scilens.org) cluster for the benchmark experiment. This cluster is
designed for high I/O bandwidth, and consists ofmultiple layers ofmachines. In order to get largeamounts of RAM,we used only the “bricks” layer,which contains itsmost powerfulmachines. ThemachineswereconnectedbyMellanoxMCX353A‐QCBTConnectX3VPIHCAcard(QDRIB40Gb/sand10GigE)throughanInfiniScaleIVQDRInfiniBandSwitch(MellanoxMIS5025Q).Eachmachinehasthefollowingspecification.
Eachdatabasehas a virtuoso.ini file as the configuration file. For the cluster version, inaddition to the virtuoso.ini file, there are three other configuration files in each node:cluster.ini,virtuoso.global.ini,clusterglobal.ini.‐ Thevirtuoso.inifilereads:[Database]
forscale10withVirtuoso7takes1hand32minutes.01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
02:30:24 PL LOG: No more files to load. Loader has finished,
02:31:02 PL LOG: No more files to load. Loader has finished,
02:31:58 PL LOG: No more files to load. Loader has finished,
02:32:29 PL LOG: No more files to load. Loader has finished,
D5.1.4–v1.0
Page46
02:33:47 PL LOG: No more files to load. Loader has finished,
02:36:08 PL LOG: No more files to load. Loader has finished,
02:39:10 PL LOG: No more files to load. Loader has finished,
02:40:21 PL LOG: No more files to load. Loader has finished,
02:40:21 PL LOG: No more files to load. Loader has finished,
02:45:59 PL LOG: No more files to load. Loader has finished,
02:46:06 PL LOG: No more files to load. Loader has finished,
02:47:06 PL LOG: No more files to load. Loader has finished,
02:47:39 PL LOG: No more files to load. Loader has finished,
03:10:47 PL LOG: No more files to load. Loader has finished,
V7cluster‐ Bulk‐loading was run with 2 loading processes in each node (thus, 32 loadingprocessesinall16nodes).Forexample,Bulk‐loadingforscale100withV7clusterintakes5hand11minutesinMasterNode.
17:13:50 PL LOG: Loader started
17:16:07 PL LOG: Loader started
22:24:00 PL LOG: No more files to load. Loader has finished
22:24:02 PL LOG: No more files to load. Loader has finished
[duc@bricks13 data]$ du -s -h openrdf-sesame/repositories/olgeo1
24G openrdf-sesame/repositories/olgeo1
D5.1.4–v1.0
Page48
5.4.2 BulkLoadScript
Virtuoso6andVirtuoso7
The bulk loading script for Virtuoso is applied on an empty database. First, theregister_load_files.sqlisruntoregisterthelistoffilestoload.Then,theloadingprocessisrunbyusingthescriptrdfload.sh.ForVirtuoso7,14“rdf_loader_run()”wereexecuted.
The dataset files are equally divided into each machines. In each machines, theregister_load_files_GEO.sqlisusedforregisteringthelistoffiletoloadinthatmachine.