Top Banner
Query Expansion Methods and Performance Evaluation for Reusing Linking Open Data of the European Public Procurement Notices Code: TSI-020100-2010-919 José María Álvarez Rodríguez WESO-Universidad de Oviedo http://purl.org/weso/moldeas/ Tecnologías de Linked Data y sus aplicaciones en España (TLDE) CAEPIA 2011-Tenerife (Spain) 8th of November, 2011
61

WESO CAEPIA-20111108

Dec 05, 2014

Download

Technology

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: WESO CAEPIA-20111108

Query Expansion Methods and Performance Evaluation

for Reusing Linking Open Data of the

European Public Procurement Notices

Code: TSI-020100-2010-919

José María Álvarez RodríguezWESO-Universidad de Oviedohttp://purl.org/weso/moldeas/

Tecnologías de Linked Data y sus aplicaciones en España (TLDE)CAEPIA 2011-Tenerife (Spain)

8th of November, 2011

Page 2: WESO CAEPIA-20111108

OverviewUse case & Context

SPARQL & Performance

Next Steps

Page 3: WESO CAEPIA-20111108

Objective

Creation of a pan -european e-procurement platform

Page 4: WESO CAEPIA-20111108

Covering almost every publicprocurement notices of the

European regions

Page 5: WESO CAEPIA-20111108

E-procurement Long Tail

TEDBOE

(official bulletin of the Spanish Governement) BOPA

(official bulletin of the Asturian Governement)

Page 6: WESO CAEPIA-20111108

To Be Able to answer to …

Which public procurement notices are relevant to Dutch companies (only SMEs) that

want to tender for contracts announced by local authorities with a total value lower than 170K € to procure “Road bridge construction work” and a two year duration in the Dutch -

speaking region of Flanders (Belgium)?

Page 7: WESO CAEPIA-20111108

XML

TEDTED

RDFizing

CPVCPV

Services

(e.g. Searching,

Matchmaking &

Prediction)BOEBOE

……

NUTSNUTS

Organizations

BOPABOPA

RDFizing

EurovocEurovoc

Linked Data

Api

Pubby+Snorql

1

2

3

4

5

Structuring public procurement notices

Transforming government classifications

LOD enrichment

Providing new semantic-based services

Easing the access to thepublished data using the

LOD approach

Semantic

Methods

Page 8: WESO CAEPIA-20111108

1,2,3 Preliminary Results

Information Triples Total

Common ProcurementVocabulary(2003 y 2008)

~300,00 ~11 millions of

RDF triples

Organizations ~5,000,000

NUTS 36,219

Public procurementnotices(2008-2011)

677,058

2,398,601

2,590,880

402,264

Page 9: WESO CAEPIA-20111108

4 Semantic -basedServices

Problem of«Query Expansion » depending on the kind of

information variable

Page 10: WESO CAEPIA-20111108

4 Methods of«QueryExpansion »

Expansion

Individual

Taxonomy-based

Directly

Syntactic Search

SpreadingActivation

Recommendingengine

Location

Georeasoning

User-based

Numeric

Fuzzy Logic

History-based

Correlation

Group

Recommendingengine

Page 11: WESO CAEPIA-20111108

Remembering …

Which public procurement notices are relevant to Dutch companies (only SMEs) that

want to tender for contracts announced by local authorities with a total value lower than 170K € to procure “Road bridge construction work” and a two year duration in the Dutch -

speaking region of Flanders (Belgium)?

Page 12: WESO CAEPIA-20111108

Query …

?ppn

NUTS-B3 300 RÉG. WALLONNE

cpv:45221111-3

SME170,000 €

cpv:45221111-3 NL

ppn:nutsCode

cpv:CodeIn2008

org:classification ppn:hasAmount

2 years

ppn:hasDuration

Page 13: WESO CAEPIA-20111108

Applying Query Expansion …

cpv:45221111-3 NL

?ppn

SME170,000-200,000 €

ppn:nutsCode

cpv:CodeIn2008

org:classification ppn:hasAmount

2-3 years

ppn:hasDuration cpv:45221111-3cpv:45221110-6cpv:45221113-7cpv:45221114-4

NUTS-B3NUTS-NL326NUTS-1025NUTS-BE2

Page 14: WESO CAEPIA-20111108

4 Example of SPARQL query

SELECT DISTINCT * WHERE {?ppn rdf:type <http://purl.org/weso/ppn/def#ppn>.?ppn ppn:nutsCode ?nutsCode.?ppn cpv:codeIn2008 ?cpvCode. ?ppn ppn:hasDuration ?duration?ppn dc:identifier ?id.?ppn dc:date ?date . ? ppn ppn:hasAmount ?amount.FILTER(? cpvCode = cpv:45221111-3 ... ) .FILTER (

(xsd:double(?amount) >= xsd:long(170,000)) && (xsd:double(?amount) <= xsd:long(200,000)) ).

. FILTER(?nutsCode = nuts:B3 ... ) .FILTER (

(xsd:long(?duration) >= xsd:long(2)) && (xsd:long(?duration) <= xsd:long(3)) ).

}

Page 15: WESO CAEPIA-20111108

Context

Performance of SPARQL Queries

~30 sec.

Page 16: WESO CAEPIA-20111108

Hardware & Software

DELL PC 2GB RAM and 30GB HardDiskVirtual Box (version 4.0.6)

Linux 2.6.35-22-server #33-Ubuntu 2 SMP x86_64 GNU/Linux

Ubuntu 10.10

OpenLink Virtuoso Opensource-6-20110218

Page 17: WESO CAEPIA-20111108

Question ?

How to decrease the time of query execution without

modify the hardware and not use any vendor feature?

Page 18: WESO CAEPIA-20111108

TripleStore

25 graphs20 M of RDF Triples

But…

8 graphs11 M of RDF Triples

Page 19: WESO CAEPIA-20111108

Focus on..

The generation of SPARQL queries

Page 20: WESO CAEPIA-20111108

Let’s start …

9 SPARQL Queries

3 executions

Page 21: WESO CAEPIA-20111108

Ti Simple Enhanced LIMIT FILTER GRAPHS Split Parallel Total

queries

T1 * 1

T2 * * 1

T3 * 1

T4 * * 1

T5 * * * 1

T6 * * * * * 4

T6-1 * * * * * * 4

T7 * * * * 5

T7-1 * * * * * 5

T8 * * * * * 20

T8-1 * * * * * * 20

T-9 * * * * * 15

T-91 * * * * * 15

T10 * * * * * 60

T10-1 * * * * * * 60

Page 22: WESO CAEPIA-20111108

Simple SPARQL query

SELECT DISTINCT * WHERE {?ppn rdf:type <http://purl.org/weso/ppn/def#ppn>.?ppn ppn:nutsCode ?nutsCode.?ppn cpv:codeIn2008 ?cpvCode. ?ppn ppn:hasDuration ?duration?ppn dc:identifier ?id.?ppn dc:date ?date . ? ppn ppn:hasAmount ?amount.FILTER(? cpvCode = cpv:15331137 ) .

. FILTER(?nutsCode = nuts:UK ) .}

Page 23: WESO CAEPIA-20111108

Simple Query

1 CPV Code1 NUTS Code

Time: ~3,29 sec.

1

Page 24: WESO CAEPIA-20111108

T1

Rewrite SPARQL queries:Match triples from specific to

general

Filter as soon as possible

Page 25: WESO CAEPIA-20111108

T2

Use the LIMIT clause

Value set to 10,000

Page 26: WESO CAEPIA-20111108

Rewrite SPARQL query

SELECT DISTINCT * WHERE {?ppn rdf:type <http://purl.org/weso/ppn/def#ppn>.?ppn cpv:codeIn2008 ?cpvCode. FILTER(? cpvCode = cpv:15331137 ) .?ppn ppn:nutsCode ?nutsCode.FILTER(?nutsCode = nuts:UK ) .?ppn ppn:hasDuration ?duration?ppn dc:identifier ?id.?ppn dc:date ?date . ? ppn ppn:hasAmount ?amount.

. } LIMIT 10000

Page 27: WESO CAEPIA-20111108

Results T22

1 CPV Code1 NUTS Code

Time: ~3,26 sec.

Page 28: WESO CAEPIA-20111108

Evaluation

There is no significant changes in execution time

and gain…and

We are interested in “enhanced queries ”

Page 29: WESO CAEPIA-20111108

T3

Execution of enhancedqueries

Page 30: WESO CAEPIA-20111108

Enhanced SPARQL query

SELECT DISTINCT * WHERE {?ppn rdf:type <http://purl.org/weso/ppn/def#ppn>.?ppn ppn:nutsCode ?nutsCode.?ppn cpv:codeIn2008 ?cpvCode. ?ppn ppn:hasDuration ?duration?ppn dc:identifier ?id.?ppn dc:date ?date . ? ppn ppn:hasAmount ?amount.FILTER(? cpvCode = {cpv:15331137 , cpv:48611000,

cpv: 48611000, cpv:50531510, cpv: 15871210 }) .. FILTER(?nutsCode = {nuts:B3, nuts:PL, nuts:RO ) .}

Page 31: WESO CAEPIA-20111108

5 CPV Codes3 NUTS Codes

1 query

3 Results T3

Time: ~20,65 sec.

Page 32: WESO CAEPIA-20111108

T4

Rewrite SPARQL queries+

Use the LIMIT clause

Page 33: WESO CAEPIA-20111108

5 CPV Codes3 NUTS Codes

1 query

4 Results T4 wrt T3

Time: ~20,55 sec.

Page 34: WESO CAEPIA-20111108

Info

8 graphs

11 M of RDF Triples

Page 35: WESO CAEPIA-20111108

T5

Rewrite SPARQL queries+

Use the LIMIT clause+

Named Graphs (FROM)

Page 36: WESO CAEPIA-20111108

5 CPV Codes3 NUTS Codes

1 query

5 Results T5 wrt T3

Time: ~20,65 sec.

Page 37: WESO CAEPIA-20111108

T6

Rewrite SPARQL queries+

Use the LIMIT clause+

Named Graphs (FROM)+

Split into simple queries

Page 38: WESO CAEPIA-20111108

5 CPV Codes3 NUTS Codes

4 Graphs4 simple queries

6 Results T6 wrt T3

Time: ~20,60 sec.

Page 39: WESO CAEPIA-20111108

T6-1

Rewrite SPARQL queries+

Use the LIMIT clause+

Named Graphs (FROM)+

Split enhance query into simple queries+

Parallelization of query execution (ad-hoc map/reduce)

Page 40: WESO CAEPIA-20111108

5 CPV Codes3 NUTS Codes

4 Graphs4 simple queries

6-1 Results T6-1 wrt T3

Time: ~11,93 sec.

Page 41: WESO CAEPIA-20111108

T7

Rewrite SPARQL queries+

Use the LIMIT clause+

Split enhance query into simple queries

Page 42: WESO CAEPIA-20111108

1 CPV Code (5)3 NUTS Code

5 simple queries

7 Results T7 wrt T3

Time: ~15,81 sec.

Page 43: WESO CAEPIA-20111108

T7-1

Rewrite SPARQL queries+

Use the LIMIT clause+

Split enhance query into simple queries+

Parallelization of query execution (ad-hoc map/reduce)

Page 44: WESO CAEPIA-20111108

1 CPV Code (5)3 NUTS Codes

5 simple queries

7-1 Results T7-1 wrt T3

Time: ~10,55 sec.

Page 45: WESO CAEPIA-20111108

T8

Rewrite SPARQL queries+

Use the LIMIT clause+

Named Graphs (FROM)+

Split into simple queries

Page 46: WESO CAEPIA-20111108

1 CPV Code (5)3 NUTS Codes

4 Graphs20 simple queries

8 Results T8 wrt T3

Time: ~32,34 sec.

Page 47: WESO CAEPIA-20111108

T8-1

Rewrite SPARQL queries+

Use the LIMIT clause+

Named Graphs (FROM)+

Split enhance query into simple queries+

Parallelization of query execution (ad-hoc map/reduce)

Page 48: WESO CAEPIA-20111108

1 CPV Code (5)3 NUTS Codes

4 Graphs20 simple queries

8-1 Results T8-1 wrt T3

Time: ~18,45 sec.

Page 49: WESO CAEPIA-20111108

T9

Rewrite SPARQL queries+

Use the LIMIT clause+

Split enhance query into simple queries (1 CPV code+1 NUTS code)

Page 50: WESO CAEPIA-20111108

1 CPV Code (5)1 NUTS Code (3)

15 simple queries

9 Results T9 wrt T3

Time: ~22,462 sec.

Page 51: WESO CAEPIA-20111108

T9-1

Rewrite SPARQL queries+

Use the LIMIT clause+

Split enhance query into simple queries (1 CPV code+1 NUTS code)

+Parallelization of query execution

(ad-hoc map/reduce)

Page 52: WESO CAEPIA-20111108

1 CPV Code (5)1 NUTS Code (3)

15 simple queries

9-1 Results T9-1 wrt T3

Time: ~12,77 sec.

Page 53: WESO CAEPIA-20111108

T10

Rewrite SPARQL queries+

Use the LIMIT clause+

Named Graphs (FROM)+

Split into simple queries(1 CPV code+1 NUTS code )

Page 54: WESO CAEPIA-20111108

1 CPV Code (5)1 NUTS Code (3)

4 Graphs60 simple queries

10 Results T10 wrt T3

Time: ~71,17 sec.

Page 55: WESO CAEPIA-20111108

T10-1Rewrite SPARQL queries

+Use the LIMIT clause

+Named Graphs (FROM)

+Split enhance query into simple queries

(1 CPV code+1 NUTS code )+

Parallelization of query execution (ad-hoc map/reduce)

Page 56: WESO CAEPIA-20111108

1 CPV Code (5)1 NUTS Code (3)

4 Graphs60 simple queries

10-1 Results T10-1 wrt T3

Time: ~35,13 sec.

Page 57: WESO CAEPIA-20111108

Ti Table of ResultsTime (sec.) Gain (%)

T1 3,29 N/AT2 3,26 0,93T3 20,65 N/AT4 20,55 0,49T5 20,65 0T6 20,6 0,24T6-1 11,93 73,09T7 15,81 30,61T7-1 10,55 95,73

T8 32,34 -36,15T8-1 18,45 11,92T9 22,62 -8,71T9-1 12,77 61,71T10 71,63 -71,17

T10-1 35,13 -41,22

Page 58: WESO CAEPIA-20111108

Discussion

• The number of queries is a key-factor• The number of CPV codes implies more

execution time• The parallelization improves execution

time• T7-1 is the best execution in terms of

time• Rewrite SPARQL queries• Use the LIMIT clause• Split enhance query into simple queries • Parallelization of query execution

Page 59: WESO CAEPIA-20111108

Further Steps

• Distribute graphs in different nodes (HW improvement)

• Use of other triple stores • (SW comparison)• Add SPARQL 1.1 new features

(Expressiveness improvement)• Cache of queries (SW improvement)

Page 60: WESO CAEPIA-20111108

SomeReferences …

• http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index. html#comparison

• http://www.slideshare.net/olafhartig/an-overview-on -linked-data-management-and-sparql-querying-isslod2011

• http://squin.sourceforge.net/• http://www2.informatik.hu-

berlin.de/~hartig/files/Slides_Hartig_ISSLOD2011.pd f• http://www2008.org/papers/pdf/p595-stocker1.pdf• http://www.informatik.uni-

freiburg.de/~mschmidt/docs/diss_final01122010.pdf• http://mayor2.dia.fi.upm.es/oeg-upm/files/sparql-dq p/eswc11-bac-ext.pdf• http://www.slideshare.net/olafhartig/the-sparql-que ry-graph-model-for-

query-optimization-1259536• http://www.w3.org/TR/sparql-features/

Page 61: WESO CAEPIA-20111108

Query Expansion Methods and Performance Evaluation

for Reusing Linking Open Data of the

European Public Procurement Notices

Code: TSI-020100-2010-919

José María Álvarez RodríguezWESO-Universidad de Oviedohttp://purl.org/weso/moldeas/

Tecnologías de Linked Data y sus aplicaciones en España (TLDE)CAEPIA 2011-Tenerife (Spain)

8th of November, 2011