Introduction to Big data

Supported by EU projects

29/11/2013Athens, Greece

Open Data for Agriculture

Joint offering by

Intro to Big Data

Antonis KoukourikosNCSR “Demokritos”

Intro to Big Data

Slide 3 of 25

Presentation Outline

• What is Big Data?

• Semantic Web Technologies

• What Semantic Web brings into the picture

WHAT IS BIG DATA?Part 1

Slide 5 of 25

Big Data Is…

Data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it

Slide 6 of 25

Big Data Sources

• Biomedical Information

• Sensor Data

• Logs

• E-mails

• Satellite images

• Audio and Video Streams

• Social Networks

Slide 7 of 25

Big Data Challenges – “The Three Vs”

…or is it 4…?

Veracity

…or is it 6… ??

Visualization

Value

Volume

VelocityVariety

Slide 8 of 25

Big Data demand…

• Storage– Impractical or impossible to use centralized storage

• Distribution• Federation

– Indexing is a problem of itself• Computational power

– For discovering– For searching / retrieving– For joining

• Human effort and expertise– Querying can become complex– Are you sure you exploit all this information?

SEMANTIC WEB TECHNOLOGIESPart 2

Slide 10 of 25

The Syntactic and the Semantic Web

• The World Wide Web represents information using natural language, graphics, multimedia...– Humans can process and combine these

information easily– However, machines are ignorant!

• The Semantic Web is a Web with a meaning– A web of data that is understandable by the

machines

Slide 11 of 25

Semantic Web Technologies

• Common formats for integration and combination of data drawn from diverse sources, whereas the original Web mainly concentrated on the interchange of documents.

• For defining– RDFS http://www.w3.org/TR/rdf-schema/

– OWL http://www.w3.org/TR/owl2-overview/

• For describing– RDF http://www.w3.org/RDF/

• For querying– SPARQL http://www.w3.org/TR/2013/REC-sparql11-query-20130321 /

http://www.w3.org/TR/rdf-schema/

http://www.w3.org/TR/rdf-schema/

http://www.w3.org/TR/owl2-overview/

http://www.w3.org/RDF/

http://www.w3.org/RDF/

http://www.w3.org/TR/2013/REC-sparql11-query-20130321/

http://www.w3.org/TR/2013/REC-sparql11-query-20130321/

Slide 12 of 25

What SW can do

• Handle heterogeneity

• Handle evolution / variability

• Elicit inferred knowledge

• Volume is still the challenge

WHAT SEMANTIC WEB BRINGS IN THE BIG DATA PICTURE

Part 3

Slide 14 of 25

Moving Forward with “Old” Technologies

HARVESTER

OAI-PMH Service Provider #1

Schema #1

OAI-PMH Service Provider #n

Schema #n

INDEXER

AggregatedXML Repository

Web Portals

Open AGRIS (FAO)AgLR/GLN (ARIADNE)Organic.Edunet (UAH)

VOA3R (UAH)...

AGRIS AP Schema

IEEE LOM Schema

DC Schema

...

RDF Triple Store

Common Schema

SPARQL endpoint(Data Source #1)

SPARQL endpoint(Data Source #n)

INDEXER

Web Portals

SPARQL endpoint

NOW (2012) CASE OF AGRICULTURAL INFRASTRUCTURES 2015 (AgINFRA) CASE OF AGRICULTURAL INFRASTRUCTURES

How Many?

Is it feasible?

BigData Problem!

Slide 15 of 25

Query

Federated endpoint Wrapper

SemaGrow SPARQL endpoint

Resource Discovery

Query results

query fragment,Source

(#1)

Instance StatisticsData Summaries

SPARQL endpoint

POWDER Inference Layer

P-Store

InstanceStatistics

query fragment,target Source

transformed query

Query Decomposition

querypatterns

Query Results Merger


(#n)

queryresults

Client

Reactivityparameters

Query Decomposer

Data Source(s) Selector

Ctrl

Candidate Source(s) List· Instance Statistics· Load Info· Semantic Proximity

Query Transformation Service

SchemaMappings


SPARQLquery

Ctrl

Ctrl

Load Info

Instance Statistics

Data Summaries

Set of query

patternsQuery Pattern Discovery

Service

equivalentpatterns

querypattern

SemanticProximity

Resource Selector

query results schema

transformed schema

queryrequest #1

queryrequest #n

queryresults


SPARQLquery

Query Manager

What Semantic Web can bring into the picture

• One Data Access Point for One Data Access Point for the entire Data Cloud– Enabling Service-Data level agreements with Data providers

• Application-level Vocabularies / Thesauri / Ontologies– Enabling different application facets for different communities of users over the SAME data pool

• Going beyond existing Distributed Triple Store Implementations–Link Heterogeneous but Semantically Connected

Data–Index Extremely Large Information Volumes (Peta

Sizes)–Improve Information Retrieval response • Data (+Metadata)

physically stored in Data Provider– No need for harvesting

• Vocabularies / Thesauri / Ontologies of Data Provider choice– No need for aligning

according to common schemas

Slide 16 of 25

The SemaGrow Solution

• Use POWDER to mass-annotate large-subspaces– Exploit naming convention regularities to compress

the indexes used by the system• Partition triple patterns in the original query• Annotate each fragment with an ordered list of

data sources most likely to contain relevant data• Distribute and transform the query fragments• Collect and align the results

Slide 17 of 25

The POWDER W3C Recommendation

• Exploits natural groupings of URIs to annotate all resources in a subset of the URI space

• Regular expression based grouping

• Allows properties and their values to be associated with an arbitrary number of subjects within a fully-defined semantic framework

• POWDER Description Resources: http://www.w3.org/TR/powder-dr/• POWDER Formal Semantics: http://www.w3.org/TR/powder-formal/

http://www.w3.org/TR/powder-dr/

http://www.w3.org/TR/powder-dr/

http://www.w3.org/TR/powder-formal/

http://www.w3.org/TR/powder-formal/

Slide 18 of 25

The SemaGrow Stack

• Integrates the components in order to offer a single SPARQL endpoint that federates a number of heterogeneous data sources

• Targets the federation of independently provided data sources

Slide 19 of 25

SemaGrow Architecture

Query

Federated endpoint Wrapper

SemaGrow SPARQL endpoint

Resource Discovery

Query results


(#1)

Instance StatisticsData Summaries

SPARQL endpoint

POWDER Inference Layer

P-Store

InstanceStatistics

query fragment,target Source

transformed query

Query Decomposition

querypatterns

Query Results Merger


(#n)

queryresults

Client

Reactivityparameters

Query Decomposer

Data Source(s) Selector

Ctrl

Candidate Source(s) List· Instance Statistics· Load Info· Semantic Proximity

Query Transformation Service

SchemaMappings


SPARQLquery

Ctrl

Ctrl

Load Info

Instance Statistics

Data Summaries

Set of query

patternsQuery Pattern Discovery

Service

equivalentpatterns

querypattern

SemanticProximity

Resource Selector

query results schema

transformed schema

queryrequest #1

queryrequest #n

queryresults


SPARQLquery

Query Manager

Resource DiscoveryQuery

Decomposition

Federated Endpoint Wrapper

Data Summaries Endpoint

Slide 24 of 25

Use Cases (DLO)Heterogeneous Data Collections & Streams Big data:

– Sensor data: soil data, weather– GIS data: land usage, forest and natural resources management data– Historical data: crop yield, economic data– Forecasts: climate change models

Problem:– Combine heterogeneous sources to analyze past food production and

forecast future trends– Cannot clone and translate: large scale, live data streams– Cannot immediately and directly affect radical re-design of all sensing

and processing currently in place

3rd Plenary & ESG Meeting 21/10/2013

Slide 25 of 25

Use Cases (FAO)Reactive Data Analysis Big data:

– Document collections: past experiences, analysis and research results– Databases: climate conditions and crop yield observations, economic

data (land and food prices) Problem:

– Retrieving complete and accurate information to compile reports• Raw data and reports, scientific publications, etc.

– Wastes human resources that could analyze data and synthesize useful knowledge and advice for food production

• Too much time spent cross-relating responses from different sources

– Too many different organizations and processes rely on the different schemas to make re-design viable

– Cloning is inefficient: large and constantly updated stores3rd Plenary & ESG Meeting 21/10/2013

Slide 26 of 25

Use Cases (AK)Reactive Resource Discovery Big data:

– Multimedia content about agriculture and biodiversity

Problem:– Real-time retrieval of relevant content– Used to compile educational activities– Schema heterogeneity:

• Different providers (Oganic edunet, Europeana, VOA3R, etc.)

– Too many different organizations and processes rely on the different schema to make re-design viable

– Cloning is inefficient: large and constantly updated stores

3rd Plenary & ESG Meeting 21/10/2013

Slide 27 of 25

Project Info

• SemaGrow: Data intensive techniques to boost the real-time performance of global agricultural data infrastructures

• FP7-ICT-2011.4.4 (Intelligent Information Management)No.

Name Country

1 Universidad de Alcala

2 NCSR “Demokritos”

3 Universita Degli Studi di Roma Tor Vergata

4 Semantic Web Company

5 Institut Za Fiziku

6 Stichting Dienst Landbouwkundik Onderzoek

7 Food and Agriculture Organization of the UN

8 Agroknow Technologies

Thank you!

Antonis Koukourikos

NCSR “Demokritos”

[email protected]

Introduction to Big data

Education

web of data

data accessclient

combination of data

data providerno

relevant data

open data

big data picture

big data isdata