Cosmos, Big Data GE Implementation

Open APIs for Open Minds

Building your first application using FI-WARE

Cosmos, Big Data GE implementation

1

Big Data:

What is it and how

much data is there

What is big data?

2

> small

data

What is big data?

3

> big data

http://commons.wikimedia.org/wiki/File:Interior_view_of_Stockholm_Public_Library.jpg

How much data is there?

4

Data growing forecast

5

2.33.6

12

19

11.3

39

0.5

1.4

Global

users

(billions)

Global networked

devices

(billions)

Global broadband

speed

(Mbps)

Global traffic

(zettabytes)

http://www.cisco.com/en/US/netsol/ns827/networking_solutions_sub_solution.html#~forecast

2012

20122012

2012

2017

2017

2017

2017

It is not only about storing big data but using it!

6

> tools

> big data

http://commons.wikimedia.org/wiki/File:Interior_view_of_Stockholm_Public_Library.jpg

7

How to deal with it:

The Hadoop reference

Hadoop was created by Doug Cutting at Yahoo!...

8

… based on the MapReduce patent by Google

Well, MapReduce was really invented by Julius Caesar

9

Divide et

impera*

* Divide and

conquer

An example

10

How much pages are written in latin among the books

in the Ancient Library of Alexandria?

LATIN

REF1

P45

GREEK

REF2

P128

EGYPT

REF3

P12

LATIN

pages 45

EGYPTIA

N

LATIN

REF4

P73

LATIN

REF5

P34

EGYPT

REF6

P10

GREEK

REF7

P20

GREEK

REF8

P230

45 (ref 1)

still

reading…

Mappers

Reducer

An example

11



GREEK

REF2

P128

still

reading…

EGYPTIA

N

LATIN

REF4

P73

LATIN

REF5

P34

EGYPT

REF6

P10

GREEK

REF7

P20

GREEK

REF8

P230

GREEK

45 (ref 1)

Mappers

Reducer

An example

12



LATIN

pages 73

EGYPTIA

N

LATIN

REF4

P73

LATIN

REF5

P34

GREEK

REF7

P20

GREEK

REF8

P230

LATIN

pages 34

45 (ref 1)

+73 (ref 4)

+34 (ref 5)

Mappers

Reducer

An example

13



GREEK

GREEK

GREEK

REF7

P20

GREEK

REF8

P230

idle…

45 (ref 1)

+73 (ref 4)

+34 (ref 5)

Mappers

Reducer

An example

14



idle…

idle…

idle…

45 (ref 1)

+73 (ref 4)

+34 (ref 5)

152 TOTAL

Mappers

Reducer

Hadoop architecture

15

head node

16

FI-WARE proposal:

Cosmos Big Data

What is Cosmos?

17

• Cosmos is Telefónica's Big Data platform• Dynamic creation of private computing clusters as a

service

• Infinity, a cluster for persistent storage

• Cosmos is Hadoop ecosystem-based• HDFS as its distributed file system

• Hadoop core as its MapReduce engine

• HiveQL and Pig for querying the data

• Oozie as remote MapReduce jobs and Hive launcher

• Plus other proprietary features• Infinity protocol (secure WebHDFS)

• Cygnus, an injector for context data coming from Orion

CB

Cosmos architecture

18

What can be done with Cosmos?

19

WhatLocally

(ssh’ing into the Head

Node)

Remotely(connecting your app)

Clusters operation Cosmos CLI REST API

I/O operation ‘hadoop fs’ commandREST API

(WebHDFS, HttpFS,

Infinity protocol)

Querying tools(basic analysis)

Hive CLI JDBC, Thrift*

MapReduce(advanced analysis)

‘hadoop jar’

commandOozie REST API

20

Clusters operation:

Getting your own roman

legion

Using the RESTful API (1)

21


22


23

Using the Python CLI

24

• Creating a cluster$ cosmos create --name <STRING> --size <INT>

• Listing all the clusters$ cosmos list

• Showing a cluster details$ cosmos show <CLUSTER_ID>

• Connecting to the Head Node of a cluster$ cosmos ssh <CLUSTER_ID>

• Terminating a cluster$ cosmos terminate <CLUSTER_ID>

• Listing available services$ cosmos list-services

• Creating a cluster with specific services$ cosmos create --name <STRING> --size <INT>

--services <SERVICES_LIST>

25

How to exploit the data:

Commanding your

roman legion

1. Hadoop filesystem commands

26

• Hadoop general command$ hadoop

• Hadoop file system subcommand$ hadoop fs

• Hadoop file system options$ hadoop fs –ls

$ hadoop fs –mkdir <hdfs-dir>

$ hadoop fs –rmr <hfds-file>

$ hadoop fs –cat <hdfs-file>

$ hadoop fs –put <local-file> <hdfs-dir>

$ hadoop fs –get <hdfs-file> <local-dir>

• http://hadoop.apache.org/docs/current/hadoop-project-

dist/hadoop-common/CommandsManual.html

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html

2. WebHDFS/HttpFS REST API

27

• List a directoryGET http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS

• Create a new directoryPUT http://<HOST>:<PORT>/<PATH>?op=MKDIRS[&permission=<OCTAL>]

• Delete a file or directoryDELETE http://<host>:<port>/webhdfs/v1/<path>?op=DELETE

[&recursive=<true|false>]

• Rename a file or directoryPUT

http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=RENAME&destination=<PATH>

• Concat filesPOST

http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CONCAT&sources=<PATHS>

• Set permissionPUT http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETPERMISSION

[&permission=<OCTAL>]

• Set ownerPUT http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETOWNER

[&owner=<USER>][&group=<GROUP>]

2. WebHDFS/HttpFS REST API (cont.)

28

• Create a new file with initial content (2 steps operation)PUT http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE

[&overwrite=<true|false>][&blocksize=<LONG>][&replication=<SHORT>]

[&permission=<OCTAL>][&buffersize=<INT>]

HTTP/1.1 307 TEMPORARY_REDIRECT

Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...

Content-Length: 0

PUT -T <LOCAL_FILE>

http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...

• Append to a file (2 steps operation)POST

http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=APPEND[&buffersize=<INT>]


Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=APPEND...

Content-Length: 0

POST -T <LOCAL_FILE>

http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=APPEND...

2. WebHDFS/HttpFS REST API (cont.)

29

• Open and read a file (2 steps operation)GET http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN

[&offset=<LONG>][&length=<LONG>][&buffersize=<INT>]


Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=OPEN...

Content-Length: 0

GET http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=OPEN...

• http://hadoop.apache.org/docs/current/hadoop-project-

dist/hadoop-hdfs/WebHDFS.html

• HttpFS does not redirect to the Datanode but to the HttpFS

server, hidding the Datanodes (and saving tens of public IP

addresses)

• The API is the same

• http://hadoop.apache.org/docs/current/hadoop-hdfs-

httpfs/index.html

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html

http://hadoop.apache.org/docs/current/hadoop-hdfs-httpfs/index.html

3. Local Hive CLI

30

• Hive is a querying tool

• Queries are expresed in HiveQL, a SQL-like

language• https://cwiki.apache.org/confluence/display/Hive/Language

Manual

• Hive uses pre-defined MapReduce jobs for

• Column selection

• Fields grouping

• Table joining

• …

• All the data is loaded into Hive tables

https://cwiki.apache.org/confluence/display/Hive/LanguageManual

3. Local Hive CLI (cont.)

31

• Log on to the Master node

• Run the hive command

• Type your SQL-like sentence!

$ hive

$ Hive history file=/tmp/myuser/hive_job_log_opendata_XXX_XXX.txt

hive>select column1,column2,otherColumns from mytable where

column1='whatever' and columns2 like '%whatever%';

Total MapReduce jobs = 1

Launching Job 1 out of 1

Starting Job = job_201308280930_0953, Tracking URL =

http://cosmosmaster-gi:50030/jobdetails.jsp?jobid=job_201308280930_0953

Kill Command = /usr/lib/hadoop/bin/hadoop job -

Dmapred.job.tracker=cosmosmaster-gi:8021 -kill job_201308280930_0953

2013-10-03 09:15:34,519 Stage-1 map = 0%, reduce = 0%

2013-10-03 09:15:36,545 Stage-1 map = 67%, reduce = 0%

2013-10-03 09:15:37,554 Stage-1 map = 100%, reduce = 0%

2013-10-03 09:15:44,609 Stage-1 map = 100%, reduce = 33%

…

4. Remote Hive client

32

• Hive CLI is OK for human-driven testing purposes• But it is not usable by remote applications

• Hive has no REST API

• Hive has several drivers and libraries• JDBC for Java

• Python

• PHP

• ODBC for C/C++

• Thrift for Java and C++

• https://cwiki.apache.org/confluence/display/Hive/HiveClie

nt

• A remote Hive client usually performs:• A connection to the Hive server (TCP/10000)

• The query execution

https://cwiki.apache.org/confluence/display/Hive/HiveClient

4. Remote Hive client – Get a connection

33

private Connection getConnection(

String ip, String port, String user, String password) {

try {

// dynamically load the Hive JDBC driver

Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver");

} catch (ClassNotFoundException e) {

System.out.println(e.getMessage());

return null;

} // try catch

try {

// return a connection based on the Hive JDBC driver, default DB

return DriverManager.getConnection("jdbc:hive://" + ip + ":" +

port + "/default?user=" + user + "&password=" + password);

} catch (SQLException e) {

System.out.println(e.getMessage());

return null;

} // try catch

} // getConnection

https://github.com/telefonicaid/fiware-connectors/tree/develop/resources/hive-basic-client


4. Remote Hive client – Do the query

34

private void doQuery() {

try {

// from here on, everything is SQL!

Statement stmt = con.createStatement();

ResultSet res = stmt.executeQuery("select column1,column2," +

"otherColumns from mytable where column1='whatever' and " +

"columns2 like '%whatever%'");

// iterate on the result

while (res.next()) {

String column1 = res.getString(1);

Integer column2 = res.getInteger(2);

// whatever you want to do with this row, here

} // while

// close everything

res.close(); stmt.close(); con.close();

} catch (SQLException ex) {

System.exit(0);

} // try catch

} // doQuery

https://github.com/telefonicaid/fiware-

connectors/tree/develop/resources/hive-basic-client


4. Remote Hive client – Plague Tracker demo

35

https://github.com/telefonicaid/fiware-connectors/tree/develop/resources/plague-tracker

https://github.com/telefonicaid/fiware-connectors/tree/develop/resources/plague-tracker

5. MapReduce applications

36

• MapReduce applications are commonly written in Java• Can be written in other languages through Hadoop Streaming

• They are executed in the command line

$ hadoop jar <jar-file> <main-class> <input-dir> <output-dir>

• A MapReduce job consists of:• A driver, a piece of software where to define inputs, outputs, formats,

etc. and the entry point for launching the job

• A set of Mappers, given by a piece of software defining its behaviour

• A set of Reducers, given by a piece of software defining its behaviour

• There are 2 APIS• org.apache.mapred old one

• org.apache.mapreduce new one

• Hadoop is distributed with MapReduce examples• [HADOOP_HOME]/hadoop-examples.jar

5. MapReduce applications – Map

37

/* org.apache.mapred example */

public static class MapClass extends MapReduceBase implements

Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> output, Reporter reporter)

throws IOException {

/* use the input value, the input key is the offset within the

file and it is not necessary in this example */

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

/* iterate on the string, getting each word */

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

/* emit an output (key,value) pair based on the word and 1 */

output.collect(word, one);

} // while

} // map

} // MapClass

5. MapReduce applications – Reduce

38


public static class ReduceClass extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> output, Reporter reporter)

throws IOException {

int sum = 0;

/* iterate on all the values and add them */

while (values.hasNext()) {

sum += values.next().get();

} // while

/* emit an output (key,value) pair based on the word and its count */

output.collect(key, new IntWritable(sum));

} // reduce

} // ReduceClass

5. MapReduce applications – Driver

39


package my.org

import java.io.IOException;

import java.util.*;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.util.*;

public class WordCount {

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class);

conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(MapClass.class);

conf.setCombinerClass(ReduceClass.class);

conf.setReducerClass(ReduceClass.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

} // main

} // WordCount

6. Launching tasks with Oozie

40

• Oozie is a workflow scheduler system to manage Hadoop

jobs• Java map-reduce

• Pig and Hive

• Sqoop

• System specific jobs (such as Java programs and shell scripts)

• Oozie Workflow jobs are Directed Acyclical Graphs (DAGs)

of actions.

• Writting Oozie applications is about including in a package• The MapReduce jobs, Hive/Pig scritps, etc (exeutable code)

• A Workflow

• Parameters for the Workflow

• Oozie can be use locally or remotely

• https://oozie.apache.org/docs/4.0.0/index.html#Developer_Do

cumentation

https://oozie.apache.org/docs/4.0.0/index.html#Developer_Documentation

6. Launching tasks with Oozie – Java client

41

OozieClient client = new OozieClient("http://130.206.80.46:11000/oozie/");

// create a workflow job configuration and set the workflow application path

Properties conf = client.createConfiguration();

conf.setProperty(OozieClient.APP_PATH, "hdfs://cosmosmaster-

gi:8020/user/frb/mrjobs");

conf.setProperty("nameNode", "hdfs://cosmosmaster-gi:8020");

conf.setProperty("jobTracker", "cosmosmaster-gi:8021");

conf.setProperty("outputDir", "output");

conf.setProperty("inputDir", "input");

conf.setProperty("examplesRoot", "mrjobs");

conf.setProperty("queueName", "default");

// submit and start the workflow job

String jobId = client.run(conf);

// wait until the workflow job finishes printing the status every 10 seconds

while (client.getJobInfo(jobId).getStatus() == WorkflowJob.Status.RUNNING) {

System.out.println("Workflow job running ...");

Thread.sleep(10 * 1000);

} // while

System.out.println("Workflow job completed");

Useful references

42

• Hive resources:• HiveQL language https://cwiki.apache.org/confluence/display/Hive/LanguageManual

• How to create Hive clients


• Hive client example https://github.com/telefonicaid/fiware-

connectors/tree/develop/resources/hive-basic-client

• Plague Tracker demo https://github.com/telefonicaid/fiware-

livedemoapp/tree/master/cosmos/plague-tracker

• Plague Tracker instance http://130.206.81.65/plague-tracker/

• Hadoop filesystem commands:• http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-

common/CommandsManual.html

• WebHDFS and HttpFS REST APIs:• http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html

• http://hadoop.apache.org/docs/current/hadoop-hdfs-httpfs/index.html

• Oozie• https://oozie.apache.org/docs/4.0.0/index.html#Developer_Documentation

https://cwiki.apache.org/confluence/display/Hive/LanguageManual



https://github.com/telefonicaid/fiware-livedemoapp/tree/master/cosmos/plague-tracker

http://130.206.81.65/plague-tracker/

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html

http://hadoop.apache.org/docs/current/hadoop-hdfs-httpfs/index.html

https://oozie.apache.org/docs/4.0.0/index.html#Developer_Documentation

43

Cosmos place in FI-

WARE:

Typical scenarios

General IoT platform

44

IoT Backend

Device Management

CKAN

COSMOS

(BIG DATA)

DATA

PROCESSING

DATA

QUERYING

SUBS

OPEN DATA

CONTEXT BROKER

measures / commands

IoT/Sensor Open Data

SENSOR 2 THINGS

T-T

Acco

un

ting

& P

aym

en

t & B

illing

IDM

& A

uth

SHORT TERM

HISTORIC

REAL TIME

PRCSSING

BI

ETL

BLNKRULES

DEFINITION

BLNKOPERATIONAL

DASHBOARD

KPI GOVERNANCE OPEN DATA PORTALS

CEP

GISContext

Adapters

Service

Orchrestation

City

Services

Real time context data persistence (architecture)

45

https://forge.fi-ware.eu/plugins/mediawiki/wiki/fiware/index.php/How_to_persist_Orion_data_in_Cosmoshttps://github.com/telefonicaid/fiware-connectors/tree/develop/flume

https://forge.fi-ware.eu/plugins/mediawiki/wiki/fiware/index.php/How_to_persist_Orion_data_in_Cosmos

https://github.com/telefonicaid/fiware-connectors/tree/develop/flume

Real time context data persistence (detail)

46

Real time context data persistence (examples)

47

• Information coming from city sensors

• Presence map gradients, aglomerations…

• Services usage distributions, top users (if

available), top POIs, unused resources…

• Information generated by smartphones

• Geolocation routes, map gradients,

aglomerations…

• Issues reporting top neighbourhooods in

incidents, crimilality, noises, garbage, plagues…

• Any other real time information

• Depending on your app, this could be product likes,

product consumption, user-2-user feedback…

recommendations, advertisement…

48

Roadmap:

More functionalities and

integrations

Roadmap

49

• Integrate the clusters creation with the cloud portal

• No more REST API work

• Streaming analysis capabilities

• Not all the analysis can wait for a batch processing

• Geolocation analysis capabilities

• An important source of data nowadays

• Integrate with CKAN

• As a source of batch data

• Integrate with the Marketplace

• Selling datasets

• Selling analysis results

• Selling applications and algorithms

50

[email protected]

[email protected]

m

mailto:[email protected]

mailto:[email protected]

http://fi-ppp.eu

http://fi-ware.eu

Follow @Fiware on Twitter!

Thanks !

51

http://fi-ppp.eu

http://fi-ware.eu

Cosmos, Big Data GE Implementation

Technology

stockh big data

data oozie

context data

howmuch data

cosmos big data16

idleidleidle1445 ref

stockhol big data tools

cosmos list