fiware_fdw_big_data_day_1_v1

FIWARE Developer’s WeekBig Data GE (day 1)

[email protected]

2

Big Data:

What is it and how much data is there

What is big data?

3

> small

data

What is big data?

4

> big data

http://commons.wikimedia.org/wiki/File:Interior_view_of_Stockholm_Public_Library.jpg

How much data is there?

5

Data growing forecast

6

2.33.6

12

19

11.3

39

0.5

1.4

Global users

(billions)

Global networked

devices(billions)

Global broadbandspeed

(Mbps)

Global traffic(zettabytes)

http://www.cisco.com/en/US/netsol/ns827/networking_solutions_sub_solution.html#~forecast

2012

20122012

2012

2017

2017

2017

2017

7

How to deal with it:

Distributed storage and computing

What happens if one shelving is not enough?

8

You buy more shelves…

… and you create an index

9

“The Avengers”, 1-100, shelf 1

“The Avengers”, 101-125, shelf 2

“Superman”, 1-50, shelf 2

“X-Men”, 1-100, shelf 3

“X-Men”, 101-200, shelf 4

“X-Men”, 201, 225, shelf 5

TheAvengers

TheAvengers

TheAvengers

TheAvengers

TheAvengers

Superman

Superman

X-Men

X-Men

X-Men

X-Men

X-Men

X-Men

X-Men

X-Men

X-Men

What about distributed computing?

10

11

Distributed storage:

The Hadoop reference

Hadoop Distributed File System (HDFS)

12

• Based on Google File System• Large files are stored across multiple

machines (Datanodes) by spliting theminto blocks that are distributed

• Metadata is managed by the Namenode• Scalable by simply adding more Datanodes• Fault-tolerant since HDFS replicates each

block (default to 3)• Security based on authentication

(Kerberos) and authorization (permissions, HACLs)

• It is managed like a Unix-like file system

HDFS architecture

13

Namenode

Datanode0 Datanode1 DatanodeN

Rack 1 Rack 2

1 2 2 3 3 1 2

Path Replicas Block IDs

/user/user1/data/file0 2 1,3

/user/user1/data/file1 3 2,4,5

… … …

14

Managing HDFS:

File System Shell

HTTP REST API

File System Shell

15

• The File System Shell includes various shell-likecomands that directly interact with HDFS

• The FS shell is invoked by any of these scripts:– bin/hadoop fs

– bin/hdfs dfs

• All FS Shell commans take URI paths as arguments:– scheme://authority/path– Available schemas: file (local FS), hdfs (HDFS)– If nothing is especified, hdfs is considered

• It is necessary to connect to the cluster via SSH• Full commands reference

– http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

File System Shell examples

16

$ hadoop fs -cat webinar/abriefhistoryoftime_page1CHAPTER 1OUR PICTURE OF THE UNIVERSEA well-known scientist (some say it was Bertrand Russell) once gave a publiclecture on astronomy. He described how the earth orbits around the sun and how the sun, in turn, orbits around the center of a vast$ hadoop fs -mkdir webinar/afolder$ hadoop fs -ls webinarFound 4 items-rw-r--r-- 3 frb cosmos 3431 2014-12-10 14:00 /user/frb/webinar/abriefhistoryoftime_page1-rw-r--r-- 3 frb cosmos 1604 2014-12-10 14:00 /user/frb/webinar/abriefhistoryoftime_page2-rw-r--r-- 3 frb cosmos 5257 2014-12-10 14:00 /user/frb/webinar/abriefhistoryoftime_page3drwxr-xr-x - frb cosmos 0 2015-03-10 11:09 /user/frb/webinar/afolder$ hadoop fs -rmr webinar/afolderDeleted hdfs://cosmosmaster-gi/user/frb/webinar/afolder

HTTP REST API

17

• The HTTP REST API supports the complete File System interface for HDFS

• It relies on the webhdfs schema for URIs

– webhdfs://<HOST>:<HTTP_PORT>/<PATH>

• HTTP URLs are built as:– http://<HOST>:<HTTP_PORT>/webhdfs/v1/<

PATH>?op=…

• Full API specification– http://hadoop.apache.org/docs/current/hadoo

p-project-dist/hadoop-hdfs/WebHDFS.html

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html

HTTP REST API examples

18

$ curl –X GET “http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar/abriefhistoryoftime_page1?op=open&user.name=frb”

CHAPTER 1

OUR PICTURE OF THE UNIVERSE

A well-known scientist (some say it was Bertrand Russell) once gave a public lecture onastronomy. He described how the earth orbits around the sun and how the sun, in turn, orbitsaround the center of a vast

$ curl -X PUT "http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar/afolder?op=mkdirs&user.name=frb"

{"boolean":true}

$ curl –X GET "http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar?op=liststatus&user.name=frb"

{"FileStatuses":{"FileStatus":[{"pathSuffix":"abriefhistoryoftime_page1","type":"FILE","length":3431,"owner":"frb","group":"cosmos","permission":"644","accessTime":1425995831489,"modificationTime":1418216412441,"blockSize":67108864,"replication":3},{"pathSuffix":"abriefhistoryoftime_page2","type":"FILE","length":1604,"owner":"frb","group":"cosmos","permission":"644","accessTime":1418216412460,"modificationTime":1418216412500,"blockSize":67108864,"replication":3},{"pathSuffix":"abriefhistoryoftime_page3","type":"FILE","length":5257,"owner":"frb","group":"cosmos","permission":"644","accessTime":1418216412515,"modificationTime":1418216412551,"blockSize":67108864,"replication":3},{"pathSuffix":"afolder","type":"DIRECTORY","length":0,"owner":"frb","group":"cosmos","permission":"755","accessTime":0,"modificationTime":1425995941361,"blockSize":0,"replication":0}]}}

$ curl -X DELETE "http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar/afolder?op=delete&user.name=frb"

{"boolean":true}

19

Feeding HDFS:

Cygnus

Cygnus FAQ

20

• What is it for?– Cygnus is a connector in charge of persisting Orion

context data in certain configured third-party storages,creating a historical view of such data. In other words,Orion only stores the last value regarding an entity'sattribute, and if an older value is required then you willhave to persist it in other storage, value by value, usingCygnus.

• How does it receives context data fromOrion Context Broker?– Cygnus uses the subscription/notification feature of

Orion. A subscription is made in Orion on behalf of Cygnus,detailing which entities we want to be notified when anupdate occurs on any of those entities attributes.

Cygnus FAQ

21

• Which storages is it able to integrate?– Current stable release is able to persist Orion context

data in:• HDFS, the Hadoop distributed file system.

• MySQL, the well-know relational database manager.

• CKAN, an Open Data platform.

• Which is its architecture?– Internally, Cygnus is based on Apache Flume. In fact,

Cygnus is a Flume agent, which is basically composed of a source in charge of receiving the data, a channelwhere the source puts the data once it has beentransformed into a Flume event, and a sink, which takesFlume events from the channel in order to persist thedata within its body into a third-party storage.

Basic Cygnus agent

22

Configure a basic Cygnus agent

23

• Edit /usr/cygnus/conf/agent_<id>.conf

• List of sources, channels and sinks:cygnusagent.sources = http-source

cygnusagent.sinks = hdfs-sink

cygnusagent.channels = hdfs-channel

• Channels configurationcygnusagent.channels.hdfs-channel.type = memory

cygnusagent.channels.hdfs-channel.capacity = 1000

cygnusagent.channels.hdfs-channel.

transactionCapacity = 100


24

• Sources configuration:cygnusagent.sources.http-source.channels = hdfs-channel

cygnusagent.sources.http-source.type = org.apache.flume.source.http.HTTPSource

cygnusagent.sources.http-source.port = 5050

cygnusagent.sources.http-source.handler = es.tid.fiware.fiwareconnectors.cygnus.handlers.OrionRestHandler

cygnusagent.sources.http-source.handler.notification_target = /notify

cygnusagent.sources.http-source.handler.default_service = def_serv

cygnusagent.sources.http-source.handler.default_service_path = def_servpath

cygnusagent.sources.http-source.handler.events_ttl = 10

cygnusagent.sources.http-source.interceptors = ts de

cygnusagent.sources.http-source.interceptors.ts.type = timestamp

cygnusagent.sources.http-source.interceptors.de.type = es.tid.fiware.fiwareconnectors.cygnus.interceptors.DestinationExtractor$Builder

cygnusagent.sources.http-source.interceptors.de.matching_table = /usr/cygnus/conf/matching_table.conf


25

• Sinks configuration:cygnusagent.sinks.hdfs-sink.channel = hdfs-channel

cygnusagent.sinks.hdfs-sink.type = es.tid.fiware.fiwareconnectors.cygnus.sinks.OrionHDFSSink

cygnusagent.sinks.hdfs-sink.cosmos_host = cosmos.lab.fi-ware.org

cygnusagent.sinks.hdfs-sink.cosmos_port = 14000

cygnusagent.sinks.hdfs-sink.cosmos_default_username = cosmos_username

cygnusagent.sinks.hdfs-sink.cosmos_default_password = xxxxxxxxxxxxx

cygnusagent.sinks.hdfs-sink.hdfs_api = httpfs

cygnusagent.sinks.hdfs-sink.attr_persistence = column

cygnusagent.sinks.hdfs-sink.hive_host = cosmos.lab.fi-ware.org

cygnusagent.sinks.hdfs-sink.hive_port = 10000

cygnusagent.sinks.hdfs-sink.krb5_auth = false

HDFS details regarding Cygnus persistence

26

• By default, for each entity Cygnus stores the data at:

– /user/<your_user>/<service>/<service-path>/<entity-id>-<entity-type>/<entity-id>-<entity-type>.txt

• Within each HDFS file, the data format may be json-row or json-column:– json-row

{

"recvTimeTs":"13453464536”,

"recvTime":"2014-02-27T14:46:21”,

"entityId":"Room1”,

"entityType":"Room”,

"attrName":"temperature”,

"attrType":"centigrade”,

“attrValue":"26.5”,

"attrMd":[

…

]

}

– json-column{

"recvTime":"2014-02-27T14:46:21”,

"temperature":"26.5”,

"temperature_md":[

…

],

“pressure”:”90”,

“pressure_md”:[

…

]

}

Advanced features (0.7.1)

27

• Round-Robin channel selection

• Pattern-based context data grouping

• Kerberos authentication

• Management Interface (roadmap)

• Multi-tenancy support (roadmap)

• Entity model-based persistence(roadmap)

Round Robin channel selection

28

• It is possible to configure more than one channel-sinkpair for each storage, in order to increase the performance

• A custom ChannelSelector is needed• https://github.com/telefonicaid/fiware-

connectors/blob/master/flume/doc/operation/performance_tuning_tips.md

https://github.com/telefonicaid/fiware-connectors/blob/master/flume/doc/operation/performance_tuning_tips.md

RoundRobinChannelSelector configuration

29

cygnusagent.sources = mysource

cygnusagent.sinks = mysink1 mysink2 mysink3

cygnusagent.channels = mychannel1 mychannel2 mychannel3

cygnusagent.sources.mysource.type = ...

cygnusagent.sources.mysource.channels = mychannel1

mychannel2 mychannel3

cygnusagent.sources.mysource.selector.type =

es.tid.fiware.fiwareconnectors.cygnus.channelselectors.

RoundRobinChannelSelector

cygnusagent.sources.mysource.selector.storages = N

cygnusagent.sources.mysource.selector.storages.storage1

= <subset_of_cygnusagent.sources.mysource.channels>

...

cygnusagent.sources.mysource.selector.storages.storageN

= <subset_of_cygnusagent.sources.mysource.channels>

Pattern-based Context Data Grouping

30

• Default destination (HDFS file, mMySQL table or CKAN resource) is obtained as a concatenation:– destination=<entity_id>-<entityType>

• It is possible to group different context data thanks tothis regex-based feature implemented as a Flumeinterceptor:cygnusagent.sources.http-source.interceptors = ts de

cygnusagent.sources.http-source.interceptors.ts.type =

timestamp

cygnusagent.sources.http-source.interceptors.de.type =

es.tid.fiware.fiwareconnectors.cygnus.interceptors.Destin

ationExtractor$Builder

cygnusagent.sources.http-

source.interceptors.de.matching_table =

/usr/cygnus/conf/matching_table.conf

Matching table for pattern-based grouping

31

• CSV file (‘|’ field separator) containing rules

– <id>|<comma-separated_fields>|<regex>|<destination>|<destination_dataset>

• For instance:

1|entityId,entityType|Room\.(\d*)Room|numeric_rooms|rooms

2|entityId,entityType|Room\.(\D*)Room|character_rooms|rooms

3|entityType,entityId|RoomRoom\.(\D*)|character_rooms|rooms

4|entityType|Room|other_roorms|rooms

• https://github.com/telefonicaid/fiware-connectors/blob/master/flume/doc/design/interceptors.md#destinationextractor-interceptor

https://github.com/telefonicaid/fiware-connectors/blob/master/flume/doc/design/interceptors.md#destinationextractor-interceptor

Kerberos authentication

32

• HDFS may be secured with Kerberos for authenticationpurposes

• Cygnus is able to persist on kerberized HDFS if theconfigured HDFS user has a registered Kerberos principal and this configuration is added:cygnusagent.sinks.hdfs-sink.krb5_auth = true

cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_user = krb5_username

cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_password = xxxxxxxxxxxx

cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_login_file = /usr/cygnus/conf/krb5_login.conf

cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_conf_file = /usr/cygnus/conf/krb5.conf

• https://github.com/telefonicaid/fiware-connectors/blob/master/flume/doc/operation/hdfs_kerberos_authentication.md

https://github.com/telefonicaid/fiware-connectors/blob/master/flume/doc/operation/hdfs_kerberos_authentication.md

33

HDFS at FIWARE LAB

Beyond Infinity

Current HDFS cluster vs. Infinity

34

• Currently:– Hadoop cluster combining HDFS storage and MapReduce

computing in the same nodes

– 10 virtual nodes

– 1 TB capacity (333 GB real capacity, replicas=3)

– Default security

– http://cosmos.lab.fi-ware.org/cosmos-gui

• Infinity:– Specific HDFS cluster for storage

– 6 physical nodes (+ other 6 planified)

– 20 TB capacity (6,7 TB real capacity, replicas=3)

– Fully IdM integrated (OAuth2)

– Support to CKAN for Open Big Data

http://cosmos.lab.fi-ware.org/cosmos-gui

Further reading

35

• Cosmos@FIWARE catalogue– http://catalogue.fiware.org/enablers/bigdata-

analysis-cosmos

• Cygnus@github– https://github.com/telefonicaid/fiware-connectors

• This presententation@slideshare– http://es.slideshare.net/FranciscoRomeroBueno/fi

warefdwbigdataday1v1

• Exercises@slideshare– http://es.slideshare.net/FranciscoRomeroBueno/fi

warefdwbigdataexercisesday1v1

http://catalogue.fiware.org/enablers/bigdata-analysis-cosmos

https://github.com/telefonicaid/fiware-connectors

http://es.slideshare.net/FranciscoRomeroBueno/fiwarefdwbigdataday1v1

http://es.slideshare.net/FranciscoRomeroBueno/fiwarefdwbigdataexercisesday1v1

Thanks!Thanks!

fiware_fdw_big_data_day_1_v1

Engineering

file local fs

big data http

complete file system

x frb cosmos

hadoop fs ls webinarfound

hdfsthe fs shell

managing hdfs

hdfs hdfsif