Top Banner
FIWARE Developer’s Week Big Data GE (day 1) [email protected]
36
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: fiware_fdw_big_data_day_1_v1

FIWARE Developer’s WeekBig Data GE (day 1)

[email protected]

Page 2: fiware_fdw_big_data_day_1_v1

2

Big Data:

What is it and how much data is there

Page 3: fiware_fdw_big_data_day_1_v1

What is big data?

3

> small

data

Page 4: fiware_fdw_big_data_day_1_v1

What is big data?

4

> big data

http://commons.wikimedia.org/wiki/File:Interior_view_of_Stockholm_Public_Library.jpg

Page 5: fiware_fdw_big_data_day_1_v1

How much data is there?

5

Page 6: fiware_fdw_big_data_day_1_v1

Data growing forecast

6

2.33.6

12

19

11.3

39

0.5

1.4

Global users

(billions)

Global networked

devices(billions)

Global broadbandspeed

(Mbps)

Global traffic(zettabytes)

http://www.cisco.com/en/US/netsol/ns827/networking_solutions_sub_solution.html#~forecast

2012

20122012

2012

2017

2017

2017

2017

Page 7: fiware_fdw_big_data_day_1_v1

7

How to deal with it:

Distributed storage and computing

Page 8: fiware_fdw_big_data_day_1_v1

What happens if one shelving is not enough?

8

You buy more shelves…

Page 9: fiware_fdw_big_data_day_1_v1

… and you create an index

9

“The Avengers”, 1-100, shelf 1

“The Avengers”, 101-125, shelf 2

“Superman”, 1-50, shelf 2

“X-Men”, 1-100, shelf 3

“X-Men”, 101-200, shelf 4

“X-Men”, 201, 225, shelf 5

TheAvengers

TheAvengers

TheAvengers

TheAvengers

TheAvengers

Superman

Superman

X-Men

X-Men

X-Men

X-Men

X-Men

X-Men

X-Men

X-Men

X-Men

Page 10: fiware_fdw_big_data_day_1_v1

What about distributed computing?

10

Page 11: fiware_fdw_big_data_day_1_v1

11

Distributed storage:

The Hadoop reference

Page 12: fiware_fdw_big_data_day_1_v1

Hadoop Distributed File System (HDFS)

12

• Based on Google File System• Large files are stored across multiple

machines (Datanodes) by spliting theminto blocks that are distributed

• Metadata is managed by the Namenode• Scalable by simply adding more Datanodes• Fault-tolerant since HDFS replicates each

block (default to 3)• Security based on authentication

(Kerberos) and authorization (permissions, HACLs)

• It is managed like a Unix-like file system

Page 13: fiware_fdw_big_data_day_1_v1

HDFS architecture

13

Namenode

Datanode0 Datanode1 DatanodeN

Rack 1 Rack 2

1 2 2 3 3 1 2

Path Replicas Block IDs

/user/user1/data/file0 2 1,3

/user/user1/data/file1 3 2,4,5

… … …

Page 14: fiware_fdw_big_data_day_1_v1

14

Managing HDFS:

File System Shell

HTTP REST API

Page 15: fiware_fdw_big_data_day_1_v1

File System Shell

15

• The File System Shell includes various shell-likecomands that directly interact with HDFS

• The FS shell is invoked by any of these scripts:– bin/hadoop fs

– bin/hdfs dfs

• All FS Shell commans take URI paths as arguments:– scheme://authority/path– Available schemas: file (local FS), hdfs (HDFS)– If nothing is especified, hdfs is considered

• It is necessary to connect to the cluster via SSH• Full commands reference

– http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

Page 16: fiware_fdw_big_data_day_1_v1

File System Shell examples

16

$ hadoop fs -cat webinar/abriefhistoryoftime_page1CHAPTER 1OUR PICTURE OF THE UNIVERSEA well-known scientist (some say it was Bertrand Russell) once gave a publiclecture on astronomy. He described how the earth orbits around the sun and how the sun, in turn, orbits around the center of a vast$ hadoop fs -mkdir webinar/afolder$ hadoop fs -ls webinarFound 4 items-rw-r--r-- 3 frb cosmos 3431 2014-12-10 14:00 /user/frb/webinar/abriefhistoryoftime_page1-rw-r--r-- 3 frb cosmos 1604 2014-12-10 14:00 /user/frb/webinar/abriefhistoryoftime_page2-rw-r--r-- 3 frb cosmos 5257 2014-12-10 14:00 /user/frb/webinar/abriefhistoryoftime_page3drwxr-xr-x - frb cosmos 0 2015-03-10 11:09 /user/frb/webinar/afolder$ hadoop fs -rmr webinar/afolderDeleted hdfs://cosmosmaster-gi/user/frb/webinar/afolder

Page 17: fiware_fdw_big_data_day_1_v1

HTTP REST API

17

• The HTTP REST API supports the complete File System interface for HDFS

• It relies on the webhdfs schema for URIs

– webhdfs://<HOST>:<HTTP_PORT>/<PATH>

• HTTP URLs are built as:– http://<HOST>:<HTTP_PORT>/webhdfs/v1/<

PATH>?op=…

• Full API specification– http://hadoop.apache.org/docs/current/hadoo

p-project-dist/hadoop-hdfs/WebHDFS.html

Page 18: fiware_fdw_big_data_day_1_v1

HTTP REST API examples

18

$ curl –X GET “http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar/abriefhistoryoftime_page1?op=open&user.name=frb”

CHAPTER 1

OUR PICTURE OF THE UNIVERSE

A well-known scientist (some say it was Bertrand Russell) once gave a public lecture onastronomy. He described how the earth orbits around the sun and how the sun, in turn, orbitsaround the center of a vast

$ curl -X PUT "http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar/afolder?op=mkdirs&user.name=frb"

{"boolean":true}

$ curl –X GET "http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar?op=liststatus&user.name=frb"

{"FileStatuses":{"FileStatus":[{"pathSuffix":"abriefhistoryoftime_page1","type":"FILE","length":3431,"owner":"frb","group":"cosmos","permission":"644","accessTime":1425995831489,"modificationTime":1418216412441,"blockSize":67108864,"replication":3},{"pathSuffix":"abriefhistoryoftime_page2","type":"FILE","length":1604,"owner":"frb","group":"cosmos","permission":"644","accessTime":1418216412460,"modificationTime":1418216412500,"blockSize":67108864,"replication":3},{"pathSuffix":"abriefhistoryoftime_page3","type":"FILE","length":5257,"owner":"frb","group":"cosmos","permission":"644","accessTime":1418216412515,"modificationTime":1418216412551,"blockSize":67108864,"replication":3},{"pathSuffix":"afolder","type":"DIRECTORY","length":0,"owner":"frb","group":"cosmos","permission":"755","accessTime":0,"modificationTime":1425995941361,"blockSize":0,"replication":0}]}}

$ curl -X DELETE "http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar/afolder?op=delete&user.name=frb"

{"boolean":true}

Page 19: fiware_fdw_big_data_day_1_v1

19

Feeding HDFS:

Cygnus

Page 20: fiware_fdw_big_data_day_1_v1

Cygnus FAQ

20

• What is it for?– Cygnus is a connector in charge of persisting Orion

context data in certain configured third-party storages,creating a historical view of such data. In other words,Orion only stores the last value regarding an entity'sattribute, and if an older value is required then you willhave to persist it in other storage, value by value, usingCygnus.

• How does it receives context data fromOrion Context Broker?– Cygnus uses the subscription/notification feature of

Orion. A subscription is made in Orion on behalf of Cygnus,detailing which entities we want to be notified when anupdate occurs on any of those entities attributes.

Page 21: fiware_fdw_big_data_day_1_v1

Cygnus FAQ

21

• Which storages is it able to integrate?– Current stable release is able to persist Orion context

data in:• HDFS, the Hadoop distributed file system.

• MySQL, the well-know relational database manager.

• CKAN, an Open Data platform.

• Which is its architecture?– Internally, Cygnus is based on Apache Flume. In fact,

Cygnus is a Flume agent, which is basically composed of a source in charge of receiving the data, a channelwhere the source puts the data once it has beentransformed into a Flume event, and a sink, which takesFlume events from the channel in order to persist thedata within its body into a third-party storage.

Page 22: fiware_fdw_big_data_day_1_v1

Basic Cygnus agent

22

Page 23: fiware_fdw_big_data_day_1_v1

Configure a basic Cygnus agent

23

• Edit /usr/cygnus/conf/agent_<id>.conf

• List of sources, channels and sinks:cygnusagent.sources = http-source

cygnusagent.sinks = hdfs-sink

cygnusagent.channels = hdfs-channel

• Channels configurationcygnusagent.channels.hdfs-channel.type = memory

cygnusagent.channels.hdfs-channel.capacity = 1000

cygnusagent.channels.hdfs-channel.

transactionCapacity = 100

Page 24: fiware_fdw_big_data_day_1_v1

Configure a basic Cygnus agent

24

• Sources configuration:cygnusagent.sources.http-source.channels = hdfs-channel

cygnusagent.sources.http-source.type = org.apache.flume.source.http.HTTPSource

cygnusagent.sources.http-source.port = 5050

cygnusagent.sources.http-source.handler = es.tid.fiware.fiwareconnectors.cygnus.handlers.OrionRestHandler

cygnusagent.sources.http-source.handler.notification_target = /notify

cygnusagent.sources.http-source.handler.default_service = def_serv

cygnusagent.sources.http-source.handler.default_service_path = def_servpath

cygnusagent.sources.http-source.handler.events_ttl = 10

cygnusagent.sources.http-source.interceptors = ts de

cygnusagent.sources.http-source.interceptors.ts.type = timestamp

cygnusagent.sources.http-source.interceptors.de.type = es.tid.fiware.fiwareconnectors.cygnus.interceptors.DestinationExtractor$Builder

cygnusagent.sources.http-source.interceptors.de.matching_table = /usr/cygnus/conf/matching_table.conf

Page 25: fiware_fdw_big_data_day_1_v1

Configure a basic Cygnus agent

25

• Sinks configuration:cygnusagent.sinks.hdfs-sink.channel = hdfs-channel

cygnusagent.sinks.hdfs-sink.type = es.tid.fiware.fiwareconnectors.cygnus.sinks.OrionHDFSSink

cygnusagent.sinks.hdfs-sink.cosmos_host = cosmos.lab.fi-ware.org

cygnusagent.sinks.hdfs-sink.cosmos_port = 14000

cygnusagent.sinks.hdfs-sink.cosmos_default_username = cosmos_username

cygnusagent.sinks.hdfs-sink.cosmos_default_password = xxxxxxxxxxxxx

cygnusagent.sinks.hdfs-sink.hdfs_api = httpfs

cygnusagent.sinks.hdfs-sink.attr_persistence = column

cygnusagent.sinks.hdfs-sink.hive_host = cosmos.lab.fi-ware.org

cygnusagent.sinks.hdfs-sink.hive_port = 10000

cygnusagent.sinks.hdfs-sink.krb5_auth = false

Page 26: fiware_fdw_big_data_day_1_v1

HDFS details regarding Cygnus persistence

26

• By default, for each entity Cygnus stores the data at:

– /user/<your_user>/<service>/<service-path>/<entity-id>-<entity-type>/<entity-id>-<entity-type>.txt

• Within each HDFS file, the data format may be json-row or json-column:– json-row

{

"recvTimeTs":"13453464536”,

"recvTime":"2014-02-27T14:46:21”,

"entityId":"Room1”,

"entityType":"Room”,

"attrName":"temperature”,

"attrType":"centigrade”,

“attrValue":"26.5”,

"attrMd":[

]

}

– json-column{

"recvTime":"2014-02-27T14:46:21”,

"temperature":"26.5”,

"temperature_md":[

],

“pressure”:”90”,

“pressure_md”:[

]

}

Page 27: fiware_fdw_big_data_day_1_v1

Advanced features (0.7.1)

27

• Round-Robin channel selection

• Pattern-based context data grouping

• Kerberos authentication

• Management Interface (roadmap)

• Multi-tenancy support (roadmap)

• Entity model-based persistence(roadmap)

Page 28: fiware_fdw_big_data_day_1_v1

Round Robin channel selection

28

• It is possible to configure more than one channel-sinkpair for each storage, in order to increase the performance

• A custom ChannelSelector is needed• https://github.com/telefonicaid/fiware-

connectors/blob/master/flume/doc/operation/performance_tuning_tips.md

Page 29: fiware_fdw_big_data_day_1_v1

RoundRobinChannelSelector configuration

29

cygnusagent.sources = mysource

cygnusagent.sinks = mysink1 mysink2 mysink3

cygnusagent.channels = mychannel1 mychannel2 mychannel3

cygnusagent.sources.mysource.type = ...

cygnusagent.sources.mysource.channels = mychannel1

mychannel2 mychannel3

cygnusagent.sources.mysource.selector.type =

es.tid.fiware.fiwareconnectors.cygnus.channelselectors.

RoundRobinChannelSelector

cygnusagent.sources.mysource.selector.storages = N

cygnusagent.sources.mysource.selector.storages.storage1

= <subset_of_cygnusagent.sources.mysource.channels>

...

cygnusagent.sources.mysource.selector.storages.storageN

= <subset_of_cygnusagent.sources.mysource.channels>

Page 30: fiware_fdw_big_data_day_1_v1

Pattern-based Context Data Grouping

30

• Default destination (HDFS file, mMySQL table or CKAN resource) is obtained as a concatenation:– destination=<entity_id>-<entityType>

• It is possible to group different context data thanks tothis regex-based feature implemented as a Flumeinterceptor:cygnusagent.sources.http-source.interceptors = ts de

cygnusagent.sources.http-source.interceptors.ts.type =

timestamp

cygnusagent.sources.http-source.interceptors.de.type =

es.tid.fiware.fiwareconnectors.cygnus.interceptors.Destin

ationExtractor$Builder

cygnusagent.sources.http-

source.interceptors.de.matching_table =

/usr/cygnus/conf/matching_table.conf

Page 31: fiware_fdw_big_data_day_1_v1

Matching table for pattern-based grouping

31

• CSV file (‘|’ field separator) containing rules

– <id>|<comma-separated_fields>|<regex>|<destination>|<destination_dataset>

• For instance:

1|entityId,entityType|Room\.(\d*)Room|numeric_rooms|rooms

2|entityId,entityType|Room\.(\D*)Room|character_rooms|rooms

3|entityType,entityId|RoomRoom\.(\D*)|character_rooms|rooms

4|entityType|Room|other_roorms|rooms

• https://github.com/telefonicaid/fiware-connectors/blob/master/flume/doc/design/interceptors.md#destinationextractor-interceptor

Page 32: fiware_fdw_big_data_day_1_v1

Kerberos authentication

32

• HDFS may be secured with Kerberos for authenticationpurposes

• Cygnus is able to persist on kerberized HDFS if theconfigured HDFS user has a registered Kerberos principal and this configuration is added:cygnusagent.sinks.hdfs-sink.krb5_auth = true

cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_user = krb5_username

cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_password = xxxxxxxxxxxx

cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_login_file = /usr/cygnus/conf/krb5_login.conf

cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_conf_file = /usr/cygnus/conf/krb5.conf

• https://github.com/telefonicaid/fiware-connectors/blob/master/flume/doc/operation/hdfs_kerberos_authentication.md

Page 33: fiware_fdw_big_data_day_1_v1

33

HDFS at FIWARE LAB

Beyond Infinity

Page 34: fiware_fdw_big_data_day_1_v1

Current HDFS cluster vs. Infinity

34

• Currently:– Hadoop cluster combining HDFS storage and MapReduce

computing in the same nodes

– 10 virtual nodes

– 1 TB capacity (333 GB real capacity, replicas=3)

– Default security

– http://cosmos.lab.fi-ware.org/cosmos-gui

• Infinity:– Specific HDFS cluster for storage

– 6 physical nodes (+ other 6 planified)

– 20 TB capacity (6,7 TB real capacity, replicas=3)

– Fully IdM integrated (OAuth2)

– Support to CKAN for Open Big Data

Page 35: fiware_fdw_big_data_day_1_v1

Further reading

35

• Cosmos@FIWARE catalogue– http://catalogue.fiware.org/enablers/bigdata-

analysis-cosmos

• Cygnus@github– https://github.com/telefonicaid/fiware-connectors

• This presententation@slideshare– http://es.slideshare.net/FranciscoRomeroBueno/fi

warefdwbigdataday1v1

• Exercises@slideshare– http://es.slideshare.net/FranciscoRomeroBueno/fi

warefdwbigdataexercisesday1v1

Page 36: fiware_fdw_big_data_day_1_v1

Thanks!Thanks!