Jul 16, 2015
What is big data?
4
> big data
http://commons.wikimedia.org/wiki/File:Interior_view_of_Stockholm_Public_Library.jpg
Data growing forecast
6
2.33.6
12
19
11.3
39
0.5
1.4
Global users
(billions)
Global networked
devices(billions)
Global broadbandspeed
(Mbps)
Global traffic(zettabytes)
http://www.cisco.com/en/US/netsol/ns827/networking_solutions_sub_solution.html#~forecast
2012
20122012
2012
2017
2017
2017
2017
… and you create an index
9
“The Avengers”, 1-100, shelf 1
“The Avengers”, 101-125, shelf 2
“Superman”, 1-50, shelf 2
“X-Men”, 1-100, shelf 3
“X-Men”, 101-200, shelf 4
“X-Men”, 201, 225, shelf 5
TheAvengers
TheAvengers
TheAvengers
TheAvengers
TheAvengers
Superman
Superman
X-Men
X-Men
X-Men
X-Men
X-Men
X-Men
X-Men
X-Men
X-Men
Hadoop Distributed File System (HDFS)
12
• Based on Google File System• Large files are stored across multiple
machines (Datanodes) by spliting theminto blocks that are distributed
• Metadata is managed by the Namenode• Scalable by simply adding more Datanodes• Fault-tolerant since HDFS replicates each
block (default to 3)• Security based on authentication
(Kerberos) and authorization (permissions, HACLs)
• It is managed like a Unix-like file system
HDFS architecture
13
Namenode
Datanode0 Datanode1 DatanodeN
Rack 1 Rack 2
1 2 2 3 3 1 2
Path Replicas Block IDs
/user/user1/data/file0 2 1,3
/user/user1/data/file1 3 2,4,5
… … …
File System Shell
15
• The File System Shell includes various shell-likecomands that directly interact with HDFS
• The FS shell is invoked by any of these scripts:– bin/hadoop fs
– bin/hdfs dfs
• All FS Shell commans take URI paths as arguments:– scheme://authority/path– Available schemas: file (local FS), hdfs (HDFS)– If nothing is especified, hdfs is considered
• It is necessary to connect to the cluster via SSH• Full commands reference
– http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
File System Shell examples
16
$ hadoop fs -cat webinar/abriefhistoryoftime_page1CHAPTER 1OUR PICTURE OF THE UNIVERSEA well-known scientist (some say it was Bertrand Russell) once gave a publiclecture on astronomy. He described how the earth orbits around the sun and how the sun, in turn, orbits around the center of a vast$ hadoop fs -mkdir webinar/afolder$ hadoop fs -ls webinarFound 4 items-rw-r--r-- 3 frb cosmos 3431 2014-12-10 14:00 /user/frb/webinar/abriefhistoryoftime_page1-rw-r--r-- 3 frb cosmos 1604 2014-12-10 14:00 /user/frb/webinar/abriefhistoryoftime_page2-rw-r--r-- 3 frb cosmos 5257 2014-12-10 14:00 /user/frb/webinar/abriefhistoryoftime_page3drwxr-xr-x - frb cosmos 0 2015-03-10 11:09 /user/frb/webinar/afolder$ hadoop fs -rmr webinar/afolderDeleted hdfs://cosmosmaster-gi/user/frb/webinar/afolder
HTTP REST API
17
• The HTTP REST API supports the complete File System interface for HDFS
• It relies on the webhdfs schema for URIs
– webhdfs://<HOST>:<HTTP_PORT>/<PATH>
• HTTP URLs are built as:– http://<HOST>:<HTTP_PORT>/webhdfs/v1/<
PATH>?op=…
• Full API specification– http://hadoop.apache.org/docs/current/hadoo
p-project-dist/hadoop-hdfs/WebHDFS.html
HTTP REST API examples
18
$ curl –X GET “http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar/abriefhistoryoftime_page1?op=open&user.name=frb”
CHAPTER 1
OUR PICTURE OF THE UNIVERSE
A well-known scientist (some say it was Bertrand Russell) once gave a public lecture onastronomy. He described how the earth orbits around the sun and how the sun, in turn, orbitsaround the center of a vast
$ curl -X PUT "http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar/afolder?op=mkdirs&user.name=frb"
{"boolean":true}
$ curl –X GET "http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar?op=liststatus&user.name=frb"
{"FileStatuses":{"FileStatus":[{"pathSuffix":"abriefhistoryoftime_page1","type":"FILE","length":3431,"owner":"frb","group":"cosmos","permission":"644","accessTime":1425995831489,"modificationTime":1418216412441,"blockSize":67108864,"replication":3},{"pathSuffix":"abriefhistoryoftime_page2","type":"FILE","length":1604,"owner":"frb","group":"cosmos","permission":"644","accessTime":1418216412460,"modificationTime":1418216412500,"blockSize":67108864,"replication":3},{"pathSuffix":"abriefhistoryoftime_page3","type":"FILE","length":5257,"owner":"frb","group":"cosmos","permission":"644","accessTime":1418216412515,"modificationTime":1418216412551,"blockSize":67108864,"replication":3},{"pathSuffix":"afolder","type":"DIRECTORY","length":0,"owner":"frb","group":"cosmos","permission":"755","accessTime":0,"modificationTime":1425995941361,"blockSize":0,"replication":0}]}}
$ curl -X DELETE "http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar/afolder?op=delete&user.name=frb"
{"boolean":true}
Cygnus FAQ
20
• What is it for?– Cygnus is a connector in charge of persisting Orion
context data in certain configured third-party storages,creating a historical view of such data. In other words,Orion only stores the last value regarding an entity'sattribute, and if an older value is required then you willhave to persist it in other storage, value by value, usingCygnus.
• How does it receives context data fromOrion Context Broker?– Cygnus uses the subscription/notification feature of
Orion. A subscription is made in Orion on behalf of Cygnus,detailing which entities we want to be notified when anupdate occurs on any of those entities attributes.
Cygnus FAQ
21
• Which storages is it able to integrate?– Current stable release is able to persist Orion context
data in:• HDFS, the Hadoop distributed file system.
• MySQL, the well-know relational database manager.
• CKAN, an Open Data platform.
• Which is its architecture?– Internally, Cygnus is based on Apache Flume. In fact,
Cygnus is a Flume agent, which is basically composed of a source in charge of receiving the data, a channelwhere the source puts the data once it has beentransformed into a Flume event, and a sink, which takesFlume events from the channel in order to persist thedata within its body into a third-party storage.
Configure a basic Cygnus agent
23
• Edit /usr/cygnus/conf/agent_<id>.conf
• List of sources, channels and sinks:cygnusagent.sources = http-source
cygnusagent.sinks = hdfs-sink
cygnusagent.channels = hdfs-channel
• Channels configurationcygnusagent.channels.hdfs-channel.type = memory
cygnusagent.channels.hdfs-channel.capacity = 1000
cygnusagent.channels.hdfs-channel.
transactionCapacity = 100
Configure a basic Cygnus agent
24
• Sources configuration:cygnusagent.sources.http-source.channels = hdfs-channel
cygnusagent.sources.http-source.type = org.apache.flume.source.http.HTTPSource
cygnusagent.sources.http-source.port = 5050
cygnusagent.sources.http-source.handler = es.tid.fiware.fiwareconnectors.cygnus.handlers.OrionRestHandler
cygnusagent.sources.http-source.handler.notification_target = /notify
cygnusagent.sources.http-source.handler.default_service = def_serv
cygnusagent.sources.http-source.handler.default_service_path = def_servpath
cygnusagent.sources.http-source.handler.events_ttl = 10
cygnusagent.sources.http-source.interceptors = ts de
cygnusagent.sources.http-source.interceptors.ts.type = timestamp
cygnusagent.sources.http-source.interceptors.de.type = es.tid.fiware.fiwareconnectors.cygnus.interceptors.DestinationExtractor$Builder
cygnusagent.sources.http-source.interceptors.de.matching_table = /usr/cygnus/conf/matching_table.conf
Configure a basic Cygnus agent
25
• Sinks configuration:cygnusagent.sinks.hdfs-sink.channel = hdfs-channel
cygnusagent.sinks.hdfs-sink.type = es.tid.fiware.fiwareconnectors.cygnus.sinks.OrionHDFSSink
cygnusagent.sinks.hdfs-sink.cosmos_host = cosmos.lab.fi-ware.org
cygnusagent.sinks.hdfs-sink.cosmos_port = 14000
cygnusagent.sinks.hdfs-sink.cosmos_default_username = cosmos_username
cygnusagent.sinks.hdfs-sink.cosmos_default_password = xxxxxxxxxxxxx
cygnusagent.sinks.hdfs-sink.hdfs_api = httpfs
cygnusagent.sinks.hdfs-sink.attr_persistence = column
cygnusagent.sinks.hdfs-sink.hive_host = cosmos.lab.fi-ware.org
cygnusagent.sinks.hdfs-sink.hive_port = 10000
cygnusagent.sinks.hdfs-sink.krb5_auth = false
HDFS details regarding Cygnus persistence
26
• By default, for each entity Cygnus stores the data at:
– /user/<your_user>/<service>/<service-path>/<entity-id>-<entity-type>/<entity-id>-<entity-type>.txt
• Within each HDFS file, the data format may be json-row or json-column:– json-row
{
"recvTimeTs":"13453464536”,
"recvTime":"2014-02-27T14:46:21”,
"entityId":"Room1”,
"entityType":"Room”,
"attrName":"temperature”,
"attrType":"centigrade”,
“attrValue":"26.5”,
"attrMd":[
…
]
}
– json-column{
"recvTime":"2014-02-27T14:46:21”,
"temperature":"26.5”,
"temperature_md":[
…
],
“pressure”:”90”,
“pressure_md”:[
…
]
}
Advanced features (0.7.1)
27
• Round-Robin channel selection
• Pattern-based context data grouping
• Kerberos authentication
• Management Interface (roadmap)
• Multi-tenancy support (roadmap)
• Entity model-based persistence(roadmap)
Round Robin channel selection
28
• It is possible to configure more than one channel-sinkpair for each storage, in order to increase the performance
• A custom ChannelSelector is needed• https://github.com/telefonicaid/fiware-
connectors/blob/master/flume/doc/operation/performance_tuning_tips.md
RoundRobinChannelSelector configuration
29
cygnusagent.sources = mysource
cygnusagent.sinks = mysink1 mysink2 mysink3
cygnusagent.channels = mychannel1 mychannel2 mychannel3
cygnusagent.sources.mysource.type = ...
cygnusagent.sources.mysource.channels = mychannel1
mychannel2 mychannel3
cygnusagent.sources.mysource.selector.type =
es.tid.fiware.fiwareconnectors.cygnus.channelselectors.
RoundRobinChannelSelector
cygnusagent.sources.mysource.selector.storages = N
cygnusagent.sources.mysource.selector.storages.storage1
= <subset_of_cygnusagent.sources.mysource.channels>
...
cygnusagent.sources.mysource.selector.storages.storageN
= <subset_of_cygnusagent.sources.mysource.channels>
Pattern-based Context Data Grouping
30
• Default destination (HDFS file, mMySQL table or CKAN resource) is obtained as a concatenation:– destination=<entity_id>-<entityType>
• It is possible to group different context data thanks tothis regex-based feature implemented as a Flumeinterceptor:cygnusagent.sources.http-source.interceptors = ts de
cygnusagent.sources.http-source.interceptors.ts.type =
timestamp
cygnusagent.sources.http-source.interceptors.de.type =
es.tid.fiware.fiwareconnectors.cygnus.interceptors.Destin
ationExtractor$Builder
cygnusagent.sources.http-
source.interceptors.de.matching_table =
/usr/cygnus/conf/matching_table.conf
Matching table for pattern-based grouping
31
• CSV file (‘|’ field separator) containing rules
– <id>|<comma-separated_fields>|<regex>|<destination>|<destination_dataset>
• For instance:
1|entityId,entityType|Room\.(\d*)Room|numeric_rooms|rooms
2|entityId,entityType|Room\.(\D*)Room|character_rooms|rooms
3|entityType,entityId|RoomRoom\.(\D*)|character_rooms|rooms
4|entityType|Room|other_roorms|rooms
• https://github.com/telefonicaid/fiware-connectors/blob/master/flume/doc/design/interceptors.md#destinationextractor-interceptor
Kerberos authentication
32
• HDFS may be secured with Kerberos for authenticationpurposes
• Cygnus is able to persist on kerberized HDFS if theconfigured HDFS user has a registered Kerberos principal and this configuration is added:cygnusagent.sinks.hdfs-sink.krb5_auth = true
cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_user = krb5_username
cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_password = xxxxxxxxxxxx
cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_login_file = /usr/cygnus/conf/krb5_login.conf
cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_conf_file = /usr/cygnus/conf/krb5.conf
• https://github.com/telefonicaid/fiware-connectors/blob/master/flume/doc/operation/hdfs_kerberos_authentication.md
Current HDFS cluster vs. Infinity
34
• Currently:– Hadoop cluster combining HDFS storage and MapReduce
computing in the same nodes
– 10 virtual nodes
– 1 TB capacity (333 GB real capacity, replicas=3)
– Default security
– http://cosmos.lab.fi-ware.org/cosmos-gui
• Infinity:– Specific HDFS cluster for storage
– 6 physical nodes (+ other 6 planified)
– 20 TB capacity (6,7 TB real capacity, replicas=3)
– Fully IdM integrated (OAuth2)
– Support to CKAN for Open Big Data
Further reading
35
• Cosmos@FIWARE catalogue– http://catalogue.fiware.org/enablers/bigdata-
analysis-cosmos
• Cygnus@github– https://github.com/telefonicaid/fiware-connectors
• This presententation@slideshare– http://es.slideshare.net/FranciscoRomeroBueno/fi
warefdwbigdataday1v1
• Exercises@slideshare– http://es.slideshare.net/FranciscoRomeroBueno/fi
warefdwbigdataexercisesday1v1