Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Dive into Sahara

Davide Del Vecchio Francesco Vollero Matteo Bernacchi

March 27, 2015


Davide Del Vecchio

•Principal Domain Architect Middleware

•Previous experience with analytics and Big Data

•Background in Science

•Passionate about technology

Who are we

Francesco Vollero

● OpenStack Technical

Specialist in EMEA● Developer background -

in Openstack since

Grizzly● Core contributor in

packstack, openstack-

puppet● Snooping other

openstack components

like Sahara● Functional programming

brain oriented :)

Matteo Bernacchi

•Senior Infrastructure Consultant

•Experienced in cloud solutions deployment

•Supporter of FOSS technologies since 2003


•An introduction to Big Data•An overview of the OpenStack components•A (Moderately) Brief Introduction to Sahara•Sahara in action

Agenda


Everything You Ever Wanted to Know About Big Data But Only Had About 20

Minutes to Learn


Insert some very Big Data here …

What is it

•Something you cannot drag'n drop

•Something you cannot think to process in a reasonable amount of time on your machines

•Something that needs on-purpose algorithm to work with


It is not a just a matter of volume ...

There are many other key aspects

•Data must be processed in a small time frame

• Data sets are different from traditional relational/not relational including machine and social data

•The large availability of computational and mathematical tools in the open source goes beyond the academia

•It's the second iteration of the feedback process of open source tools that are now available as a commodity

•Data visualization tools is an accelerator to the movement


How do I commoditize Big Data


-2004: MapReduce Whitepaper (Google)

- Described the MapReduce algorithm

- Kind of a big deal

-Many were already doing this; it's a very basic prescription

-Specification for easy extensibility

-THIS was the big deal

-Google's vision for clean extension points and design drove the Big Data movement

A Bit of History: MapReduce


-2007: Apache Hadoop

-First and still most significant OSS Big Data engine

-Originally built by Yahoo!

-“Hadoop” now used to refer both to Hadoop itself and the large ecosystem of supporting technologies

-Dominant in the market now, but there are new contenders

-Named after a developer's son's stuffed elephant

A Bit of History: Hadoop


MapReduce: What Does It Do

•MAP•Iterate over records•Emit (0, 1, or n) key-value pairs for each•Word Count:

•Input: “Let's reduce map reduce”

•Output: (“Let's”: 1), (“reduce”: 1), (“map”: 1), (“reduce”: 1)

•REDUCE•Gather all the KVPs for each key together•Apply some function to all of each key's values and emit something for each key•Word Count:

•Input: {“Let's”: [1], “map”: [1], “reduce”: [1, 1]}

•Ouptut: {“Let's”: 1, “map”: 1, “reduce”: 2}


So... It's... GROUP BY.

•Yes, it is kinda GROUP BY.•You are now authorized to laugh at Big Data engineers.•It is, however, VERY easy to parallelize.

•M Mappers can be run against any amount of data on any number of nodes, in small chunks

•N Reducers only have to deal with the data for any one key at a time


MapReduce Extension Points(Per Hadoop MapReduce Interface)

•An Input Reader

•Divides data into “splits” (1 per mapper)

•Usually 16-128MB•A Map Function•A Combiner Function

•Just a reduce function within a mapper process

•With a combiner, mappers only emit one KVP per key

•A Partition Function

•Determines which key goes to which reducer

•Default is hash(key) % len(reducers)•(Optional) A Compare Function

•Orders final output•A Reduce Function•An Output Writer

•By default, writes one file per reducer and just dumps text


MapReduce Abstraction Layers

•Hive (SQL-like)•DROP TABLE IF EXISTS words;

•CREATE TABLE words( text string ) row format delimited fields terminated by '\n' stored as textfile;

•LOAD DATA LOCAL INPATH ‘data_path' OVERWRITE INTO TABLE words;

•SELECT word, COUNT(*) FROM words LATERAL VIEW explode(split(text,' ')) lTable AS word GROUP BY word;

•Pig (relational flow)•raw_input = LOAD './input.txt‘;

•words = FOREACH raw_input GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word;

•grouped = GROUP words BY word;

•counted = FOREACH grouped GENERATE group, COUNT(words);

•STORE counted INTO './wordcount';


Hadoop: HDFS

Hadoop Distributed File System•Large block size

•128MB defaultReplication

•3 default, 512 max

Strictly separate from logic – can be used with any algo

•Giraph: Graph Processing

•Mahout: Machine Learning•The name node tracks data blocks and replication•Data nodes hold data


Hadoop: Data Processing

•Namenode tasks

•Breaks jobs (whole dataset) into tasks (one mapper or reducer)

•Assigns tasks to data nodes

•Tracks progress to completion

•Retry failed tasks a configurable number of times

•Allows Hadoop clusters to be run on error-prone commodity hardware•Datanode tasks

•Tracks its own map and reduce jobs

•Transfers data to other nodes as needed

•Each data node has slots for map and reduce tasks (to be run in JVMs)


Hadoop: The Ecosystem

•Oozie: Workflow manager (chained jobs)•Data pipelining: Flume, Scribe, Kafka•RDBMS integration: Sqoop•Tabular interface for unstructured data: Hcatalog•M/R Abstraction: Pig, Hive•SO MANY OTHERS


OpenStack: take a look at the best place to host your Big Data platform

OpenStack: take a look at the best place to host your Big Data platform



Why does the world need OpenStack?

● Cloud is widely seen as the next-generation IT delivery modelo Agile & Flexibleo Utility-based on-demand consumptiono Self-service driving down administrative overhead and

maintenance● Public clouds are setting the benchmark of how IT could be delivered to

userso Not all organisations are ready for public cloud

● Applications are being written differently today-o More tolerant of failureo Making use of scale-out architecture


● Our data is too largeo Volumes of data are being generated at unprecedented levelso Most of this data is unstructured

● Service requests are too largeo More and more devices are coming onlineo Tablets, phones, laptops, BYOD generation…

● Crucially, applications weren’t written to cope with the demand!o Traditional infrastructure capabilities are being exhaustedo Service uptime, QoS, KPI’s and SLA’s are slipping

Major issues with traditional infrastructure…


Workloads are evolving…

● Typically each tier resides on a single machine● Doesn’t tolerate any downtime● Relies on underlying infrastructure for

availability● Applications scale-up, not out

● Workload resides across multiple machines● Applications built to tolerate failure● Does not rely on underlying infrastructure● Applications scale-out, not up

Cloud-enabled WorkloadsTraditional workloads


Or an easier analogy...

PETS = TRADITIONAL WORKLOADS FARM ANIMALS = CLOUD WORKLOADS

● Farm animals have tag numbers like piggie242.redhat.com

● They are almost identical to each other

● When they get ill you get another one

● Pets are given names like lasy.internal.redhat.com

● They are unique, lovingly hand raised and cared for

● When they get ill you nurse them back to health


OpenStack is typically suitable for the following use cases —● A public cloud-like Infrastructure-as-a-Service cloud platform

o Internal “Infrastructure on Demand” - private cloudo Test and Development environments - e.g. sandboxo Cloud service provider platform - reselling compute, network &

storage

● Building a scale-out platform for cloud-enabled workloadso Web-scale applications, e.g. NetFlix-like, photo/video-streamingo Academic or pharma workloads, e.g. genetic sequencing

So, how does OpenStack fit in?


•OpenStack is made up of individual autonomous components

•All of which are designed to scale-out to accommodate throughput and

availability

•OpenStack is considered more of a framework, that relies on drivers and

plugins

•Largely written in Python and is heavily dependent on Linux

OpenStack Architecture


• Keystone provides a common authentication and authorisation store for OpenStack

• Responsible for users, their roles, and to which project(s) they belong to

• Provides a catalogue of all other OpenStack services

• All OpenStack services typically rely on Keystone to verify a user’s request

OpenStack Identity Service (Keystone)


• Nova is responsible for the lifecycle of running instances within OpenStack

• Manages multiple different hypervisor types via drivers, e.g-

•Red Hat Enterprise Linux (+KVM)

•VMware vSphere

OpenStack Compute (Nova)


•Glance provides a mechanism for the storage and retrieval of disk

images/templates

•Supports a wide variety of image formats, including qcow2, vmdk, ami, vhd

and ova

•Many different backend storage options for images, including Swift…

OpenStack Image Service (Glance)


• Swift provides a mechanism for storing and retrieving arbitrary unstructured data

• Provides an object based interface via a RESTful/HTTP-based API

• Highly fault-tolerant with replication, self-healing, and load-balancing

• Architected to be implemented using commodity compute and storage

OpenStack Object Store (Swift)


• Neutron is responsible for providing networking to running instances within

OpenStack

• Provides an API for defining, configuring, and using networks

• Relies on a plugin architecture for implementation of networks, examples include-

•Open vSwitch (default in Red Hat’s distribution)

•Cisco, PLUMgrid, VMware NSX, Arista, Mellanox, Brocade, etc.

OpenStack Networking (Neutron)


• Cinder provides block storage to instances running within OpenStack

• Used for providing persistent and/or additional storage

• Relies on a plugin/driver architecture for implementation, examples include-

• Red Hat Storage (GlusterFS), IBM XIV, HP Leftland, 3PAR, etc.

OpenStack Volume Service (Cinder)


• Heat facilitates the creation of ‘application stacks’ made from multiple resources

• Stacks are imported as a descriptive template language

• Heat manages the automated orchestration of resources and their dependencies

• Allows for dynamic scaling of applications based on configurable metrics

OpenStack Orchestration (Heat)


• Ceilometer is a central collection of metering and monitoring data

• Primarily used for chargeback of resource usage

• Ceilometer consumes data from the other components - e.g. via agents

• Architecture is completely extensible - meter what you want to - expose via API

OpenStack Telemetry (Ceilometer)


• Horizon is OpenStack’s web-based self-service portal

• Sits on-top of all of the other OpenStack components via API interaction

• Provides a subset of underlying functionality

• Examples include: instance creation, network configuration, block storage attachment

• Exposes an administrative extension for basic tasks, e.g. user creation

OpenStack Dashboard (Horizon)


• All OpenStack components expose a RESTful API for communication

• A stateless, shared-nothing API service provides scalability and fault-tolerance

• Keystone manages a list of these API endpoints in its catalog

Common OpenStack Architecture



Where’s Nova?

http://server0:8773

server1:8773

server2:8773

server3:8773

LB

server0:8773


• In addition to providing API services, each component has a set of workers

• These workers actually do the heavy lifting behind the scenes

• Workers (and API services) scale-out and communicate using a message bus

(RabbitMQ)

• Example with Nova:


Nova API

Nova Compute

Nova Compute

Nova Compute

RabbitMQ AMQP




• Workers (and API services) scale-out and communicate using a message bus

(RabbitMQ)



Nova API

Nova Compute

Nova Compute

Nova Compute

RabbitMQ AMQP




• Workers (and API services) scale-out and communicate using a message bus (RabbitMQ)



Nova API

Nova Compute

Nova Compute

Nova Compute

RabbitMQ AMQP


• OpenStack services store state information in a SQL-based database, default is MySQL

• Each service can use it’s own database infrastructure or share a common platform

• For resilience and throughput, replicated multi-master databases can be implemented

• Example with Keystone:


Keystone Server

LB

Multi-Master ReplicationUsing Galera


• OpenStack services check a users request with Keystone for both authentication and authorisation



Keystone Server

Nova API

Launch an Instance

1) Are they authenticated?2) Are they allowed to launch an instance?

Success/Fail


OpenStack Architecture



OpenStack Sahara, or what we supposed to talk about today


Hadoop without Sahara: the challenges•Hadoop clusters are difficult to configure and few have the expert knowledge to do fine•Commodity hardware is cheap but requires frequent (costly, expert) maintenance•Demand for data processing varies over time, even with sophisticated scheduling•Baremetal Hadoop cluster nodes can fail, leading to a loss of service•Many public BigData services don't give you flexibility


Hadoop with Sahara: beat the challenges

•OpenStack Sahara lets you to:

•Deploy Hadoop Clusters (predictable and repeatable)

•Scaling the deployed clusters

•Define and run jobs

•Offer a programmatic API interface and a web console•Furthermore:

•It support many Hadoop Distributions

•It is well integrated with other OpenStack Services

•Enables to use Hadoop even with little knowledge about it


Sahara: the project

History:

•Started at Portland Summit

•Incubated in Icehouse

•Integrated in Juno

Main components:

•Sahara REST API

•Python REST Client and Sahara Pages (Integrated with Horizon)

•Elastic Data Processing

•Provisioning Engine

•Vendor Plugins (Vanilla, Intel, Hortonworks, Cloudera, MapR)


Sahara: Architecture


Sahara: Usecases

•Cluster Management (API V1.0)

•On-demand, scalable, persistent clusters

•Supports multiple plugins

•Integrates with Heat, Glance, Nova, Neutron, and Cinder

•EDP (Elastic Data Processing ) (API V1.1)

•Supports multiple job types (Java, MR, Hive, Pig, Spark...)

•Supports transient clusters (spin up, process, shut down) or persistent clusters

•Integrates with Swift (optionally) and services on Vms


Sahara: end-user workflow



Questions ?

Sahara presentation latest - Codemotion Rome 2015

Documents

big data background

big data movement

big data engineers

social data

big deal

input reader divides

map function

small time frame data