Top Banner
ROME 27-28 march 2015 - Speaker’s name Dive into Sahara Davide Del Vecchio Francesco Vollero Matteo Bernacchi March 27, 2015
51
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Dive into Sahara

Davide Del Vecchio Francesco Vollero Matteo Bernacchi

March 27, 2015

Page 2: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Davide Del Vecchio

•Principal Domain Architect Middleware

•Previous experience with analytics and Big Data

•Background in Science

•Passionate about technology

Who are we

Francesco Vollero

● OpenStack Technical

Specialist in EMEA● Developer background -

in Openstack since

Grizzly● Core contributor in

packstack, openstack-

puppet● Snooping other

openstack components

like Sahara● Functional programming

brain oriented :)

Matteo Bernacchi

•Senior Infrastructure Consultant

•Experienced in cloud solutions deployment

•Supporter of FOSS technologies since 2003

Page 3: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

•An introduction to Big Data•An overview of the OpenStack components•A (Moderately) Brief Introduction to Sahara•Sahara in action

Agenda

Page 4: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Everything You Ever Wanted to Know About Big Data But Only Had About 20

Minutes to Learn

Page 5: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Insert some very Big Data here …

What is it

•Something you cannot drag'n drop

•Something you cannot think to process in a reasonable amount of time on your machines

•Something that needs on-purpose algorithm to work with

Page 6: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

It is not a just a matter of volume ...

There are many other key aspects

•Data must be processed in a small time frame

• Data sets are different from traditional relational/not relational including machine and social data

•The large availability of computational and mathematical tools in the open source goes beyond the academia

•It's the second iteration of the feedback process of open source tools that are now available as a commodity

•Data visualization tools is an accelerator to the movement

Page 7: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

How do I commoditize Big Data

Page 8: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

-2004: MapReduce Whitepaper (Google)

- Described the MapReduce algorithm

- Kind of a big deal

-Many were already doing this; it's a very basic prescription

-Specification for easy extensibility

-THIS was the big deal

-Google's vision for clean extension points and design drove the Big Data movement

A Bit of History: MapReduce

Page 9: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

-2007: Apache Hadoop

-First and still most significant OSS Big Data engine

-Originally built by Yahoo!

-“Hadoop” now used to refer both to Hadoop itself and the large ecosystem of supporting technologies

-Dominant in the market now, but there are new contenders

-Named after a developer's son's stuffed elephant

A Bit of History: Hadoop

Page 10: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

MapReduce: What Does It Do

•MAP•Iterate over records•Emit (0, 1, or n) key-value pairs for each•Word Count:

•Input: “Let's reduce map reduce”

•Output: (“Let's”: 1), (“reduce”: 1), (“map”: 1), (“reduce”: 1)

•REDUCE•Gather all the KVPs for each key together•Apply some function to all of each key's values and emit something for each key•Word Count:

•Input: {“Let's”: [1], “map”: [1], “reduce”: [1, 1]}

•Ouptut: {“Let's”: 1, “map”: 1, “reduce”: 2}

Page 11: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

So... It's... GROUP BY.

•Yes, it is kinda GROUP BY.•You are now authorized to laugh at Big Data engineers.•It is, however, VERY easy to parallelize.

•M Mappers can be run against any amount of data on any number of nodes, in small chunks

•N Reducers only have to deal with the data for any one key at a time

Page 12: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

MapReduce Extension Points(Per Hadoop MapReduce Interface)

•An Input Reader

•Divides data into “splits” (1 per mapper)

•Usually 16-128MB•A Map Function•A Combiner Function

•Just a reduce function within a mapper process

•With a combiner, mappers only emit one KVP per key

•A Partition Function

•Determines which key goes to which reducer

•Default is hash(key) % len(reducers)•(Optional) A Compare Function

•Orders final output•A Reduce Function•An Output Writer

•By default, writes one file per reducer and just dumps text

Page 13: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

MapReduce Abstraction Layers

•Hive (SQL-like)•DROP TABLE IF EXISTS words;

•CREATE TABLE words( text string ) row format delimited fields terminated by '\n' stored as textfile;

•LOAD DATA LOCAL INPATH ‘data_path' OVERWRITE INTO TABLE words;

•SELECT word, COUNT(*) FROM words LATERAL VIEW explode(split(text,' ')) lTable AS word GROUP BY word;

•Pig (relational flow)•raw_input = LOAD './input.txt‘;

•words = FOREACH raw_input GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word;

•grouped = GROUP words BY word;

•counted = FOREACH grouped GENERATE group, COUNT(words);

•STORE counted INTO './wordcount';

Page 14: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Hadoop: HDFS

Hadoop Distributed File System•Large block size

•128MB defaultReplication

•3 default, 512 max

Strictly separate from logic – can be used with any algo

•Giraph: Graph Processing

•Mahout: Machine Learning•The name node tracks data blocks and replication•Data nodes hold data

Page 15: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Hadoop: Data Processing

•Namenode tasks

•Breaks jobs (whole dataset) into tasks (one mapper or reducer)

•Assigns tasks to data nodes

•Tracks progress to completion

•Retry failed tasks a configurable number of times

•Allows Hadoop clusters to be run on error-prone commodity hardware•Datanode tasks

•Tracks its own map and reduce jobs

•Transfers data to other nodes as needed

•Each data node has slots for map and reduce tasks (to be run in JVMs)

Page 16: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Hadoop: The Ecosystem

•Oozie: Workflow manager (chained jobs)•Data pipelining: Flume, Scribe, Kafka•RDBMS integration: Sqoop•Tabular interface for unstructured data: Hcatalog•M/R Abstraction: Pig, Hive•SO MANY OTHERS

Page 17: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

OpenStack: take a look at the best place to host your Big Data platform

OpenStack: take a look at the best place to host your Big Data platform

Page 18: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Page 19: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Why does the world need OpenStack?

● Cloud is widely seen as the next-generation IT delivery modelo Agile & Flexibleo Utility-based on-demand consumptiono Self-service driving down administrative overhead and

maintenance● Public clouds are setting the benchmark of how IT could be delivered to

userso Not all organisations are ready for public cloud

● Applications are being written differently today-o More tolerant of failureo Making use of scale-out architecture

Page 20: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

● Our data is too largeo Volumes of data are being generated at unprecedented levelso Most of this data is unstructured

● Service requests are too largeo More and more devices are coming onlineo Tablets, phones, laptops, BYOD generation…

● Crucially, applications weren’t written to cope with the demand!o Traditional infrastructure capabilities are being exhaustedo Service uptime, QoS, KPI’s and SLA’s are slipping

Major issues with traditional infrastructure…

Page 21: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Workloads are evolving…

● Typically each tier resides on a single machine● Doesn’t tolerate any downtime● Relies on underlying infrastructure for

availability● Applications scale-up, not out

● Workload resides across multiple machines● Applications built to tolerate failure● Does not rely on underlying infrastructure● Applications scale-out, not up

Cloud-enabled WorkloadsTraditional workloads

Page 22: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Or an easier analogy...

PETS = TRADITIONAL WORKLOADS FARM ANIMALS = CLOUD WORKLOADS

● Farm animals have tag numbers like piggie242.redhat.com

● They are almost identical to each other

● When they get ill you get another one

● Pets are given names like lasy.internal.redhat.com

● They are unique, lovingly hand raised and cared for

● When they get ill you nurse them back to health

Page 23: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

OpenStack is typically suitable for the following use cases —● A public cloud-like Infrastructure-as-a-Service cloud platform

o Internal “Infrastructure on Demand” - private cloudo Test and Development environments - e.g. sandboxo Cloud service provider platform - reselling compute, network &

storage

● Building a scale-out platform for cloud-enabled workloadso Web-scale applications, e.g. NetFlix-like, photo/video-streamingo Academic or pharma workloads, e.g. genetic sequencing

So, how does OpenStack fit in?

Page 24: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

•OpenStack is made up of individual autonomous components

•All of which are designed to scale-out to accommodate throughput and

availability

•OpenStack is considered more of a framework, that relies on drivers and

plugins

•Largely written in Python and is heavily dependent on Linux

OpenStack Architecture

Page 25: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

• Keystone provides a common authentication and authorisation store for OpenStack

• Responsible for users, their roles, and to which project(s) they belong to

• Provides a catalogue of all other OpenStack services

• All OpenStack services typically rely on Keystone to verify a user’s request

OpenStack Identity Service (Keystone)

Page 26: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

• Nova is responsible for the lifecycle of running instances within OpenStack

• Manages multiple different hypervisor types via drivers, e.g-

•Red Hat Enterprise Linux (+KVM)

•VMware vSphere

OpenStack Compute (Nova)

Page 27: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

•Glance provides a mechanism for the storage and retrieval of disk

images/templates

•Supports a wide variety of image formats, including qcow2, vmdk, ami, vhd

and ova

•Many different backend storage options for images, including Swift…

OpenStack Image Service (Glance)

Page 28: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

• Swift provides a mechanism for storing and retrieving arbitrary unstructured data

• Provides an object based interface via a RESTful/HTTP-based API

• Highly fault-tolerant with replication, self-healing, and load-balancing

• Architected to be implemented using commodity compute and storage

OpenStack Object Store (Swift)

Page 29: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

• Neutron is responsible for providing networking to running instances within

OpenStack

• Provides an API for defining, configuring, and using networks

• Relies on a plugin architecture for implementation of networks, examples include-

•Open vSwitch (default in Red Hat’s distribution)

•Cisco, PLUMgrid, VMware NSX, Arista, Mellanox, Brocade, etc.

OpenStack Networking (Neutron)

Page 30: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

• Cinder provides block storage to instances running within OpenStack

• Used for providing persistent and/or additional storage

• Relies on a plugin/driver architecture for implementation, examples include-

• Red Hat Storage (GlusterFS), IBM XIV, HP Leftland, 3PAR, etc.

OpenStack Volume Service (Cinder)

Page 31: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

• Heat facilitates the creation of ‘application stacks’ made from multiple resources

• Stacks are imported as a descriptive template language

• Heat manages the automated orchestration of resources and their dependencies

• Allows for dynamic scaling of applications based on configurable metrics

OpenStack Orchestration (Heat)

Page 32: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

• Ceilometer is a central collection of metering and monitoring data

• Primarily used for chargeback of resource usage

• Ceilometer consumes data from the other components - e.g. via agents

• Architecture is completely extensible - meter what you want to - expose via API

OpenStack Telemetry (Ceilometer)

Page 33: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

• Horizon is OpenStack’s web-based self-service portal

• Sits on-top of all of the other OpenStack components via API interaction

• Provides a subset of underlying functionality

• Examples include: instance creation, network configuration, block storage attachment

• Exposes an administrative extension for basic tasks, e.g. user creation

OpenStack Dashboard (Horizon)

Page 34: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

• All OpenStack components expose a RESTful API for communication

• A stateless, shared-nothing API service provides scalability and fault-tolerance

• Keystone manages a list of these API endpoints in its catalog

Common OpenStack Architecture

Page 35: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Common OpenStack Architecture

Where’s Nova?

http://server0:8773

server1:8773

server2:8773

server3:8773

LB

server0:8773

Page 36: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

• In addition to providing API services, each component has a set of workers

• These workers actually do the heavy lifting behind the scenes

• Workers (and API services) scale-out and communicate using a message bus

(RabbitMQ)

• Example with Nova:

Common OpenStack Architecture

Nova API

Nova Compute

Nova Compute

Nova Compute

RabbitMQ AMQP

Page 37: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

• In addition to providing API services, each component has a set of workers

• These workers actually do the heavy lifting behind the scenes

• Workers (and API services) scale-out and communicate using a message bus

(RabbitMQ)

• Example with Nova:

Common OpenStack Architecture

Nova API

Nova Compute

Nova Compute

Nova Compute

RabbitMQ AMQP

Page 38: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

• In addition to providing API services, each component has a set of workers

• These workers actually do the heavy lifting behind the scenes

• Workers (and API services) scale-out and communicate using a message bus (RabbitMQ)

• Example with Nova:

Common OpenStack Architecture

Nova API

Nova Compute

Nova Compute

Nova Compute

RabbitMQ AMQP

Page 39: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

• OpenStack services store state information in a SQL-based database, default is MySQL

• Each service can use it’s own database infrastructure or share a common platform

• For resilience and throughput, replicated multi-master databases can be implemented

• Example with Keystone:

Common OpenStack Architecture

Keystone Server

LB

Multi-Master ReplicationUsing Galera

Page 40: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

• OpenStack services check a users request with Keystone for both authentication and authorisation

• Example with Nova:

Common OpenStack Architecture

Keystone Server

Nova API

Launch an Instance

1) Are they authenticated?2) Are they allowed to launch an instance?

Success/Fail

Page 41: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

OpenStack Architecture

Page 42: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Page 43: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

OpenStack Sahara, or what we supposed to talk about today

Page 44: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Hadoop without Sahara: the challenges•Hadoop clusters are difficult to configure and few have the expert knowledge to do fine•Commodity hardware is cheap but requires frequent (costly, expert) maintenance•Demand for data processing varies over time, even with sophisticated scheduling•Baremetal Hadoop cluster nodes can fail, leading to a loss of service•Many public BigData services don't give you flexibility

Page 45: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Hadoop with Sahara: beat the challenges

•OpenStack Sahara lets you to:

•Deploy Hadoop Clusters (predictable and repeatable)

•Scaling the deployed clusters

•Define and run jobs

•Offer a programmatic API interface and a web console•Furthermore:

•It support many Hadoop Distributions

•It is well integrated with other OpenStack Services

•Enables to use Hadoop even with little knowledge about it

Page 46: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Sahara: the project

History:

•Started at Portland Summit

•Incubated in Icehouse

•Integrated in Juno

Main components:

•Sahara REST API

•Python REST Client and Sahara Pages (Integrated with Horizon)

•Elastic Data Processing

•Provisioning Engine

•Vendor Plugins (Vanilla, Intel, Hortonworks, Cloudera, MapR)

Page 47: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Sahara: Architecture

Page 48: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Sahara: Usecases

•Cluster Management (API V1.0)

•On-demand, scalable, persistent clusters

•Supports multiple plugins

•Integrates with Heat, Glance, Nova, Neutron, and Cinder

•EDP (Elastic Data Processing ) (API V1.1)

•Supports multiple job types (Java, MR, Hive, Pig, Spark...)

•Supports transient clusters (spin up, process, shut down) or persistent clusters

•Integrates with Swift (optionally) and services on Vms

Page 49: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Sahara: end-user workflow

Page 50: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Page 51: Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name

Questions ?