Top Banner
DevOps for Big Data Enabling Continuous Delivery for data analytics applications based on Hadoop, Vertica, and Tableau 1 Max Martynov, VP of Technology Grid Dynamics
19

DevOps for Big Data - Data 360 2014 Conference

Nov 22, 2014

Download

Data & Analytics

Grid Dynamics

Enabling Continuous Delivery for data analytics applications based on Hadoop, Vertica, and Tableau.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DevOps for Big Data - Data 360 2014 Conference

DevOps for Big DataEnabling Continuous Delivery for data analytics applications based on Hadoop, Vertica, and Tableau

1

Max Martynov, VP of TechnologyGrid Dynamics

Page 2: DevOps for Big Data - Data 360 2014 Conference

2

Introductions

• Grid Dynamics─ Solutions company, specializing in eCommerce

─ Experts in mission-critical applications (IMDGs, Big Data)

─ Implementing Continuous Integration and Continuous Delivery for 5+ years

• Qubell─ Enterprise DevOps platform 

─ Focused on self-service environments, service orchestration, and continuous upgrades

─ Targets web-scale and big data applications

Page 3: DevOps for Big Data - Data 360 2014 Conference

3

State of DevOps and Continuous Delivery

Continuous Delivery Value

• Agility

• Transparency

• Efficiency

• Consistency

• Quality

• Control

Findings from The 2014 State of DevOps Report

• Strong IT performance is a competitive advantage

• DevOps practices improve IT performance

• Organizational culture matters

• Job satisfaction is the No. 1 predictor of organizational performance

Page 4: DevOps for Big Data - Data 360 2014 Conference

4

Continuous Delivery Infrastructure

• Environments ─ Reliable and repeatable deployment automation

─ Database schema management

─ Data management

─ Application properties management

─ Dynamic environments

• Quality─ Test automation

─ Test data management (again)

─ Code analysis and review

• Process─ Source code management, branching strategy

─ Agile requirements and project management

─ CICD pipeline

* Big Data applications bring additional challenges in these areas due to big amounts of data, complexity of business logic and large scale environments.

Page 5: DevOps for Big Data - Data 360 2014 Conference

5

Implementing Continuous Delivery for Big Data:Initial State of the Project

• Medium size distributed development team

• Diverse technology stack – Hadoop + Vertica + Tableau

• Only one environment existed and it was production

• Delivery pipeline:

• Procurement of hardware for a new environment was taking months

Development Team

Production

Page 6: DevOps for Big Data - Data 360 2014 Conference

6

Development in Production

It is fun until somebody misses the nail

Page 7: DevOps for Big Data - Data 360 2014 Conference

7

Hadoop Analytical Application

Master

Database

Slaves 1 - N

Manager

10+ TB of data; 10+ nodes in production; 10+ applications; manually pre-deployed on hardware serversHow to quickly reproduce this environment for dev-test purposes?

Page 8: DevOps for Big Data - Data 360 2014 Conference

8

1. Stop Gap Measure

• Same hardware, different logical “zones” implemented on the file system

• Automated build and deployment

• Delivery pipeline:

Development Team

Production cluster

/test1-N

/stage

/prod

Zones

Page 9: DevOps for Big Data - Data 360 2014 Conference

9

1. Stop Gap Measure: Pros and Cons

Pros

• Better than before: code can be tested before it goes to production

• All logical environments has access to the same production data

• Zero additional environment costs

Cons

• Stability, security and compliance issues: dev, test and prod environments share same hardware

• Performance issues: tests affect production performance

• Impossible to run “destructive” tests that affect shared production data

• Impossible to test upgrades of middleware (new versions of H* components)

Page 10: DevOps for Big Data - Data 360 2014 Conference

10

2. Hadoop Dynamic Environments

DataCustom

Application

Dev

Components

Services Environment Policies

QA

ProdStage

Dev/QA/Ops

Request Environment

Orchestrate environment provisioning and application

deployment

Environment

Page 11: DevOps for Big Data - Data 360 2014 Conference

11

2. Hadoop Dynamic Environments (continued)

• Dev/QA/Ops teams got a self-service portal to ─ provision environments

─ deploy applications

• A new environment can be created from scratch in 2-3 hours─ singe-node dev sandbox

─ multi-node QA

─ big clusters for scalability and performance

• An application can be deployed to an environment within 10 minutes

Page 12: DevOps for Big Data - Data 360 2014 Conference

12

3. Vertica and Tableau Dynamic Environments

Data UDF

Dev

Components

ServicesEnvironment

Policies

QA

ProdStage

Dev/QA/Ops

Request Environment

Orchestrate environment provisioning and application

deployment

Environment

VSQL Config

Shared service

Page 13: DevOps for Big Data - Data 360 2014 Conference

13

Unit Tests

Component Tests

Integration Tests(integration with data)

4. Tests & Test Data

• Dev and QA teams implemented automated tests

• Two options to handle data on dev-test environments:

1. Tests generate data for themselves

2. A reduced representative snapshot of obfuscated production data (10TB -> 10GB)

Exploratory Tests

Java code, auto-generated data;build-time validation

Auto tests on “API” level, testing job output;test-generated data

Auto tests on “API” level, validating job output;snapshot of production data

Manual tests;snapshot of production data

Page 14: DevOps for Big Data - Data 360 2014 Conference

14

5. CICD pipeline

With all components ready, implementing CICD pipeline is easy:

Development Team

Dev Sandbox QA Environment

Github Flow2. Commit

1. Develop & Experiment

3. Build & unit test

4. Deploy 5. Test

6. Release

Page 15: DevOps for Big Data - Data 360 2014 Conference

15

6. Release Button

Release Candidate

Release

ProductionOps/RE

Page 16: DevOps for Big Data - Data 360 2014 Conference

16

Assembly Line

Page 17: DevOps for Big Data - Data 360 2014 Conference

17

Results

• Reduced risk and higher quality─ No more development in production

─ Developers have sandboxes, tests are run on separate environments

─ Feature are deployed to production only after validation

• Increased efficiency─ A new environment can be provisioned within 2 hours

─ Developers can freely experiment with new changes

─ No resource contention

• Reduced costs─ No need to procure in-house hardware and manage in-house datacenter

─ Dynamic environments save money by using them on only when they are needed

Page 18: DevOps for Big Data - Data 360 2014 Conference

18

Enabling Technologies

Agile Software FactorySoftware Engineering Assembly Line

griddynamics.com

QubellEnterprise DevOps Platform

qubell.com

Page 19: DevOps for Big Data - Data 360 2014 Conference

A P R I L 8 , 2 0 2 3

Thank You

19

Max Martynov, VP of Technology, Grid [email protected]

Victoria Livschitz, CEO and Founder, [email protected]