DevOps for Big Data - Data 360 2014 Conference

DevOps for Big DataEnabling Continuous Delivery for data analytics applications based on Hadoop, Vertica, and Tableau

1

Max Martynov, VP of TechnologyGrid Dynamics

2

Introductions

• Grid Dynamics─ Solutions company, specializing in eCommerce

─ Experts in mission-critical applications (IMDGs, Big Data)

─ Implementing Continuous Integration and Continuous Delivery for 5+ years

• Qubell─ Enterprise DevOps platform

─ Focused on self-service environments, service orchestration, and continuous upgrades

─ Targets web-scale and big data applications

3

State of DevOps and Continuous Delivery

Continuous Delivery Value

• Agility

• Transparency

• Efficiency

• Consistency

• Quality

• Control

Findings from The 2014 State of DevOps Report

• Strong IT performance is a competitive advantage

• DevOps practices improve IT performance

• Organizational culture matters

• Job satisfaction is the No. 1 predictor of organizational performance

4

Continuous Delivery Infrastructure

• Environments ─ Reliable and repeatable deployment automation

─ Database schema management

─ Data management

─ Application properties management

─ Dynamic environments

• Quality─ Test automation

─ Test data management (again)

─ Code analysis and review

• Process─ Source code management, branching strategy

─ Agile requirements and project management

─ CICD pipeline

* Big Data applications bring additional challenges in these areas due to big amounts of data, complexity of business logic and large scale environments.

5

Implementing Continuous Delivery for Big Data:Initial State of the Project

• Medium size distributed development team

• Diverse technology stack – Hadoop + Vertica + Tableau

• Only one environment existed and it was production

• Delivery pipeline:

• Procurement of hardware for a new environment was taking months

Development Team

Production

6

Development in Production

It is fun until somebody misses the nail

7

Hadoop Analytical Application

Master

Database

Slaves 1 - N

Manager

10+ TB of data; 10+ nodes in production; 10+ applications; manually pre-deployed on hardware serversHow to quickly reproduce this environment for dev-test purposes?

8

1. Stop Gap Measure

• Same hardware, different logical “zones” implemented on the file system

• Automated build and deployment

• Delivery pipeline:

Development Team

Production cluster

/test1-N

/stage

/prod

Zones

9

1. Stop Gap Measure: Pros and Cons

Pros

• Better than before: code can be tested before it goes to production

• All logical environments has access to the same production data

• Zero additional environment costs

Cons

• Stability, security and compliance issues: dev, test and prod environments share same hardware

• Performance issues: tests affect production performance

• Impossible to run “destructive” tests that affect shared production data

• Impossible to test upgrades of middleware (new versions of H* components)

10

2. Hadoop Dynamic Environments

DataCustom

Application

Dev

Components

Services Environment Policies

QA

ProdStage

Dev/QA/Ops

Request Environment

Orchestrate environment provisioning and application

deployment

Environment

11

2. Hadoop Dynamic Environments (continued)

• Dev/QA/Ops teams got a self-service portal to ─ provision environments

─ deploy applications

• A new environment can be created from scratch in 2-3 hours─ singe-node dev sandbox

─ multi-node QA

─ big clusters for scalability and performance

• An application can be deployed to an environment within 10 minutes

12

3. Vertica and Tableau Dynamic Environments

Data UDF

Dev

Components

ServicesEnvironment

Policies

QA

ProdStage

Dev/QA/Ops

Request Environment

Orchestrate environment provisioning and application

deployment

Environment

VSQL Config

Shared service

13

Unit Tests

Component Tests

Integration Tests(integration with data)

4. Tests & Test Data

• Dev and QA teams implemented automated tests

• Two options to handle data on dev-test environments:

1. Tests generate data for themselves

2. A reduced representative snapshot of obfuscated production data (10TB -> 10GB)

Exploratory Tests

Java code, auto-generated data;build-time validation

Auto tests on “API” level, testing job output;test-generated data

Auto tests on “API” level, validating job output;snapshot of production data

Manual tests;snapshot of production data

14

5. CICD pipeline

With all components ready, implementing CICD pipeline is easy:

Development Team

Dev Sandbox QA Environment

Github Flow2. Commit

1. Develop & Experiment

3. Build & unit test

4. Deploy 5. Test

6. Release

15

6. Release Button

Release Candidate

Release

ProductionOps/RE

16

Assembly Line

17

Results

• Reduced risk and higher quality─ No more development in production

─ Developers have sandboxes, tests are run on separate environments

─ Feature are deployed to production only after validation

• Increased efficiency─ A new environment can be provisioned within 2 hours

─ Developers can freely experiment with new changes

─ No resource contention

• Reduced costs─ No need to procure in-house hardware and manage in-house datacenter

─ Dynamic environments save money by using them on only when they are needed

18

Enabling Technologies

Agile Software FactorySoftware Engineering Assembly Line

griddynamics.com

QubellEnterprise DevOps Platform

qubell.com

A P R I L 8 , 2 0 2 3

Thank You

19

Max Martynov, VP of Technology, Grid [email protected]

Victoria Livschitz, CEO and Founder, [email protected]

mailto:[email protected]



DevOps for Big Data - Data 360 2014 Conference

Data & Analytics

tests test data dev

devtest environments

tb of data

big amounts of data

big data applications2

production delivery

new environment

selfservice environments