Top Banner
Data Lake and the rise of the Microservices
23

Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

May 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Data Lake and the rise of the Microservices

Page 2: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

About Me

• Developer (forcibly) turned Product Manager • Been designing system architectures as well as products

for the hosting market during the last 8 years. • Passionated about distributed computing, HPC environments

and supercomputers

@alexandrubordei

@bigstepinc

[email protected]

Page 3: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

About Bigstep• High performance, bare metal cloud purpose built for big data • Automatically deployed (managed and unmanaged) big data software stacks • HDFS as a Service Offering • Managed Docker platform (coming soon) • Spark clusters as a service (coming soon) • Purely on-demand: bare metal instances get deployed in 2 minutes, can be deleted anytime • Locally attached drives support • SDN controlled Layer 2 networking (40Gbps per instance, cut through) • Distributed SSD based storage fabric

Page 4: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Big Data technologies for mainstream and vice-versa

• Due to the cap on CPU frequency, the horizontal is the only dimension left to grow into. • Client-server architecture outdated. • All components of an application must be as independent as possible and as scalable as possible. • Big data technologies increasingly used in general purpose applications • In-memory technologies are orders of magnitude faster than the others.

• Docker promotes and simplifies large scale application management using low-overhead containers instead of VMs

• Mesos used with Docker and some additional services creates a Distributed OS

Page 5: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Source: Tori Randall, Ph.D. prepares a 550-year old Peruvian child mummy for a CT scan

Data as artefacts

• Just like archeological artefacts, old data can yield new insights if correlated in a novel way or analysed with a new technology.

• Throwing away data because it is of no use today might cripple the business tomorrow.

Page 6: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

The Data Lake - A paradigm, not a technology• Store unstructured data in its original format • Store structured data along with the structure (schema) so it can be distributed onto multiple

machines • Ingest massive amounts of data - go to petabyte scale if needed • Stream in or batch import data from any source • Perform new, deeper analytics by focusing on correlations between diverse data sources: clickstream,

social media, machine data, documents, audio/video, etc. • Store anonymised data (keep IDs and not names or other personally identifiable information) • Promote data exploration

Page 7: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Data Services• A data service provides data to other services

Clusters Service Cluster

Timetable

Datalake

Service center load predictor

Driver's path

optimiser

Datalake Datalake

Page 8: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Data Services - It’s about the teams and not the technology

• Conway law: “[…] organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations”

Application

View

Controller

Model

in charge of

UX specialists

Backend specialists

in charge of

DB specialists

in charge of

poor

com

mun

icat

ion

Page 9: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Per data microservice teams

• Data teams are independent • A data service has its own

release cycle • Ultra-specialisation is reduced • Communication among

members of the same team is better

App App

App

App App

App

App AppApp

API

API

API

better communication better communication

better communication

Page 10: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Monolith vs. Microservices approach

Server

App

Server

Monolithic approach

App

App

Server

App

App

Server

App

App

Server

App

App

Server

App

Server

App

Server

App

Server

App

App

Server

App

App

Microservices approach

Page 11: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Polyglot persistence• The data does not have to reside in the same place (e.g.: same HDFS cluster) • But it has to be always available for any team, microservice, or data application authorized to use it

Single DB (slave)

piece 3

piece 4

piece 1

piece 2

Single DB

DB 4DB 3 DB 4DB 3

DB 1 DB 2

DB 4DB 3

DB 1DB 1 DB 2DB 2

Page 12: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Polyglot persistence• The data does not have to reside in the same place (e.g.: same HDFS cluster) • But it has to be always available for any team, microservice, or data application authorized to use it

Single DB (slave)

piece 3

piece 4

piece 1

piece 2

Single DB

DB 4DB 3 DB 4DB 3

DB 1 DB 2

DB 4DB 3

DB 1DB 1 DB 2DB 2

Page 13: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Microservices orientated architecture• Components, not layers. • Each component can scale horizontally and is masterless • Each component can be unit tested independently • Each component can be deployed independently to production • Multiple versions of same component can coexist for a short amount of time • Using APIs to integrate components as opposed to direct method call • Use natively backward compatible API designs and implementations • Use distributed locking (e.g.: Zookeeper) instead of file based locking • Use queuing instead of blocking calls with evolving schemas (e.g.: Kafka with Avro serialiser) • Using distributed databases (e.g.: Couchbase) instead of master-slave oriented ones. Avoid immutable

schemas.

Page 14: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Docker• A Docker container is neither a VM nor a VPS • Application level virtualisation • Same kernel • No performance overhead • Instant deployment • Usually a single app per container • Uses LXC engine (network namespaces and

cgroups) • Git-like deployment method with branches

and repositories.

Container

Kernel

Container Container Container

vNIC LAN

vNICWAN

LAN

WAN

vNIC LAN

vNICWAN

vNIC LAN

vNICWAN

vNIC LAN

vNICWAN

Page 15: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Docker Persistency• Docker is designed for services that do not need persistency but it does support it • By default all containers have an unique clone of the filesystem in the image • All changes to this clone are stored in unique directories per container that does not get garbage

collected • A new container has a new tree and so restarting a container without an explicit mapping appears as

having destroyed the data. • Docker achieves persistency by mapping directories from the host machine to the container.

Page 16: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Docker Networking

instance-1001.bigstep.io

container

container

container

eno1

instance-1002.bigstep.io

container

container

container

eno1

instance-100n.bigstep.io

container

container

container

eno1

LANlayer 2

...

haproxy haproxy haproxy

WAN

internet

Instancearray01.bigstep.io

client

DNS loadbalancing

172.167.1.2:80

172.167.1.3:80

172.167.1.200:80

172.167.2.2:80

172.167.3.3:80

172.167.3.200:80

172.167.3.2:80

172.167.3.3:80

172.167.3.200:80

... ... ...

31.00.62.211:80 31.00.62.212:80 31.00.62.213:80

• Uses network namespaces • Needs Layer 2 or software overlay network • Each container gets a private IP • Bigstep Automatic DNS load-balancing • Automatic HAProxy load-balancing

Page 17: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Docker vs Native - Latency

Aver

age

Res

pons

e Ti

me

(ms)

- S

mal

ler I

s Be

tter

0

10

20

30

40

INSERT AVG response time (us) UPDATE AVG response time (us)

2830

40

13

2620

11

1921

10

1819

1 node native 1 node native 1 Docker container1 node native with 2 Docker containers 1 node native with 4 Docker containers

Source: Bigstep’s Cassandra Benchmark 2015

Page 18: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Docker vs Native - Throughput

KReq

/s -

bigg

er is

bet

ter

0

45

90

135

180

INSERT throughput (k) SELECT throughput (k) UPDATE throughput (k)

566045

816878

149

9282

168

9690

1 node native 1 node native 1 Docker container1 node native with 2 Docker containers 1 node native with 4 Docker containers

Source: Bigstep’s Cassandra Benchmark 2015

Page 19: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Mesos & Marathon• Allows an app’s environment to be software

defined. • Docker (currently) knows only about 1 host • Orchestration layer for Docker containers • Out of the box load-balancing • Monitors and restarts containers if failed • API driven • Useful for creating high performance, distributed,

fault tolerant architectures.

C C C

C C C

C C C

C C C

C C C

C C C

Page 20: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Streaming versus batch• Resource usage patterns for streaming resemble those of web-centric systems, and need consolidation

for efficiency as well as high availability

time

resource usage (%)

25%

resource usage pattern of a production system time

resource usage (%)

100%

typical resource usage pattern of a big data analytics system

Page 21: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Spark with Mesos• Spark & Spark Streaming are great candidates for building data microservices as they are very fast and

easy to use • Spark can use Mesos as a resource manager • Spark needs YARN to access Secure HDFS YARN on Mesos: Myriad

Page 22: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Is it hard to build a Data Lake?• Use flexible infrastructure, workloads are very difficult to predict as data volumes and types of

analysis change all the time • Polyglot Persistency promotes the idea that data must be always available - but that it can be stored

in any technology that fits - e.g. Hadoop, NoSQL. • Polyglot Programming advocates the use of the right tool for the right job. Docker-based deployment

makes environment setup more or less irrelevant. • Mesos is more complicated to setup on-premise. Mesosphere offers a commercial product for this.

Bigstep also automates a scalable Mesos (with Docker) deployment on bare metal. • Data import services could be tricky to setup. The problem is the organisation structure and security.

Anonymisation is required. • A service discovery solution is required: Use mesos-dns

Page 23: Datalake and the rise of the microservices - Meetupfiles.meetup.com/4533812/Datalake and the rise of... · Data Lake and the rise of the Microservices. About Me ... • Orchestration

Conclusions• Data (micro-)services allows building a data ecosystem within your organisation. A team is a provider

of data to other teams. • An agile data environment enables an agile business. New tools must be inserted quickly into the

mix. (Eg: found out about Looker today, why not try it on the data). • There are methods to improve consolidation ratios with 40% while preserving performance of data

services

Data analysis Business modelling

Business understanding + =

Production Systemsmachine data

prediction model

Visualization & Reports