Top Banner
Stream-processing with Samza
31

Operating samza at skyscanner

Jan 19, 2017

Download

Data & Analytics

Joseph Francis
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Operating samza at skyscanner

Stream-processing with Samza

Page 2: Operating samza at skyscanner

01020304050607

IntroductionUnified log in SkyscannerBasics of SamzaUse cases for stream processing in SkyscannerDeployment & Local development environmentMonitoringFuture

Agenda

Page 3: Operating samza at skyscanner

Introduction

Page 4: Operating samza at skyscanner

Introduction

• Skyscanner is a travel search company with over 50m UMVs and over 700 employees globally.

• Joseph Francis, Senior Software Engineer in Skyscanner

• Some use cases in Skyscanner

• Make samza jobs easily deployable and operable in a multi-tenant cluster

Page 5: Operating samza at skyscanner

Unified log in Skyscanner

Page 6: Operating samza at skyscanner

Past

• One (big) monolith SQL database for reporting and monitoring

• Central team to deliver data needs for the organization

• Not yet jumped into the bandwagon of large scale batch processing

Page 7: Operating samza at skyscanner

Unified Log & Eco-system

Page 8: Operating samza at skyscanner

Basics of Samza

Page 9: Operating samza at skyscanner

Key Points

• Samza consumes 1 message at a time with at-least once delivery guarantee

• Single thread of execution

• API offers init(), process() and window() methods

• State management with embedded key-value store

Page 10: Operating samza at skyscanner

Configuration

Page 11: Operating samza at skyscanner

Samza Job with State

Page 12: Operating samza at skyscanner

Use cases in Skyscanner

Page 13: Operating samza at skyscanner

Use Cases

• Building a user timeline

• Data enrichment downstream

• Stream join and windowed aggregations

Page 14: Operating samza at skyscanner

Use Cases

• Indicative pricing for car hire users

Page 15: Operating samza at skyscanner

Use Cases

• Real-time metrics computation off streams

Page 16: Operating samza at skyscanner

Deployment & Local Development

Page 17: Operating samza at skyscanner

Current Deployment Pipelines

Page 18: Operating samza at skyscanner

Current Deployment

• No centralised configuration

• Restrictive source folder structure

• Ansible deployment scripts were embedded with the samza job

Page 19: Operating samza at skyscanner

Local Development Environment

Page 20: Operating samza at skyscanner

New Deployment Configuration

Page 21: Operating samza at skyscanner

Centralised global config

Page 22: Operating samza at skyscanner

Drone Plugin

Drone reads .drone.yml file

Reduced per environment configuration

Page 23: Operating samza at skyscanner

Monitoring

Page 24: Operating samza at skyscanner

Metrics Pipeline

Page 25: Operating samza at skyscanner

Job Metrics

Page 26: Operating samza at skyscanner

Job Alerts

Page 27: Operating samza at skyscanner

Application Logs

• Application logs forwarded to elasticsearch through logstash

• Requires a shared format for logging (log4j.xml)

• Yarn UI is not the most intuitive!

Page 28: Operating samza at skyscanner

Future

Page 29: Operating samza at skyscanner

Future

• More generic jobs

• Developers should only worry about writing code

• Fully automated production deployment

• Cross the boundaries of Batch vs Streaming?

Page 30: Operating samza at skyscanner

Question Time

Questions?

[email protected]

Page 31: Operating samza at skyscanner

Thank you