Top Banner
© Rocana, Inc. All Rights Reserved. | 1 Joey Echeverria, Platform Technical Lead San Francisco Hadoop Users Group, June 14th 2016 San Francisco, CA Streaming ETL for All
46

Streaming ETL for All

Apr 12, 2017

Download

Technology

Joey Echeverria
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 1

Joey Echeverria, Platform Technical Lead

San Francisco Hadoop Users Group, June 14th 2016

San Francisco, CA

Streaming ETL for All

Page 3: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 3

Context

Page 4: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 4

Joey• Where I work: Rocana – Platform Technical Lead

• Where I used to work: Cloudera (’11-’15), NSA

• Distributed systems, security, data processing, big data

Page 5: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 5

Page 6: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 6

History

Page 7: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 7

Spark

Impala

“Legacy” data architecture

HDFS

Avro/Parquet FilesFlume/Sqoop

Data Producers MapReduce

Visualization/Query

Page 8: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 8

Flink

Storm

Stream data architecture

Kafka

Avro Serialized Recrods

Data Producers Spark Streaming

Real-time Visualization

HDFS

Avro/Parquet FilesKafka Consumers

Page 9: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 9

Flink

Storm

Stream data architecture

Kafka

Avro Serialized Recrods

Data Producers Spark Streaming

Real-time Visualization

HDFS

Avro/Parquet FilesKafka Consumers

Page 10: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 10

Stream processingA primer

Page 11: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 11

Stream processing• Filter

• Extract

• Project

• Aggregate

• Join

• Model

Page 12: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 12

Stream processing• Filter

• Extract

• Project

• Aggregate

• Join

• Model

Page 13: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 13

Stream processing• Filter

• Extract

• Project

• Aggregate

• Join

• Model

• Data transformation

Page 14: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 14

Apache Storm• "Distributed real-time computation system"

• Applications packaged into topologies (think MapReduce job)

• Topologies operate over streams of tuples

• Spout: source of a stream

• Bolt: arbitrary operation such as filtering, aggregating, joining, or executing arbitrary functions

Page 15: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 15

Apache Spark• Supports batch and stream processing

• Continuous stream of records discretized into a DStream

• DStream: a sequence of RDDs (batches of records)

• Micro-batch

Page 16: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 16

Apache Flink• Supports batch and stream processing

• DataStream: unbounded collection of records

• Operations can apply to individual records or windows of records

• Supports record-at-a-time processing (like Storm)

Page 17: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 17

Apache Kafka• Pub-sub messaging system implemented as a distributed commit log

• Popular as a source and sink for data streams

• Scalability, durability, and easy-to-understand delivery guarantees

• Can do stream processing directly in Kafka consumers

• Kafka Streams

Page 18: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 18

Data transformation

Page 19: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 19

Filter

filter

Page 20: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 20

Extract

127.0.0.1 Mozilla/5.0 laura [31/Mar/2016] "GET /index.html HTTP/1.0" 200 2326

ts: 1436576671000body: <binary blob>event_type_id: 100...

extract

ts: 1436576671000body: <binary blob>event_type_id: 100attributes: { ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" date: "[31/March/2016]" request: "GET /index.html HTTP/1.0" status_code: "200" size: "2326"}

Page 21: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 21

Project

ts: 1436576671000body: <binary blob>event_type_id: 100attributes: { ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" date: "[31/March/2016]" request: "GET /index.html HTTP/1.0" status_code: "200" size: "2326"}

ts: 1459444413000ip: "127.0.0.1"user_agent: "Mozilla/5.0"user_id: "laura"request: "GET /index.html HTTP/1.0"status_code: 200size: 2326

project

Page 22: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 22

Problem

Page 23: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 23

Who• Developers

• Data engineers

• Sysadmins

• Analysts

Page 24: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 24

Tools

Page 25: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 25

The dark art of data science• Feature engineering

• “Getting a mess of raw data that can be used as input to a machine learning algorithm” - @josh_wills

• Video from Midwest.io 2014

Page 26: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 26

Data transformation for all

Page 27: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 27

Rocana Transform• Library

• Java

• Rocana configuration• JSON + comments + specific numeric types - excess quoting

Page 28: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 28

Data model• Event schema

• id: A globally unique identifier for this event• ts: Epoch timestamp in milliseconds• event_type_id: ID indicating the type of the event• location: Location from which the event was generated• host: Hostname, IP, or other device identifier from which the event was

generated• service: Service or process from which the event was generated• body: Raw event content in bytes• attributes: Event type-specific key/value pairs

Page 29: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 29

Example event{ "id": "JRHAIDMLCKLEAPMIQDHFLO3MXYXV7NVBEJNDKZGS2XVSEINGGBHA====", "event_type_id": 100, "ts": 1436576671000, "location": "aws/us-west-2a", "host": "example01.rocana.com", "service": "dhclient", "body": "<36>Jul 10 18:04:31 gs09.example.com dhclient[865] DHCPACK from …", "attributes": { "syslog_timestamp": "1436576671000", "syslog_process": "dhclient", "syslog_pid": "865", "syslog_facility": "3", "syslog_severity": "6", "syslog_hostname": "example01", "syslog_message": "DHCPACK from 10.10.1.1 (xid=0x5c64bdb0)" }}

Page 30: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 30

Filter, extract, and flatten

Page 31: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 31

Filter, extract, and flatten• Filter out events without type id 100

• Filter out events without hostname prefix "ex"

• Extract a numeric prefix from the syslog message

• Flatten syslog attributes to top-level fields in a different avro schema

Page 32: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 32

Filter, extract, and flatten{ load-event: {}, // Filter by event_type_id filter: { expression: "${event_type_id == 100}" }, // Extract hostname prefix regex: { ... }, filter: { expression: "${host_prefix.match.group.1 == 'ex'}", // Extract a numeric prefix from the syslog message regex: { ... }, // Build flattened record build-avro-record: { ... }, // Accumulate output record accumulate-output: { value: "${output_record}" }}

Page 33: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 33

Extract hostname prefix{ load-event: {}, filter: { expression: "${event_type_id == 100}" }, regex: { pattern: "^(.{2}).*$", value: "${attr.syslog_hostname}", destination: "host_prefix" }, filter: { expression: "${host_prefix.match.group.1 == 'ex'}", ...}

Page 34: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 34

Extract numeric prefix ... filter: { expression: "${host_prefix.match.group.1 == 'ex'}", regex: { pattern: "^([0-9]*)", value: "${attributes['syslog_message']}", destination: "msg", match-actions: { set-values: { extracted_field: "${msg.match.group.1}" } }, no-match-actions: { set-values: { extracted_field: "" } } }, ...

Page 35: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 35

Build flattened record... build-avro-record: { schema-uri: "resource:avro-schemas/flattened-syslog.avsc", destination: "output_record", field-mapping: { ts: "${ts}", event_type_id: "${event_type_id}", source: "${source}", syslog_facility: "${convert:toInt(attributes['syslog_facility'])}", syslog_severity: "${convert:toInt(attributes['syslog_severity'])}", ... syslog_message: "${attributes['syslog_message']}", syslog_pid: "${convert:toInt(attributes['syslog_pid)}", extracted_field: "${extracted_field}" }, },...

Page 36: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 36

Extract metrics from log data

Page 37: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 37

Extract metrics• Input: HTTP status logs

• Extract request latency

• Extract counts by HTTP status code

• Metric types• Guage: A value that varies over time (think latency, CPU %, etc.)• Counter: A value that accumulates over time (think event volume, status codes,

etc.)

Page 38: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 38

Example metric event{ "id": "JRHAIDMLCKLEAPMIQDHFLO3MXBBQ7NVBEJNDKZGS2XVSEINGGBHA====", "event_type_id": 107, "ts": 1436576671000, "location": "aws/us-west-2a", "host": "web01.rocana.com", "service": "httpd", "attributes": { "m.http.request.latency": "4.2000000000E1|g", "m.http.status.401.count": "1.0000000000E0|c", }}

Page 39: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 39

Extract metrics{ load-event: {}, build-metric: { gauge-mapping: { http.request.latency: "${convert:toDouble(attributes['latency'])}" }, destination: "latency_metric" }, accumulate-output: { value: "${latency_metric}" }, build-metric: { dynamic-counter-mapping: [ "${string:format('http.status.%s.count', attributes['sc_status'])}", 1D ], destination: "status_metric" }, accumulate-output: { value: "${status_metric}" }}

Page 40: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 40

Architecture

Page 41: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 41

Java action objects

Architecture

Configuration file Java action objects Context

Variables

Driver

1. Parse config

2. Initialize context

5. Copy output3. Execute actions

4. Read/write variables

Page 42: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 42

Custom actions• Actions loaded at runtime using Java services framework

• Add your jar to the classpath

• Custom actions appear as top-level keywords just like regular actions

• Implement the execute() method of the Action interface

• Implement the build() method of the ActionBuilder interface

Page 43: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 43

Custom actions• Parse custom log formats

• Cisco ACS• Citrix• Juniper• Customer-specific formats

• Lookup IP addresses in the MaxMind GeoIP2 database

• Reference dataset lookups• Device id to device name

Page 44: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 44

Putting it all together• Stream processing is causing us to re-think how we analyze data

• Limiting accessibility of data transformation side increases costs and decreases velocity

• Reduce your reliance on developers to code custom pipelines

• Re-use transformation configuration in any stream processing framework or batch job

Page 45: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 45

Coming soon• Rocana transform will be released under the ASL 2.0

• The configuration library is available today:• https://github.com/scalingdata/rocana-configuration

Page 46: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 46

Questions?