Top Banner
Open Source Data Collection/Ingestion Treasure Data, Inc. www.treasuredata.com
30
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Open source data ingestion

Open SourceData Collection/Ingestion

Treasure Data, Inc.www.treasuredata.com

Page 2: Open source data ingestion

Hello!

- “Committer” of Fluentd

- Treasure Data, Inc.

- Former Algorithmic Trader

- Stanford Math and CS

Page 3: Open source data ingestion

Table of Contents

1. Why you should care2. Data Collection v. Data Ingestion3. Examples: Data Collection Tools4. Examples: Data Ingestion Tools5. Case Study: Async App Logging

Links to be added after the talk.

Page 4: Open source data ingestion

Data Collection/Ingestion is HARD

Page 5: Open source data ingestion

Data Sources Raw Data Storage

Processed Data

AnalysisEnvironment

(Big) Data Pipeline

Data Collection and Ingestion

Data Pre-processing

Data Fetching

Data Engineers

Page 6: Open source data ingestion

Data Sources Raw Data Storage

Processed Data

AnalysisEnvironment

If Data Collection Goes Awry...

Data Collection and Ingestion

Data Pre-processing

Data Fetching

Data Engineers

Page 7: Open source data ingestion

Collection v. Ingestion

Page 8: Open source data ingestion

Data Collection

- Happens where data originates

- “logging code”

- Batch v. Streaming

- Pull v. Push

log.error(“FUUUUU....WHY!?”)

cln.send({“uid”:1,”action”:”died”})

200 GET a.com/?utm=big%20data

Page 9: Open source data ingestion

Data Ingestion

- Receives data

- Sometimes coupled with storage

- Routing data Data Ingestion Layer

Page 10: Open source data ingestion

ex. Data Collection Tools

Page 11: Open source data ingestion

rsyslog

- The grandfather of data collectors

- Streaming

- Installed by default, widely understood

- Not as easy to extend/configure

Page 12: Open source data ingestion

rsyslog

https://github.com/rsyslog/rsyslog/blob/master/ChangeLog

Page 13: Open source data ingestion

Scribe

- Written originally at Facebook

- Streaming

- Fast (C++)

- Nightmare to build, largely

abandoned

Page 14: Open source data ingestion

Flume-ng- Written and maintained by

Cloudera (successor to Flume)

- Commercial support by

Cloudera. Track record for

Hadoop

- Java can be heavy-handed for

some orgs/cases

Page 15: Open source data ingestion

Logstash

- Pluggable architecture, rich

ecosystem

- The “L” of the ELK stack by

Elastic

- JRuby

- HA uses Redis as a queuehttp://apuntesdetrabajo.es/?p=263

Page 16: Open source data ingestion

Heka

- Developed at Mozilla

- Written in Go, extensible w/ Lua

- Plugin system, but compilation

needed (Go’s limitation, may

change)

Page 17: Open source data ingestion

Fluentd

- Plugin architecture

- Built-in HA

- CRuby (JRuby on the roadmap)

- google-fluentd, td-agent

- Lightweight multi-source, multi-

destination log routing

Page 18: Open source data ingestion

Embulk

- Plugin architecture

- Focuses on Batch workloads

- Java/JRuby

- Very new! (looking for

contributors!)

Page 19: Open source data ingestion

ex. Data Ingestion Tools

Page 20: Open source data ingestion

RabbitMQ

- Written in Erlang, supported by

Pivotal

- Implements AMQP

Page 21: Open source data ingestion

Kafka

- Begun at LinkedIn, now Confluent

- Topic-based Message Broker:

Producer/Broker/Consumer

- Distributed design

- Provides at least once, at most

once by consumers

Page 22: Open source data ingestion

Fluentd!?

- Used (abused?) as a bus/MQ

- tag-based event routing

- Can be combined with

RabbitMQ/Kafka, etc.

Page 23: Open source data ingestion

case study: Async App Logging

Page 24: Open source data ingestion

Application Logging

- Common ask: “How’s our new feature doing?”

GET /foobar

API Server200 {...}

Page 25: Open source data ingestion

Application Logging

- What NOT to do: synchronous logging

GET /foobar

API Server200 {...} Data Backend

write

ack

Page 26: Open source data ingestion

Application Logging

- What NOT to do: synchronous logging

GET /foobar

API Server200 {...} Local Data Collector

write Flush

DataBackendack

Buffer

Page 27: Open source data ingestion

- Is writing to a local log collector safe?

- What if the log collector retries by error?

But wait...

- A lot of problems to think about!

Page 28: Open source data ingestion

“Much of the blame, little of the glory”(Just kidding. The entire data team relies on YOU!)

Page 29: Open source data ingestion

Thank you!(...and we are hiring!)

www.treasuredata.com/careers