Hello!
- “Committer” of Fluentd
- Treasure Data, Inc.
- Former Algorithmic Trader
- Stanford Math and CS
Table of Contents
1. Why you should care2. Data Collection v. Data Ingestion3. Examples: Data Collection Tools4. Examples: Data Ingestion Tools5. Case Study: Async App Logging
Links to be added after the talk.
Data Sources Raw Data Storage
Processed Data
AnalysisEnvironment
(Big) Data Pipeline
Data Collection and Ingestion
Data Pre-processing
Data Fetching
Data Engineers
Data Sources Raw Data Storage
Processed Data
AnalysisEnvironment
If Data Collection Goes Awry...
Data Collection and Ingestion
Data Pre-processing
Data Fetching
Data Engineers
Data Collection
- Happens where data originates
- “logging code”
- Batch v. Streaming
- Pull v. Push
log.error(“FUUUUU....WHY!?”)
cln.send({“uid”:1,”action”:”died”})
200 GET a.com/?utm=big%20data
rsyslog
- The grandfather of data collectors
- Streaming
- Installed by default, widely understood
- Not as easy to extend/configure
Scribe
- Written originally at Facebook
- Streaming
- Fast (C++)
- Nightmare to build, largely
abandoned
Flume-ng- Written and maintained by
Cloudera (successor to Flume)
- Commercial support by
Cloudera. Track record for
Hadoop
- Java can be heavy-handed for
some orgs/cases
Logstash
- Pluggable architecture, rich
ecosystem
- The “L” of the ELK stack by
Elastic
- JRuby
- HA uses Redis as a queuehttp://apuntesdetrabajo.es/?p=263
Heka
- Developed at Mozilla
- Written in Go, extensible w/ Lua
- Plugin system, but compilation
needed (Go’s limitation, may
change)
Fluentd
- Plugin architecture
- Built-in HA
- CRuby (JRuby on the roadmap)
- google-fluentd, td-agent
- Lightweight multi-source, multi-
destination log routing
Embulk
- Plugin architecture
- Focuses on Batch workloads
- Java/JRuby
- Very new! (looking for
contributors!)
Kafka
- Begun at LinkedIn, now Confluent
- Topic-based Message Broker:
Producer/Broker/Consumer
- Distributed design
- Provides at least once, at most
once by consumers
Fluentd!?
- Used (abused?) as a bus/MQ
- tag-based event routing
- Can be combined with
RabbitMQ/Kafka, etc.
Application Logging
- What NOT to do: synchronous logging
GET /foobar
API Server200 {...} Data Backend
write
ack
Application Logging
- What NOT to do: synchronous logging
GET /foobar
API Server200 {...} Local Data Collector
write Flush
DataBackendack
Buffer
- Is writing to a local log collector safe?
- What if the log collector retries by error?
But wait...
- A lot of problems to think about!
- Software- www.fluentd.org- hekad.readthedocs.org- logstash.org- kafka.apache.org- Embulk.org- www.rabbitmq.com
- Ideas- https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying- http://radar.oreilly.com/2015/04/the-log-the-lifeblood-of-your-data-
pipeline.htmlL
Bibliography