May 06, 2015
Log Everything! @DC13
Stefan & Mike
Mike Lohmann Co-Founder / Software Engineer
Dr. Stefan Schadwinkel Co-Founder / Analytics Engineer
ABOUT DECK36 Who We Are – DECK36 is a young spin-off from ICANS
– Small team of 7 engineers
– Longstanding expertise in designing, implementing and operating complex web systems
– Developing own data intelligence-focused tools and web services
– Offering our expert knowledge in Automation & Operations, Architecture & Engineering, Analytics & Data Logistics
– Log everything! – The Data Pipeline.
– Tackling the Leviathan – Realtime Stream Processing with Storm.
– JS Client DataCollector: Live Demo
– Storm Processing with PHP: Live Demo
WHAT WE WILL TALK ABOUT Topics
Log everything! The Data Pipeline
THE DATA PIPELINE Requirements Background: Building and operating multiple education communities
Baseline: PokerStrategy.com KPIs
– 6M registered users, 700k posts/month, 2.8M page impressions/day, 7.6M requests/day
New products à New business models à New Questions
– Extendable generic solution
– Storage and accessability more important than specific, optimized applications
Analytics
Producer
Transport
Storage
Realtime Stream Processing
Producer
– Monolog Plugin, JS Client
Transport
– Flume 0.9.4 m( à RabbitMQ, Erlang Consumer
– Evaluated Apache Kafka
Storage
– Hadoop HDFS (our very own) à Amazon S3
Analytics
Producer
Transport
Storage
Realtime Stream Processing
THE DATA PIPELINE Requirements
THE DATA PIPELINE Logging Pipeline
Analytics
- Hadoop MapReduce à Amazon EMR, Python, R
- Exports to Excel (CSV), Qlikview à Amazon Redshift
Realtime Stream Processing
- Twitter Storm
Analytics
Producer
Transport
Storage
Realtime Stream Processing
Analytics
Producer
Transport
Storage
Realtime Stream Processing
THE DATA PIPELINE Unified Message Format - Fixed, guaranteed envelope
- Processing driven by message content
- Single message gets compressed (LZOP) to about 70% of original size "(1184 B à 817 B)
- Message bulk gets compressed to about 12-14% of original size "(@ 42k & 325k messages)
Unified Message Form
THE DATA PIPELINE Compaction RabbitMQ consumer (Erlang) stores data to cloud
- Relatively large amount of files
- Mixed messages
We want
- A few files
- Messages grouped by „Event Type“ and „Time Partition“
- Data transformation
s3://[BUCKET]/icanslog/[WEBSITE]/icans.content/year=2012/month=10/day=01/part-00000.lzo
Hive partitioning!
Determined by message content
THE DATA PIPELINE Compaction Using Cascalog
- Based on Clojure (LISP) and Cascading
- Provides a Datalog-like query language
- Don‘t LISP? à JCascalog
Very handy features (unavailable in Hive or Pig)
- Cascading Output Taps can be parameterized by data records
- Trap location for corrupted records (job finishes for all the correct messages)
- Runs within the JVM à large available codebase, arbitrary processing is simple
Cacalog Query Syntax
Cascalog is Clojure, Clojure is Lisp
(?<- (stdout) [?person] (age ?person ?age) … (< ?age 30))
Query Operator
Cascading Output Tap
Columns of the dataset generated
by the query
„Generator“ „Predicate“
- as many as you want
- both can be any clojure function
- clojure can call anything that is
available within a JVM
Cacalog Query Syntax
Run the Cascalog processing on Amazon EMR:
./elastic-mapreduce [standard parameters omitted]
--jar s3://[BUCKET]/mapreduce/compaction/icans-cascalog.jar
--main-class icans.cascalogjobs.processing.compaction
--args "s3://[BUCKET]/incoming/*/*/*/","s3://[BUCKET]/icanslog","s3://[BUCKET]/icanslog-error
The Data Pipeline Data Queries with Hive Hive is table-based and provides SQL-like syntax
- Assumes one storage location (directory) per table
- Simple to use if you know SQL
- Widely used, rapid development for „simple“ queries
Hive @ Amazon
- Table locations can be S3
- „Cluster on demand“ à requires to rebuild Hive metadata
- CREATE TABLE for source and target S3 locations
- Import Table metadata (auto-discovery for partitions)
- INSERT OVERWRITE to query source table(s) and store to target S3 location
Hive @ Amazon (1)
Hive @ Amazon (2)
We can now simply copy the data from S3 and import into any local analytical tool e.g. Excel, Redshift, QlikView, R, etc.
Further Reading
- More details in the Log Everything! ebook
- Available at Amazon and DeveloperPress
THE DATA PIPELINE Still: It’s Batch Processing
- While quite efficient in flight, the logistics of getting the job started are significant.
- Only cost-efficient for long distance travel.
THE DATA PIPELINE Instant Insight through Stream Processing
- Often, only updates for the recent day, week, or month are necessary
- Time is of importance when direct feedback or user interaction is desired
More Wind In The Sails With Storm
- Distributed realtime processing framework
- Battle-proven by Twitter
- All *BINGO-Abilities fulfilled!
- Hadoop = data batch processing; Storm = realtime data processing
- More (and maybe new) *BINGO: DRPC, ETL, RTET, Spouts, Bolts, Tuple, Topology
- Easy to use (Really!)
REALTIME STREAM PROCESSING Instant Insight through Stream Processing
Realtime Stream Processing Infrastructure with Storm
Producer Transport
Queue
Nimbus (Master)
Zookeeper
Supervisor
Supervisor
Worker Worker
Worker
NodeJS
Realtime Data Stream Analytics
S3
Zabbix Graylog
DB
Storage
Analytics
Storm-Cluster
Apps &Server
REALTIME STREAM PROCESSING JS Client Features - Event system
- Master/Slave Tabs
- Local queuing of data
- Ability to use node modules
- Easy to extend
- Complete development suite
- Deliver bundles with vendors or not
Realtime Stream Processing - Loading the JS Client
NodeJS
<script .. src=“https://cdn.tradimo.com/js/starlog-client.min.js?5193e1ba0325c756b78d87384d2f80e9"></script>
https://../starlog-client.min.js
Set-Cookie:UUID starlog-client.min.js
/socket.io/1/websockets Upgrade: websockets
Cookie: UUID
HTTP 101 – Protocol Change Connection: Upgrade Upgrade: websocket
Established connection
Create signed cookie
Check cookie
Sending data in UMF
Queue Sending data to the client
UMF
Counts
Collecting Data
Backend Magic
Queue
Realtime Stream Processing - JS Client in action
UseCase: If num of clicks on a Domain % 10 == 0, send „Star Trek Commander“ Badge
ClickEvent collector register onclick Event
localstorage
Clicked-Data
SocketConnect
obse
rve
Clicked-Data
NodeJS
Clicked-Data-UMF
Realtime Stream Processing - JS Client in action
function ClickFetcher() { this.collectData = function (callback) { var clicked = 1; logger.debug('ClickFetcher - collectData called!'); window.onclick = function() { var collectedData = { key : window.location.host.toString()+window.location.pathname.toString(), value: { payload: clicked, timestamp: +new Date() } }; localstorage.set(collectedData, function (storageResult) { logger.debug("err = " + storageResult.hasError()); logger.debug("storageResult = " + storageResult); }, false, true, true); clicked++; }; }; } var clickFetcher = new ClickFetcher(); starlogclient.on(starlogclient.COLLECTINGDATA, clickFetcher.collectData);
Client Live Demo
https://localhost:3001/test/1-page-stub.html
REALTIME STREAM PROCESSING Producer Libraries - LoggingComponent: Provides interfaces, filters and handlers
- LoggingBundle: Glues all together for Symfony2
- Drupal Logging Module: Using the LoggingComponent
- JS Frontend Client: LogClient Framework for Browsers
https://github.com/ICANS/IcansLoggingComponent
https://github.com/ICANS/IcansLoggingBundle
https://github.com/ICANS/drupal-logging-module
https://github.com/DECK36/starlog-js-frontend-client
Realtime Stream Processing - PHP & Storm
UseCase: If num of clicks on a Domain % 10 == 0, send „Star Trek Commander“ Badge
Using PHP for that! https://github.com/Lazyshot/storm-php/blob/master/lib/storm.php
Clicked-Data-UMF
Queue
Event: „Star Trek Commander“ Badge
Storm & PHP Live Demo
REALTIME STREAM PROCESSING Get Inspired! Powered-by Storm: https://github.com/nathanmarz/storm/wiki/Powered-By
- 50+ companies (Twitter, Yahoo, Groupon, Ooyala, Baidu, Wayfair, …)
- Ads & real-time bidding, Data-centric (Economic, Environmental, Health), User interactions
Language-agnostic backend systems (Operate Storm, Develop in PHP)
Streaming „counts“: Sentiment Analysis, Frequent Items, Multi-armed Bandits, …
DRPC: Custom user feeds, Complex Queries (i.e. trace graph links)
Realtime, distributed ETL
- Buffering / Retries
- Integrate Data: Third-party API, Machine Learning
- Store to DBs, Search engines, etc
Questions?
Thanks a lot!