Top Banner
TRACKING LIVE WIKIPEDIA CHANGES [email protected] Insight Data Engineering Week 4 - January 2015 DRAFT WIKIWATCH.ANDREWMO.COM
17
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Insight Data Engineering - Week 4 DRAFT

TRACKING LIVE WIKIPEDIA [email protected]

Insight Data Engineering Week 4 - January 2015

DRAFT WIKIWATCH.ANDREWMO.COM

Page 2: Insight Data Engineering - Week 4 DRAFT

MOTIVATION

• Raw dumps of Wikipedia data are available for analysis on a monthly basis, but…What about changes between these intervals?

• Data Collection:Live edits for Wikimedia projects are broadcast to nearly 882 IRC channels

• Goals: (Collect, filter, format, transform and produce information about live edit data)

Page 3: Insight Data Engineering - Week 4 DRAFT

Realtime Telemetry wikiwatch.andrewmo.com

Page 4: Insight Data Engineering - Week 4 DRAFT

SOMETHING ELSEOther statistics

Page 5: Insight Data Engineering - Week 4 DRAFT

DATA PIPELINE ENGINEERINGCapture + Fusion + Analysis

Kafka Storm MySQL APIIRC

Hadoop NoSQLHDFS

Page 6: Insight Data Engineering - Week 4 DRAFT

CAPTUREUp to 882 Simultaneous Channels

~660 events/min avg across all channels

Page 7: Insight Data Engineering - Week 4 DRAFT

INGESTKafka + Hadoop

Kafka#de logBot

Topic

Topic

Topic#fr

#en

HDFS

Page 8: Insight Data Engineering - Week 4 DRAFT

–Andrew Mo

We tried Spark Streaming.Scala, it’s not you - It’s me.

Next sprint, maybe?

Page 9: Insight Data Engineering - Week 4 DRAFT

STREAM PROCESSING

Multiple Topologies(10 sec, 10 min, 1 hr)

Multiple Metrics(events, size, new pages, topics, users)

Python + Storm (Pyleus)MySQL

Page 10: Insight Data Engineering - Week 4 DRAFT

API ACCESSTime Series Summary Metrics

for Multiple Windows

New Pages

Detailed User Activity

Detailed Topic Activity

Top Topics, Top Users, Top Bots, etc

Page 11: Insight Data Engineering - Week 4 DRAFT

Thanks

Apache Software FoundationWikimedia FoundationInsight Data Science

LinkedIn (Kafka)Twitter (Storm)

Yelp (Pyleus)

Page 12: Insight Data Engineering - Week 4 DRAFT

ABOUT MOA Project Manager that Writes Code !

Worked at RAND Corporation Booz Allen Hamilton

Studied at Pardee RAND Graduate School UC San Diego - Electrical Engineering

Alphabet SoupPMP, PMI-ACP, CISSP, ISSEP, CSEP, CSEP-ACQ [email protected] GitHub: https://github.com/moandcompanyLinkedIn: http://linkedin.com/in/andrewmo

Page 13: Insight Data Engineering - Week 4 DRAFT

BONUS CHARTS

Page 14: Insight Data Engineering - Week 4 DRAFT

BATCHPROCESSING

Map Reduce+

Hadoop Streaming

HiveMR Job

… and more …

Visual

Page 15: Insight Data Engineering - Week 4 DRAFT

FIREHOSEMultiplex all sensors to a firehose topic

Kafka#de logBot Omnichannel

#fr

#en

logBot

logBot

Page 16: Insight Data Engineering - Week 4 DRAFT

VELOCITY AND OUR NEXT SPRINTSprint 1 (MVP Development)

18 Jan - 31 Jan 2015

Address the need + Simplify

API-query elicitation and discovery

Novel feature focus - Realtime

Maximize common-code (Python)

Sprint 2 (MVP Validation)

Engage users + Complete Features

API enhancement

Batch Integration

NoSQL Optimization

Preempt Technical Debt - Refactoring

Velocity Chart

Page 17: Insight Data Engineering - Week 4 DRAFT

TECHNOLOGY TO EVALUATE

• Presto

• Samza

• Hive + Tez

• Kafka on YARN (KOYA)

• Kafka Security (Authentication)

• Spark + Spark Streaming (1.2+ Python)