Analytics Drives Big Data Drives Infrastructure Confessions of Storage turned Analytics Geeks Dr. Aloke Guha 29th IEEE Conference on Massive Data Storage May 8 th , 2013 [email protected]
Analytics Drives Big Data Drives Infrastructure Confessions of Storage turned Analytics Geeks
Dr. Aloke Guha
29th IEEE Conference on Massive Data Storage
May 8th, 2013
2
What’s Common Between a Sensor that could Distinguish a fine Cognac, and Predicting Movies You’d Like on Netflix?
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
The Sommelier “Robot”
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 3
Predicting What Movies You’d Watch
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 4
5
(Analytics, BigData, DataStore)+
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
6
Many Analytics Techniques . . .
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
Statistics
Regression Linear
Time-Series
Decision Trees
R
AI (McCarthy) 1956
Expert Systems
Machine Learning
Neural Networks
SVM LDA
Naïve Bayes K-nearest
neighbor Random Forests
. . . Genetic
Algorithms
Random Forests
SNARC (Minsky) 1951
Dendral (Feigenbaum) 1965
Fraser and Burnell (1970)
. . . Vapnik (1992)
Ihaka and Gentleman (1993)
7
Common Analytics Processing pre-2000
• Sources: Local
• Data: Numeric, Homogeneous
• Processing: Local
• Consumer: Local
• Analytics: Linear/Non-Linear Regression, Neural Networks, SVM, LDA, LSA, Decision Trees, Monte Carlo, Lin-Ops, Expert Systems . . .
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
Flavor Predictor – Neural Networks
USPTO #5,373,452 (1994) 1988
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 8
Pattern Recognition – Genetic Algorithms
US PTO #5,140,530, 1992
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 9
10
Small to Big
http://article.wn.com/view/2013/04/04/Big_data_forefather_Michael_Stonebraker_shows_no_signs_of_sl/#/related_news
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
11
Typical Analytics: 2000-2006
• Sources: Global , Social Networks
• Data: Heterogeneous, Numeric, Text
• Processing: Hosted/Scale
• Consumer: Global
• Analytics: Batch Mode, Social Media Marketing, Churn Detection, Sentiment Analysis, etc.
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
2007- : Internet Data Analytics
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 12
Financial Risk Scoring: Detect
Risk Scoring: detect incremental change in # occurrences where corporate officers
mention “risk” (or equivalent terms) during earnings call
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 13
Financial Risk Scoring: Listen
*Risk Scoring: detect incremental change in occurrences where corporate officers
mention “risk” (or semantically equivalent terms) during the corporate earnings call
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 14
Banking: Credit Worthiness – remember 2008?
Analyze bank reports to assess loans, payments, recoveries, etc. for key bank
indexes, groups of banks, or individual banks
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 15
Share of Voice: Online Buzz
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 16
Sentiment Analysis
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 17
18
Analytics Processing: 2007-
• Sources: Global, Mobile, New Social (Instagram, . . )
• Data: Multi-Dimensional, Heterogeneous, Audio/Video
• Processing: Hosted/Scale
• Consumer: Global
• Analytics: Batch, Streaming, . . .
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
2008 - : Real-Time/Streaming Analytics
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 19
Brand Marketing
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 20
Brand Management
21
Customer Support
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 22
Customer Support
23
24
Lead Generation
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
. . . More Data, Faster
http://www.cioinsight.com/it-strategy/big-data/data-analytics-allows-pg-to-turn-on-a-dime/?kc=CIOMINUTE05062013CIOA
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 25
“Internet of Things”
http://www.news-sap.com/survey-by-sap-and-harris-interactive-finds-brazil-china-germany-and-india-most-ready-for-
m2m-technology-to-drive-connected-smarter-cities/
Message Queuing Telemetry Transport
Machine-to-Machine
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 26
27
AumniData: Batch Processing
Data Collector (Batch Scheduled)
Twitter Blog/Web Site
Data Collector (Batch Scheduled)
RSS/ATOM
Feed Requestor/
URL Scanner
NLP+ Cruxly Intent
Detection (AWS)
NLP+ Cruxly Intent
Detection (AWS)
NLP+ Cruxly Intent
Detection (AWS)
NLP+ Cruxly Intent
Detection (AWS)
NLP+ Cruxly Intent
Detection (AWS)
NLP Stack+ AumniData
Classifier + Analytics*
(RackSpace VM)
Dashboard
Application (.3rd party App)
Blog/Web Site
Blog/Web Site YouTube
Dashboard
Configuration (TomCat)
Custom Analytics
Display Ad-Hoc Query
Summary
Data Collector (Batch Scheduled)
Content
Store
Content /
Metadata
Index
(MySQL)
Dashboard
Store
(SQL Server)
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
28
Cruxly: Stream Processing
Streaming API Client (Heroku Worker)
(24x7)
Streaming API Client (Heroku Worker)
(24x7)
NLP+ Cruxly Intent
Detection (AWS)
Streaming API Client (Heroku Worker)
(24x7)
Tweets
(Keywords)
Request
(Keywords)
Tweets
(Keywords) Tweet ID + Intent
Signal
(Heroku
PostgresSQL)
Tweets
Content Store
(DynamoDB)
NLP+ Cruxly Intent
Detection (AWS)
NLP+ Cruxly Intent
Detection (AWS)
NLP+ Cruxly Intent
Detection (AWS)
NLP+ Cruxly Intent
Detection (AWS)
NLP (NER, etc + Cruxly
Intent Detection (AWS)
Reports / Dashboard
Tracker Editor (web app - Heroku)
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
29
Data Analytics Demands . . .
Store
Process
Analyze
View
Store
Process
Analyze
View
Storm
Data Collector Text / Sensor Data/ Stream . . .
NLP Classify
Index
Query/ RT Query Ad Hoc/ Search/ SQL
Custom Analytics
Dashboards Chart
Report
Machine
Learning
Library
Stats
Library
R
Yarn
Storage Implications: Back to the Future
MB/s – Batch
IOPs – Stream
Both?
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 30
Storage Implications: Back to the Future II, III
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
Task tracker
Task tracker
Task tracker
Job Tracker
Zookeeper
Hive
Pig
Oozie
HUE
HDFS client Data Node Data Node Data Node
Name Node
Ma
pR
ed
uce
H
DF
S
Master Slave #1 Slave #N Mgmt Node
Storage Capacity Scaling?
31
Storage Tiering?
Import/Export Data?
A More General Data Analytics Framework?
Data Ingesters (Basic)
Data Ingesters (Smart)
Content Store Metadata / In-Mem Store
Processing Stream and Batch
Data Ingesters
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
An
alyt
ics
Pro
cess
ing
Sen
sor
Pro
cess
ing:
Dat
a In
tegr
atio
n
Vis
ual
izat
ion
Lib
rary
/ In
tera
ctiv
e Q
ue
ry
Loca
l Sto
rage
/ Fl
ash
/ D
AS
SA
N
Map
Re
du
ce /
Dis
trib
ute
d D
ata
Sto
re
32
33
Conclusion
• Data Analytics Big Data Scale-Out
• Variety Infrastructure
• Volume Bandwidth Support
• Velocity Streaming Support
• We Solved the Processing Problem
• We Need to Solve the Larger Storage Problem
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
34
Grateful Acknowledgements
• Kapil Tundwal
• Dr. Kirill Kireyev
• Dr. Andrew Lampert
• Venky Madireddy
• Dr. Shumin Wu
• Joan Wrabetz
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013