Big and Fast Data Strategy 2017Jonathan Raspaud
AVP - Big Data Architecture
February, 2017
© Antuit 2016 Proprietary & Confidential; Not for circulation 2
Executive Summary
2017 Data Landscape
Vision
Strategy
Roadmap
Key Initiatives
High Level Architecture
High Level Data Flow
Data Validity Vendor Comparison
© Antuit 2016 Proprietary & Confidential; Not for circulation 3
About Jonathan Raspaud:
1998 2000
2006
2011
2012
2017
AVP-Big Data Architecture
Senior Principal Data Architect
Mobility Practice Lead
Manager Business Intelligence
Datawarehouse EngineerSoftware Engineer
Software Engineer
Teamlog
1999
IAE GrenobleMaster of Science in Managementof Information Systems
1997
© Antuit 2016 Proprietary & Confidential; Not for circulation 4
2017 Data Landscape (1): The Four V’s
Data Volume:Billions of Rows
Data Validity:FormatProcess
Data Velocity:Real timeStreaming
WeblogsClickstreamsIoTTextCall CenterChatSocial
SensorsMarketsNetworksTransportationIoTSocial
Data Variety:Structured
Semi-Structured
Unstructured
© Antuit 2016 Proprietary & Confidential; Not for circulation 5
2017 Data Landscape (2): Legacy RDBMS Databases are poor at:
• Scalability,
• Fast Streaming Data,
• Unstructured Data,
• Schema Flexibility,
• Search,
© Antuit 2016 Proprietary & Confidential; Not for circulation 6
2017 Data Landscape (3): MPP/Column-Store Databases:
The Good: The Bad:
SQL based, wide capability with BI tools
Need to move the data from operational systems
Good Performance Data loses Freshness
Full support for aggregation and ad hoc filtering
Ultimate scale limitations
Hard to adapt schema
Can be expensive
© Antuit 2016 Proprietary & Confidential; Not for circulation 7
2017 Data Landscape (4): Hadoop:
The Good: The Bad:
Distributed storage and processing of massive data sets
SQL interfaces are improving but still not speed-of-thought
Low-cost clusters built from commodityhardware
© Antuit 2016 Proprietary & Confidential; Not for circulation 8
2017 Data Landscape (5): NoSQL Databases:
The Good: The Bad:
Storage and retrieval of data which is modeled in means other than the tabular relations used in RDBMS
Traditional BI tools lack native compatibility
More and more applicationdevelopers choose NoSQL Databases as operational databases
Not optimized for analytic queries
Scalability; schema-less flexibility, and fast response time for short-request queries
Some don’t support aggregation or ad hoc filtering on arbitrary field
© Antuit 2016 Proprietary & Confidential; Not for circulation 9
2017 Data Landscape (6): Search Databases:
The Good: The Bad:
Using a search index technology is a great way to enable access to big data in the enterprise
Lacks SQL interface – traditional BI tools incompatibility
Deliver fast access to unstructured or semi-structured information: blog posts and comments, customer product reviews, machine logs, JSON scripts…
Native APIs required to access data
Very effective with structured data too
© Antuit 2016 Proprietary & Confidential; Not for circulation 10
2017 Data Landscape (7): Cloud Big Data Stores:
The Good: The Bad:
Storing massive amounts of data in the cloud
Traditional BI tools lack performance optimized native integration
Low cost
Easy to manage
Range of storage options: file system, SQL database, Hadoop, Spark…
© Antuit 2016 Proprietary & Confidential; Not for circulation 11
2017 Data Landscape (8): Fast Data:
The Good: The Bad:
Fast inserts/updates Traditional BI tools lack integration
Fast analytics Traditional BI tools are not architected for streaming data
Limited or Lacks SQL interface
© Antuit 2016 Proprietary & Confidential; Not for circulation 12
2017 Data Landscape (9): Conclusion
• Legacy BI not designed for Modern Data:• Hard to use: designed in an age of specialized skills
– Focus on the power user– Complicated workbench interfaces– Require SQL coding quickly
• Cannot Scale: deployed on desktops or monolithic servers– Limited user scalability– Poor performance– Not built for embedding in other applications
• Performance Problems: designed for relational data only– Loss of functionality– Poor performance– Limited data scalability
© Antuit 2016 Proprietary & Confidential; Not for circulation 13
Modern Big and Fast Data Platform Requirements: 5 V’s
Data Requirement
Volume 1. Immediate visualization & interaction regardless of size of data
2. Don’t move or copy data
Variety 1. Support a broad range of modern sources without lock-in
2. Blend multi-source data on-the-fly3. Extensible data connectors for different types of data
Velocity 1. Support fast data (streaming)2. Integrate streaming & historical data in a single view
Veracity 1. Master Data Management2. Definitions
Value 1. Business Insight, Monetization, Optimization, New Customers
© Antuit 2016 Proprietary & Confidential; Not for circulation 14
Vision (Example):
“Business Insights at the Speed of Light”.
© Antuit 2016 Proprietary & Confidential; Not for circulation 15
Strategy (Example):
• Speed is our main strategic asset,
• Spark is the engine that powers all our data initiatives,
• Set the context and get out of the way,
• Build Proof of Concepts ready for Production,
• Public Cloud only,
• Leverage Key Vendors as needed: Paxata, Cloudera, ZoomData, Google, Amazon…
© Antuit 2016 Proprietary & Confidential; Not for circulation 16
Roadmap (Example):
Insights
Infrastructure
Ingestion
Big BI
Strategy
Procurement
Q2 Q3
2017
Q1
LambdaArchitecture
Deskside
People
WorkDayOracle
FinancialServiceNow
HumanResource
Q4
2018
Telecom
TEM
From BI To Big Data
IOTReal Time
Data ScienceTraining
EDL
Mobile BI
Q1
Data ScienceReal Time Self Healing AI Aware
Transportation
Real Time ML
ZoomData PrestoDB Paxata IBMDS Platform
© Antuit 2016 Proprietary & Confidential; Not for circulation 17
Enterprise Data Lake – Ingestion (Example):
Q1 Q2 Q3
Data Ingestion• Snapchat
Other Source Systems• Billz• Workday
“Near Real Time” Update (Spark batch)• Instagram
More than once per day update• Pinterest
Data Ingestion• Facebook ✅• Twitter ✅• Pinterest ✅• Youtube ✅• Instagram ✅• DCM ✅
Other Source Systems• Adobe Analytics• Salesforce Marketing
Near Real Time Update (Spark Batch)• Facebook
Data Ingestion• LinkedIn ✅• Google Maps ✅• Waze
Other Source Systems• GSA• Salesforce✅
“Near Real Time” Update (Spark batch)• Youtube ✅
Data Ingestion• Wikipedia• STAT
Real Time Update (Spark Streaming)• Twitter
Q4
© Antuit 2016 Proprietary & Confidential; Not for circulation 18
Enterprise Data Lake – Infrastructure (Example):
Q1 Q2 Q3
Scalable Database for Data Marts• RedShift vs. BigQuery
Security• Kerberos authentication• Configure External Authentication for
Cloudera Manager using AD.
Cluster Scaling
DB migration for Hive Metastore.
Configure high availability for Hive.
Scalable Database for Big BI Data Marts• RedShift vs. BigQuery
Configuration Data Base
Kafka Cluster
Cloudera Upgrade ✅
Disaster Recovery ✅
Configuration Data Base ✅
Kafka Cluster • (Test Cluster complete Sprint 190 ✅)
Subnet Migration
Cluster resource upgrade –scaled out ✅
Q4
Security• Configure Sentry in Production cluster
Configure external database for Cloudera Manager
Hue DB migration to External Database
© Antuit 2016 Proprietary & Confidential; Not for circulation 19
Key Initiatives (Example):
Focus on high impact/high dollar,
Machine Learning/Deep Learning,
Big BI,
Big MDM,
© Antuit 2016 Proprietary & Confidential; Not for circulation 20
High Level Streaming Architecture (Example):
Grid Data Visualization
& Reporting
Big and Fast Data Stream and Data Store
PivotReal Time Pipeline
Batch Pipeline
Device Events
© Antuit 2016 Proprietary & Confidential; Not for circulation 21
Data Sources Data Driven
Decision
Data Visualization
and Exploration
Ingestion Big Data Store Big BI
The Enterprise Data Lake is the one source of truth for all reports
SQL
Interactive
Reporting
High Level Data Flow (Example):
Relational
Data
(CSV)
Schema Free
Nested
Data
(JSON)
Tableau, PowerBI, Looker
ODBC
JDBC
© Antuit 2016 Proprietary & Confidential; Not for circulation 22
Vendor Alteryx Paxata Trifacta
Primary
user
Technical data developer Non-technical business analyst Technical data scientist
Strengths Data integration
Data mapping
Advanced analytics
Data integration and quality
Comprehensive governance model
Centralized collaboration workbench
No coding, scripting required
Visualization
Batch processing
Weaknesses Data cleansing
Data manipulation
Ease of use
Limited enrichment today Only works with information loaded into
Hadoop
Only works with samples of data
Feedback is not in real time
Minimal data quality capabilities
Analysis Alteryx is a full stack BI
tool, and it includes a layer
of data integration
capabilities. Introducing
another BI tool (in addition
to Tableau, Qlik, Excel) is
not ideal, particularly since
it would only be able to
address data migration use
cases. It overlaps with
Snaplogic which Yahoo!
already owns.
Paxata has the most robust
capabilities to address the broadest
set of data preparation use cases.
Their model for data governance is
far above anything else on the
market. They appear to also ingest
the widest range of data sources
and have the ability to scale to a
billion rows. True enterprise
capabilities for security and scale.
Trifacta is not a good fit for our users
since they are all business analysts
and it is very complex to make
changes. Also, the information for
these use cases are coming from
multiple data sources, many of which
are not Hadoop. Trifacta does not have
the data quality capabilities needed for
the broadest number of use cases.
Big and Fast Data Validity: Vendor Comparison