Big Data @ Microsoft Raghu Ramakrishnan CTO for Data, Technical Fellow Microsoft
Jan 08, 2017
Big Data @ Microsoft
Raghu RamakrishnanCTO for Data, Technical Fellow
Microsoft
Data and Analytics – 3 Pillars
SQL 2016Azure SQL DB
Azure SQL DW
SQL Server R services
On-prem and cloud
(Windows, Linux)
Cortana Intelligence
SuiteHadoop, Data Lake, Machine
learning, PowerBI, Data Factory, Streaming,
Perceptual Intelligence
On-prem connectivity
Microsoft
R serverHadoop
Teradata
On-prem and cloud
(Windows, Linux)
SQL Server 2016: Everything Built-In
The above graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from Microsoft. Gartner does not endorse any
vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research
organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
Consistent experience from on-premises to cloud
Microsoft Tableau Oracle
$120
$480
$2,230
Self-service BI per user
In-memory across all workloads
TPC-H non-clustered 10TB
Oracle is #4#2
SQL Server
#1
SQL Server
#3
SQL Server
built-inbuilt-in built-in built-in built-in
01
4
0 03
34
29
22
15
5
22
6
43
20
69
18
49
3
-80
-70
-60
-50
-40
-30
-20
-10
0
2010 2011 2012 2013 2014 2015
SQL Server Oracle MySQL2 SAP HANA
TPC-H non-clustered results as of 04/06/15, 5/04/15, 4/15/14 and 11/25/13, respectively. http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster
at massive scale
National Institute of Standards and Technology Comprehensive Vulnerability Database update 5/4/2015
In-Database Advanced AnalyticsNo need to move the data
Open source R with in-memory & massive scale – multi-threading & massive parallel processing
Data ScientistInteract directly with data
R built-in to SQL Server
Data Developer/DBAManage data and
analytics together
Example Solutions
• Sales forecasting
• Warehouse efficiency
• Predictive maintenance
Extensibility
?R
R Integration
Relational data
Analytic Library
T-SQL interface
010010
100100
010101
New R scripts
010010
100100
010101
010010
100100
010101
010010
100100
010101
• Credit risk protection
010010
100100
010101
Microsoft Azure Marketplace
Real-time operational analytics without moving the data
NEW
NEW
End-to-end mobile BI Advanced AnalyticsMission critical OLTP
High-performance open source R plus:
Enterprise Scale & Performance
– Scales from workstations to large clusters
– Scales to large data sizes
– Growing portfolio of Parallelized algorithms
Secure, Scalable R Deployment/Operationalization
Write Once Deploy Anywhere for multiple platforms
IDE for data scientists and developers
Enterprise Class Support
DistributedR
DeployR DevelopR
ScaleR
ConnectR
Cloud – SQL Server/SQL Azure
Shifting how you purchase and manage machines
Increased focus on Total Cost of Ownership and continuous improvements
Built from the same code base
We increased surface area compatibility with V12 Azure SQL Database
We’re learning how to run our own code – the good and the badWe’re using that to improve both product and service
Microsoft is the only provider both on-premises and in the cloud
Order history
Name SSN Date
Jane Doe cm61ba906fd 2/28/2005
Jim Gray ox7ff654ae6d 3/18/2005
John Smith i2y36cg776rg 4/10/2005
Bill Brown nx290pldo90l 4/27/2005
Sue Daniels ypo85ba616rj 5/12/2005
Order history
Name SSN Date
Jane Doe cm61ba906fd 2/28/2005
Jim Gray ox7ff654ae6d 3/18/2005
John Smith i2y36cg776rg 4/10/2005
Bill Brown nx290pldo90l 4/27/2005
Customer data
Product data
Order History
Stretch to cloud
Stretch SQL Server into AzureStretch warm and cold tables to Azure with remote query processing
App
Query
Microsoft Azure
Jim Gray ox7ff654ae6d 3/18/2005
SQL Server 2016
Azure SQL DW
Fully managed relational data warehouse-as-a-service
First elastic cloud data warehouse with proven SQL Server capabilities
Support your smallest to your largest data storage needs
Scales to petabytes of data
Massively Parallel Processing
Instant-on compute scales in seconds
Query Relational / Non-Relational
Saas
Azure
PublicCloud
Office 365Office 365
Get started in minutes
Integrated with Azure ML, PowerBI & ADF
Simple billing compute & storage
Pay for what you need, when you need it with dynamic pause
AzureAzure
Store any datarelations
Do any analysisSQL queries
Hive,
At any speedBatch
Hive
At any scale … elastic!
Anywhere
Data to Intelligent
Action
Web Logs, Omniture logs
On-Premise SQL Server
(customer and product data)
In-Store Activity with
Kinect sensors
Social Data
Diagnostic streaming
Event hubs
Machine Learning
Stream Analytics
Azure DataLake
Data Factory: Move Data, Orchestrate, Schedule, and Monitor
HDInsight HDInsight Machine Learning
Azure SQL Data Warehouse
Power BI
INGEST PREPARE ANALYZE PUBLISH
Stream Analytics
CONSUMEDATA SOURCES
Cortana
Web/LOB Dashboards
Azure Data Analytics Stack
REEF library
STORAGE
YARN
HDFS/WebHDFS API
Compute-tier Cache Clusters(Local ENs + CSM)RAM / SSD / HDD
WAS-based Remote Storage
Cosmos Store API
CLUSTER-WIDE RM (YARN++)
YARN + Federation
YARN + Rayon (Capacity reservation)
YARN + Mercury
Shared micro-services for all
metadata (extent map, logical name space, secure
store) based on Hekaton/RSL
rings
YARN + Mercury
YARN + Mercury
Application Engines
Per-job RM and runtimeM/R
U-SQLBatch
Spark
TezSpark
Runtime
Spark HiveU-SQL Azure ML Azure SA
COMPUTE TIER
SQL-DW HDInsightIaaS
Services
Windows
SMSG
LiveAds
CRM/DynamicsWindows Phone
Xbox Live
Office365
STB Malware ProtectionMicrosoft Stores
STB Commerce Risk
MessengerLCA
Exchange
YammerSkype
Bing
data managed: EBs
cluster sizes: 10s of Ks
# machines: 100s of Ks
daily I/O: >100 PBs
# internal developers: 1000s
# daily jobs: 100s of Ks
Observation
Pattern
Theory
Hypothesis
What will happen?
How can we make it happen?
Predictive
Analytics
Prescriptive
Analytics
What happened?
Why did it happen?
Descriptive
Analytics
Diagnostic
Analytics
Confirmation
Theory
Hypothesis
Observation
Implement Data Warehouse
Physical Design
ETL
Development
Reporting &
Analytics
Development
Install and Tune
Reporting & Analytics Design
Dimension Modelling
ETL Design
Setup Infrastructure
Understand Corporate Strategy
Data sources
ETL
BI and analytic
Data warehouse
Gather Requirements
Business Requirements
Technical Requirements
Ingest all data regardless of requirements
Store all data in native format without
schema definition
Do analysisUsing analytic engines
like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
What happened?
What is happening?
Why did it happen?
What are key relationships?
What will happen?
What if?
How risky is it?
What should happen?
What is the best option?
How can I optimize?
Data sources
Handling failures
Sharing data, resources
Parallelism
Data-aware Optimization
Security, Compliance, Governance
Enterprise
Forrester Wave
Big Data Hadoop
Cloud Solutions
Q2 2016
• Interactive and Real-Time Analytics requires i
• Massive data volumes require scale-out stores using commodity servers, even archival storage
Tiered StorageSeamlessly move data across tiers, mirroring life-cycle and usage patterns
Schedule compute near low-latency copies of data
How can we manage this trade-off without moving data across
different storage systems (and governance boundaries)?
• Many different analytic engines (OSS and vendors; SQL, ML; batch, interactive, streaming)
• Many users’ jobs (across these job types) run on the same machines (where the data lives)
Resource Management with Multitenancy and SLAsPolicy-driven management of vast compute pools co-located with data
Schedule computation “near” data
How can we manage this multi-tenanted heterogeneous job mix
across tens of thousands of machines?
Azure Data Lake Store
Fully managed cloud data store designed for analytics
Supports HDFS compliant analytics applications and tools
Petabyte files, unlimited account size
High throughput for analytics performance
Low latency ingestion with read as you write
AAD-based authentication, access auditing
File and folder-level ACLs, Encryption at rest
ADLS Security: Encryption-at-Rest
Transparently encrypts data flowing
to and from public networks as well
as at rest
Transparent server-side encryption
User can manage their own
encryption keys or let Azure Data
Lake Store manage the key using
Azure Key Vault
28
ADLS Security: Role-Based Access Control
Each file and directory is associated
with an owner and a group
Files or directories have separate
permissions (read(r), write(w),
execute(x)) for owners, members of
the group, and for all other users
Fine-grained access control lists
(ACLs) can be specified for specific
named users or named groups
29
ADL Store: IngressData can be ingested into Azure Data Lake Store from a variety of sources
Server logs
Azure Event Hub
Apache
Flume
Azure Storage Blobs
Custom programs
.NET SDK
JavaScript CLI
Azure Portal
Azure PowerShell
Azure Data Factory
Apache Sqoop
Azure SQL DB
Azure SQL DW
Azure tables
Table Storage
On-premises databases
SQL
30
ADL Store
Built-in
copy service
ADL Store: EgressData can be exported from Azure Data Lake Store into numerous targets/sinks
Azure SQL DB
SQL
Azure SQL DW
Azure
Tables
Table Storage
On-premises databases
Azure Data Factory
Apache Sqoop
Azure Storage Blobs
Custom programs
.NET SDK
JavaScript CLI
Azure Portal
Azure PowerShell
31
Built-in
copy service
ADL Store
Extent
Metadata
Data Data Data…
Remote Storage
Naming
Service
Secret Store
1) Filename Translation
3) Find Extents
4) Data
access
Remote storage tier
builds securely on
WAS
Secure
Works with
YARN!
COMPUTE
TIER
Secure Store Service
Intelligent ingest
Massively parallel
2) Azure Access Keys
• Interactive and Real-Time Analytics requires i
• Massive data volumes require scale-out stores using commodity servers, even archival storage
Tiered StorageScale storage independently of compute
Seamlessly move data across tiers, mirroring life-cycle and usage patterns
Schedule compute near low-latency copies of data
Data Lifecycle Management
How can we manage this trade-off without moving data across
different storage systems (and governance boundaries)?
Extent
Metadata
Data Data Data…
Remote Storage
Naming
Service
Secret Store
1) Filename Translation
3) Find Extents
4) Data
access
Remote storage tier
builds securely on
WAS
Secure
Works with
YARN!
COMPUTE
TIER
Data Data Data…
Secure Store Service
Local Storage
Intelligent ingest
Massively parallel
2) Azure Access Keys
Azure HDInsight—Linux and Windows
Managed, Monitored, Supported• Cluster customization – Install your favorite project
• Harness existing .Net & Java skills to write
customer extensions
• Supports broad ecosystem of ISVs
(Hadoop and Traditional)
Full Apache Hadoop• Batch – MapReduce, PIG, Hive, Spark
• Stream Processing and Analytics – Storm,
SparkStreaming
• Interactive SQL – Hive (Tez), and SparkSQL
• Table Serving – Hbase
• Machine Learning – SparkML, Mahout
BatchMapReduce, PIG, Hive, Spark
Interactive SQLHive (Tez), SparkSQL
Stream AnalyticsStorm, SparkStreaming
Machine LearningSparkML, Mahout
Table ServingHbase
Exploratory VisualizationJupyter, Zeppelin
Interactive SQL SQL DW
Stream AnalyticsAzure Stream Analytics
Machine LearningAzure ML
Table ServingAzure SQL DB
Exploratory VisualizationPower BI
Tree Swallow
Azure Data Lake Analytics Service
A new distributed analytics service
Built on Apache YARN
Scales dynamically with a dial
Pay by the query
Supports Azure AD for access control, roles, and integration with on-premidentity systems
U-SQL language unifies the benefits of SQL with the power of C#
Hive etc. will be added over time
Processes data across Azure
41
Get started
Log in to Azure Create an ADLA account
Write and submit an ADLA job with U-SQL (or Hive/Pig)
The job reads and writes data from storage
1 2 3 4
30 seconds
ADLS
Azure Blobs
Azure DB
…
ADLA Complements HDInsight
HDInsight
Dedicated managed clusters for developers familiar with the Open Source: Java, Eclipse, Hive, etc.
Clusters offer customization, control, and flexibility in a managed Hadoop cluster
ADLA
Enables customers to leverage
existing experience with C#, SQL &
PowerShell
Offers convenience, efficiency, and
automatic scale in a “job service”
form factor over a system-managed
shared resource pool
U-SQL A hyper-scalable, highly extensible
language for preparing, transforming
and analyzing all data
Allows users to focus on the what—
not the how—of business problems
Built on familiar languages (SQL and
C#) and supported by a fully integrated
development environment
Built for data developers & scientists
44
U-SQL Language PhilosophyDeclarative query and transformation language:• Uses SQL’s SELECT FROM WHERE with GROUP BY/aggregation, joins,
SQL Analytics functions
• Optimizable, scalable
Operates on unstructured & structured data• Schema on read over files
• Relational metadata objects (e.g. database, table)
Extensible from ground up:• Type system is based on C#
• Expression language is C#
21
User-defined functions (U-SQL and C#)
User-defined types (U-SQL/C#) (future)
User-defined aggregators (C#)
User-defined operators (UDO) (C#)
U-SQL provides the parallelization and scale-out framework for
usercode• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINERS
Expression-flow programming style:• Easy to use functional lambda composition
• Composable, globally optimizable
Federated query across distributed data sources (soon)
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt“
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt“
USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, SUM(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;
45
Federated Queries: Query Data Where It LivesEasily query data in multiple Azure data stores without moving it to a single store
Benefits
Avoid moving large amounts of data across the network between stores
Single view of data irrespective of physical location
Minimize data proliferation issues caused by maintaining multiple copies
Single query language for all data
Each data store maintains its own sovereignty
Design choices based on the need
U-SQL
QueryResult
Query
46
Azure
Storage Blobs
Azure SQL
in VMs
Azure
SQL DB
Azure Data
Lake Analytics
Join Local (ADLS) and External Data
1. Create two tables.
• An external table ‘PurchaseOrders’ that refers to the
PurchaseOrders table in the external SQL Azure DB.
• A ‘local’ table ‘UserIdsTable’ created by ‘extracting’ User
Ids and region fields from the WebLogRecords.txt file
stored in Azure Data Lake.
2. Join the PurchaseOrders table with UserIds table on the
common UserId column.
Purchase orders table
Azure SQL DB
External
purchase orders
table
Local
user IDs
table
JOIN
(on User IDs)
Azure Data Lake
Analytics
Find sum of all purchases by users in the ‘en-us’ region
Query 9
47
WebLogRecords.txt
Concepts: Jobs, Stages and Vertexes
Each job is broken into a number
of vertexes
Each vertex is some work that
needs to be done
Input
Output
Output
6 Stages
8 Vertexes
Vertexes are organized into stages
– Vertexes in each stage do the same
work on the same data
– Vertex in one stage may depend on a
vertex in a earlier stage
Stages themselves are organized into
an acyclic graph
49
• Many different analytic engines (OSS and vendors; SQL, ML; batch, interactive, streaming)
• Many users’ jobs (across these job types) run on the same machines (where the data lives)
Resource Management with Multitenancy and SLAsPolicy-driven management of vast compute pools co-located with data
Schedule computation “near” data
How can we manage this multi-tenanted heterogeneous job mix
across tens of thousands of machines?
Resource Managers for Big Data
Allocate compute containers to competing jobs
Multiple job engines shared pool
Containers
YARN: Resource manager for Hadoop2.x
Corona, Mesos, Omega
Shared Data and Compute
Tiered Storage
Relational Query Engine
MachineLearning
Compute Fabric (Resource Management)
Multiple analytic engines sharing same
resource pool
Compute and store/cache on same machines
What’s Behind a U-SQL Query
. . .
. . . … … …
YARN Gaps
resource allocation SLOs
scalability limitations
• High allocation latency
• Support for specialized execution frameworks• Interactive environments, long-running services
• Amoeba Rayon
• Status: shipping in Apache Hadoop 2.6
• Mercury and Yaq
• Status: Now in Apache Hadoop trunk!
• Federation
• Status: prototype and JIRA
• Framework-level Pooling
• Enable frameworks that want to take over resource allocation to support millisecond-level response and adaptation times
• Status: spec
Microsoft Contributions to OSS Apache YARN
REEF
http://ww.reef-project.org http://reef.incubator.apache.org
http://aka.ms/adltechblog/
http://ww.reef-project.org and
http://reef.incubator.apache.org