Top Banner
Big Data on Azure
41

Big Data on azure

Jan 13, 2017

Download

Technology

David Giard
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big Data on azure

Big Data on Azure

Page 2: Big Data on azure

David GiardMicrosoft Technical [email protected]@davidgiard

Page 3: Big Data on azure

Cloud ComputingHost some or all of your data or application on a third-party serverin a highly-scalable, highly-reliable way

Page 4: Big Data on azure

Advantages of Cloud Computing• Lower capital costs• Flexible operating cost (Rent vs Buy)• Platform as a Service• Freedom from infrastructure / hardware• Redundancy• Automatic monitoring and failover

Page 5: Big Data on azure

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec0

1

2

3

4

5

6

7

8

9

Demand and Capacity

Page 6: Big Data on azure

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec0

1

2

3

4

5

6

7

8

9

Waste

Demand and Capacity

Page 7: Big Data on azure

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec0

1

2

3

4

5

6

7

8

9

Waste

Lost Opportunity

Demand and Capacity

Page 8: Big Data on azure

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec0

1

2

3

4

5

6

7

8

9

Demand and Capacity

Page 9: Big Data on azure

Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri0

1

2

3

4

5

6

7

8

9

Demand and Capacity

Page 10: Big Data on azure

1:00 2:00 3:00 4:00 5:00 6:00 7:00 8:00 9:00 10:00 11:00 12:000

1

2

3

4

5

6

7

8

9

Demand and Capacity

Page 11: Big Data on azure

1:00 2:00 3:00 4:00 5:00 6:00 7:00 8:00 9:00 10:00 11:00 12:000

1

2

3

4

5

6

7

8

9

Big Data Demand

Page 12: Big Data on azure

Cost Factors• Service• VM Size• # VMs• Time

Page 13: Big Data on azure

HDInsight

Page 14: Big Data on azure

Azure HDInsight• Microsoft Azure’s big-data solution using Hadoop• Open-source framework for storing and analyzing massive amounts of data

on clusters built from commodity hardware• Uses Hadoop Distributed File System (HDFS) for storage

• Employs the open-source Hortonworks Data Platform implementation of Hadoop• Includes HBase, Hive, Pig, Storm, Spark, and more

• Integrates with popular BI tools • Includes Power BI, Excel, SSAS, SSRS, Tableau

Page 15: Big Data on azure

Apache Hadoop on Azure• Automatic cluster provisioning and configuration• Bypass an otherwise manual-intensive process

• Cluster scaling• Change number of nodes without deleting/re-creating the cluster

• High availability/reliability• Managed solution - 99.9% SLA• HDInsight includes a secondary head node

• Reliable and economical storage• HDFS mapped over Azure Blob Storage• Accessed through “wasb://” protocol prefix

Page 16: Big Data on azure

Lambda Architecture• Batch Layer• Speed Layer• Serving Layer

Page 17: Big Data on azure

Clusters

Page 18: Big Data on azure

Clusters

Blob Storage

Page 19: Big Data on azure

HDInsight Cluster Types• Hadoop: Query workloads• Reliable data storage, simple MapReduce

• HBase: NoSQL workloads• Distributed database offering random access to large amounts of

data• Apache Storm: Stream workloads• Real-time analysis of moving data streams

• Apache Spark: High-performance workloads• In-memory parallel processing

Page 20: Big Data on azure

Cluster Creation

Page 21: Big Data on azure

Cluster Creation{ "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#", "contentVersion": "1.0.0.0", "parameters": { "clusterName": { "type": "string", "metadata": { "description": "The name of the HDInsight cluster to create." } }, "clusterLoginUserName": { "type": "string", "defaultValue": "admin", "metadata": { "description": "These credentials can be used to submit jobs to the cluster and to log into cluster dashboards." } }, "clusterLoginPassword": { "type": "securestring", "metadata": { "description": "The password must be at least 10 characters in length and must contain at least one digit, one non-alphanumeric character, and one upper or lower case letter." } }, "sshUserName": { "type": "string", "defaultValue": "sshuser", "metadata": { "description": "These credentials can be used to remotely access the cluster." } }, "sshPassword": { "type": "securestring", "metadata": { "description": "The password must be at least 10 characters in length and must contain at least one digit, one non-alphanumeric character, and one upper or lower case letter." } }, "location": { "type": "string", "defaultValue": "East US", "allowedValues": [ "East US", "East US 2", "North Central US", "South Central US", "West US", "North Europe", "West Europe", "East Asia", "Southeast Asia", "Japan East", "Japan West", "Australia East", "Australia Southeast" ], "metadata": { "description": "The location where all azure resources will be deployed." } }, "clusterType": { "type": "string", "defaultValue": "storm", "allowedValues": [ "hadoop", "hbase", "storm", "spark" ], "metadata": { "description": "The type of the HDInsight cluster to create." } }, "clusterWorkerNodeCount": { "type": "int", "defaultValue": 2, "metadata": { "description": "The number of nodes in the HDInsight cluster." } } }, "variables": { "defaultApiVersion": "2015-05-01-preview", "clusterApiVersion": "2015-03-01-preview", "clusterVersion": "3.4", "clusterOSType": "Linux", "clusterStorageAccountName": "[concat(parameters('clusterName'),'store')]", "clusterStorageAccountType": "Standard_LRS" }, "resources": [ { "name": "[variables('clusterStorageAccountName')]", "type": "Microsoft.Storage/storageAccounts", "location": "[parameters('location')]", "apiVersion": "[variables('defaultApiVersion')]", "dependsOn": [ ], "tags": {

}, "properties": { "accountType": "[variables('clusterStorageAccountType')]" } }, { "name": "[parameters('clusterName')]", "type": "Microsoft.HDInsight/clusters", "location": "[parameters('location')]", "apiVersion": "[variables('clusterApiVersion')]", "dependsOn": [ "[concat('Microsoft.Storage/storageAccounts/',variables('clusterStorageAccountName'))]" ], "tags": {

}, "properties": { "clusterVersion": "[variables('clusterVersion')]", "osType": "[variables('clusterOSType')]", "clusterDefinition": { "kind": "[parameters('clusterType')]", "configurations": { "gateway": { "restAuthCredential.isEnabled": true, "restAuthCredential.username": "[parameters('clusterLoginUserName')]", "restAuthCredential.password": "[parameters('clusterLoginPassword')]" } } }, "storageProfile": { "storageaccounts": [ { "name": "[concat(variables('clusterStorageAccountName'),'.blob.core.windows.net')]", "isDefault": true, "container": "[parameters('clusterName')]", "key": "[listKeys(resourceId('Microsoft.Storage/storageAccounts', variables('clusterStorageAccountName')), variables('defaultApiVersion')).key1]" } ] }, "computeProfile": { "roles": [ { "name": "headnode", "targetInstanceCount": "2", "hardwareProfile": { "vmSize": "Standard_D3" }, "osProfile": { "linuxOperatingSystemProfile": { "username": "[parameters('sshUserName')]", "password": "[parameters('sshPassword')]" } } }, { "name": "workernode", "targetInstanceCount": "[parameters('clusterWorkerNodeCount')]", "hardwareProfile": { "vmSize": "Standard_D3" }, "osProfile": { "linuxOperatingSystemProfile": { "username": "[parameters('sshUserName')]", "password": "[parameters('sshPassword')]" } } }, { "name": "zookeepernode", "targetInstanceCount": "3", "hardwareProfile": { "vmSize": "Standard_A1" }, "osProfile": { "linuxOperatingSystemProfile": { "username": "[parameters('sshUserName')]", "password": "[parameters('sshPassword')]" } } } ] } } } ], "outputs": { "cluster": { "type": "object", "value": "[reference(resourceId('Microsoft.HDInsight/clusters',parameters('clusterName')))]" } }}

Page 22: Big Data on azure

Demo

Page 23: Big Data on azure
Page 24: Big Data on azure

Storm• Apache Storm is a distributed, fault-tolerant, open-source computation system

that allows you to process data in real-time with Hadoop.• Apache Storm on HDInsight allows you to create distributed, real-time

analytics solutions in the Azure environment by using Apache Hadoop.• Storm solutions can also provide guaranteed processing of data, with the

ability to replay data that was not successfully processed the first time.• Ability to write Storm components in C#, JAVA and Python.• Azure Scale up or Scale down without an impact for running Storm topologies.• Ease of provision and use in Azure portal.• Visual Studio project templates for Storm apps

Page 25: Big Data on azure

Storm• Apache Storm apps are submitted as Topologies.• A topology is a graph of computation that processes streams• Stream: An unbound collection of tuples. Streams are produced by spouts

and bolts, and they are consumed by bolts.• Tuple: A named list of dynamically typed values.• Spout: Consumes data from a data source and emits one or more streams.• Bolt: Consumes streams, performs processing on tuples, and may emit

streams. Bolts are also responsible for writing data to external storage, such as a queue, HDInsight, HBase, a blob, or other data store.• Nimbus: JobTracker in Hadoop that distribute jobs, monitoring failures.

Page 26: Big Data on azure

Stream

Apache Storm Topology

Event Source

Tuple(“timestamp:: 1234567890,“measurement”: “123”,“location”: “ABC123”)

Tuple{key1: “value1”,key2: “value2”,key3: “value3”,}

{key1: “value1”,key2: “value2”,key3: “value3”,}

Tuple Tuple{key1: “value1”,key4: “value4”}

Bolt

Spout

Bolt Bolt

Bolt

Tuple{key1: “value1”,key2: “value2”,key3: “value3”,}

Page 27: Big Data on azure

Demo

Page 28: Big Data on azure

@DavidGiard

Page 29: Big Data on azure

HBase• Apache HBase is an open-source, NoSQL database that is built on Hadoop and

modeled after Google BigTable.

• HBase provides random access and strong consistency for large amounts of unstructured and semistructured data in a schemaless database organized by column families

• Data is stored in the rows of a table, and data within a row is grouped by column family. • The open-source code scales linearly to handle petabytes of data on thousands of

nodes. It can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop ecosystem.

Page 30: Big Data on azure

HBase• HBase Commands:

• create Equivalent to create table in T-SQL• get Equivalent to select statements in T-SQL• put Equivalent to update, Insert statement in T-SQL• scan Equivalent to select (no where condition) in T-SQL • delete/deleteall Equivalent to delete in T-SQL

• HBase shell is your query tool to execute in CRUD commands to a HBase cluster.• Data can also be managed using the HBase C# API, which provides a client library on

top of the HBase REST API. • An HBase database can also be queried by using Hive.

Page 31: Big Data on azure

HBase

RowKey a:1 a:2 a:3 a:4 b:1 b:2 c:numA

982069 10 20 30 40 5 7 4

926025 9 11 21 4 9 3

254114 11 15 22 35 7 11 4

881514 8 14 2 3 2

Column family “a”

Column family “b”

Column family “c”

Page 32: Big Data on azure

@DavidGiard

Page 33: Big Data on azure

Hive• Apache Hive is a data warehouse system for Hadoop, which enables data

summarization, querying, and analysis of data by using HiveQL (a query language similar to SQL).

• Hive understands how to work with structured and semi-structured data, such as text files where the fields are delimited by specific characters.• Hive also supports custom serializer/deserializers for complex or irregularly

structured data. • Hive can also be extended through user-defined functions (UDF).• A UDF allows you to implement functionality or logic that isn't easily modeled in

HiveQL.

Page 34: Big Data on azure

HiveQL# Number of RecordsSELECT COUNT(1) FROM www_access;

# Number of Unique IPsSELECT COUNT(1) FROM ( \ SELECT DISTINCT ip FROM www_access \) t;

# Number of Unique IPs that Accessed the Top PageSELECT COUNT(distinct ip) FROM www_access \ WHERE url='/';

# Number of Accesses per Unique IPSELECT ip, COUNT(1) FROM www_access \ GROUP BY ip LIMIT 30;

# Unique IPs Sorted by Number of AccessesSELECT ip, COUNT(1) AS cnt FROM www_access \ GROUP BY ip ORDER BY cnt DESC LIMIT 30;

# Number of Accesses After a Certain TimeSELECT COUNT(1) FROM www_access \ WHERE TD_TIME_RANGE(time, "2011-08-19", NULL, "PDT")

# Number of Accesses Each DaySELECT \ TD_TIME_FORMAT(time, "yyyy-MM-dd", "PDT") AS day, \ COUNT(1) AS cnt \FROM www_access \GROUP BY TD_TIME_FORMAT(time, "yyyy-MM-dd", "PDT")

Page 35: Big Data on azure

@DavidGiard

Page 36: Big Data on azure

Apache Spark• Interactive manipulation and visualization of data

• Scala, Python, and R Interactive Shells• Jupyter Notebook with PySpark (Python) and Spark (Scala) kernels provide in-

browser interaction

• Unified platform for processing multiple workloads• Real-time processing, Machine Learning, Stream Analytics, Interactive Querying,

Graphing

• Leverages in-memory processing for really big data• Resilient distributed datasets (RDDs)• APIs for processing large datasets• Up to 100x faster than MapReduce

Page 37: Big Data on azure

Spark Components on HDInsight• Spark Core• Includes Spark SQL, Spark Streaming,

GraphX, and MLlib

• Anaconda• Livy• Jupyter Notebooks• ODBC Driver for connecting from BI tools

(Power BI, Tableau)

Page 38: Big Data on azure

Jupyter Notebooks on HDInsight• Browser-based interface for working with text, code, equations, plots,

graphics, and interactive controls in a single document.• Include preset Spark and Hive contexts (sc and sqlContext)

Page 39: Big Data on azure

Demo

Page 40: Big Data on azure

Items of Note About HDInsight• There is no “suspend” on HDInsight clusters• Provision the cluster, do work, then delete the cluster to avoid unnecessary

charges• Storage can be decoupled from the cluster and reused across deployments

• Can deploy from the portal, but often scripted in practice• Easier/repeatable creation and deletion

Page 41: Big Data on azure

@DavidGiard

Thank you!