Top Banner
Big Data-BI Fusion: Microsoft HDInsight & MS BI Level: Intermediate March 28, 2013 Andrew Brust CEO and Founder Blue Badge Insights
31

Big Data and NoSQL for Database and BI Pros

Jan 27, 2015

Download

Technology

Andrew Brust

Big Data and NoSQL for Database and BI Pros - Visual Studio Live! March2013
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big Data and NoSQL for Database and BI Pros

Big Data-BI Fusion:Microsoft HDInsight & MS BI

Level: Intermediate

March 28, 2013

Andrew BrustCEO and Founder

Blue Badge Insights

Page 2: Big Data and NoSQL for Database and BI Pros

• CEO and Founder, Blue Badge Insights• Big Data blogger for ZDNet• Microsoft Regional Director, MVP• Co-chair VSLive! and 18 years as a speaker• Founder, MS BI and Big Data User Group of NYC

– http://www.msbigdatanyc.com

• Co-moderator, NYC .NET Developers Group– http://www.nycdotnetdev.com

• “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News

• brustblog.com, Twitter: @andrewbrust

Meet Andrew

Page 3: Big Data and NoSQL for Database and BI Pros

Andrew’s New Blog (bit.ly/bigondata)

Page 4: Big Data and NoSQL for Database and BI Pros

Read all about it!

Page 5: Big Data and NoSQL for Database and BI Pros

What is Big Data?

• 100s of TB into PB and higher• Involving data from: financial data, sensors,

web logs, social media, etc.• Parallel processing often involved

– Hadoop is emblematic, but other technologies are Big Data too

• Processing of data sets too large for transactional databases– Analyzing interactions, rather than transactions– The three V’s: Volume, Velocity, Variety

• Big Data tech sometimes imposed on small data problems

Page 6: Big Data and NoSQL for Database and BI Pros

The Hadoop Stack

MapReduce, HDFS

Database

RDBMS Import/Export

Query: HiveQL and Pig Latin

Machine Learning/Data Mining

Log file integration

Page 7: Big Data and NoSQL for Database and BI Pros

What’s MapReduce?

• Divide and conquer approach to “Big” data processing

• Partition the data and send to mappers (nodes in cluster)

• Mappers pre-process into key-value pairs, then all output for (a) given key(s) goes to a reducer

• Reducer performs aggregations; one output per key, with value

• Map and Reduce code natively written as Java functions

Page 8: Big Data and NoSQL for Database and BI Pros

MapReduce, in a Diagram

mapper

mapper

mapper

mapper

mapper

mapper

Input

reducer

reducer

reducer

Input

Input

Input

Input

Input

Input

Output

Output

Output

Output

Output

Output

Output

Input

Input

Input

K1

K2

K3

Output

Output

Output

Page 9: Big Data and NoSQL for Database and BI Pros

HDFS

• File system whose data gets distributed over commodity disks on commodity servers

• Data is replicated• If one box goes down, no data lost

– “Shared Nothing”– Except the name node

• BUT: Immutable– Files can only be written to once– So updates require drop + re-write (slow)– You can append though– Like a DVD/CD-ROM

Page 10: Big Data and NoSQL for Database and BI Pros

HBase

• A Wide-Column Store, NoSQL database• Modeled after Google BigTable• HBase tables are HDFS files

– Therefore, Hadoop-compatible

• Hadoop often used with HBase– But you can use either without the other

• HDInsight (more on next slide) does not (yet) include HBase

Page 11: Big Data and NoSQL for Database and BI Pros

Microsoft HDInsight

• Developed with Hortonworks and incorporates Hortonworks Data Platform (HDP) for Windows

• Windows Azure HDInsight and Microsoft HDInsight Server– Single node preview runs on Windows client

• Includes ODBC Driver for Hive– And Excel add-in that uses it

• JavaScript MapReduce framework• Contribute it all back to open source

Apache Project

Page 12: Big Data and NoSQL for Database and BI Pros

Azure HDInsight Provisioning

• HDInsight preview now public, so…• Go to Windows Azure portal• Sign up for the public preview• Select HDInsight from left navbar• Click “+ NEW” button @ lower-left• Specify cluster name, number of nodes, admin

password, storage account– Credentials used for browser login, RDP and ODBC– During preview, you will be billed 50% of Azure compute rates for

nodes in cluster. Will be 100% at GA.

• Click “CREATE HDINSIGHT CLUSTER”• Wait for provisioning to complete• Navigate to http://clustername.azurehdinsight.net

New!

Page 13: Big Data and NoSQL for Database and BI Pros

Azure HDInsight Provisioning

New!

Page 14: Big Data and NoSQL for Database and BI Pros

Submitting, Running and Monitoring Jobs

• Upload a JAR• Use Streaming

– Use other languages (i.e. other than Java) to write MapReduce code

– Python is popular option– Any executable works, even C# console apps– On HDInsight, JavaScript works too– Still uses a JAR file: streaming.jar

• Run at command line (passing JAR name and params) or use GUI

Page 15: Big Data and NoSQL for Database and BI Pros

Hortonworks Data Platform for Windows

MRLib (NuGet Package)

LINQ to Hive

OdbcClient + Hive ODBC

Driver

Deployment

Debugging

MR code in C#, HadoopJob, MapperBase, ReducerBase

Amenities for Visual Studio/.NET

Page 16: Big Data and NoSQL for Database and BI Pros

Running MapReduce Jobs

Page 17: Big Data and NoSQL for Database and BI Pros

The “Data-Refinery” Idea

• Use Hadoop to “on-board” unstructured data, then extract manageable subsets

• Load the subsets into conventional DW/BI servers and use familiar analytics tool to examine

• This is the current rationalization of Hadoop + BI tools’ coexistence

• Will it stay this way?

Page 18: Big Data and NoSQL for Database and BI Pros

Hive

• Used by most BI products which connect to Hadoop

• Provides a SQL-like abstraction over Hadoop– Officially HiveQL, or HQL

• Works on own tables, but also on HBase• Query generates MapReduce job, output

of which becomes result set• Microsoft has Hive ODBC driver

– Connects Excel, Reporting Services, PowerPivot, Analysis Services Tabular Mode (only)

Page 20: Big Data and NoSQL for Database and BI Pros

HDInsight Data Sources

• Files in HDFS• Azure Blob Storage (Azure HDInsight only)

– Use asv:// URLs (“Azure Storage Vault”)

• Hive tables• HBase?

Page 21: Big Data and NoSQL for Database and BI Pros

Just-in-time Schema

• When looking at unstructured data, schema is imposed at query time

• Schema is context specific– If scanning a book, are the values words, lines, or

pages?– Are notes a single field, or is each word a value?– Are date and time two fields or one?– Are street, city, state, zip separate or one value?– Pig and Hive let you determine this at query time– So does the Map function in MapReduce code

Page 22: Big Data and NoSQL for Database and BI Pros

How Does MS BI Fit In?

• Excel, PowerPivot: can query via Hive ODBC driver

• Analysis Services (SSAS) Tabular Mode– Also compatible with Hive ODBC Driver

Multidimensional mode is not

• Power View– Works against PowerPivot and SSAS Tabular

• RDBMS + Parallel Data Warehouse (PDW)– Sqoop connectors– Columnstore Indexes

Enterprise Edition and PDW only

• PDW: PolyBase

Page 23: Big Data and NoSQL for Database and BI Pros

Excel, PowerPivot

• Excel and PowerPivot use the BI Semantic Model (BISM), which can query Hadoop via Hive and its ODBC driver

• Excel also features “Data Explorer” (currently in Beta) which can query HDFS directly and insert the results into a BISM repository

• Excel BISM accommodates millions of rows through compression. Not petabyte scale, but sufficient to store and analyze output of Hadoop queries.

Page 24: Big Data and NoSQL for Database and BI Pros

PowerPivot, SSAS Tabular

• SQL Server Analysis Services Tabular mode is the enterprise server implementation of BISM

• Features partitioning and role-based security

• Can store billions of rows. So even better for Hadoop output analysis.

• Excel-based BISM repositories can be upsized to SSAS Tabular

Page 25: Big Data and NoSQL for Database and BI Pros

Querying Hadoop from Microsoft BI

Page 26: Big Data and NoSQL for Database and BI Pros

Sqoop

• Acronym for “SQL to Hadoop”• Essentially a technology for moving data

between data warehouses and Hadoop• Command line utility; allows specification

of source/target HDFS file and relational server, database and table

• Sqoop connectors available for SQL Server and PDW

• Sqoop generates MapReduce job to extract data from, or insert data into, HDFS

Page 27: Big Data and NoSQL for Database and BI Pros

PDW, PolyBase

• SQL Server Parallel Data Warehouse (PDW) is a Massively Parallel Proicessing (MPP) data warehouse appliance version of SQL Server

• MPP manages a grid of relational database servers for divide-and-conquer processing of large data sets.

• PDW v2 includes “PolyBase,” a component which allows PDW to query data in Hadoop directly.– Bypasses MapReduce; addresses data nodes directly

and orchestrates parallelism itself

Page 28: Big Data and NoSQL for Database and BI Pros

PolyBase Versus Hive, Sqoop

• Hive and Sqoop generate MapReduce jobs, and work in batch mode

• PolyBase addresses HDFS data itself• This is true SQL over Hadoop.• Competitors:

– Cloudera Impala– Teradata Aster SQL-H– EMC/Greenplum Pivotal HD– Hadapt

Page 29: Big Data and NoSQL for Database and BI Pros

Usability Impact

• PowerPivot makes analysis much easier, self-service

• Power View is great for discovery and visualization; also self-service

• Combine with the Hive ODBC driver and suddenly Hadoop is accessible to business users

• Caveats– Someone has to write the HiveQL– Can query Big Data, but must have smaller result

Page 30: Big Data and NoSQL for Database and BI Pros

Resources

• Big On Data blog– http://www.zdnet.com/blog/big-data

• Apache Hadoop home page– http://hadoop.apache.org/

• Hive & Pig home pages– http://hive.apache.org/– http://pig.apache.org/

• Hadoop on Azure home page– https://www.hadooponazure.com/

• SQL Server 2012 Big Data– http://bit.ly/sql2012bigdata