Top Banner
Hadoop for DBA’s Michael Naumov DB Expert [email protected]
18

מיכאל

Jan 27, 2015

Download

Technology

sqlserver.co.il

Big Data and New Challenges for DBAs (Michael Naumov, LivePerson)
Hadoop has become a popular platform for managing large datasets of structured and unstructured data. It does not replace existing infrastructures, but instead augments them. Most companies will still use relational databases for transactional processing and low-latency queries, but can benefit from Hadoop for reporting, machine learning or ETL. This session will cover:
What is Hadoop and why do I care?
What do people do with Hadoop?
How can SQL Server DBAs add Hadoop to their architecture?
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: מיכאל

Hadoop for DBA’s

Michael NaumovDB [email protected]

Page 2: מיכאל

LP Facts

8500+ customers

450M unique visitors per month

1.3B visits per month

60TB of new data every month

Page 3: מיכאל

LP Facts

Oracle – B2B SQL Server – B2C

Hadoop – Raw Data Vertica – Forecasting / BI

Cassandra – Application HA MySQL - Segmentation

MySQL NDB - ETL MongoDB – Predictive Targeting

Databases in Liveperson

Page 4: מיכאל

Hadoop in Liveperson

• 2TB Of data streamed into Hadoop each day

• 100+ Data Nodes serving our data needs

• DataNodes are 36GB RAM, 1TBx12 SATA disks, 2 quad CPU

• Dozens (and growing) of daily MR jobs

• 5 different project (and growing) based on Hadoop eco-system

Page 5: מיכאל

Hadoop

What is Hadoop?

• Open Source project from Apache

• Able to store and process large amounts of data. Including not only structured

data but also complex, unstructured data as well.

• Hadoop is not actually a single product but instead

a collection of several components

• Commodity hardware. Low software and hardware costs

• Shared nothing machines - Scalability

Page 6: מיכאל

Example Comparison: RDBMS vs. Hadoop

Typical Traditional RDBMS Hadoop

Data Size Gigabytes Petabytes

Access Interactive and Batch Batch – NOT Interactive

Updates Read / Write many times Write once, Read many times

Structure Static Schema Dynamic Schema

Scaling Nonlinear Linear

Query Response Time

Can be near immediate Has latency (due to batch processing)

Page 7: מיכאל

Hadoop Distributed File System - HDFS

• HDFS has three types of Nodes

• Namenode (MasterNode)• Distribute files in the cluster• Responsible for the replication

between the datanodes and for file blocks location

• Datanodes• Responsible for actual file store• Serving data from files(data) to client

• BackupNode (version 0.23 and up)• It’s a backup of the NameNode

Page 8: מיכאל

Hadoop Ecosystem

• MapReduce - Framework for writing/executing algorithms• Hbase – Distributed, versioned, column-oriented database • Hive - SQL like interface for large datasets stored in HDFS.• Pig - Pig consists on a high-level language (Pig Latin) for

expressing data analysis programs• Sqoop - (“SQL-to-Hadoop”) is a straightforward tool. Able to

Imports/export tables or databases in HDFS/Hive/Hbase

Page 9: מיכאל

MapReduce

Example: $HADOOP_HOME/bin/hadoop jar @HADOOP_HOME/hadoop-streaming.jar \

- input myInputDirs \

- output myOutputDir \

- mapper /bin/cat \

- reducer /bin/wc

What is MapReduce?

• Runs programs (jobs) across many computers• Protects against single server failure by re-run failed steps.• MR jobs can be written in Java, C, Phyton, Ruby and etc • Users only write Map and Reduce functions

• MAP - Takes a large problem and divides into sub problems.

Performs the same function on all subsystems• REDUCE - Combine the output from all sub-problems

Page 10: מיכאל

Hbase

What is Hbase and why should you use HBase?

• Huge volumes of randomly accessed data.• There is no restrictions on column numbers for rows it’s dynamic.• Consider HBase when you’re loading data by key, searching data by key (or

range), serving data by key, querying data by key or when storing data by row that doesn’t conform well to a schema.

Hbase dont’s?• It doesn’t talk SQL, have an optimizer, support in transactions or joins. If

you don’t use any of these in your database application then HBase could very well be the perfect fit.

Example: create ‘blogposts’, ‘post’, ‘image’ ---create table put ‘blogposts’, ‘id1′, ‘post:title’, ‘Hello World’ ---insert value put ‘blogposts’, ‘id1′, ‘post:body’, ‘This is a blog post’ ---insert value put ‘blogposts’, ‘id1′, ‘image:header’, ‘image1.jpg’ ---insert value

get ‘blogposts’, ‘id1′ ---select records

Page 11: מיכאל

Hive

What is Hive?

• Built on the MapReduce framework so it generates M/R jobs behind• Hive is a data warehouse that enables easy data summarization and

ad-hoc queries via an SQL-like interface for large datasets stored in HDFS/Hbase.

• Have partitioning and partition swapping • Good for random sampling Example:

CREATE EXTERNAL TABLE vs_hdfs (site_id string,session_id string,time_stamp bigint,visitor_id bigint,row_unit string,evts string,biz string,plne string,dims string)partitioned by (site string,day string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'STORED AS SEQUENCEFILE LOCATION '/home/data/';

select session_id, get_json_object(concat(tttt, "}"), '$.BY'), get_json_object(concat(tttt, "}"), '$.TEXT') from ( select session_id,concat("{", regexp_replace(event, "\\[\\{|\\}\\]", ""), "}") tttt from ( select session_id,get_json_object(plne, '$.PLine.evts[*]') pln from vs_hdfs_v1 where site='6964264' and day='20120201' and plne!='{}' limit 10 ) t LATERAL VIEW explode(split(pln, "\\},\\{")) adTable AS event )t2

Page 12: מיכאל

Pig

What is Pig?

• Data flow processing • Uses Pig Latin query language • Highly parallel in order to distribute data processing across many

servers• Combining multiple data sources (Files, Hbase, Hive)

Example:

Page 13: מיכאל

Sqoop

What is Sqoop?

• It’s a command line tool for moving data between HDFS and relational database systems.

• You can download drivers for Sqoop from Microsoft and• Import Data/Query results from SQL Server to Hadoop.• Export Data from Hadoop to SQL Server.

• It’s like BCP

Example:

$bin/sqoop import --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' \

--table lineitem --hive-import

$bin/sqoop export --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem --export-dir /data/lineitemData

Page 14: מיכאל

Other projects in Hadoop

There is many other projects we didn’t talk about

• Chukwa • Mahout• Avro • Zookeeper • Fuse • Flume• Oozie• Hue• Hiho• ………………..

Page 15: מיכאל

Hadoop and Microsoft

• Bring Hive data directly to Excel through the Microsoft Hive Add-in for Excel• Build a PowerPivot/PowerView on top of Hive• Instead of manually refreshing a PowerPivot workbook based on Hive on

their desktop, users can use PowerPivot for SharePoint to schedule a data refresh feature to refresh a central copy shared with others, without worrying about the time or resources it takes.

• BI Professionals can build BI Semantic Model or Reporting Services Reports on Hive in SQL Server Data tools

Page 16: מיכאל

HOW DO I GET STARTED

• Microsoft • https://www.hadooponazure.com/• http://www.microsoft.com/download/en/details.aspx?id=27584 (sqoop driver)

• Open Source: http://hadoop.apacha.org

• Vendors• http://www.cloudera.com • http://hortonworks.com• http://mapr.com

Page 17: מיכאל

@

Demo

Page 18: מיכאל

Questions?