Indic threads pune12-comparing hadoop data storage

Comparing Hadoop Data Storage (HDFS, HBase, Hive and Pig)

Rakesh JadhavSAS

Agenda

• Hadoop Ecosystem• HDFS • HBase• Hive• Pig

Hadoop Ecosystem

Hadoop Ecosystem Components HDFS: Hadoop Distributed File System MapReduce: Hadoop Distributed Programming Paradigm HBase: Hadoop Column Oriented Database for

Random Access Read/Write of Smaller Data

Hive: Hadoop Petabyte scalable Data Warehousing Infrastructure

Pig: Hadoop Data Flow/Analysis Infrastructure Zookeeper: Hadoop Co-ordination service, Configuration

Service Infrastructure Chukwa: Hadoop Monitoring Service Avro: Hadoop Data Serialization De-Serialization

Infrastructure Mahout: Hadoop Scalable Machine Learning Library

HDFS (Data Storage)

• Failure Is Norm• Designed For Large Datasets than Small• Designed For Batch Processing than

Interactive• Supports Write Once- Read Many• Provides Interfaces to Move Processing

Closer To Data

Design Features

HDFS

APPLICATION AREAS• Large Log Processing• Web search indexing

LIMITATIONS• Small Size Problem• Single Node Of Failure • No Random Access• No Write Support

HBase (Data Storage)

• Key-Value Store (Like Map)

• Semi Structured Data

• Column Family, Time Stamp

• Key=RowKey.ColumnFamiliy.ColumnName.TimeStamp

• De-normalized Data

• Faster Data Retrieval Using Column Families

• Static Column Families, Dynamic Columns

Design Features

RDBMS v/s HBase: ExampleRDBMS

ID Name

Age

Birth-Place

Marital Status

Location

Weight Employer

1 Sam 35 Mumbai Married Pune 76 XYZ

2 Bob 56 Chicago Married New York

79 PQR

Row Key

Personal Information(Column Family)

Other Information (Column Family)

1 Name: T1=Sam

Age: T2= 35

Age: T1:=25

Birth-Place :T1=Mumbai

Marital Status :T2= Married

Marital Status: T1= Unmarried

Weight:T2= 76

Weight:T1= 65

Location: T2= Pune

Location: T1:=Mumbai

Employer:T1= XYZ

2 … … … … … … …

HBase

HBase: Application Areas

• Applications which need Store/Access/Search using Key

• Need Fast Random Access/Update to scalable structured data

• Applications Needing Flexible Table Schema• Applications Needing range-search capabilities

supported by key ordering

HBase: Limitations

• Expensive Full Row Read• No Secondary Keys• No SQL Support• Not Efficient for Big Cell Values

Hive (Data Access)

• Scalable data warehouse on top of Hadoop developed by Facebook

• SQL like Query Language HiveQL• Limited JDBC support• Support for rich data types• Ability to insert custom map-reduce jobs

Design Features

Hive: Application Areas

• Adhoc analysis on huge structured data, not having any requirement of low latency

• Log processing• Text Mining• Document Indexing• Customer Facing business intelligence

(Google analytics)• Predictive Modeling, hypothesis testing

Hive: Limitations

• No Support To Update Data• Only Bulk Load Support • Not Efficient For Small Data

Hive: Example

• create table employee (id bigint, name string, age int…) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

• LOAD DATA LOCAL INPATH '/sas/employee.txt' OVERWRITE INTO TABLE employee;

• INSERT OVERWRITE TABLE oldest_employee SELECT * FROM employee SORT BY age DESC LIMIT 100;

Pig(Data Access)

• Pig Latin High level data flow language.

• Client side library, no server side deployment needed.

• Batch processing large unstructured data

• Procedural language

• Runtime Schema Creation, Check point ability, Splits pipeline support

• Customer code support

• Rich data types

• Support for Joins

Pig: Application Areas

• Extract Transform Load (ETL)• Unstructured Data Analysis

PIG: Limitations

• Not efficient for processing small datasets

PIG: Example

Load Emplyee data from text file, filter it using age and joining year and group using joining year.

1. records = LOAD 'sas/input/files/employee.txt'

AS (joiningYear:chararray, employeeId:int, age:int);

2. filtered_records = FILTER records BY age> 30 AND

( joiningYear >=2000 OR joiningYear <= 2012);

3. grouped_records = GROUP filtered_records BY joiningYear;

max_age = FOREACH grouped_records GENERATE group,

MAX(filtered_records.age);

DUMP max_age;

Conclusion

Organizations•Revisit data strategy

•Evaluate Hadoop Ecosystem

•Build economical, scalable solutions for Big Data problems

References

• Hadoop: Definitive Guide, By Tom White• http://hadoop.apache.org/• http://developer.yahoo.com/hadoop/tutorial/•

http://www-01.ibm.com/software/data/infosphere/hadoop/

• http://www.information-management.com/blogs/ • http://www.mckinsey.com/insights/mgi/

research/technology_and_innovation/big_data_the_next_frontier_for_innovation

21

Thank You

Indic threads pune12-comparing hadoop data storage

Technology

hadoop data storagehdfs

update data

hive data accessdesign

scalable data warehouse

huge structured data

structured data column

example load emplyee

rich data types ability