Comparing Hadoop Data Storage (HDFS, HBase, Hive and Pig) Rakesh Jadhav SAS
May 15, 2015
Comparing Hadoop Data Storage (HDFS, HBase, Hive and Pig)
Rakesh JadhavSAS
Agenda
• Hadoop Ecosystem• HDFS • HBase• Hive• Pig
Hadoop Ecosystem
Hadoop Ecosystem Components HDFS: Hadoop Distributed File System MapReduce: Hadoop Distributed Programming Paradigm HBase: Hadoop Column Oriented Database for
Random Access Read/Write of Smaller Data
Hive: Hadoop Petabyte scalable Data Warehousing Infrastructure
Pig: Hadoop Data Flow/Analysis Infrastructure Zookeeper: Hadoop Co-ordination service, Configuration
Service Infrastructure Chukwa: Hadoop Monitoring Service Avro: Hadoop Data Serialization De-Serialization
Infrastructure Mahout: Hadoop Scalable Machine Learning Library
HDFS (Data Storage)
• Failure Is Norm• Designed For Large Datasets than Small• Designed For Batch Processing than
Interactive• Supports Write Once- Read Many• Provides Interfaces to Move Processing
Closer To Data
Design Features
HDFS
APPLICATION AREAS• Large Log Processing• Web search indexing
LIMITATIONS• Small Size Problem• Single Node Of Failure • No Random Access• No Write Support
HBase (Data Storage)
• Key-Value Store (Like Map)
• Semi Structured Data
• Column Family, Time Stamp
• Key=RowKey.ColumnFamiliy.ColumnName.TimeStamp
• De-normalized Data
• Faster Data Retrieval Using Column Families
• Static Column Families, Dynamic Columns
Design Features
RDBMS v/s HBase: ExampleRDBMS
ID Name
Age
Birth-Place
Marital Status
Location
Weight Employer
1 Sam 35 Mumbai Married Pune 76 XYZ
2 Bob 56 Chicago Married New York
79 PQR
Row Key
Personal Information(Column Family)
Other Information (Column Family)
1 Name: T1=Sam
Age: T2= 35
Age: T1:=25
Birth-Place :T1=Mumbai
Marital Status :T2= Married
Marital Status: T1= Unmarried
Weight:T2= 76
Weight:T1= 65
Location: T2= Pune
Location: T1:=Mumbai
Employer:T1= XYZ
2 … … … … … … …
HBase
HBase: Application Areas
• Applications which need Store/Access/Search using Key
• Need Fast Random Access/Update to scalable structured data
• Applications Needing Flexible Table Schema• Applications Needing range-search capabilities
supported by key ordering
HBase: Limitations
• Expensive Full Row Read• No Secondary Keys• No SQL Support• Not Efficient for Big Cell Values
Hive (Data Access)
• Scalable data warehouse on top of Hadoop developed by Facebook
• SQL like Query Language HiveQL• Limited JDBC support• Support for rich data types• Ability to insert custom map-reduce jobs
Design Features
Hive: Application Areas
• Adhoc analysis on huge structured data, not having any requirement of low latency
• Log processing• Text Mining• Document Indexing• Customer Facing business intelligence
(Google analytics)• Predictive Modeling, hypothesis testing
Hive: Limitations
• No Support To Update Data• Only Bulk Load Support • Not Efficient For Small Data
Hive: Example
• create table employee (id bigint, name string, age int…) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
• LOAD DATA LOCAL INPATH '/sas/employee.txt' OVERWRITE INTO TABLE employee;
• INSERT OVERWRITE TABLE oldest_employee SELECT * FROM employee SORT BY age DESC LIMIT 100;
Pig(Data Access)
• Pig Latin High level data flow language.
• Client side library, no server side deployment needed.
• Batch processing large unstructured data
• Procedural language
• Runtime Schema Creation, Check point ability, Splits pipeline support
• Customer code support
• Rich data types
• Support for Joins
Pig: Application Areas
• Extract Transform Load (ETL)• Unstructured Data Analysis
PIG: Limitations
• Not efficient for processing small datasets
PIG: Example
Load Emplyee data from text file, filter it using age and joining year and group using joining year.
1. records = LOAD 'sas/input/files/employee.txt'
AS (joiningYear:chararray, employeeId:int, age:int);
2. filtered_records = FILTER records BY age> 30 AND
( joiningYear >=2000 OR joiningYear <= 2012);
3. grouped_records = GROUP filtered_records BY joiningYear;
max_age = FOREACH grouped_records GENERATE group,
MAX(filtered_records.age);
DUMP max_age;
Conclusion
Organizations•Revisit data strategy
•Evaluate Hadoop Ecosystem
•Build economical, scalable solutions for Big Data problems
References
• Hadoop: Definitive Guide, By Tom White• http://hadoop.apache.org/• http://developer.yahoo.com/hadoop/tutorial/•
http://www-01.ibm.com/software/data/infosphere/hadoop/
• http://www.information-management.com/blogs/ • http://www.mckinsey.com/insights/mgi/
research/technology_and_innovation/big_data_the_next_frontier_for_innovation
21
Thank You