Hadoop hbase mapreduce
Post on 07-May-2015
8229 Views
Preview:
DESCRIPTION
Transcript
What is Big Data ?
● How is big “Big Data” ?● Is 30 40 Terabyte big data ?● ….
● Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools
● Today Terabyte, Petabyte, Exabyte● Tomorrow ?
Enterprises & Big Data
● Most companies are currently using traditional tools to store data
● Big data: The next frontier for innovation, competition, and productivity
● The use of big data will become a key basis of competition
● Organisations across the globe need to take the rising importance of big data more seriously
Hadoop is an ecosystem, not a single product.
When you deal with BigData, the data center is your computer.
• A Brief History of Hadoop• Contributers and Development• What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
• Hadoop has its origins in Apache Nutch
• Nutch was started in 2002
• Challenge : The billions of pages on the Web ?
• 2003 GFS (Google File System)
• 2004 NDFS (Nutch File System)
• 2004 Google published the paper of MapReduce
• 2005 Nutch Developers getting started with development of MapReduce
A Brief History of Hadoop
• A Brief History of Hadoop• Contributers and Development• What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
Contributers and Development
Lifetime patches contributed for all Hadoop-related projects: community members by current employer* source : JIRA tickets
Contributers and Development
Contributers and Development
* Resource: Kerberos Konference (Yahoo) – 2010
Development in ASF/Hadoop
● Resources● Mailing List● Wiki Pages , blogs● Issue Tracking – JIRA● Version Control SVN – Git
• A Brief History of Hadoop• Contributers and Development• What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
What is Hadoop
• Open-source project administered by the ASF
• Data Intensive Storage
• and Massivly Paralel Processing(MPP)
• Enables applications to work with thousands of nodes and petabytes of data
• Suitable for application with large data sets
What is Hadoop ?
• Scalable
• Fault Tolerance
• Reliable data storage using the Hadoop Distributed File System (HDFS)
• High-performance parallel data processing using a technique called MapReduce
What is Hadoop ?
• Hadoop Becoming defacto standard for large scale dataprocessing
• Becoming more than just MapReduce
• Ecosystem growing rapidly lot’s of great tools around it
What is Hadoop ?
Yahoo Hadoop Cluster
SGI Hadoop Cluster
38,000 machines distributed across 20 different clusters. Recource : Yahoo 2010
50,000 m : January 2012Resource http://www.computerworlduk.com/in-depth/applications/3329092/hadoop-could-save-you-money-over-a-traditional-rdbms/
• A Brief History of Hadoop• Contributers and Development• What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
Why Hadoop?
Why Hadoop?
Why Hadoop?
Why Hadoop? • Hadoop has its origins in Apache Nutch
• Can Process Big Data (Petabytes and more..)
• Unlimited Data Storage & Analyse
• No licence cost - Apache License 2.0
• Can be build out of the commodity hardware
• IT Cost Reduction
• Results
• Be One Step Ahead of Competition
• Stay there
Is hadoop alternative for RDBMs ?• At the moment Apache Hadoop is not a substitute for a database
• No Relation
• Key Value pairs
• Big Data
• unstructured (Text)
• semi structured (Seq / Binary Files)
• Structured (Hbase=Google BigTable)
• Works fine together with RDBMs
• A Brief History of Hadoop• Contributers and Development• What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
Hadoop Ecosystem
HDFS
(Hadoop Distributed File System)
HBase (Key-Value store)
MapReduce (Job Scheduling/Execution System)
Pig (Data Flow) Hive (SQL)
BI ReportingETL Tools
Sqoop
RDBMS
Hadoop Ecosystem
Important components of Hadoop
• HDFS: A distributed, fault tolerance file system
• MapReduce: A paralel data processing framework
• Hive : A query framework (like SQL)
• PIG : A query scripting tool
• HBase : realtime read/write access to your Big Data
Hadoop EcosystemHadoop is a Distributed Data Computing Platform
HDFS
HDFS
NameNode /DataNode interaction in HDFS. The NameNode keeps track of the file metadata—which files are in the system and how each file is broken down into blocks. The DataNodes provide backup store of the blocks and constantly report to the NameNode to keep the metadata current.»
Hadoop Cluster
Writing Files To HDFS
• Client consults NameNode
• Client writes block directly to
one DataNode
• DataNote replicates block
• Cycle repeats for next block
Reading Files From HDFS
• Client consults NameNode
• Client receives Data Node list for each block
• Client picks first Data Node for each block
• Client reads blocks sequentially
Rackawareness & Fault Tolerance
NameNode
Rack Aware Metadata
Rack 1:DN1DN2DN3DN5
Rack 5:DN5DN6DN7DN8
Rack N
File.txtBlk A:DN1,DN5,DN6
Blk B:DN1,DN2,DN9
BLKC:DN5,DN9,DN10
• Never loose all data if entire rack fails
• In Rack is higher bandwidth , lower latency
Cluster Healt
Hadoop Ecosystem
Important components of Hadoop
• HDFS: A distributed, fault tolerance file system
• MapReduce: A paralel data processing framework
• Hive : A query framework (like SQL)
• PIG : A query scripting tool
• HBase : A Column oriented Database for OLTP
MapReduce-Paradigm
• Simplified Data Processing on Large Clusters
• Splitting a Big Problem/Data into Little PiecesHive
• Key-Value
MapReduce-Batch Processing
• Phases
• Map
• Sort/Shuffle
• Reduce (Aggregation)
• Coordination
• Job Tracker
• Task Tracker
MapReduce-Map
MAP
MAP
MAP
K V1111
1111
1111
Datanode 1
Datanode 2
Datanode 3
MapReduce-Sort/Shuffle
1
1
1
1
1
1
1
1
1
1
1
1
SORT
SORT
SORT
Datanode 1
Datanode 2
Datanode 3
MapReduce-Reduce
1
1
1
1
1
1
1
1
1
1
1
1
4
3
3
2
REDUCE
REDUCE
REDUCE
SORT
SORT
SORT
K V
K V
K V
Datanode 1
Datanode 2
Datanode 3
MapReduce-All Phases
1
1
1
1
1
1
1
11
1
11
4
3
3
2
REDUCE
REDUCE
REDUCE
SORT
SORT
SORT
MAP
MAP
MAP
1111
1111
1111
MapReduce-Job & Task Tracker
JobTracker and TaskTracker interaction. After a client calls the JobTracker to begin a data processing job, the JobTracker partitions the work and assigns different map and reduce tasks to each TaskTracker in the cluster
Namenode
Datanodes
Summary of HDFS and MR
Hadoop Ecosystem
Important components of Hadoop
• HDFS: A distributed, fault tolerance file system
• MapReduce: A paralel data processing framework
• Hive : A query framework (like SQL)
• PIG : A query scripting tool
• HBase : A Column oriented Database for OLTP
Hive
Hive
• Data warehousing package built on top of Hadoop
• It began its life at Facebook processing large amount of user
and log data
• Hadoop subproject with many contributors
• Ad hoc queries , summarization , and data analysis on Hadoop-
scale data
• Directly query data from different formats (text/binary) and file
formats (Flat/Sequence)
• HiveQL - like SQL
Hive ComponentsHDFS
Hive CLI
DDLQueriesBrowsing
Map Reduce
MetaStore
Thrift API
Execution
Hive QL
Parser
Planner
Mgm
t. W
eb
UI
*Thrift : Interface Definition Lang.
Hadoop Ecosystem
Important components of Hadoop
• HDFS: A distributed, fault tolerance file system
• MapReduce: A paralel data processing framework
• Hive : A query framework (like SQL)
• PIG : A query scripting tool
• HBase : A Column oriented Database for OLTP
Pig
• The language used to express data flows, called Pig Latin
• Pig Latin can be extended using UDF (User Defined Functions)
• was originally developed at Yahoo Research
• PigPen is an Eclipse plug-in that provides an environment for
developing Pig programs
• Running Pig Programs
• Script ; script file that contains Pig commands
• Grunt ; interactive shell
• Embedded ; java
Piggrunt> records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int);
grunt> DUMP records;(1950,0,1)(1950,22,1)(1950,-11,1)(1949,111,1)(1949,78,1)
grunt> DESCRIBE records;records: {year: chararray,temperature: int,quality: int}
grunt> filtered_records = FILTER records BY temperature != 22 );grunt> DUMP filtered_records;
grunt> grouped_records = GROUP records BY year;grunt> DUMP grouped_records;(1949,{(1949,111,1),(1949,78,1)})(1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})
Hadoop Ecosystem
Important components of Hadoop
• HDFS: A distributed, fault tolerance file system
• MapReduce: A paralel data processing framework
• Hive : A query framework (like SQL)
• PIG : A query scripting tool
• HBase : A Column oriented Database for OLTP
HBase
• Random, realtime read/write access to your Big Data
• Billions of rows X millions of columns
• Column-oriented store modeled after Google's BigTable
• provides Bigtable-like capabilities on top of Hadoop and HDFS
• HBase is not a column-oriented database in the typical RDBMS
sense, but utilizes an on-disk column storage format
HBase-Datamodel
• Think of tags. Values any length, no predefined names or widths
• Column names carry info (just like tags)
• (Table, RowKey, Family,Column, Timestamp) → Value
HBase-Datamodel
• (Table, RowKey, Family,Column, Timestamp) → Value
HBase-Datamodel
• (Table, RowKey, Family,Column, Timestamp) → Value
Create Sample Table hbase(main):003:0> create 'test', 'cf'
hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value11'
hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value12'
hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2'
hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3'
hbase(main):007:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf:a, timestamp=1288380727188, value=value12
row2 column=cf:b, timestamp=1288380738440, value=value2
row3 column=cf:c, timestamp=1288380747365, value=value3
hbase(main):007:0> scan 'test', { VERSIONS => 3 }
ROW COLUMN+CELL
row1 column=cf:a, timestamp=1288380727188, value=value12
row1 column=cf:a, timestamp=1288380727188, value=value11
row2 column=cf:b, timestamp=1288380738440, value=value2
row3 column=cf:c, timestamp=1288380747365, value=value3
Hbase-Architecture
• Splits
• Auto-Sharding
• Master
• Region Servers
• HFile
Splits & RegionServers
• Rows grouped in regions and served by different servers• Table dynamically split into “regions” • Each region contains values [startKey, endKey) • Regions hosted on a regionserver
Hbase-Architecture
Other Components
• Flume
• Sqoop
Commertial Products
• Oracle Big Data Appliance
• Microsoft Azure + Excel + MapReduce
• Cloud Computing , Amazon elastic computing
• IBM Hadoop-based InfoSphere BigInsights
• VMWare Spring for Apache Hadoop
• Toad for Cloud Database
• Mapr , Cloudera , HortonWorks, Datameer
Thank You
Faruk Berksözfberksoz@gmail.com
top related