Top Banner
Slide 1 Hadoop for Java Professionals View Hadoop Courses at : www.edureka.in/hadoop * Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
23

Hadoop for Java Professionals

Nov 11, 2014

Download

Technology

Edureka!

With the surge in Big Data, organizations have began to implement Big Data related technologies as a part of their system. This has lead to a huge need to update existing skillsets with Hadoop. Java professionals are one such people who have to update themselves with Hadoop skills.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hadoop for Java Professionals

Slide 1

Hadoop for Java Professionals

View Hadoop Courses at : www.edureka.in/hadoop

*

Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

Page 2: Hadoop for Java Professionals

www.edureka.in/hadoopSlide 2 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

Objectives of this Session

• Un• Big Data and Hadoop• Why Hadoop?• Job Trends: Hadoop and Java• Hadoop ecosystem• MapReduce Programming and Java• User Defined Functions (UDF) in Pig and Hive• HBase and Java

For Queries during the session and class recording:Post on Twitter @edurekaIN: #askEdurekaPost on Facebook /edurekaIN

Page 3: Hadoop for Java Professionals

www.edureka.in/hadoopSlide 3 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

Big Data

Lots of Data (Terabytes or Petabytes)

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.

cloud

tools

statistics

No SQL

compression

storage

support

database

analyze

information

terabytesprocessing

mobile

Big Data

Page 4: Hadoop for Java Professionals

www.edureka.in/hadoopSlide 4 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

Unstructured Data is Exploding

2,500 exabytes of new information in 2012 with internet as primary driver “Digital universe grew by 62% last year to 800K petabytes and will grow to1.2 zettabytes” this year

Page 5: Hadoop for Java Professionals

www.edureka.in/hadoopSlide 5 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

Big Data - Challenges Increasing Data Volumes

New data sources and types

Email and documents

Social Media, Web Logs

Machine Device(Scientific)

Transactions, OLTP, OLAP

Page 6: Hadoop for Java Professionals

Slide 6 www.edureka.in/hadoop

Job Trends: Hadoop and Java

Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions

Page 7: Hadoop for Java Professionals

Slide 7 www.edureka.in/hadoop

Job Trends: Hadoop and Java

Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions

Page 8: Hadoop for Java Professionals

www.edureka.in/hadoopSlide 8 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

Jobs in Hadoop

Big Data has opened up the door to new job

opportunities, to name a few:

Hadoop Developer Hadoop Architects Hadoop Engineers Hadoop Application Developer Data Analysts Data Scientists Business Intelligence (BI) Architects Big Data Engineer

Page 9: Hadoop for Java Professionals

www.edureka.in/hadoopSlide 9 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

Hadoop for Java Professionals

Hadoop is red-hot as it: allows distributed processing of large data sets across

clusters of computers using simple programming model.

has become the de facto standard for storing, processing, and analyzing hundreds of terabytes and petabytes of data.

Is cheaper to use in comparison to other traditional proprietary technologies such as Oracle, IBM etc. It can runs on low cost commodity hardware.

Can handle all types of data from disparate systems such server logs, emails, sensor data, pictures, videos etc.

Page 10: Hadoop for Java Professionals

www.edureka.in/hadoopSlide 10 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

Hadoop for Java Professionals (Contd.)

Hadoop is Natural career progression for Java

professionals. It is a Java-based framework and written entirely in Java.

The combination of Hadoop and Java skills is the number one combination in demand among all Hadoop Jobs.

Java skills comes handy while writing code for the following in Hadoop:

MapReduce programming using Java User Defined Functions (UDFs) in PIG and Hive

scripts of Hadoop Applications Client Applications in HBase

Page 11: Hadoop for Java Professionals

www.edureka.in/hadoopSlide 11 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

Hadoop for Big Data

Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.

It is an Open-source Data Management with scale-out storage & distributed processing.

Page 12: Hadoop for Java Professionals

www.edureka.in/hadoopSlide 12 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

Hadoop and MapReduce

Hadoop is a system for large scale data processing.

It has two main components:

HDFS – Hadoop Distributed File System (Storage) highly fault-tolerant high throughput access to application data suitable for applications that have large data set Natively redundant

MapReduce (Processing) software framework for easily writing applications which

process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) in a reliable, fault-tolerant manner

Splits a task across processors

Map-Reduce

Key Value

Page 13: Hadoop for Java Professionals

www.edureka.in/hadoopSlide 13 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

HDFS (Hadoop Distributed File System)

Pig LatinData Analysis

HiveDW System

MapReduce Framework

HBase

Important Hadoop Eco-System components

Page 14: Hadoop for Java Professionals

www.edureka.in/hadoopSlide 14 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

What is Map - Reduce?

cloudsupport

database

Map - Reduce is a programming model It is neither platform- nor language-specific Record-oriented data processing (key and value) Task distributed across multiple nodes

Where possible, each node processes datastored on that node

Consists of two phases Map Reduce

ValueKey

MapReduce

Page 15: Hadoop for Java Professionals

www.edureka.in/hadoopSlide 15 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

What is Map - Reduce? (Contd.)

cloudsupport

database

Process can be considered as being similar to a Unix pipeline

cat /my/log | grep '\.html' | sort | uniq –c > /my/outfile

MAP SORT REDUCE

Page 16: Hadoop for Java Professionals

www.edureka.in/hadoopSlide 16 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

A Sample MapReduce program in Java

Page 17: Hadoop for Java Professionals

www.edureka.in/hadoopSlide 17 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

Problem – Data Processing

Page 18: Hadoop for Java Professionals

www.edureka.in/hadoopSlide 18 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

Huge Raw XML files with

unstructured data line reviews

Map Reduce

HDFS

Category hash url +tive -tive total

Problem - Data Processing

Output

Page 19: Hadoop for Java Professionals

www.edureka.in/hadoopSlide 19 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

Other Applications of Java Skills in Hadoop – UDFs

Page 20: Hadoop for Java Professionals

www.edureka.in/hadoopSlide 20 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

Pig is a High-level, declarative data flow language.

It is at the top of Hadoop and makes it possible to create complex jobs to process large volumes of data quickly and efficiently.

Similar to SQL query where the user specifies the “what” and leaves the “how” to the underlying processing engine.

Hadoop

Pig

User Defined Functions (UDFs) in PIG

Page 21: Hadoop for Java Professionals

www.edureka.in/hadoopSlide 21 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

public class IsOfAge extends FilterFunc {@Overridepublic Boolean exec(Tuple tuple) throws IOException {

if (tuple == null || tuple.size() == 0) {return false;

}

try {Object object = tuple.get(0);if (object == null) {

return false;}int i = (Integer) object;if (i == 18 || i == 19 || i == 21 || i == 23 || i == 27) {

return true;} else {

return false;}

} catch (ExecException e) {throw new IOException(e);

}}

}

A Program to create UDF:

Pig Latin – Creating UDF

Page 22: Hadoop for Java Professionals

www.edureka.in/hadoopSlide 22 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

How to call a UDF?

register myudf.jar;

X = filter A by IsOfAge(age);

Pig and UDF

Page 23: Hadoop for Java Professionals

Slide 23

Questions?Buy Complete Course at : www.edureka.in/hadoop

Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions

www.edureka.in/hadoop

Interested in learning “Big-Data & Hadoop”?Let us know by mailing us at [email protected]