Top Banner
Hadoop Jessica Miller March 17, 2017
17

Hadoop - Nc State Universitypost/jess/Hadoop.pdf · Beyond the Fundamentals Hadoop YARN : job prioritization, cluster management Apache Hive : data querying, summarization, analysis

May 30, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hadoop - Nc State Universitypost/jess/Hadoop.pdf · Beyond the Fundamentals Hadoop YARN : job prioritization, cluster management Apache Hive : data querying, summarization, analysis

HadoopJessica Miller

March 17, 2017

Page 2: Hadoop - Nc State Universitypost/jess/Hadoop.pdf · Beyond the Fundamentals Hadoop YARN : job prioritization, cluster management Apache Hive : data querying, summarization, analysis

Problems with Big Data

Storage Processing

https://www2.cisl.ucar.edu/resources/storage-and-file-systems/hpss https://ncar.ucar.edu/community-resources/computational-resources

Jessica Miller

Page 3: Hadoop - Nc State Universitypost/jess/Hadoop.pdf · Beyond the Fundamentals Hadoop YARN : job prioritization, cluster management Apache Hive : data querying, summarization, analysis

Welcome to Hadoop

Storage Processing

■ Divides data

■ Stores across multiple nodes

■ Dimension reduction

■ Parallel computing

HDFS MapReducehttp://www.forbes.com/sites/danwoods/2011/11/03/explaining-hadoop-to-your-ceo/2/#6d30f55d4b54

Jessica Miller

Page 4: Hadoop - Nc State Universitypost/jess/Hadoop.pdf · Beyond the Fundamentals Hadoop YARN : job prioritization, cluster management Apache Hive : data querying, summarization, analysis

Hadoop Distributed File SystemScalability:

Locality:Multi-node cluster common servers shell + Java

Fail capability:Block of data multiple servers

ALL DATA

Block 1

Block 2

Block n...

Jessica Miller

Page 5: Hadoop - Nc State Universitypost/jess/Hadoop.pdf · Beyond the Fundamentals Hadoop YARN : job prioritization, cluster management Apache Hive : data querying, summarization, analysis

MapReduceMapping:

Dataset mapping task output

Reduction:Map output reduce task output

Input 1 MAPPED 1

Input 2

Input n...

MAPPED 2

MAPPED n...

REDUCED

REDUCEDOutput

Page 6: Hadoop - Nc State Universitypost/jess/Hadoop.pdf · Beyond the Fundamentals Hadoop YARN : job prioritization, cluster management Apache Hive : data querying, summarization, analysis

Analogy: Roman Empire Census

Roman Empire

City 1

City 2

City n

...

Population of City 1

Population of City 2

Population of City n

...

Population of Roman

Empire

DATA HDFS BLOCKS MAP TASK:COUNT MAPS

REDUCTION TASK:SUM

OUTPUT

https://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/

Jessica Miller

Page 7: Hadoop - Nc State Universitypost/jess/Hadoop.pdf · Beyond the Fundamentals Hadoop YARN : job prioritization, cluster management Apache Hive : data querying, summarization, analysis

Beyond the Fundamentals■ Hadoop YARN : job prioritization, cluster management

■ Apache Hive : data querying, summarization, analysis (like SQL)

■ Apache Spark: computation over an application

■ Apache Pig: parallel computing execution

✓ Open source✓ Affordable✓ Projects for easier execution

Large infrastructure neededStraightforward analysis not as easy

Jessica Miller

Page 8: Hadoop - Nc State Universitypost/jess/Hadoop.pdf · Beyond the Fundamentals Hadoop YARN : job prioritization, cluster management Apache Hive : data querying, summarization, analysis

Who uses Hadoop?Search/content optimization:

Data storage:

Image conversion:

Jessica Miller

Page 9: Hadoop - Nc State Universitypost/jess/Hadoop.pdf · Beyond the Fundamentals Hadoop YARN : job prioritization, cluster management Apache Hive : data querying, summarization, analysis

Case Study: Sears

Motivation:■ Know the customers better■ Individual customer

personalization■ Better retail, higher profit

Results:■ Use of Hadoop framework■ Frontrunner in big data

technologies

Jessica Miller

Page 10: Hadoop - Nc State Universitypost/jess/Hadoop.pdf · Beyond the Fundamentals Hadoop YARN : job prioritization, cluster management Apache Hive : data querying, summarization, analysis

Before using Hadoop: SearsGridded data : only 10% analyzed

■ Business insights■ Inefficient use of time and money■ Sales not improving

Boldly stepping forward■ Relatively new technology■ Replacement of infrastructure■ Trial and error

http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?page_number=1

Jessica Miller

Page 11: Hadoop - Nc State Universitypost/jess/Hadoop.pdf · Beyond the Fundamentals Hadoop YARN : job prioritization, cluster management Apache Hive : data querying, summarization, analysis

After using Hadoop: Sears100% of data available for analysis

■ 1 node 300+ nodes■ Time management■ Money management■ Retail computing and analytics : MetaScale

Individual transactions

https://www.edureka.co/blog/big-data-applications-sears-case-study/

Jessica Miller

Page 12: Hadoop - Nc State Universitypost/jess/Hadoop.pdf · Beyond the Fundamentals Hadoop YARN : job prioritization, cluster management Apache Hive : data querying, summarization, analysis

Example: Sears■ Query: All items priced more than $29,999.00■ Tools: Ruby MapReduce, Pig, Hive■ Data: 15 billion records■ Solution: Pig can provide quick execution■ Input: 15,274,430,951 records searched■ Output: 28 records returned■ Time: 53 seconds

http://www.metascale.com/resources/blogs/156-big-data-case-study-hadoop-first-usage-in-production-at-sears-holdings.html#.WFtg2_krLIU

Jessica Miller

Page 13: Hadoop - Nc State Universitypost/jess/Hadoop.pdf · Beyond the Fundamentals Hadoop YARN : job prioritization, cluster management Apache Hive : data querying, summarization, analysis

Further Application: Apache SparkData processing specialist

Up to 100 times faster (in memory) than Hadoop MapReduce

Built for ease of application usage

http://www.infoworld.com/article/3014440/big-data/five-things-you-need-to-know-about-hadoop-v-apache-spark.html

Jessica Miller

Page 14: Hadoop - Nc State Universitypost/jess/Hadoop.pdf · Beyond the Fundamentals Hadoop YARN : job prioritization, cluster management Apache Hive : data querying, summarization, analysis

Step by step: SparkOpen source

File management system

Parallel operations

JavaPython

R

Machine learning

SQL

Streaming

Operates in memoryhttp://spark.apache.org/

Jessica Miller

Page 15: Hadoop - Nc State Universitypost/jess/Hadoop.pdf · Beyond the Fundamentals Hadoop YARN : job prioritization, cluster management Apache Hive : data querying, summarization, analysis

Spark vs. Hadoop

Spark■ Needs data storage system■ In-memory operations■ Reads, operates, writes all at once■ Resilient distributed datasets■ Built-in SQL, machine learning

Hadoop■ Built-in data storage system, HDFS■ Hard drive operations■ Part of the data at a time■ More secure failure capabilities■ Needs advanced analytics add-ons

http://www.forbes.com/sites/bernardmarr/2015/06/22/spark-or-hadoop-which-is-the-best-big-data-framework/#21f1d563532c

Jessica Miller

Page 16: Hadoop - Nc State Universitypost/jess/Hadoop.pdf · Beyond the Fundamentals Hadoop YARN : job prioritization, cluster management Apache Hive : data querying, summarization, analysis

From the eyes of a business owner: SparkDo I need advanced analytics?

Is it more cost effective?

Is it easier to execute?

“[...] everyday business owners are finding increasingly innovative uses for their stored data.” Bernard Marr, Forbes

Jessica Miller

Page 17: Hadoop - Nc State Universitypost/jess/Hadoop.pdf · Beyond the Fundamentals Hadoop YARN : job prioritization, cluster management Apache Hive : data querying, summarization, analysis

Thank you!

Google “Hadoop SAS”

https://www.sas.com/en_us/insights/big-data/hadoop.html