Top Banner
Apache Hadoop ® Students: Omar Jaber Dr. Aiman AbuSamra 31/12/2014 Islamic University-Gaza Faculty of Engineering Computer Engineering Department
17

Introduction to Apache hadoop

Jan 08, 2017

Download

Engineering

Omar Jaber
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Apache hadoop

Apache™ Hadoop®

Students: Omar Jaber

Dr. Aiman AbuSamra

31/12/2014

Islamic University-Gaza Faculty of EngineeringComputer Engineering Department

Page 2: Introduction to Apache hadoop

What is Apache ?

Hadoop, formally called Apache Hadoop, is an Apache Software Foundation project and open source software platform for scalable,distributed computing. Hadoop can provide fast and reliable analysis of both structured data and unstructured data. Given its capabilities to handle large data sets, it's often associated with the phrase big data.

The Apache Hadoop software library is essentially a framework that allows for the distributed processing of large datasets across clusters of computers using a simple programming model. Hadoop can scale up from single servers to thousands of machines, each offering local computation and storage.

Page 3: Introduction to Apache hadoop

What is Apache ? Apache Hadoop™ was born out of a need to process an avalanche of Big

Data. The web was generating more and more information on a daily basis, and it was becoming very difficult to index over one billion pages of content. In order to cope, Google invented a new style of data processing known as MapReduce. A year after Google published a white paper describing the MapReduce framework, Doug Cutting and Mike Cafarella, inspired by the white paper, created Hadoop to apply these concepts to an open-source software framework to support distribution for the Nutch search engine project. Given the original case, Hadoop was designed with a simple write-once storage infrastructure.

Hadoop has moved far beyond its beginnings in web indexing and is now used in many industries for a huge variety of tasks that all share the common theme of lots of variety, volume and velocity of data – both structured and unstructured. It is now widely used across industries, including finance, media and entertainment, government, healthcare, information services, retail, and other industries with Big Data requirements but the limitations of the original storage infrastructure remain.

Page 4: Introduction to Apache hadoop

Who uses Hadoop?

Page 5: Introduction to Apache hadoop

The base Apache Hadoop framework 

• Hadoop Common – contains libraries and utilities needed by other Hadoop modules.

• Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.

• Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.

• Hadoop MapReduce – a programming model for large scale data processing.

Page 6: Introduction to Apache hadoop

How is Hadoop Different from Past Techniques?

• Hadoop can handle data in a very fluid way. Hadoop is more than just a faster, cheaper database and analytics tool. Unlike databases, Hadoop doesn’t insist that you structure your data. Data may be unstructured and schemaless. Users can dump their data into the framework without needing to reformat it. By contrast, relational databases require that data be structured and schemas be defined before storing the data.

• Hadoop has a simplified programming model. Hadoop’s simplified programming model allows users to quickly write and test distributed systems. Performing computation on large volumes of data has been done before, usually in a distributed setting but writing distributed systems is notoriously hard. By trading away some programming flexibility, Hadoop makes it much easier to write distributed programs.

Page 7: Introduction to Apache hadoop

How is Hadoop Different from Past Techniques?

• Hadoop is easy to administer. Alternative high performance computing (HPC) systems allow programs to run on large collections of computers, but they typically require rigid program configuration and generally require that data be stored on a separate storage area network (SAN) system. Schedulers on HPC clusters require careful administration and since program execution is sensitive to node failure, administration of a Hadoop cluster is much easier.

• Hadoop is agile. Relational databases are good at storing and processing data sets with predefined and rigid data models. For unstructured data, relational databases lack the agility and scalability that is needed. Apache Hadoop makes it possible to cheaply process and analyze huge amounts of both structured and unstructured data together, and to process data without defining all structure ahead of time.

Page 8: Introduction to Apache hadoop

Hadoop Creation History

Page 9: Introduction to Apache hadoop

SQL on Hadoop

SQL is one of the most widely used languages to access, analyze, and manipulate structured data. As Hadoop gains traction within enterprise data architectures across industries, the need for SQL for both structured and loosely-structured data on Hadoop is growing rapidly. Key organizational drivers include the ability to: - Leverage existing SQL skills in the organization - Reuse BI, ETL, and analytics infrastructure investments with Hadoop

MapR delivers maximum flexibility for SQL access in Hadoop by ensuring that its users can run the widest variety of both open-source and proprietary SQL technologies on its secure and high-performance distribution for Hadoop.

Page 10: Introduction to Apache hadoop

SQL on Hadoop

MapR supports SQL as a key use case along with the other types of processing on Hadoop. MapR takes an open approach to SQL, supporting the broadest set of SQL-on-Hadoop (also called "SQL-in-Hadoop") projects and technologies on the enterprise-grade MapR Distribution for Hadoop.

Page 11: Introduction to Apache hadoop

What is Hadoop MapReduce ?

Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed processing of large data sets on compute clusters of commodity hardware. It is a sub-project of the Apache Hadoop project. The framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks. 

According to The Apache Software Foundation, the primary objective of Map/Reduce is to split the input data set into independent chunks that are processed in a completely parallel manner. The Hadoop MapReduce framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically, both the input and the output of the job are stored in a file system.

Page 12: Introduction to Apache hadoop

The MapR Advantage

MapR allows you to do more with Hadoop by combining Apache Hadoop with architectural innovations focused on operational excellence in the data center. MapR is the only distribution that is built from the ground up for business-critical production applications.

MapR is a complete distribution for Apache Hadoop that packages more than a dozen projects from the Hadoop ecosystem to provide you with a broad set of big data capabilities. The MapR platform not only provides enterprise-grade features such as high availability, disaster recovery, security, and full data protection but also allows Hadoop to be easily accessed as traditional network attached storage (NAS) with read-write capabilities.

Page 13: Introduction to Apache hadoop

Why use Apache Hadoop?

• It’s cost effective. Apache Hadoop controls costs by storing data more affordably per terabyte than other platforms. Instead of thousands to tens of thousands per terabyte, Hadoop delivers compute and storage for hundreds of dollars per terabyte.

• It’s fault-tolerant. Fault tolerance is one of the most important advantages of using Hadoop. Even if individual nodes experience high rates of failure when running jobs on a large cluster, data is replicated across a cluster so that it can be recovered easily in the face of disk, node or rack failures.

Page 14: Introduction to Apache hadoop

Why use Apache Hadoop?

• It’s flexible. The flexible way that data is stored in Apache Hadoop is one of its biggest assets – enabling businesses to generate value from data that was previously considered too expensive to be stored and processed in traditional databases. With Hadoop, you can use all types of data, both structured and unstructured, to extract more meaningful business insights from more of your data.

• It’s scalable. Hadoop is a highly scalable storage platform, because it can store and distribute very large data sets across clusters of hundreds of inexpensive servers operating in parallel. The problem with traditional relational database management systems (RDBMS) is that they can’t scale to process massive volumes of data.

Page 15: Introduction to Apache hadoop

How Hadoop got its name?Why use the elephant !

Page 16: Introduction to Apache hadoop

References • http://hadoop.apache.org/• http://ar.wikipedia.org/• http://hortonworks.com/• www.mapr.com

Page 17: Introduction to Apache hadoop

Thanks ☺