Top Banner

Click here to load reader

Apache hadoop technology : Beginners

Apr 14, 2017



Apache Hadoop Technology


Apache Hadoop Technology


Content :Introduction to HadoopHadoop architectureWhat is Apache HadoopData flowMapReduceHDFSYARN FrameworkWho uses HadoopHadoop in enterprisesAdvantage Conclusion

What is Hadoop :Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of theApacheproject sponsored by the Apache Software Foundation.At its core, Hadoop has two major layers namely: (a) Processing/Computation layer (MapReduce), and (b) Storage layer (Hadoop Distributed File System).

Hadoop Architecture :

What is Apache Hadoop :The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage..

Data flow :

Web ServersScribe ServersNetwork StorageHadoop ClusterOracle RACMySQL

MapReduce :Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.A MapReducejobusually splits the input data-set into independent chunks which are processed by themap tasksin a completely parallel manner. The framework sorts the outputs of the maps, which are then input to thereduce tasks.

Cont.. Job A full program - an execution of a Mapper and Reducer across a data setTask An execution of a Mapper or a Reducer on a slice of data a.k.a. Task-In-Progress (TIP)Task Attempt A particular instance of an attempt to execute a task on a machine

MapReduce High level :

HDFS :A file system, that stores data in a very efficient manner, which can be used easily. A distributed file system that provides high throughput access to application.Features :It is suitable for the distributed storage and processing.Hadoop provides a command interface to interact with HDFS.The built-in servers of namenode and datanode help users to easily check the status of cluster.Streaming access to file system data.HDFS provides file permissions and authentication.

Architecture :

YARN Framework :Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology.YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture.It provides resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters.It provides, a consistent framework for writing data access applications that run IN Hadoop, to the developers.


Cont. :Some features are :Multi TangencyCluster UtilizationScalability Compatibility

Architecture :

Who Uses Hadoop :Amazon/A9FacebookGoogleIBMJoostLast.fmNew York TimesPowerSetVeohYahoo!

Hadoop in the EnterpriseAccelerate nightly batch business processes Storage of extremely high volumes of dataCreation of automatic, redundant backupsImproving the scalability of applicationsUse of Java for data processing instead of SQLProducing JIT feeds for dashboards and BIHandling urgent, ad hoc request for dataTurning unstructured data into relational dataTaking on tasks that require massive parallelismMoving existing algorithms, code, frameworks, and components to a highly distributed computing environment

Advantage :

Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores. Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.

Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption. Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.

Conclusion :Apache Hadoop is a fast-growing data frameworkApache Hadoop offers a free, cohesive platform that encapsulates: Data integration Data processing Workflow scheduling Monitoring




MapReduce job submitted by client computer

Master node


Slave node

Task instance


Slave node

Task instance


Slave node

Task instance

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.