COURSE CURRICULUM BIG DATA HADOOP FULL Pre-requisites for the Big Data Hadoop Training Course? There will be no pre-requisites but Knowledge of Java/ Python, SQL, Linux will be beneficial, but not mandatory. Ducat provides a crash course for pre-requisites required to initiate Big Data training. Apache Hadoop on AWS Cloud This module will help you understand how to configure Hadoop Cluster on AWS Cloud: Introduction to Amazon Elastic MapReduce l AWS EMR Cluster l AWS EC2 Instance: Multi Node Cluster Configuration l AWS EMR Architecture l Web Interfaces on Amazon EMR l Amazon S3 l Executing MapReduce Job on EC2 & EMR l Apache Spark on AWS, EC2 & EMR l Submitting Spark Job on AWS l Hive on EMR l Available Storage types: S3, RDS & DynamoDB l Apache Pig on AWS EMR l Processing NY Taxi Data using SPARK on Amazon EMR[Type text] l Learning Big Data and Hadoop This module will help you understand Big Data: Common Hadoop ecosystem components l Hadoop Architecture l HDFS Architecture l Anatomy of File Write and Read l How MapReduce Framework works l Hadoop high level Architecture l MR2 Architecture l Hadoop YARN l Hadoop 2.x core components l Hadoop Distributions l Hadoop Cluster Formation l Hadoop Architecture and HDFS This module will help you to understand Hadoop & HDFS ClusterArchitecture: Configuration files in Hadoop Cluster (FSimage & editlog file) l Setting up of Single & Multi node Hadoop Cluster l HDFS File permissions l HDFS Installation & Shell Commands l Deamons of HDFS l Node Manager l Resource Manager l NameNode l DataNode l
6
Embed
BIG DATA HADOOP FULLlBulk Loading in HBase lCreate, Insert, Read Tables in HBase lHBase Admin APIs l HBase Security lHBase vs Hive lBackup & Restore in HBase lApache HBase External
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
COURSE CURRICULUM
BIG DATAHADOOP
FULL
Pre-requisites for the Big Data Hadoop Training Course?There will be no pre-requisites but Knowledge of Java/ Python, SQL, Linux will be beneficial, but not mandatory. Ducat provides a crash course for pre-requisites required to initiate Big Data training.
Apache Hadoop on AWS CloudThis module will help you understand how to configure Hadoop Cluster on AWS Cloud:
Introduction to Amazon Elastic MapReducel�
AWS EMR Clusterl�
AWS EC2 Instance: Multi Node Cluster Configurationl�
AWS EMR Architecturel�
Web Interfaces on Amazon EMRl�
Amazon S3l�
Executing MapReduce Job on EC2 & EMRl�
Apache Spark on AWS, EC2 & EMRl�
Submitting Spark Job on AWSl�
Hive on EMRl�
Available Storage types: S3, RDS & DynamoDBl�
Apache Pig on AWS EMRl�
Processing NY Taxi Data using SPARK on Amazon EMR[Type text]l�
Learning Big Data and HadoopThis module will help you understand Big Data:
Common Hadoop ecosystem componentsl�
Hadoop Architecturel�
HDFS Architecturel�
Anatomy of File Write and Readl�
How MapReduce Framework worksl�
Hadoop high level Architecturel�
MR2 Architecturel�
Hadoop YARNl�
Hadoop 2.x core componentsl�
Hadoop Distributionsl�
Hadoop Cluster Formationl�
Hadoop Architecture and HDFSThis module will help you to understand Hadoop & HDFS ClusterArchitecture:
Configuration files in Hadoop Cluster (FSimage & editlog file)l�
Setting up of Single & Multi node Hadoop Clusterl�
HDFS File permissionsl�
HDFS Installation & Shell Commandsl�
Deamons of HDFSl�
Node Managerl�
Resource Managerl�
NameNodel�
DataNodel�
� l� Secondary NameNode YARN Deamonsl�
HDFS Read & Write Commandsl�
NameNode & DataNode Architecturel�
HDFS Operationsl�
Hadoop MapReduce Jobl�
Executing MapReduce Jobl�
Hadoop MapReduce FrameworkThis module will help you to understand Hadoop MapReduce framework:
How MapReduce works on HDFS data setsl�
MapReduce Algorithml�
MapReduce Hadoop Implementationl�
Hadoop 2.x MapReduce Architecturel�
MapReduce Componentsl�
YARN Workflowl�
MapReduce Combinersl�
MapReduce Partitionersl�
MapReduce Hadoop Administrationl�
MapReduce APIsl�
Input Split & String Tokenizer in MapReducel�
MapReduce Use Cases on Data setsl�
Advanced MapReduce ConceptsThis module will help you to learn:
Job Submission & Monitoringl�
Countersl�
Distributed Cachel�
Map & Reduce Joinl�
Data Compressorsl�
Job Configurationl�
Record Readerl�
PigThis module will help you to understand Pig Concepts:
Pig Architecturel�
Pig Installationl�
Pig Grunt shelll�
Pig Running Modesl�
Pig Latin Basicsl�
Pig LOAD & STORE Operators[Type text]l�
Diagnostic Operatorsl�
DESCRIBE Operatorl�
EXPLAIN Operatorl�
ILLUSTRATE Operatorl�
DUMP Operatorl�
Grouping & Joiningl�
GROUP Operatorl�
COGROUP Operatorl�
JOIN Operatorl�
CROSS Operatorl�
Combining & Splittingl�
UNION Operatorl�
SPLIT Operatorl�
Filteringl�
FILTER Operatorl�
DISTINCT Operatorl�
FOREACH Operatorl�
l� Sorting ORDERBYFIRSTl�
LIMIT Operatorl�
Built in Fuctionsl�
EVAL Functionsl�
LOAD & STORE Functionsl�
Bag & Tuple Functionsl�
String Functionsl�
Date-Time Functionsl�
MATH Functionsl�
Pig UDFs (User Defined Functions)l�
Pig Scripts in Local Model�
Pig Scripts in MapReduce Model�
Analysing XML Data using Pigl�
Pig Use Cases (Data Analysis on Social Media sites, Banking, Stock Market & Others)l�
Analysing JSON data using Pigl�
Testing Pig Sctiptsl�
HiveThis module will build your concepts in learning:
Hive Installationl�
Hive Data typesl�
Hive Architecture & Componentsl�
Hive Meta Storel�
Hive Tables(Managed Tables and External Tables)l�
Hive Partitioning & Bucketingl�
Hive Joins & Sub Queryl�
Running Hive Scriptsl�
Hive Indexing & Viewl�
Hive Queries (HQL); Order By, Group By, Distribute By, Cluster By, Examplesl�
Hive Functions: Built-in & UDF (User Defined Functions)l�
Data Processing with Apache SparkSpark executes in-memory data processing & how Spark Job runs faster then Hadoop MapReduce Job. Course will also help you understand the Spark Ecosystem & it related APIs like Spark SQL, Spark Streaming, Spark MLib, Spark GraphX & Spark Core concepts as well.This course will help you to understand Data Analytics & Machine Learning algorithms applying to various datasets to process & to analyze large amount of data.
Spark RDDs.l�
Spark RDDs Actions & Transformations.l�
Spark SQL : Connectivity with various Relational sources & its convert it into Data Frame using Spark SQL.l�
Project #1: Working with MapReduce, Pig, Hive & FlumeProblem Statement : Fetch structured & unstructured data sets from various sources like Social Media Sites, Web Server & structured source like MySQL, Oracle & others and dump it into HDFS and then analyze the same datasets using PIG,HQL queries & MapReduce technologies to gain proficiency in Hadoop related stack & its ecosystem tools.Data Analysis Steps in :
Dump XML & JSON datasets into HDFS.l�
Convert semi-structured data formats(JSON & XML) into structured format using Pig,Hive & MapReduce.l�
Push the data set into PIG & Hive environment for further analysis.l�
Writing Hive queries to push the output into relational database(RDBMS) using Sqoop.l�
Renders the result in Box Plot, Bar Graph & others using R & Python integration with Hadoop.l�
Project #2: Analyze Stock Market DataIndustry: FinanceData : Data set contains stock information such as daily quotes ,Stock highest price, Stock opening price on New York Stock Exchange.Problem Statement: Calculate Co-variance for stock data to solve storage & processing problems related to huge volume of data.
Positive Covariance, If investment instruments or stocks tend to be up or down during the same time l�
periods, they have positive covariance.Negative Co-variance, If return move inversely,If investment tends to be up while other is down, this l�
shows Negative Co-variance.
Project #3: Hive,Pig & MapReduce with New York City Uber TripsProblem Statement: What was the busiest dispatch base by trips for a particular day on entire month?l�
What day had the most active vehicles.l�
What day had the most trips sorted by most to fewest.l�
Dispatching_Base_Number is the NYC taxi & Limousine company code of that base that dispatched the l�
UBER.active_vehicles shows the number of active UBER vehicles for a particular date & company(base). l�
Trips is the number of trips for a particular base & date.
BIG DATA PROJECTS
Partners :
PITAMPURA (DELHI)NOIDAA-43 & A-52, Sector-16,
GHAZIABAD1, Anand Industrial Estate, Near ITS College, Mohan Nagar, Ghaziabad (U.P.)
GURGAON1808/2, 2nd floor old DLF,Near Honda Showroom,Sec.-14, Gurgaon (Haryana)
SOUTH EXTENSION
www.facebook.com/ducateducation
Java
Plot No. 366, 2nd Floor, Kohat Enclave, Pitampura,( Near- Kohat Metro Station)Above Allahabad Bank, New Delhi- 110034.
Noida - 201301, (U.P.) INDIA 70-70-90-50-90 +91 99-9999-3213 70-70-90-50-90 70-70-90-50-90
70-70-90-50-90
70-70-90-50-90
D-27,South Extension-1New Delhi-110049
+91 98-1161-2707
(DELHI)
Project #4: Analyze Tourism DataData: Tourism Data comprises contains : City Pair, seniors travelling,children traveling, adult traveling, car booking price & air booking price.Problem Statement: Analyze Tourism data to find out :
Top 20 destinations tourist frequently travel to: Based on given data we can find the most popular l�
destinations where people travel frequently, based on the specific initial number of trips booked for a particular destination
Top 20 high air-revenue destinations, i.e the 20 cities that generate high airline revenues for travel, so l�
that the discount offers can be given to attract more bookings for these destinations.Top 20 locations from where most of the trips start based on booked trip count.l�
Project #5: Airport Flight Data Analysis : We will analyze Airport Information System data that gives information regarding flight delays,source & destination details diverted routes & others.Industry: AviationProblem Statement: Analyze Flight Data to:
List of Delayed flights.l�
Find flights with zero stop.l�
List of Active Airlines all countries.l�
Source & Destination details of flights.l�
Reason why flight get delayed.l�
Time in different formats.l�
Project #6: Analyze Movie RatingsIndustry: MediaData: Movie data from sites like rotten tomatoes, IMDB, etc. Problem Statement: Analyze the movie ratings by different users to:
Get the user who has rated the most number of moviesl�
Get the user who has rated the least number of moviesl�
Get the count of total number of movies rated by user belonging to a specific occupationl�
Get the number of underage usersl�
Project #7: Analyze Social Media Channels :Facebookl�
Twitterl�
Instagram l�
YouTubel�
Industry: Social Medial�
Data: DataSet Columns : VideoId, Uploader, Internal Day of establishment of You tube & the date of l�
uploading of the video,Category,Length,Rating, Number of comments.Problem Statement: Top 5 categories with maximum number of videos uploaded.l�
Problem Statement: Identify the top 5 categories in which the most number of videos are uploaded, l�
the top 10 rated videos, and the top 10 most viewed videos.Apart from these there are some twenty more use-cases to choose: Twitter Data Analysisl�