User Guide V5.2015-05-22.final HDFS
Aug 12, 2015
User Guide V5.2015-05-22.final HDFS
Cox Automotive
User Guide HDFS 2
Executive Statement
This document is meant for those new to Hadoop who will be
working with Hadoop Distributed File System (HDFS). The
audience is anyone utilizing Hadoop within their company. The
goal is for the user to have a greater understanding of how to
access their files and where their files will be stored using HDFS.
Project Owner(s)
Person(s) of Contact
Michael Gay [email protected]
Mike Tacker [email protected]
Cox Automotive
User Guide HDFS 3
Table of Contents
Chapter 1: What is HDFS?……………………………………………….………4
Introduction…........................4
-What Makes it Work…………4
-What You Gain………………..4
Understanding HDFS Data….5
Chapter 2: How Can I Get Started?………………………………………….…6
Hue VS Command Prompt……6
What You Can Access……7
Chapter 3: How Can I Interact With My Files? ………………………….……8
Shell Commands....8-9
With Hue....10
-Rename...11
-Move...11
-Copy...12
-Change Permissions...12
-Move to Trash/Download...12
Chapter 4: How Is My Data Stored?………………………………………..…..13
Cluster Set Up...13
NameNodes....13
Data Nodes...13
MapReduce...14
Data Housing and Hardware...14
Chapter 5: Conclusion……………………………………………………...........15
Cox Automotive
User Guide HDFS 4
Chapter 1
What is HDFS?
Welcome to HDFS!
Introduction: Hadoop
Apache™ Hadoop® creates open-source software for developers and computer architects to build frameworks
for your use! With the help of Hadoop, collecting and reading one terabyte of data would take less than five
minutes. This guide will discuss specifically the Hadoop Distributed File System (HDFS) and what makes it
unique from other filing systems we currently use.
First and foremost: What exactly is HDFS? The system is used for filing data on low-cost commodity hardware.
It can store large data sets that are sometimes terabytes in size. Recovery of your files come quickly and stress
free due to the HDFS fault-tolerant recovery. The cluster system HDFS uses is able to store data in a more
collective format by duplicating your files.
What Makes it Work
The background of this system is all based around data architecture. It runs on top of Ext3, Ext4, and XFS. It’s
based off of Google’s filing system (GFS). Basically, since your files are stored in the data cluster, there is a
sort of mapping system used. Say if you had that one terabyte of data mentioned earlier: that data is scattered
into smaller hard drives. This is a breakdown of a singular file which makes it easier to open, read, search, et
cetera.
What You Gain
You will be able to have access to user resources for your files. It will be a more functional environment to keep
your files safe and secure. Because you own your files, HDFS allows you to go into the data of that file to
retrieve or change any of the data. For example, if you want to find how many times a word was repeated in
your file, that action could easily be done in no time. The recovery system keeps your files secure if an error
were to occur. By having duplications of your file stored in the data, there is no corruption or loss of
information.
Cox Automotive
User Guide HDFS 5
Understanding the HDFS Data
Hadoop has four layers of data. The layers go from bottom to top. If you are the main user and owner of the
project, you are able to access and change whatever you want to that project. At the bottom, the raw materials
are system processes that are simply true data without any modifications made by you or the enterprise as a
whole. These are production areas where you’ll find your data.
Now that you’ve been introduced to HDFS, we can begin going through the details of how your files can be
stored, how you can access them and the hardware used to make it all work!
Projects /data/proj/
The main file that has been or is
currently being worked on. It can be accessed by the user.
(Business Unit/User; Read/Write Access)
Core /data/core/
The more refined data for
enterprise business use. (Business Unit/User;
Read/Write Access)
Published /data/pub/
The data has been lightly
modified. It can be accessed for business use. (IT Owned; Business Unit/User Read Only)
Raw /data/raw/
The raw materials used in the database. They are meant to
process your data. (IT Owned; Business Unit/ User Read Only)
Group /grp/[bu_team] env [dev or
qa]/ proj core
temp pub(if needed) raw(if needed)
Houses any group data being in use or development including projects, data assets, etc.
Sandbox /sandbox
Volatile, shared area for testing new concepts
Training /training
Safe area for end users to learn the environment
Cox Automotive
User Guide HDFS 6
Chapter 2 How can I get started? Hue VS Command Prompt
The latest version that is being used is: hadoop-
2.3.0+cdh5.1.5+854
There are two different ways you can view HDFS. Hue is a
Hadoop system that lets you directly interface with data. It is
much simpler than using the command prompt because all of
your files are organized in a table like format. If you would like
to use Hue, simply go to your company’s login page (such as:
http//:hue.autotrader.com). From there, you will see an area
where you enter your Active Directory Login (Figure 1)
Once you are logged in, you will
see files you can read and/or
write using HDFS (Figure 2).
Each section will be labeled such
as Name, Size, User, Group,
Permissions, and Date. When
logged in, your login ID will be
the /user/LOGINID
Figure 1
Figure 2
Cox Automotive
User Guide HDFS 7
If you would prefer to use the command prompt, you can simply open up your prompt and use the command:
After doing so, you’ll have your command prompt brought up. Similar to Hue, your files
will be grouped in their respective areas. However, it is more text based rather than in a
table. For example, you will need to enter a command to reach certain files (Figure 3).
Figure 3
What You Can Access Any file you create can be accessed through HDFS. You can view different aspects of the data. The different
commands you as a user can access are:
Because everything in HDFS is considered a file, every bit of information can be opened like a file right down
to the raw data mentioned in the previous chapter. You can have direct contact with the files whether it be
writing and/or reading them. You will know the information input and output. As a user, sometimes you will
not be able to access certain files due to whatever group you may be in. For example, if your business group or
/grp/[bu] does not have you added or is not owned by you, no data will show. You need to be given rights to the
correct groups before you have access.
Sandbox /sandbox
Training /training
Group /grp/[bu_team]
Users Tree /user/[yourname]
/hdfs dfs
Cox Automotive
User Guide HDFS 8
count: hdfs dfs -count [-q] [-h] <paths>
You can count directories, files, and bytes under the file pattern you
are looking for.
cp: hdfs dfs –cp [-f] URI [URI ...] <dest>
Copy files from source destination. If you copy files from multiple
sources, the destination must be a directory.
get: hdfs dfs –get [-ignorecrc] [-crc] <src> <localdst>
Copies files to the local file system.
ls: hdfs dfs –ls <args>
For file returns. Example: hdfs dfs -ls /user/hadoop/test.txt
lsr:
hdfs dfs -lsr <args>
Repeated application of ls.
Chapter 3
How can I interact with my files?
Shell Commands HDFS utilizes shell commands using the Command Prompt. These commands interact with HDFS in order to
execute a copy, delete, group change of files, and many other different changes or additions you may want to
make to your file data. These commands can only be completed in the command prompt. To start the shell
command, you will need to enter:
/hdfs dfs <commands>
Cox Automotive
User Guide HDFS 9
mkdir: hdfs dfs –mkdir [-p] <paths>
Creates directories using the path uri’s argument.
mv: hdfs dfs –mv URI [URI ...] <dest>
Moves files from source to destination:
hdfs dfs –mv /user/hadoop/test.txt /user/hadoop/test2.txt
put: hdfs dfs –put <localsrc> ... <dst>
Copies a source or multiple sources from a local file system and puts
them in a different file system. It can also read and write inputs to
the destination file system.
rm: hdfs dfs –rm [-skipTrash] URI [URI ...]
Deletes non empty directory and files specified as arguments. By
putting in the [-skipTrash] command, it will not go through the trash
so your files will be deleted faster.
text: hdfs dfs -text <src>
Turns a source file into text format.
Cox Automotive
User Guide HDFS 10
With Hue If you’re going to use Hue, it is much easier to navigate, download, move, and copy files if you are not
comfortable using the command prompt. As discussed in Chapter 2, Hue’s File Browser is set up like a table
format where you can have access to different files. We are going to break this down a little so it is easier to
understand.
First, you have your main File Browser screen. You will notice in Figure 4, there are
files set up with your /user name. You also have a set of files that give you User,
Group, Permissions, and the Date the file was accessed.
What if you do not have any files into the system just yet? All you have to do is
go to the very right of your screen and click on the Upload drop down box. It
will give you the option of uploading Files or Zip/Tgz files (Figure 5). You can
choose to upload any kind of file you want. For this test, I am going to upload a
docx file (or a Word Doc file).
Now that I have a file uploaded, let’s start at the very top of the File Browser.
There is a search box called Search for file name. You can type in a file you
want to find, and it will pop up under your user table (Figure 6). You can
then check the file you searched for and can begin modifying it (Figure 7).
Note: Remember
there are some files
you may not have
access to because
you have not been
given the proper
permission.
Figure 4
Figure 5
Figure 6 Figure 7
Cox Automotive
User Guide HDFS 11
You now have the option to change and manage your files with the different modifications (Figure 8).
Rename When you rename your file, click on the Rename tab, and a
field will pop up where you can enter in a different name for
your file(Figure 9).* Renaming your file will not corrupt or
change the format.*
Move When you move a file, it can
either be moved to a typed in destination, a folder you already
have under your /user, or a new folder you want to create. If you
have a certain destination (such as a folder for another user or
group), type it in as /user/[username]/[destination name] and
click Move. If you would like to move it to another folder, select
the folder of your choice and click Move(Figure 10).
Figure 8
Figure 9
Figure 10
Cox Automotive
User Guide HDFS 12
Copy Copy allows you to copy a file to a new destination
such as a different folder (Figure 11).
Figure 11
Change Permissions
You have the authority over who can Read, Write, and
Execute your files. Change Permissions brings up a table
where you can change your preferences on who is allowed
to access different areas of your files. For example, if you
would like the entire group to write on your file rather than
just you, you would check the Group box in the Write
column(Figure 12).
Download and Move to Trash You can download your own files or files of other users if you have permissions to do so. Move to trash allows
you to delete your selected file(Figure 13).
Now you have the working knowledge of HDFS utilizing both the Command
Prompt and Hue!
Figure 12
Figure 13
Cox Automotive
User Guide HDFS 13
Chapter 4 How is my data stored?
Cluster Set Up
The cluster of your files is set up in nodes. The two different nodes are called NameNodes and DataNodes.
They rely on each other in order to work properly. All of the nodes are what make up the cluster as a whole.
This is how your files are backed up and loaded whenever you open them.
NameNode A NameNode would be considered the Master node. In that sense, it keeps track of where the data lives.
DataNode
The DataNode is where the data is actually stored. It keeps the data in blocks generally in a set of three. Those
blocks duplicated into DataNode2 and DataNode3.
Because the data is set up in duplicates, data can be split, recovered, and found a lot easier. If you take the
Test.txt file and a DataNode gets corrupted or goes down, there are two more to recover your data. Think of it
as a secured back up system.
Cox Automotive
User Guide HDFS 14
Mapping The Cluster
Within this cluster, you have MapReduce. Using this programming model, it is easier to filter out certain
aspects of a file. For example, say if you had a large library file and needed to find data with the word
“Contact.” MapReduce can find that word for you while giving options to delete, modify, or duplicate the word.
Another example would be entering similar titles: Doing so, MapReduce will be able to show you how many
times the words were used within the file.
Data Housing and Hardware
All of your data is supported by stack style servers in a data center. There are multiple machines used so that
your data can be read quickly rather than relying on one or two machines to scan through your data at a slower
pace. For example:
(reading 1 TB)
VS
40 minutes to read 4 minutes to read
Both visuals are reading the same amount of data. However, since the model on the right is more widely split
up, you will be able to read and write your data much more easily. If you have many terabytes of data even,
HDFS makes the process still that faster with use of many servers.
This allows users to utilize HDFS without use of a super computer or a highly advanced laptop. You can buy
cheap devices and still receive the same quality and speed of information as you would if you had a more
expensive computer.
Rose Tile
Rose Quartz
Rose Bud
Rose, 3
Tile, 1
Quartz, 1
Bud, 1
Cox Automotive
User Guide HDFS 15
Chapter 5
Conclusion
The goal of this guide was to teach you how you can access and interact with your
files using HDFS. We went over what you as the user can read/write as well as the
different formats you can use such as the Command Prompt and Hue. You now
know where your data is stored, how it is managed, and how you can manage it. We
hope this guide was helpful to you in your learning process. If you would like more
information, please visit the next page for Useful Resources for HDFS.
Thank you for choosing the Cox Automotive HDFS User Guide!
Updated:
5.18.15
HDFS
User Guide
Anna Ellis
Cox
Automotive
Cox Automotive
User Guide HDFS 16
Useful Resources HDFS
"Apache Hadoop 2.6.0 - HDFS Users Guide." Apache Hadoop 2.6.0 - HDFS Users Guide. Apache, 2014. Web. 6
May 2015.
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-
hdfs/HdfsUserGuide.html#HDFS_Users_Guide
"CDH." Packaging and Tarball Information. Cloudera, 2015. Web. 13 May 2015.
http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-1-x/CDH-Version-and-Packaging-
Information/cdhvd_cdh_package_tarball.html
Farooqui, Sameer. "Hadoop Tutorial: Intro to HDFS." YouTube. YouTube, 31 Oct. 2012. Web. 8 May 2015.
https://www.youtube.com/watch?v=ziqx2hJY8Hg
Hall, Marty. "Hadoop Tutorial: HDFS Part 1 -- Overview." Hadoop Tutorial: HDFS Part 1 -- Overview. Slideshare, 30
Mar. 2013. Web. 11 May 2015.
http://www.slideshare.net/martyhall/hadoop-tutorial-hdfs-part-1-overview
"HDFS Architecture." HDFS Architecture. Apache, 2014. Web. 6 May 2015.
http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-project-dist/hadoop-
hdfs/HdfsDesign.html#HDFS_Architecture