Top Banner
User Guide V5.2015-05-22.final HDFS
16
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: UserGuideHDFS_FinalDocument

User Guide V5.2015-05-22.final HDFS

Page 2: UserGuideHDFS_FinalDocument

Cox Automotive

User Guide HDFS 2

Executive Statement

This document is meant for those new to Hadoop who will be

working with Hadoop Distributed File System (HDFS). The

audience is anyone utilizing Hadoop within their company. The

goal is for the user to have a greater understanding of how to

access their files and where their files will be stored using HDFS.

Project Owner(s)

Person(s) of Contact

Michael Gay [email protected]

Mike Tacker [email protected]

Page 3: UserGuideHDFS_FinalDocument

Cox Automotive

User Guide HDFS 3

Table of Contents

Chapter 1: What is HDFS?……………………………………………….………4

Introduction…........................4

-What Makes it Work…………4

-What You Gain………………..4

Understanding HDFS Data….5

Chapter 2: How Can I Get Started?………………………………………….…6

Hue VS Command Prompt……6

What You Can Access……7

Chapter 3: How Can I Interact With My Files? ………………………….……8

Shell Commands....8-9

With Hue....10

-Rename...11

-Move...11

-Copy...12

-Change Permissions...12

-Move to Trash/Download...12

Chapter 4: How Is My Data Stored?………………………………………..…..13

Cluster Set Up...13

NameNodes....13

Data Nodes...13

MapReduce...14

Data Housing and Hardware...14

Chapter 5: Conclusion……………………………………………………...........15

Page 4: UserGuideHDFS_FinalDocument

Cox Automotive

User Guide HDFS 4

Chapter 1

What is HDFS?

Welcome to HDFS!

Introduction: Hadoop

Apache™ Hadoop® creates open-source software for developers and computer architects to build frameworks

for your use! With the help of Hadoop, collecting and reading one terabyte of data would take less than five

minutes. This guide will discuss specifically the Hadoop Distributed File System (HDFS) and what makes it

unique from other filing systems we currently use.

First and foremost: What exactly is HDFS? The system is used for filing data on low-cost commodity hardware.

It can store large data sets that are sometimes terabytes in size. Recovery of your files come quickly and stress

free due to the HDFS fault-tolerant recovery. The cluster system HDFS uses is able to store data in a more

collective format by duplicating your files.

What Makes it Work

The background of this system is all based around data architecture. It runs on top of Ext3, Ext4, and XFS. It’s

based off of Google’s filing system (GFS). Basically, since your files are stored in the data cluster, there is a

sort of mapping system used. Say if you had that one terabyte of data mentioned earlier: that data is scattered

into smaller hard drives. This is a breakdown of a singular file which makes it easier to open, read, search, et

cetera.

What You Gain

You will be able to have access to user resources for your files. It will be a more functional environment to keep

your files safe and secure. Because you own your files, HDFS allows you to go into the data of that file to

retrieve or change any of the data. For example, if you want to find how many times a word was repeated in

your file, that action could easily be done in no time. The recovery system keeps your files secure if an error

were to occur. By having duplications of your file stored in the data, there is no corruption or loss of

information.

Page 5: UserGuideHDFS_FinalDocument

Cox Automotive

User Guide HDFS 5

Understanding the HDFS Data

Hadoop has four layers of data. The layers go from bottom to top. If you are the main user and owner of the

project, you are able to access and change whatever you want to that project. At the bottom, the raw materials

are system processes that are simply true data without any modifications made by you or the enterprise as a

whole. These are production areas where you’ll find your data.

Now that you’ve been introduced to HDFS, we can begin going through the details of how your files can be

stored, how you can access them and the hardware used to make it all work!

Projects /data/proj/

The main file that has been or is

currently being worked on. It can be accessed by the user.

(Business Unit/User; Read/Write Access)

Core /data/core/

The more refined data for

enterprise business use. (Business Unit/User;

Read/Write Access)

Published /data/pub/

The data has been lightly

modified. It can be accessed for business use. (IT Owned; Business Unit/User Read Only)

Raw /data/raw/

The raw materials used in the database. They are meant to

process your data. (IT Owned; Business Unit/ User Read Only)

Group /grp/[bu_team] env [dev or

qa]/ proj core

temp pub(if needed) raw(if needed)

Houses any group data being in use or development including projects, data assets, etc.

Sandbox /sandbox

Volatile, shared area for testing new concepts

Training /training

Safe area for end users to learn the environment

Page 6: UserGuideHDFS_FinalDocument

Cox Automotive

User Guide HDFS 6

Chapter 2 How can I get started? Hue VS Command Prompt

The latest version that is being used is: hadoop-

2.3.0+cdh5.1.5+854

There are two different ways you can view HDFS. Hue is a

Hadoop system that lets you directly interface with data. It is

much simpler than using the command prompt because all of

your files are organized in a table like format. If you would like

to use Hue, simply go to your company’s login page (such as:

http//:hue.autotrader.com). From there, you will see an area

where you enter your Active Directory Login (Figure 1)

Once you are logged in, you will

see files you can read and/or

write using HDFS (Figure 2).

Each section will be labeled such

as Name, Size, User, Group,

Permissions, and Date. When

logged in, your login ID will be

the /user/LOGINID

Figure 1

Figure 2

Page 7: UserGuideHDFS_FinalDocument

Cox Automotive

User Guide HDFS 7

If you would prefer to use the command prompt, you can simply open up your prompt and use the command:

After doing so, you’ll have your command prompt brought up. Similar to Hue, your files

will be grouped in their respective areas. However, it is more text based rather than in a

table. For example, you will need to enter a command to reach certain files (Figure 3).

Figure 3

What You Can Access Any file you create can be accessed through HDFS. You can view different aspects of the data. The different

commands you as a user can access are:

Because everything in HDFS is considered a file, every bit of information can be opened like a file right down

to the raw data mentioned in the previous chapter. You can have direct contact with the files whether it be

writing and/or reading them. You will know the information input and output. As a user, sometimes you will

not be able to access certain files due to whatever group you may be in. For example, if your business group or

/grp/[bu] does not have you added or is not owned by you, no data will show. You need to be given rights to the

correct groups before you have access.

Sandbox /sandbox

Training /training

Group /grp/[bu_team]

Users Tree /user/[yourname]

/hdfs dfs

Page 8: UserGuideHDFS_FinalDocument

Cox Automotive

User Guide HDFS 8

count: hdfs dfs -count [-q] [-h] <paths>

You can count directories, files, and bytes under the file pattern you

are looking for.

cp: hdfs dfs –cp [-f] URI [URI ...] <dest>

Copy files from source destination. If you copy files from multiple

sources, the destination must be a directory.

get: hdfs dfs –get [-ignorecrc] [-crc] <src> <localdst>

Copies files to the local file system.

ls: hdfs dfs –ls <args>

For file returns. Example: hdfs dfs -ls /user/hadoop/test.txt

lsr:

hdfs dfs -lsr <args>

Repeated application of ls.

Chapter 3

How can I interact with my files?

Shell Commands HDFS utilizes shell commands using the Command Prompt. These commands interact with HDFS in order to

execute a copy, delete, group change of files, and many other different changes or additions you may want to

make to your file data. These commands can only be completed in the command prompt. To start the shell

command, you will need to enter:

/hdfs dfs <commands>

Page 9: UserGuideHDFS_FinalDocument

Cox Automotive

User Guide HDFS 9

mkdir: hdfs dfs –mkdir [-p] <paths>

Creates directories using the path uri’s argument.

mv: hdfs dfs –mv URI [URI ...] <dest>

Moves files from source to destination:

hdfs dfs –mv /user/hadoop/test.txt /user/hadoop/test2.txt

put: hdfs dfs –put <localsrc> ... <dst>

Copies a source or multiple sources from a local file system and puts

them in a different file system. It can also read and write inputs to

the destination file system.

rm: hdfs dfs –rm [-skipTrash] URI [URI ...]

Deletes non empty directory and files specified as arguments. By

putting in the [-skipTrash] command, it will not go through the trash

so your files will be deleted faster.

text: hdfs dfs -text <src>

Turns a source file into text format.

Page 10: UserGuideHDFS_FinalDocument

Cox Automotive

User Guide HDFS 10

With Hue If you’re going to use Hue, it is much easier to navigate, download, move, and copy files if you are not

comfortable using the command prompt. As discussed in Chapter 2, Hue’s File Browser is set up like a table

format where you can have access to different files. We are going to break this down a little so it is easier to

understand.

First, you have your main File Browser screen. You will notice in Figure 4, there are

files set up with your /user name. You also have a set of files that give you User,

Group, Permissions, and the Date the file was accessed.

What if you do not have any files into the system just yet? All you have to do is

go to the very right of your screen and click on the Upload drop down box. It

will give you the option of uploading Files or Zip/Tgz files (Figure 5). You can

choose to upload any kind of file you want. For this test, I am going to upload a

docx file (or a Word Doc file).

Now that I have a file uploaded, let’s start at the very top of the File Browser.

There is a search box called Search for file name. You can type in a file you

want to find, and it will pop up under your user table (Figure 6). You can

then check the file you searched for and can begin modifying it (Figure 7).

Note: Remember

there are some files

you may not have

access to because

you have not been

given the proper

permission.

Figure 4

Figure 5

Figure 6 Figure 7

Page 11: UserGuideHDFS_FinalDocument

Cox Automotive

User Guide HDFS 11

You now have the option to change and manage your files with the different modifications (Figure 8).

Rename When you rename your file, click on the Rename tab, and a

field will pop up where you can enter in a different name for

your file(Figure 9).* Renaming your file will not corrupt or

change the format.*

Move When you move a file, it can

either be moved to a typed in destination, a folder you already

have under your /user, or a new folder you want to create. If you

have a certain destination (such as a folder for another user or

group), type it in as /user/[username]/[destination name] and

click Move. If you would like to move it to another folder, select

the folder of your choice and click Move(Figure 10).

Figure 8

Figure 9

Figure 10

Page 12: UserGuideHDFS_FinalDocument

Cox Automotive

User Guide HDFS 12

Copy Copy allows you to copy a file to a new destination

such as a different folder (Figure 11).

Figure 11

Change Permissions

You have the authority over who can Read, Write, and

Execute your files. Change Permissions brings up a table

where you can change your preferences on who is allowed

to access different areas of your files. For example, if you

would like the entire group to write on your file rather than

just you, you would check the Group box in the Write

column(Figure 12).

Download and Move to Trash You can download your own files or files of other users if you have permissions to do so. Move to trash allows

you to delete your selected file(Figure 13).

Now you have the working knowledge of HDFS utilizing both the Command

Prompt and Hue!

Figure 12

Figure 13

Page 13: UserGuideHDFS_FinalDocument

Cox Automotive

User Guide HDFS 13

Chapter 4 How is my data stored?

Cluster Set Up

The cluster of your files is set up in nodes. The two different nodes are called NameNodes and DataNodes.

They rely on each other in order to work properly. All of the nodes are what make up the cluster as a whole.

This is how your files are backed up and loaded whenever you open them.

NameNode A NameNode would be considered the Master node. In that sense, it keeps track of where the data lives.

DataNode

The DataNode is where the data is actually stored. It keeps the data in blocks generally in a set of three. Those

blocks duplicated into DataNode2 and DataNode3.

Because the data is set up in duplicates, data can be split, recovered, and found a lot easier. If you take the

Test.txt file and a DataNode gets corrupted or goes down, there are two more to recover your data. Think of it

as a secured back up system.

Page 14: UserGuideHDFS_FinalDocument

Cox Automotive

User Guide HDFS 14

Mapping The Cluster

Within this cluster, you have MapReduce. Using this programming model, it is easier to filter out certain

aspects of a file. For example, say if you had a large library file and needed to find data with the word

“Contact.” MapReduce can find that word for you while giving options to delete, modify, or duplicate the word.

Another example would be entering similar titles: Doing so, MapReduce will be able to show you how many

times the words were used within the file.

Data Housing and Hardware

All of your data is supported by stack style servers in a data center. There are multiple machines used so that

your data can be read quickly rather than relying on one or two machines to scan through your data at a slower

pace. For example:

(reading 1 TB)

VS

40 minutes to read 4 minutes to read

Both visuals are reading the same amount of data. However, since the model on the right is more widely split

up, you will be able to read and write your data much more easily. If you have many terabytes of data even,

HDFS makes the process still that faster with use of many servers.

This allows users to utilize HDFS without use of a super computer or a highly advanced laptop. You can buy

cheap devices and still receive the same quality and speed of information as you would if you had a more

expensive computer.

Rose Tile

Rose Quartz

Rose Bud

Rose, 3

Tile, 1

Quartz, 1

Bud, 1

Page 15: UserGuideHDFS_FinalDocument

Cox Automotive

User Guide HDFS 15

Chapter 5

Conclusion

The goal of this guide was to teach you how you can access and interact with your

files using HDFS. We went over what you as the user can read/write as well as the

different formats you can use such as the Command Prompt and Hue. You now

know where your data is stored, how it is managed, and how you can manage it. We

hope this guide was helpful to you in your learning process. If you would like more

information, please visit the next page for Useful Resources for HDFS.

Thank you for choosing the Cox Automotive HDFS User Guide!

Updated:

5.18.15

HDFS

User Guide

Anna Ellis

Cox

Automotive

Page 16: UserGuideHDFS_FinalDocument

Cox Automotive

User Guide HDFS 16

Useful Resources HDFS

"Apache Hadoop 2.6.0 - HDFS Users Guide." Apache Hadoop 2.6.0 - HDFS Users Guide. Apache, 2014. Web. 6

May 2015.

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-

hdfs/HdfsUserGuide.html#HDFS_Users_Guide

"CDH." Packaging and Tarball Information. Cloudera, 2015. Web. 13 May 2015.

http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-1-x/CDH-Version-and-Packaging-

Information/cdhvd_cdh_package_tarball.html

Farooqui, Sameer. "Hadoop Tutorial: Intro to HDFS." YouTube. YouTube, 31 Oct. 2012. Web. 8 May 2015.

https://www.youtube.com/watch?v=ziqx2hJY8Hg

Hall, Marty. "Hadoop Tutorial: HDFS Part 1 -- Overview." Hadoop Tutorial: HDFS Part 1 -- Overview. Slideshare, 30

Mar. 2013. Web. 11 May 2015.

http://www.slideshare.net/martyhall/hadoop-tutorial-hdfs-part-1-overview

"HDFS Architecture." HDFS Architecture. Apache, 2014. Web. 6 May 2015.

http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-project-dist/hadoop-

hdfs/HdfsDesign.html#HDFS_Architecture