Top Banner
Hadoop Distributed File System (HDFS) 01/16/2018 1
38

Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

Aug 08, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

Hadoop Distributed File

System (HDFS)

01/16/2018 1

Page 2: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

Survey Results

Total: 19 responses

18 CS and 1 CEN

15 Master (79%) and 4 PhD (21%)

01/16/2018 2

Page 3: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

Survey Results

01/16/2018 3

Page 4: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

Survey Results

How many hours did you spend in the first

week for the reading assignment?

1 hour (2 responses)

2 hours (7 responses)

3-5 hours (6 responses)

6 and more hours (4 responses)

How many hours per week do you plan to

spend for studying the course?

0-5 hours: 5 responses

6-10 hours: 11 hours

> 10 hours: 01/16/2018 4

Page 5: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

Additional Comments

No final exam and give higher weights to

assignments and project

More programming assignments and hands-on

experience

Solve real problems in big data using cloud

platforms, e.g., AWS or Google Cloud Platform

One review per week and increase the word limit

to 1000 words

Suggest a book or reference for further reads

Show how big data is used in other fields such

as machine learning01/16/2018 5

Page 6: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS Overview

A distributed file system

Built on the architecture of Google File

System (GS)

Shares a similar architecture to many other

common distributed storage engines such as

Amazon S3 and Microsoft Azure

HDFS is a stand-along storage engine and

can be used in isolation of the query

processing engine

01/16/2018 6

Page 7: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS Architecture

01/16/2018

B B B

B B B

B B B

B

B B B

B B

Name node

Data nodes

7

Page 8: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

What is where?

01/16/2018

B B B

B B B

B B B

B

B B B

B B

Name node

Data nodes

File and directory names

Block ordering and locations

Capacity of data nodes

Architecture of data nodes

Block data

Name node location

8

Page 9: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

Analogy to Unix FS

01/16/2018

The logical view is similar

/

usermary

chu

etc hadoop

9

Page 10: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

Analogy to Unix FS

01/16/2018

The physical model is comparable

Unix HFDS

File1

List of iNodes

Block 1

Block 2

Block 3

File1

List of block locations

Meta data

B B B

B B B

B B B

B

B B B

B B

10

Page 11: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS Create

01/16/2018

Data nodes

File creator

Name node

11

Page 12: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS Create

01/16/2018

Data nodes

File creatorCreate(…)

Name node

The creator process calls the create

function which translates to an RPC

call at the name node

12

Page 13: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS Create

01/16/2018

Name node

Data nodes

File creatorCreate(…)

The master node creates three initial

blocks

1. First block is assigned to a random

machine

2. Second block is assigned to another

random machine in the same rack of

the first machine

3. Third block is assigned to a random

machine in another rack

1 2 3

13

Page 14: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS Create

01/16/2018

Name node

Data nodes

File creatorOutputStream

1 2 3

14

Page 15: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS Create

01/16/2018

Name node

Data nodes

File creator

1 2 3

OutputStream#write

15

Page 16: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS Create

01/16/2018

Name node

Data nodes

File creator

1 2 3

OutputStream#write

16

Page 17: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS Create

01/16/2018

Name node

Data nodes

File creator

1 2 3

OutputStream#write

17

Page 18: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS Create

01/16/2018

Name node

Data nodes

File creator

1 2 3

OutputStream#write

When a block is filled up, the

creator contacts the name node

to create the next block

Next block

18

Page 19: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

Notes about writing to HDFS

Data transfers of replicas are pipelined

The data does not go through the name node

Random writing is not supported

Appending to a file is supported but it creates

a new block

01/16/2018 19

Page 20: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

Self-writing

01/16/2018

Name node

Data nodes

File

creator

If the file creator is running on one

of the data nodes, the first replica

is always assigned to that node

20

Page 21: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

Reading from HDFS

Reading is relatively easier

No replication is needed

Replication can be exploited

Random reading is allowed

01/16/2018 21

Page 22: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS Read

01/16/2018

Data nodes

File readeropen(…)

Name node

The reader process calls the open

function which translates to an RPC

call at the name node

22

Page 23: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS Read

01/16/2018

Data nodes

File readerInputStream

Name node

The name node locates the first block

of that file and returns the address of

one of the nodes that store that block

The name node returns an input

stream for the file

23

Page 24: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS Read

01/16/2018

Data nodes

File reader

InputStream#read(…)

Name node

24

Page 25: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS Read

01/16/2018

Data nodes

File reader

Name node

When an end-of-block is

reached, the name node

locates the next block

Next block

25

Page 26: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS Read

01/16/2018

Data nodes

File reader

Name node

seek(pos)

InputStream#seek operation locates

a block and positions the stream

accordingly

26

Page 27: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

Self-reading

01/16/2018

Data nodes

File

reader

Name node

1. If the block is locally stored

on the reader, this replica is

chosen to read

2. If not, a replica on another

machine in the same rack is

chosen

3. Any other random block is

chosen

Open,

seek

27

When self-reading occurs,

HDFS can make it much faster

through a feature called

short-circuit

Page 28: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

Notes About Reading

The API is much richer than the simple

open/seek/close API

You can retrieve block locations

You can choose a specific replica to read

The same API is generalized to other file

systems including the local FS and S3

Review question: Compare random access

read in local file systems to HDFS

01/16/2018 28

Page 29: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS Special Features

Node decomission

Load balancer

Cheap concatenation

01/16/2018 29

Page 30: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

Node Decommission

01/16/2018 30

B B B

B B B

B B B

B

B B B

B B

B B B

B

Page 31: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

Load Balancing

01/16/2018 31

B B B

B B B

B B B

B

B B B

B B

Page 32: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

Load Balancing

01/16/2018 32

B B B

B B B

B B B

B

B B B

B B

Start the load balancer

Page 33: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

Cheap Concatenation

01/16/2018 33

Name node

File 1

File 2

File 3

Concatenate File 1 + File 2 + File 3 File 4

Rather than creating new blocks, HDFS can just

change the metadata in the name node to delete

File 1, File 2, and File 3, and assign their blocks to a

new File 4 in the right order.

Page 34: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS API

01/16/2018 34

FileSystem

DistributedFileSystemLocalFileSystem S3FileSystem

Path Configuration

Page 35: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS API

01/16/2018 35

Configuration conf = new Configuration();Path path = new Path(“…”);FileSystem fs = path.getFileSystem(conf);

// To get the local FSfs = FileSystem.getLocal (conf);

// To get the default FSfs = FileSystem.get(conf);

Create the file system

Page 36: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS API

01/16/2018 36

FSDataOutputStream out = fs.create(path, …);

Create a new file

fs.delete(path, recursive);fs.deleteOnExit(path);

Delete a file

fs.rename(oldPath, newPath);

Rename a file

Page 37: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS API

01/16/2018 37

FSDataInputStream in = fs.open(path, …);

Open a file

in.seek(pos);in.seekToNewSource(pos);

Seek to a different location

Page 38: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

HDFS API

01/16/2018 38

fs.concat(destination, src[]);

Concatenate

fs.getFileStatus(path);

Get file metadata

fs.getFileBlockLocations(path, from, to);

Get block locations