MapReduce Over Lustre

MapReduce over Lustre report

David Luan, Simon Huang, GaoShengGong 2008.10~2009.6

Outline

• Early research, analysis

• Platform design & improvement

• Test cases, test process design

• Result analysis

• Related jobs (GFS-like redundancy)

• White paper & conclusion

Early research, analysis

• HDFS, Lustre overall Benchmark tests IOZone IOR？ WebDAV (an indirect way to mount HDFS)★

• Hadoop platform overview MapReduce Three kinds of Hadoop I/O Shortcoming & bottlenecks

• Lustre platform Modules analysis Shortcomings


Overall Benchmark tests


Input

Map

Key, Value Key, Value …=

Map Map

Key, Value

Key, Value

…

Key, Value

Key, Value

…

Key, Value

Key, Value

…

Split Input into Key-Value pairs.For each K-V pair call Map.

Each Map produces new set of K-V pairs.

Reduce(K, V[…])

Sort

Output Key, Value Key, Value …=

For each distinct key, call reduce. Produces one K-V pair for each distinct key.

Output as a set of Key Value Pairs.

MapReduce Flow


Hadoop I/O phases

Map Read

Local Read

Local Read

HTTP

Reduce write

Node a

Node 1

MapTask phase begins

ReduceTask phase begins

MapTask 1

InputSplit1

MapTask phase ends

HTTP

ReduceTask a

Result a

ReduceTask phase ends

……

……

……

Linux FS

Submit Job

Split the Job and distribute the tasks: MapTask {1, 2, ... n}, ReduceTask {1,...m}

Node 2

MapTask 2

InputSplit 2

Linux FS

Node n

MapTask n

InputSplit n

Linux FS

HDFS read:

Shuffle map results

Node b

ReduceTask b

Result b

Shuffle map results

Node m

ReduceTask m

Result m

Shuffle map results

Linux FS Linux FS Linux FS

HDFS Write:

Local R/W：

Local R/W


• Hadoop + HDFS Job/Task level parallel Compute/storage tightly coupled HDFS prefer huge files app limited (job split difficult)

• Distribute grep• Distribute sort • Log Processing• Data Warehousing

• Lustre I/O level parallel Compute/storage loose coupled POSIX compatible Apps

• Super computer

Platform comparison


• HDFS shortcom.– Metadata design – No parallel I/O– No general use (desig

n for MapReduce)

• Lustre shortcom.– inadequate reliability – inadequate stability– No native redundancy

Shortcomings comparison

Outline




• Result analysis



Platform design & improvement

Two ways:

① Java wrapper for liblustre (without Lustre client ) Motivation：

Design a method to merge these two system. Implement Hadoop’s FileSystem interface with java wrapper, then MapReduce can work without Lustre Client.

Touch impasse

② Use Lustre Client Design Improvement


Java wrapper touch impasse -_-

• JNI call liblustre.so error: Java’ JNI will mis-link the function whose name is the same as s

ystem call (such as: mount, read, write, etc.) If we use C to call static-lib (liblutre.a), compile to a executable p

rogram, it works ok.

• liblustre’s other problems Liblustre is not recommended to use in wiki When use it, use liblustre.a instead of liblustre.so Liblustre depends on gcc version


Advantages for each Task (with Lustre) Decentralized I/O Lustre can write parallel Lustre is common usage Great for non-splitable jobs

Platform design (1) advantages:


Platform design (2) modules

Slave

Lustre Client

Slave

Lustre Client

Slave

DISK

Lustre Client

OSS …

Master

Lustre Client

JobTracker Application Launcher

Submit job

MGS/MDS

…...

Distribute tasks/heartbeat

TaskTracker

OST OST

…

OSS …

TaskTracker

OST OST OSS …

TaskTracker

OST OST

DISK DISK … DISK DISK … DISK


Node a

Node 1



MapTask 1

MapTask phase ends

ReduceTask a


……

……

……

Submit Job


Node 2 Node n

Lustre R/W:

Node b Node m

Lustre client

ReduceTask b

Lustre client

ReduceTask c

Lustre client

Lustre client

MapTask 2

Lustre client

MapTask n

Lustre client

Platform design (3) read/write


• Use Hardlink in instead of HTTP shuffle before ReduceTask starts [1] decentralized network bandwidth usage delay ReduceTask actual Read/Write

• Use Lustre block location info to distribute tasks[2] “move the compute to its data” Save network bandwidth Use a java Child thread to run shell to fetch the location info (det

ail in White paper)

Platform improvement 1


Platform improvement 2

Slave

Lustre Client

Slave

Lustre Client

Slave

DISK

Lustre Client

OSS …

Master

Lustre Client

JobTracker Application Launcher

Submit job

MGS/MDS

…...

Distribute tasks/heartbeat

TaskTracker

OST OST

…

OSS …

TaskTracker

OST OST OSS …

TaskTracker

OST OST

DISK DISK … DISK DISK … DISK

Add location info as a scheduling

parameter Use hardlink to delay shuffle pahse

Outline




• Result analysis

• Related jobs


Test cases design (Two kinds apps)

① Apps of statistics (search, log processing, etc.) Little grained tasks (job tasks) MapTask intermediate result is small

② Apps of no-good splitable & highly complex large grained tasks (job tasks) MapTask intermediate result is big Each task is highly compute Each task needs big I/O

Test cases, test process design


Apps of highly complex, no-good splitable

intermediate result is big

Each task is highly compute


Test cases:

• Apps of statistics: WordCount This test reads text files and count each words. The output contains a word and its count, separated by a tab.

• Apps of no-good splitable : BigMapoutput ：It is a map/reduce program that works on a very big non-splittable file , for map or reduce tasks it just read the input file and the do nothing but output the same file.


Test results ：

• Overall execute time• Time of each phase

– Map Read phase (the most time-consuming for Lustre )– Local read/write and HTTP phase – Reduce write phase


Test scene:

• No optimization • Use hardlink• Use hadlink and location info • Lustre tuning

– Stripe size=?

– Stripe count=?

Outline




• Result analysis



Result analysis

• Result analysis

• Conclusion

Result analysis

• Test1: WordCount with a big file– process one big textfile(6G)– blocksize=32m– Reduce Tasks=0.95((1.75))*2*7=13

Result analysis

• Test2:WordCount with many small files• process a large number small files(10000)• Reduce Tasks=0.95*2*7=13

Result analysis

• Test3: BigMapOutput with one big file

• Result1:

• Result2（ fresh memory ）

• Result3 (set mapred.local.dir to default value)

Result analysis

• Test4: BigMapOutput with hardlink

• Test5: BigMapOutput with hardlink & location information

Result analysis

• Test6: BigMapOutput Map Read phase

• Conclusion– Map Read is The most time-consulting part : ★

Result analysis

Node a

Node 1



MapTask 1

InputSplit1

MapTask phase ends

HTTP

ReduceTask a

Result a


……

……

……

Linux FS

Submit Job


Node 2

MapTask 2

InputSplit 2

Linux FS

Node n

MapTask n

InputSplit n

Linux FS

HDFS read:

Shuffle map results

Node b

ReduceTask b

Result b

Shuffle map results

Node m

ReduceTask m

Result m

Shuffle map results

Linux FS Linux FS Linux FS

HDFS Write：

Local R/W：

Local R/W：

Conclusion 1: Hadoop+ HDFS

Map Read

Local Read

Local Read

HTTP

Reduce write

Result analysis

Conclusion 2: Hadoop + Lustre

Node a

Node 1



MapTask 1

MapTask phase ends

ReduceTask a


……

……

……

Submit Job


Node 2 Node n

Lustre R/W:

Node b Node m

Lustre client

ReduceTask b

Lustre client

ReduceTask c

Lustre client

Lustre client

MapTask 2

Lustre client

MapTask n

Lustre client

HDFS block location is fitter for Hadoop task distribute

algorithm than Lustre stripe info

This makes Map Read be The most time-consult

ing part

Result analysis

• Dig the logs (each task execution time (map-read))

Outline




• Result analysis



Related jobs

GFS-liked redundant design

• Motivation: Lustre is no native redundant RAID is expensive Lustre’s new feature

• Code analysis & HLD design

• Challenges for design

Related jobs

Lustre’s inodeInode (*,*,*,…,{obj1,obj2,obj3,…})

Related jobs

Raw HLD thinking 1

• Modified inode structure Make inode contains 3 obj arrays Inode (*,*,*,…,{obj11,obj12,obj13,…},{obj21,obj22,obj23,…}, {obj31,obj32,obj33,…})))

• File read: client read first group, if the first damaged, then read second,…

• File writeclient write three arrays objects one by one

• File consistency (all done by client )• Streaming replication(-_-)

Client {OST,…}{OST,…} {OST,…}, like a write chain Some work shift to OST

• automatic recovery (-_-) If file’s one group of object damaged, the system automatically recover it

by other backup.

Related jobs

Raw HLD thinking 2 : challenges

• No rack information (place for redundancy?) • OST can not write to another OST (streaming redundant chain?)• File consistency • Lustre is changing fast (pool, etc.)• Internship is time-limited




• Result analysis

• Related jobs


Outline

White paper & conclusion

• White paper (hadoop_lustre_wp_v0.4.2.pdf)

• Thanks to


Thanks a lot to our mentors and manager

• Mentor: WangDi, HuangHua• Manager: Nirant.Puntambekar


Q&A！

Email me: [email protected]

MapReduce Over Lustre

Technology

test cases design

test process design

test process design

value key

general use design

analysis input map key

result analysis test2

result analysis test1