Top Banner
MapReduce over Lustre repor t David Luan, Simon Huang, GaoShengGong 2008.10~2009.6
41
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MapReduce Over Lustre

MapReduce over Lustre report

David Luan, Simon Huang, GaoShengGong 2008.10~2009.6

Page 2: MapReduce Over Lustre

Outline

• Early research, analysis

• Platform design & improvement

• Test cases, test process design

• Result analysis

• Related jobs (GFS-like redundancy)

• White paper & conclusion

Page 3: MapReduce Over Lustre

Early research, analysis

• HDFS, Lustre overall Benchmark tests IOZone IOR? WebDAV (an indirect way to mount HDFS)★

• Hadoop platform overview MapReduce Three kinds of Hadoop I/O Shortcoming & bottlenecks

• Lustre platform Modules analysis Shortcomings

Page 4: MapReduce Over Lustre

Early research, analysis

Overall Benchmark tests

Page 5: MapReduce Over Lustre

Early research, analysis

Input

Map

Key, Value Key, Value …=

Map Map

Key, Value

Key, Value

Key, Value

Key, Value

Key, Value

Key, Value

Split Input into Key-Value pairs.For each K-V pair call Map.

Each Map produces new set of K-V pairs.

Reduce(K, V[…])

Sort

Output Key, Value Key, Value …=

For each distinct key, call reduce. Produces one K-V pair for each distinct key.

Output as a set of Key Value Pairs.

MapReduce Flow

Page 6: MapReduce Over Lustre

Early research, analysis

Hadoop I/O phases

Map Read

Local Read

Local Read

HTTP

Reduce write

Node a

Node 1

MapTask phase begins

ReduceTask phase begins

MapTask 1

InputSplit1

MapTask phase ends

HTTP

ReduceTask a

Result a

ReduceTask phase ends

……

……

……

Linux FS

Submit Job

Split the Job and distribute the tasks: MapTask {1, 2, ... n}, ReduceTask {1,...m}

Node 2

MapTask 2

InputSplit 2

Linux FS

Node n

MapTask n

InputSplit n

Linux FS

HDFS read:

Shuffle map results

Node b

ReduceTask b

Result b

Shuffle map results

Node m

ReduceTask m

Result m

Shuffle map results

Linux FS Linux FS Linux FS

HDFS Write:

Local R/W:

Local R/W

Page 7: MapReduce Over Lustre

Early research, analysis

• Hadoop + HDFS Job/Task level parallel Compute/storage tightly coupled HDFS prefer huge files app limited (job split difficult)

• Distribute grep• Distribute sort • Log Processing• Data Warehousing

• Lustre I/O level parallel Compute/storage loose coupled POSIX compatible Apps

• Super computer

Platform comparison

Page 8: MapReduce Over Lustre

Early research, analysis

• HDFS shortcom.– Metadata design – No parallel I/O– No general use (desig

n for MapReduce)

• Lustre shortcom.– inadequate reliability – inadequate stability– No native redundancy

Shortcomings comparison

Page 9: MapReduce Over Lustre

Outline

• Early research, analysis

• Platform design & improvement

• Test cases, test process design

• Result analysis

• Related jobs (GFS-like redundancy)

• White paper & conclusion

Page 10: MapReduce Over Lustre

Platform design & improvement

Two ways:

① Java wrapper for liblustre (without Lustre client ) Motivation:

Design a method to merge these two system. Implement Hadoop’s FileSystem interface with java wrapper, then MapReduce can work without Lustre Client.

Touch impasse

② Use Lustre Client Design Improvement

Page 11: MapReduce Over Lustre

Platform design & improvement

Java wrapper touch impasse -_-

• JNI call liblustre.so error: Java’ JNI will mis-link the function whose name is the same as s

ystem call (such as: mount, read, write, etc.) If we use C to call static-lib (liblutre.a), compile to a executable p

rogram, it works ok.

• liblustre’s other problems Liblustre is not recommended to use in wiki When use it, use liblustre.a instead of liblustre.so Liblustre depends on gcc version

Page 12: MapReduce Over Lustre

Platform design & improvement

Advantages for each Task (with Lustre) Decentralized I/O Lustre can write parallel Lustre is common usage Great for non-splitable jobs

Platform design (1) advantages:

Page 13: MapReduce Over Lustre

Platform design & improvement

Platform design (2) modules

Slave

Lustre Client

Slave

Lustre Client

Slave

DISK

Lustre Client

OSS …

Master

Lustre Client

JobTracker Application Launcher

Submit job

MGS/MDS

…...

Distribute tasks/heartbeat

TaskTracker

OST OST

OSS …

TaskTracker

OST OST OSS …

TaskTracker

OST OST

DISK DISK … DISK DISK … DISK

Page 14: MapReduce Over Lustre

Platform design & improvement

Node a

Node 1

MapTask phase begins

ReduceTask phase begins

MapTask 1

MapTask phase ends

ReduceTask a

ReduceTask phase ends

……

……

……

Submit Job

Split the Job and distribute the tasks: MapTask {1, 2, ... n}, ReduceTask {1,...m}

Node 2 Node n

Lustre R/W:

Node b Node m

Lustre client

ReduceTask b

Lustre client

ReduceTask c

Lustre client

Lustre client

MapTask 2

Lustre client

MapTask n

Lustre client

Platform design (3) read/write

Page 15: MapReduce Over Lustre

Platform design & improvement

• Use Hardlink in instead of HTTP shuffle before ReduceTask starts [1] decentralized network bandwidth usage delay ReduceTask actual Read/Write

• Use Lustre block location info to distribute tasks[2] “move the compute to its data” Save network bandwidth Use a java Child thread to run shell to fetch the location info (det

ail in White paper)

Platform improvement 1

Page 16: MapReduce Over Lustre

Platform design & improvement

Platform improvement 2

Slave

Lustre Client

Slave

Lustre Client

Slave

DISK

Lustre Client

OSS …

Master

Lustre Client

JobTracker Application Launcher

Submit job

MGS/MDS

…...

Distribute tasks/heartbeat

TaskTracker

OST OST

OSS …

TaskTracker

OST OST OSS …

TaskTracker

OST OST

DISK DISK … DISK DISK … DISK

Add location info as a scheduling

parameter Use hardlink to delay shuffle pahse

Page 17: MapReduce Over Lustre

Outline

• Early research, analysis

• Platform design & improvement

• Test cases, test process design

• Result analysis

• Related jobs

• White paper & conclusion

Page 18: MapReduce Over Lustre

Test cases design (Two kinds apps)

① Apps of statistics (search, log processing, etc.) Little grained tasks (job tasks) MapTask intermediate result is small

② Apps of no-good splitable & highly complex large grained tasks (job tasks) MapTask intermediate result is big Each task is highly compute Each task needs big I/O

Test cases, test process design

Page 19: MapReduce Over Lustre

Platform design & improvement

Apps of highly complex, no-good splitable

intermediate result is big

Each task is highly compute

Page 20: MapReduce Over Lustre

Test cases, test process design

Test cases:

• Apps of statistics: WordCount This test reads text files and count each words. The output contains a word and its count, separated by a tab.

• Apps of no-good splitable : BigMapoutput :It is a map/reduce program that works on a very big non-splittable file , for map or reduce tasks it just read the input file and the do nothing but output the same file.

Page 21: MapReduce Over Lustre

Test cases, test process design

Test results :

• Overall execute time• Time of each phase

– Map Read phase (the most time-consuming for Lustre )– Local read/write and HTTP phase – Reduce write phase

Page 22: MapReduce Over Lustre

Test cases, test process design

Test scene:

• No optimization • Use hardlink• Use hadlink and location info • Lustre tuning

– Stripe size=?

– Stripe count=?

Page 23: MapReduce Over Lustre

Outline

• Early research, analysis

• Platform design & improvement

• Test cases, test process design

• Result analysis

• Related jobs (GFS-like redundancy)

• White paper & conclusion

Page 24: MapReduce Over Lustre

Result analysis

• Result analysis

• Conclusion

Page 25: MapReduce Over Lustre

Result analysis

• Test1: WordCount with a big file– process one big textfile(6G)– blocksize=32m– Reduce Tasks=0.95((1.75))*2*7=13

Page 26: MapReduce Over Lustre

Result analysis

• Test2:WordCount with many small files• process a large number small files(10000)• Reduce Tasks=0.95*2*7=13

Page 27: MapReduce Over Lustre

Result analysis

• Test3: BigMapOutput with one big file

• Result1:

• Result2( fresh memory )

• Result3 (set mapred.local.dir to default value)

Page 28: MapReduce Over Lustre

Result analysis

• Test4: BigMapOutput with hardlink

• Test5: BigMapOutput with hardlink & location information

Page 29: MapReduce Over Lustre

Result analysis

• Test6: BigMapOutput Map Read phase

• Conclusion– Map Read is The most time-consulting part : ★

Page 30: MapReduce Over Lustre

Result analysis

Node a

Node 1

MapTask phase begins

ReduceTask phase begins

MapTask 1

InputSplit1

MapTask phase ends

HTTP

ReduceTask a

Result a

ReduceTask phase ends

……

……

……

Linux FS

Submit Job

Split the Job and distribute the tasks: MapTask {1, 2, ... n}, ReduceTask {1,...m}

Node 2

MapTask 2

InputSplit 2

Linux FS

Node n

MapTask n

InputSplit n

Linux FS

HDFS read:

Shuffle map results

Node b

ReduceTask b

Result b

Shuffle map results

Node m

ReduceTask m

Result m

Shuffle map results

Linux FS Linux FS Linux FS

HDFS Write:

Local R/W:

Local R/W:

Conclusion 1: Hadoop+ HDFS

Map Read

Local Read

Local Read

HTTP

Reduce write

Page 31: MapReduce Over Lustre

Result analysis

Conclusion 2: Hadoop + Lustre

Node a

Node 1

MapTask phase begins

ReduceTask phase begins

MapTask 1

MapTask phase ends

ReduceTask a

ReduceTask phase ends

……

……

……

Submit Job

Split the Job and distribute the tasks: MapTask {1, 2, ... n}, ReduceTask {1,...m}

Node 2 Node n

Lustre R/W:

Node b Node m

Lustre client

ReduceTask b

Lustre client

ReduceTask c

Lustre client

Lustre client

MapTask 2

Lustre client

MapTask n

Lustre client

HDFS block location is fitter for Hadoop task distribute

algorithm than Lustre stripe info

This makes Map Read be The most time-consult

ing part

Page 32: MapReduce Over Lustre

Result analysis

• Dig the logs (each task execution time (map-read))

Page 33: MapReduce Over Lustre

Outline

• Early research, analysis

• Platform design & improvement

• Test cases, test process design

• Result analysis

• Related jobs (GFS-like redundancy)

• White paper & conclusion

Page 34: MapReduce Over Lustre

Related jobs

GFS-liked redundant design

• Motivation: Lustre is no native redundant RAID is expensive Lustre’s new feature

• Code analysis & HLD design

• Challenges for design

Page 35: MapReduce Over Lustre

Related jobs

Lustre’s inodeInode (*,*,*,…,{obj1,obj2,obj3,…})

Page 36: MapReduce Over Lustre

Related jobs

Raw HLD thinking 1

• Modified inode structure Make inode contains 3 obj arrays Inode (*,*,*,…,{obj11,obj12,obj13,…},{obj21,obj22,obj23,…}, {obj31,obj32,obj33,…})))

• File read: client read first group, if the first damaged, then read second,…

• File writeclient write three arrays objects one by one

• File consistency (all done by client )• Streaming replication(-_-)

Client {OST,…}{OST,…} {OST,…}, like a write chain Some work shift to OST

• automatic recovery (-_-) If file’s one group of object damaged, the system automatically recover it

by other backup.

Page 37: MapReduce Over Lustre

Related jobs

Raw HLD thinking 2 : challenges

• No rack information (place for redundancy?) • OST can not write to another OST (streaming redundant chain?)• File consistency • Lustre is changing fast (pool, etc.)• Internship is time-limited

Page 38: MapReduce Over Lustre

• Early research, analysis

• Platform design & improvement

• Test cases, test process design

• Result analysis

• Related jobs

• White paper & conclusion

Outline

Page 39: MapReduce Over Lustre

White paper & conclusion

• White paper (hadoop_lustre_wp_v0.4.2.pdf)

• Thanks to

Page 40: MapReduce Over Lustre

White paper & conclusion

Thanks a lot to our mentors and manager

• Mentor: WangDi, HuangHua• Manager: Nirant.Puntambekar

Page 41: MapReduce Over Lustre

White paper & conclusion

Q&A!

Email me: [email protected]