Top Banner
PARALLEL TREE MANIPULATION Islam Atta
30

PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Jan 17, 2018

Download

Documents

Ashley Gibbs

This talk is about … Manipulation algorithm for Tree data structures Parallel Programming domain Novel Tree representation using Linear Arrays Copyright © Islam Atta 3
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

PARALLEL TREE MANIPULATIONIslam Atta

Page 2: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 2

Sources

• Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010.

• Experimental evaluation was done as part of course work at UofT (ECE1749H, ECE1755H).

Page 3: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 3

This talk is about …

• Manipulation algorithm for Tree data structures

• Parallel Programming domain

• Novel Tree representation using Linear Arrays

Page 4: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 4

Trees…?

• Widely used• Hierarchical data (financial data, NLP, machine vision, GIS, DNA,

protein sequences…)• Indexing/hashing (search engines…)

• Tree Manipulation context• Full traversal of a tree (or sub-tree) for read or write accesses.• E.g. CBIR, DNA sequence alignment

Page 5: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 5

Problem

• Categorized as Non-Uniform Random Access structures• Bad spatial locality• Incurring high miss rates

• Worse for multiprocessing (Berkley, 2006)• Requires High on-/off-chip bandwidth

Page 6: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 6

Tree RepresentationA

B C D

E F G H I J K

L M N O P Q R

S T U V

A

B C D

E F G H I J K

L M N O P Q R

S T U V

A B CD E FG H IJ KL MN OP QRS TU VA B CD E FG H IJ KL MN OP QRS TU V

Memory Layout

No GAPS

Page 7: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 7

Multiprocessing Platforms

• Non-shared memory architectures (Cell BE, Blue Gene, Intel SCC)• Explicit message passing Many small messages

• Shared memory architectures (Intel Quad-core)• Coherent cache banks cache blocks grabbed when referenced

A B CD E FG H IJ KL MN OP QRS TU VA B CD E FG H IJ KL MN OP QRS TU V

A B CD E FG H IJ KL MN OP QRS TU VA B CD E FG H IJ KL MN OP QRS TU VOptimal SolutionReallocate tree elements in memory to

form contiguous memory regions

Page 8: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 8

Organization & Representation

Evaluation

Discussion & Conclusion

GOAL: Allocate tree elements to promote Spatial Locality

Organization & Representation

Page 9: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 9

Polish Notation

Arithmetic example

Page 10: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 10

Linear Tree: Depth-First OrderingA

B C D

E F G H I J K

L M N O P Q R

S T U V

A

B C D

E F G H I J K

L M N O P Q R

S T U V

Page 11: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 11

Recursive ContiguityA

B C D

E F G H I J K

L M N O P Q R

S T U V

A B C DE F G H I J KL M N O P Q RS T U V

N

S T

N S T

C

G H I

O P Q

C G H IO P Q

Page 12: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 12

• Non-shared memory architectures• Explicit message passing Few Large messages

• Shared memory architectures• Minimal False sharing

Spatial Locality

A B C DE F G H I J KL M N O P Q RS T U V

A B C DE F G H I J KL M N O P Q RS T U V

Page 13: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 13

RepresentationData Array

Parents Reference Array

Children Reference Array

Page 14: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 14

First-Child/SiblingData Array

Parents Reference Array

First-Child Reference Array

Siblings’ Reference Array

Page 15: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 15

Scheduling Algorithm• Designed for Cell BE and Blue Gene /L

• Message-passing (DMA, MPI, mailboxes)

• Challenges• Unbalanced trees with varying computation complexity• Limited local storage • Larger data chunks, 128 byte aligned

• Algorithm properties• Master-slave• Dynamic scheduling of sub-workloads• Work-stealing: coordinated by the master• Double buffering

Page 16: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 16

Organization & Representation

Evaluation

Discussion & Conclusion

Implementation and comparison

Evaluation

Page 17: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 17

Methodology

• Application: Sequence Alignment problem• DNA, RNA, protein, NLP, Financial data

• Implementation: pthreads on x86 Intel machines• UG: Quad-core• Kodos: 2-socket quad-core• Kang: 4-socket dual-core

• Data Cache Simulation

• In-memory Trees

Page 18: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 18

Memory Access Time

• Naïve sequence alignment consists of only read/write operations.

• Random: • Sub-linear increase up to

4 threads. • Saturates after 4 threads

• Linear: • Sequential - 2.7X gain • Hit memory-wall

1 2 3 4 6 8 12 16 24 320.00

0.50

1.00

1.50

2.00

2.50

3.00

UG-Random UG-Linear

Threads

Tim

e/no

de (u

sec

/ no

de)

Page 19: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 19

Quad-core vs. 2-socket Quad-core

• Kodos-Random: • Sub-linear increase

before 8 threads• Saturates after 8

threads

• Kodos-Linear:• Similar to UG-linear

1 2 3 4 6 8 12 16 24 320.00

1.00

2.00

3.00

4.00

5.00

6.00

UG-Random UG-LinearKodos-Random Kodos-Linear

Threads

Tim

e/no

de (u

sec

/nod

e)

Page 20: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 20

Interconnect Effect

• Assume that UG has a perfect interconnect and compare against it.

𝑇𝑖𝑚𝑒𝐾𝑎𝑛𝑔 /𝐾𝑜𝑑𝑜𝑠−𝑇𝑖𝑚𝑒𝑈𝐺

𝑇𝑖𝑚𝑒𝑈𝐺

1 2 3 4 6 8 12 16 24 320

0.5

1

1.5

2

2.5

3

3.5

4

Kodos Kang

Threads

Ratio

Page 21: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 21

Effective Bandwidth

1 2 3 4 6 8 12 16 24 32 1 2 3 4 6 8 12 16 24 32 1 2 3 4 6 8 12 16 24 320

1

2

3

4

5

6

Random Linear

Threads

Ban

dwid

th (G

bps)

𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡 h=𝑀𝑖𝑠𝑠𝐶𝑜𝑢𝑛𝑡×𝐶𝑎𝑐h𝑒𝐿𝑖𝑛𝑒𝑆𝑖𝑧𝑒𝑇𝑜𝑡𝑎𝑙 𝐴𝑐𝑐𝑒𝑠𝑠𝑇𝑖𝑚𝑒

KangKodosUG

Page 22: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 22

Modeling Real Computation

• Computation scaling modeled as SQRT(number of threads)

• 1.6 – 3.05 X for 1 – 4 threads

1 2 3 4 6 8 12 16 24 320

20

40

60

80

100

120

UG-Random UG-Linear

Threads

Exec

utio

n Ti

me

(sec

)

Page 23: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 23

Other Experimental Results

Metric Random Linear

Miss Rate (L2) 14% 1.6%

Sequential/Parallel fractions

Sequential is 10% with minor improvement for Linear

Load balancing Maximum 4% deviation (no work-stealing required)

Stalling on Locks No difference

Memory size ratio 1 0.32

Page 24: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 24

Organization & Representation

Evaluation

Discussion & Conclusion

Practical considerations & Potential work

Discussion & Conclusion

Page 25: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 25

Discussion• Limitations:

• Only shared memory architectures• Max tree size: 4G Bytes, 47M nodes

• Compression• First-child references can be reduced to 1-bit per node.

• Use Delta distance instead of full address.

Page 26: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 26

Discussion• Insertion/deletion complexity

• Random • Linear • Scope: trees with rare modifications

• Relaxed model: partitions• Shared memory: cache line • Non-shared memory: message size

Page 27: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 27

Next…• Path #1:

• Implement and evaluate a commercial/scientific workload

• Developing a library/framework for parallel tree manipulation

• Path #2:• Algorithm evaluation for non-shared memory architectures

• E.g. Blue Gene, Intel SCC

• Both

Page 28: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 28

Conclusion• Tree manipulation using typical data representation is not

well suited for parallel processing.

• Propose and evaluate a technique for parallel tree manipulation• Performance gain for sequential and parallel processing• Saves memory and bandwidth• Scalable

• For our experiments, on-chip communication with fewer cores is superior to off-chip communication.

Page 29: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Copyright © 2010-2012 Islam Atta 29

Acknowledgment

• Special thanks goes to:

Prof. Natalie Enright Jerger

Prof. Greg Steffan

Page 30: PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

QUESTIONSFact: 42,270 runs were executed in the experimentation using 91 TBs of data.

Thank You.

Please send me your comments, [email protected]