PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

PARALLEL TREE MANIPULATIONIslam Atta

Copyright © 2010-2012 Islam Atta 2

Sources

• Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010.

• Experimental evaluation was done as part of course work at UofT (ECE1749H, ECE1755H).


This talk is about …

• Manipulation algorithm for Tree data structures

• Parallel Programming domain

• Novel Tree representation using Linear Arrays


Trees…?

• Widely used• Hierarchical data (financial data, NLP, machine vision, GIS, DNA,

protein sequences…)• Indexing/hashing (search engines…)

• Tree Manipulation context• Full traversal of a tree (or sub-tree) for read or write accesses.• E.g. CBIR, DNA sequence alignment


Problem

• Categorized as Non-Uniform Random Access structures• Bad spatial locality• Incurring high miss rates

• Worse for multiprocessing (Berkley, 2006)• Requires High on-/off-chip bandwidth


Tree RepresentationA

B C D

E F G H I J K

L M N O P Q R

S T U V

A

B C D

E F G H I J K

L M N O P Q R

S T U V

A B CD E FG H IJ KL MN OP QRS TU VA B CD E FG H IJ KL MN OP QRS TU V

Memory Layout

No GAPS


Multiprocessing Platforms

• Non-shared memory architectures (Cell BE, Blue Gene, Intel SCC)• Explicit message passing Many small messages

• Shared memory architectures (Intel Quad-core)• Coherent cache banks cache blocks grabbed when referenced

A B CD E FG H IJ KL MN OP QRS TU VA B CD E FG H IJ KL MN OP QRS TU V

A B CD E FG H IJ KL MN OP QRS TU VA B CD E FG H IJ KL MN OP QRS TU VOptimal SolutionReallocate tree elements in memory to

form contiguous memory regions


Organization & Representation

Evaluation

Discussion & Conclusion

GOAL: Allocate tree elements to promote Spatial Locality



Polish Notation

Arithmetic example


Linear Tree: Depth-First OrderingA

B C D

E F G H I J K

L M N O P Q R

S T U V

A

B C D

E F G H I J K

L M N O P Q R

S T U V


Recursive ContiguityA

B C D

E F G H I J K

L M N O P Q R

S T U V

A B C DE F G H I J KL M N O P Q RS T U V

N

S T

N S T

C

G H I

O P Q

C G H IO P Q


• Non-shared memory architectures• Explicit message passing Few Large messages

• Shared memory architectures• Minimal False sharing

Spatial Locality




RepresentationData Array

Parents Reference Array

Children Reference Array


First-Child/SiblingData Array

Parents Reference Array

First-Child Reference Array

Siblings’ Reference Array


Scheduling Algorithm• Designed for Cell BE and Blue Gene /L

• Message-passing (DMA, MPI, mailboxes)

• Challenges• Unbalanced trees with varying computation complexity• Limited local storage • Larger data chunks, 128 byte aligned

• Algorithm properties• Master-slave• Dynamic scheduling of sub-workloads• Work-stealing: coordinated by the master• Double buffering



Evaluation


Implementation and comparison

Evaluation


Methodology

• Application: Sequence Alignment problem• DNA, RNA, protein, NLP, Financial data

• Implementation: pthreads on x86 Intel machines• UG: Quad-core• Kodos: 2-socket quad-core• Kang: 4-socket dual-core

• Data Cache Simulation

• In-memory Trees


Memory Access Time

• Naïve sequence alignment consists of only read/write operations.

• Random: • Sub-linear increase up to

4 threads. • Saturates after 4 threads

• Linear: • Sequential - 2.7X gain • Hit memory-wall

1 2 3 4 6 8 12 16 24 320.00

0.50

1.00

1.50

2.00

2.50

3.00

UG-Random UG-Linear

Threads

Tim

e/no

de (u

sec

/ no

de)


Quad-core vs. 2-socket Quad-core

• Kodos-Random: • Sub-linear increase

before 8 threads• Saturates after 8

threads

• Kodos-Linear:• Similar to UG-linear

1 2 3 4 6 8 12 16 24 320.00

1.00

2.00

3.00

4.00

5.00

6.00

UG-Random UG-LinearKodos-Random Kodos-Linear

Threads

Tim

e/no

de (u

sec

/nod

e)


Interconnect Effect

• Assume that UG has a perfect interconnect and compare against it.

𝑇𝑖𝑚𝑒𝐾𝑎𝑛𝑔 /𝐾𝑜𝑑𝑜𝑠−𝑇𝑖𝑚𝑒𝑈𝐺

𝑇𝑖𝑚𝑒𝑈𝐺

1 2 3 4 6 8 12 16 24 320

0.5

1

1.5

2

2.5

3

3.5

4

Kodos Kang

Threads

Ratio


Effective Bandwidth

1 2 3 4 6 8 12 16 24 32 1 2 3 4 6 8 12 16 24 32 1 2 3 4 6 8 12 16 24 320

1

2

3

4

5

6

Random Linear

Threads

Ban

dwid

th (G

bps)

𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡 h=𝑀𝑖𝑠𝑠𝐶𝑜𝑢𝑛𝑡×𝐶𝑎𝑐h𝑒𝐿𝑖𝑛𝑒𝑆𝑖𝑧𝑒𝑇𝑜𝑡𝑎𝑙 𝐴𝑐𝑐𝑒𝑠𝑠𝑇𝑖𝑚𝑒

KangKodosUG


Modeling Real Computation

• Computation scaling modeled as SQRT(number of threads)

• 1.6 – 3.05 X for 1 – 4 threads

1 2 3 4 6 8 12 16 24 320

20

40

60

80

100

120

UG-Random UG-Linear

Threads

Exec

utio

n Ti

me

(sec

)


Other Experimental Results

Metric Random Linear

Miss Rate (L2) 14% 1.6%

Sequential/Parallel fractions

Sequential is 10% with minor improvement for Linear

Load balancing Maximum 4% deviation (no work-stealing required)

Stalling on Locks No difference

Memory size ratio 1 0.32



Evaluation


Practical considerations & Potential work



Discussion• Limitations:

• Only shared memory architectures• Max tree size: 4G Bytes, 47M nodes

• Compression• First-child references can be reduced to 1-bit per node.

• Use Delta distance instead of full address.


Discussion• Insertion/deletion complexity

• Random • Linear • Scope: trees with rare modifications

• Relaxed model: partitions• Shared memory: cache line • Non-shared memory: message size


Next…• Path #1:

• Implement and evaluate a commercial/scientific workload

• Developing a library/framework for parallel tree manipulation

• Path #2:• Algorithm evaluation for non-shared memory architectures

• E.g. Blue Gene, Intel SCC

• Both


Conclusion• Tree manipulation using typical data representation is not

well suited for parallel processing.

• Propose and evaluate a technique for parallel tree manipulation• Performance gain for sequential and parallel processing• Saves memory and bandwidth• Scalable

• For our experiments, on-chip communication with fewer cores is superior to off-chip communication.


Acknowledgment

• Special thanks goes to:

Prof. Natalie Enright Jerger

Prof. Greg Steffan

QUESTIONSFact: 42,270 runs were executed in the experimentation using 91 TBs of data.

Thank You.

Please send me your comments, [email protected]

mailto:[email protected]

PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Documents