PARALLEL TREE MANIPULATION Islam Atta
Jan 17, 2018
PARALLEL TREE MANIPULATIONIslam Atta
Copyright © 2010-2012 Islam Atta 2
Sources
• Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010.
• Experimental evaluation was done as part of course work at UofT (ECE1749H, ECE1755H).
Copyright © 2010-2012 Islam Atta 3
This talk is about …
• Manipulation algorithm for Tree data structures
• Parallel Programming domain
• Novel Tree representation using Linear Arrays
Copyright © 2010-2012 Islam Atta 4
Trees…?
• Widely used• Hierarchical data (financial data, NLP, machine vision, GIS, DNA,
protein sequences…)• Indexing/hashing (search engines…)
• Tree Manipulation context• Full traversal of a tree (or sub-tree) for read or write accesses.• E.g. CBIR, DNA sequence alignment
Copyright © 2010-2012 Islam Atta 5
Problem
• Categorized as Non-Uniform Random Access structures• Bad spatial locality• Incurring high miss rates
• Worse for multiprocessing (Berkley, 2006)• Requires High on-/off-chip bandwidth
Copyright © 2010-2012 Islam Atta 6
Tree RepresentationA
B C D
E F G H I J K
L M N O P Q R
S T U V
A
B C D
E F G H I J K
L M N O P Q R
S T U V
A B CD E FG H IJ KL MN OP QRS TU VA B CD E FG H IJ KL MN OP QRS TU V
Memory Layout
No GAPS
Copyright © 2010-2012 Islam Atta 7
Multiprocessing Platforms
• Non-shared memory architectures (Cell BE, Blue Gene, Intel SCC)• Explicit message passing Many small messages
• Shared memory architectures (Intel Quad-core)• Coherent cache banks cache blocks grabbed when referenced
A B CD E FG H IJ KL MN OP QRS TU VA B CD E FG H IJ KL MN OP QRS TU V
A B CD E FG H IJ KL MN OP QRS TU VA B CD E FG H IJ KL MN OP QRS TU VOptimal SolutionReallocate tree elements in memory to
form contiguous memory regions
Copyright © 2010-2012 Islam Atta 8
Organization & Representation
Evaluation
Discussion & Conclusion
GOAL: Allocate tree elements to promote Spatial Locality
Organization & Representation
Copyright © 2010-2012 Islam Atta 9
Polish Notation
Arithmetic example
Copyright © 2010-2012 Islam Atta 10
Linear Tree: Depth-First OrderingA
B C D
E F G H I J K
L M N O P Q R
S T U V
A
B C D
E F G H I J K
L M N O P Q R
S T U V
Copyright © 2010-2012 Islam Atta 11
Recursive ContiguityA
B C D
E F G H I J K
L M N O P Q R
S T U V
A B C DE F G H I J KL M N O P Q RS T U V
N
S T
N S T
C
G H I
O P Q
C G H IO P Q
Copyright © 2010-2012 Islam Atta 12
• Non-shared memory architectures• Explicit message passing Few Large messages
• Shared memory architectures• Minimal False sharing
Spatial Locality
A B C DE F G H I J KL M N O P Q RS T U V
A B C DE F G H I J KL M N O P Q RS T U V
Copyright © 2010-2012 Islam Atta 13
RepresentationData Array
Parents Reference Array
Children Reference Array
Copyright © 2010-2012 Islam Atta 14
First-Child/SiblingData Array
Parents Reference Array
First-Child Reference Array
Siblings’ Reference Array
Copyright © 2010-2012 Islam Atta 15
Scheduling Algorithm• Designed for Cell BE and Blue Gene /L
• Message-passing (DMA, MPI, mailboxes)
• Challenges• Unbalanced trees with varying computation complexity• Limited local storage • Larger data chunks, 128 byte aligned
• Algorithm properties• Master-slave• Dynamic scheduling of sub-workloads• Work-stealing: coordinated by the master• Double buffering
Copyright © 2010-2012 Islam Atta 16
Organization & Representation
Evaluation
Discussion & Conclusion
Implementation and comparison
Evaluation
Copyright © 2010-2012 Islam Atta 17
Methodology
• Application: Sequence Alignment problem• DNA, RNA, protein, NLP, Financial data
• Implementation: pthreads on x86 Intel machines• UG: Quad-core• Kodos: 2-socket quad-core• Kang: 4-socket dual-core
• Data Cache Simulation
• In-memory Trees
Copyright © 2010-2012 Islam Atta 18
Memory Access Time
• Naïve sequence alignment consists of only read/write operations.
• Random: • Sub-linear increase up to
4 threads. • Saturates after 4 threads
• Linear: • Sequential - 2.7X gain • Hit memory-wall
1 2 3 4 6 8 12 16 24 320.00
0.50
1.00
1.50
2.00
2.50
3.00
UG-Random UG-Linear
Threads
Tim
e/no
de (u
sec
/ no
de)
Copyright © 2010-2012 Islam Atta 19
Quad-core vs. 2-socket Quad-core
• Kodos-Random: • Sub-linear increase
before 8 threads• Saturates after 8
threads
• Kodos-Linear:• Similar to UG-linear
1 2 3 4 6 8 12 16 24 320.00
1.00
2.00
3.00
4.00
5.00
6.00
UG-Random UG-LinearKodos-Random Kodos-Linear
Threads
Tim
e/no
de (u
sec
/nod
e)
Copyright © 2010-2012 Islam Atta 20
Interconnect Effect
• Assume that UG has a perfect interconnect and compare against it.
𝑇𝑖𝑚𝑒𝐾𝑎𝑛𝑔 /𝐾𝑜𝑑𝑜𝑠−𝑇𝑖𝑚𝑒𝑈𝐺
𝑇𝑖𝑚𝑒𝑈𝐺
1 2 3 4 6 8 12 16 24 320
0.5
1
1.5
2
2.5
3
3.5
4
Kodos Kang
Threads
Ratio
Copyright © 2010-2012 Islam Atta 21
Effective Bandwidth
1 2 3 4 6 8 12 16 24 32 1 2 3 4 6 8 12 16 24 32 1 2 3 4 6 8 12 16 24 320
1
2
3
4
5
6
Random Linear
Threads
Ban
dwid
th (G
bps)
𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡 h=𝑀𝑖𝑠𝑠𝐶𝑜𝑢𝑛𝑡×𝐶𝑎𝑐h𝑒𝐿𝑖𝑛𝑒𝑆𝑖𝑧𝑒𝑇𝑜𝑡𝑎𝑙 𝐴𝑐𝑐𝑒𝑠𝑠𝑇𝑖𝑚𝑒
KangKodosUG
Copyright © 2010-2012 Islam Atta 22
Modeling Real Computation
• Computation scaling modeled as SQRT(number of threads)
• 1.6 – 3.05 X for 1 – 4 threads
1 2 3 4 6 8 12 16 24 320
20
40
60
80
100
120
UG-Random UG-Linear
Threads
Exec
utio
n Ti
me
(sec
)
Copyright © 2010-2012 Islam Atta 23
Other Experimental Results
Metric Random Linear
Miss Rate (L2) 14% 1.6%
Sequential/Parallel fractions
Sequential is 10% with minor improvement for Linear
Load balancing Maximum 4% deviation (no work-stealing required)
Stalling on Locks No difference
Memory size ratio 1 0.32
Copyright © 2010-2012 Islam Atta 24
Organization & Representation
Evaluation
Discussion & Conclusion
Practical considerations & Potential work
Discussion & Conclusion
Copyright © 2010-2012 Islam Atta 25
Discussion• Limitations:
• Only shared memory architectures• Max tree size: 4G Bytes, 47M nodes
• Compression• First-child references can be reduced to 1-bit per node.
• Use Delta distance instead of full address.
Copyright © 2010-2012 Islam Atta 26
Discussion• Insertion/deletion complexity
• Random • Linear • Scope: trees with rare modifications
• Relaxed model: partitions• Shared memory: cache line • Non-shared memory: message size
Copyright © 2010-2012 Islam Atta 27
Next…• Path #1:
• Implement and evaluate a commercial/scientific workload
• Developing a library/framework for parallel tree manipulation
• Path #2:• Algorithm evaluation for non-shared memory architectures
• E.g. Blue Gene, Intel SCC
• Both
Copyright © 2010-2012 Islam Atta 28
Conclusion• Tree manipulation using typical data representation is not
well suited for parallel processing.
• Propose and evaluate a technique for parallel tree manipulation• Performance gain for sequential and parallel processing• Saves memory and bandwidth• Scalable
• For our experiments, on-chip communication with fewer cores is superior to off-chip communication.
Copyright © 2010-2012 Islam Atta 29
Acknowledgment
• Special thanks goes to:
Prof. Natalie Enright Jerger
Prof. Greg Steffan
QUESTIONSFact: 42,270 runs were executed in the experimentation using 91 TBs of data.
Thank You.
Please send me your comments, [email protected]