An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa Department of Information Science, Faculty of Science, University of Tokyo
26
Embed
An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation
on Shared Memory Parallel Computer
Yoshihiro Oyama, Kenjiro Taura,
Toshio Endo, Akinori Yonezawa
Department of Information Science, Faculty of Science,
University of Tokyo
Background
“Irregular” parallel applications• Tasks are not identified until runtime• synchronization structure is complicated
Languages with fine-grain threads• promising approach to handle the complexity
Motivation
Q: Are fine-grain threads really effective?
• Easy to describe irregular parallelism?• Scalable?• Fast?
Case studies to answer the Q are few
Many sophisticated designs and implementation techniqueshave been proposed so far, but
Goal
Case study to better understandthe effectiveness of fine-grain threads
C + Solaris threads
VS.
• program description cost• speed on 1 PE• scalability on 64PE SMP
in terms of
our language Schematic
approach w/o fine-grain threads
approach withfine-grain threads
Overview
Applications ( RNA & CKY )
Solutions without fine-grain threads
Solutions with fine-grain threads
Performance evaluation
Case Study 1: RNA- protein secondary structure prediction -
Algorithm simple node traversal + pruning
finding a path• satisfying certain condition• with largest weight
unbalanced tree
Case Study 2: CKY- context-free grammar parser -
calculation of matrix elements
depends on all s
She is a girl whose mother is a teacher.
calculation time significantlyvaries from element to element
actual size 100≒
To create a threadfor each node large overhead
communicationwith memory
Task Pool
P P P
Solution without Fine-grain Threads(RNA)
calculating 1 element→ 0 ~ 200 synchronization
P P P
decision strategy?• trial & error• prediction
Solution without Fine-grain Threads(CKY )
how to implement?• small delay → simple spin• large delay → block wait
Performance Evaluation(Condition) Sun Ultra Enterprise 10000
(UltraSparc 250MHz × 6464) Solaris 2.5.1 Solaris thread (user-level thread)
GC time not included Runtime type check omitted
Performance Evaluation(Sequential)
0
1
2
3
RNA CKY
norm
aliz
ed e
laps
ed t
ime
C Schematic
Performance Evaluation(Parallel)
0
10
20
30
40
50
0 10 20 30 40 50 60# of PEs
spee
dup
C (RNA) Schematic (RNA) C (CKY) Schematic (CKY)
Related Work
ICC++ [Chien et al. 97]• Similar study using 7 apps• Experiments on distributed memory machines• Focus on
• namespace management
• data locality
• object-consistency model
Conclusion
We demonstrated the usefulness of fine-grain multithread languages• Task pool-like execution with simple description• Aggressive optimizations for synchronization
We showed the experimental results• A factor of 2.8 slower than C• Scalability comparable to C