Early Experience with Out-of- Early Experience with Out-of- Core Applications on the Cray Core Applications on the Cray XMT XMT Daniel Chavarría-Miranda § , Andrés Márquez § , Jarek Nieplocha § , Kristyn Maschhoff † and Chad Scherrer § § Pacific Northwest National Laboratory (PNNL) † Cray, Inc.
18
Embed
Early Experience with Out-of-Core Applications on the Cray XMT Daniel Chavarría-Miranda §, Andrés Márquez §, Jarek Nieplocha §, Kristyn Maschhoff † and.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Early Experience with Out-of-Early Experience with Out-of-Core Applications on the Cray Core Applications on the Cray
XMTXMT
Early Experience with Out-of-Early Experience with Out-of-Core Applications on the Cray Core Applications on the Cray
XMTXMT
Daniel Chavarría-Miranda§, Andrés Márquez§, Jarek Nieplocha§, Kristyn Maschhoff† and Chad Scherrer§
§ Pacific Northwest National Laboratory (PNNL)† Cray, Inc.
2
IntroductionIntroductionIntroductionIntroduction
Increasing gap between memory and processor speed Causing many applications to become memory-bound Mainstream processors utilize cache hierarchy Caches not effective for highly irregular, data-intensive
applications
Multithreaded architectures provide an alternative Switch computation context to hide memory latency
Cray MTA-2 processors and newer ThreadStorm processors on the Cray XMT utilize this strategy
3
Cray XMTCray XMTCray XMTCray XMT3rd generation multithreaded system from Cray Infrastructure is based on XT3/4, scalable up to 8192 processors SeaStar network, torus topology, service and I/O nodes Compute nodes contain 4 ThreadStorm multithreaded processors
instead of 4 AMD Opteron processors Hybrid execution capabilities: code can run on ThreadStorm
processors in collaboration with code running on Opteron processors
4
Cray XMT (cont.)Cray XMT (cont.)Cray XMT (cont.)Cray XMT (cont.)ThreadStorm processors run at 500 MHz 128 hardware thread contexts, each with its own set of 32 registers No data cache 128KB, 4-way associative data buffer on the memory side Extra bits in each 64-bit memory word: full/empty for synchronization Hashed memory at a 64-byte level, i.e. contiguous logical addresses at a
64-byte boundary might be mapped to uncontiguous physical locations Global shared memory
Lightweight User Communication library (LUC) to coordinate data transfers and hybrid execution between ThreadStorm and Opteron processors Portals-based on Opterons Fast I/O API-based on ThreadStorms RPC-style semantics
Service and I/O (SIO) nodes provide Lustre, a high-performance parallel file system ThreadStorm processors cannot directly access Lustre
LUC-based execution and transfers combined with Lustre access on the SIO nodes Attractive and high-performance alternative for processing very
We use an enhancement to the ADTree data structure called a PDTree where we don’t need to store all possible combinations of values Only store a priori specified combinations
PDTree implemented using a multiple type, recursive tree structure Root node is an array of ValueNodes (counts for different value instances of
the root variables) Interior and leaf nodes are linked lists of ValueNodes Inserting a record at the top level involves just incrementing the counter of
the corresponding ValueNode XMT’s int_fetch_add() atomic operation is used to increment counters
Inserting a record at other levels requires the traversal of a linked list to find the right ValueNode
If the ValueNode does not exist, it must be appended to the end of the list
Inserting at other levels when the node does not exist is tricky To ensure safety the end pointer of the list must be locked Use readfe() and writeef() MTA operations to create critical sections
Take advantage of full/empty bits on each memory word
As data analysis progresses the probability of conflicts between threads is lower
Experimental setup and ResultsExperimental setup and ResultsExperimental setup and ResultsExperimental setup and ResultsLarge dataset to be analyzed by PDTree 4 GB resident on disk (64M records, 9 column guide
tree)
Options: Direct file I/O from ThreadStorm procesors via NFS
Not very efficient Indirect I/O via LUC server running on Opteron
processors on the SIO nodes Large input file can reside on high-performance Lustre file
system
Simulates the use of PDTree for online network traffic analysis Need to use dynamic PDTree 128K element hash table
13
Experimental setup and Results (cont.)Experimental setup and Results (cont.)Experimental setup and Results (cont.)Experimental setup and Results (cont.)
14
ThreadstormCPU
ThreadstormCPU
ThreadstormCPU
ThreadstormCPU
DRAM
OpteronCPU
DRAM
Service/login node Compute node
SeaStarInterconnect
Lustre filesystem
DirectAccess
IndirectAccess LUC
RPC
Note: results obtained on a preproduction XMT with only half ofthe DIMM slots populated
Experimental setup and Results (cont.)Experimental setup and Results (cont.)Experimental setup and Results (cont.)Experimental setup and Results (cont.)
15
# of procs.
XMT Insertion
XMT Speedup
MTA Insertion
MTA Speedup
1 239.26 1.00 200.17 1.00
2 116.36 2.06 98.25 2.04
4 56.48 4.24 48.07 4.16
8 27.53 8.69 23.29 8.59
16 13.97 17.13 11.61 17.24
32 7.13 33.56 5.81 34.45
64 3.68 65.02 N/A N/A
96 2.60 92.02 N/A N/A
In-core, 1M record execution, static PDTree version
Experimental setup and Results (cont.)Experimental setup and Results (cont.)Experimental setup and Results (cont.)Experimental setup and Results (cont.)
16
816
3264
96
0
50
100
150
200
250
300
time [s]
# of processors
100 MB Dataset
Insertion
Preprocessing
LUC Transfer
Experimental setup and Results (cont.)Experimental setup and Results (cont.)Experimental setup and Results (cont.)Experimental setup and Results (cont.)
17
816
3264
96
0
100
200
300
400
500
600
700
time [s]
# of processors
250 MB Dataset
Insertion
Preprocessing
LUC Transfer
18
ConclusionsConclusionsConclusionsConclusions
Results indicate the value of the XMT hybrid architecture and its improved I/O capabilities Indirect access to Lustre through LUC interface Need to improve I/O operation implementation to take
full advantage of Lustre Multiple LUC transfers in parallel should improve
performance
Scalability of the system is very good for complex, data-dependent irregular accesses in the PDTree application Future work includes comparisons against parallel