ORNL is managed by UT-Battelle for the US Department of Energy Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization Dmitry I. Lyakh [email protected]This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract No. DE-AC05-00OR22725.
29
Embed
Portable Heterogeneous High-Performance Computing via ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ORNL is managed by UT-Battelle for the US Department of Energy
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak RidgeNational Laboratory, which is supported by the Office of Science of the U.S. Department of Energyunder contract No. DE-AC05-00OR22725.
2 ExaTensor
Application-Hardware Interface
HPC Application
Hardware/HPC-system
Interface Width
Interface Width ~ O(1): Manageable Performance Portability, regardless ofthe programming model(only O(1) parts of the codeinteract with raw hardware)
Direct Interface
Interface Width ~ O(N): Performance Portability canbecome an issue, even ifusing OpenMP/OpenACC
HPC Application
Runtime + Libraries
Interface Width
Performance Portability effort is shifted towards Runtime and Libraries
Uni-Indirect Interface
Hardware/HPC-system
Interface Width
HPC Application responsibility to be aware of Hardware
HPC Application
Domain-SpecificRuntime + Libraries
Interface Width
Performance Portability effort is shifted towardsdomain-specific Runtime and Libraries
Bi-Indirect Interface
DS Runtime is aware of boththe Application and Hardware
Hardware/HPC-system
Interface Width
Performance portability effort is shifted towards Runtime and Libraries
HPC Application is awareof Runtime, Runtime isresponsible to be awareof Hardware
3 ExaTensor
Domain-Specific Virtual Processor
HPC Application
Domain-SpecificVirtual Processor + Libraries
Interface Width
Hardware/HPC-system
Interface Width
A better structured runtime thatformally resembles a processorbut specialized to the domain-specific workloads
Understands the specificityof domain-specific data,operations, and algorithms
Understands the specificityof a class of HPC architecturesvia a parameterized templateof node architectures
➢ The domain HPC applications express their algorithms in a domain-native language (higher-level),either via a standalone or embedded DSL;
➢ The domain algorithms are compiled into instructions for a domain-aware virtual architecture;➢ The virtual architecture is built from virtual units that map reasonably well to the physical hardware;➢ Hardware encapsulation, yet potential for a seemless co-design.
Requires: Building blocks for heterogeneous computing: HiHat?
5 ExaTensor
Performance Portability Strategy• Abstract computing system template:
– Distributed (weakly-coupled) level: The computing systemis composed of compute nodes interconnected via network interfaces in some topology
– (Semi-)shared (strongly-coupled) level: Each node is composed of multiple compute devices of the same or different kinds, possibly sharing the same (hierarchical) memory
• Algorithms are formulated for this abstract computer
• The hardware specificity is masked by driver libraries that provide a device-unified API interface for a set of necessary domain-specific primitives (DS-ISA)
• 2006-2008: CLUSTER moved to a direct interpretation of many-body CC equations: High-level spec → Bytecode
• 2009-2013: ACES III and ACES IV: Domain-specific super-instruction language and runtime (SIAL/SIP):SIAL (med.level) → SIP bytecode → Interpretation by SIP
• 2014+: ExaTENSOR framework: Direct interpretation of generic tensor expressions on heterogeneous HPC architectures with cross-domain applications
Math Framework: Basic Tensor Algebra• Formal tensor: : n-D Array
Full tensor:
Tensor slice:
• Tensor addition:
• Tensor product:
• Tensor contraction:
Compute intensive (potentially)!
Parallelism!
Parallelism!
Few primitive operations:
,...),,,(...... srqpTT pq
rs SsRrQqPpsrqpT ,,,:),,,(
SSsRRr
QQqPPpsrqpT
','
,',':),,,(
pqrs
pqrs
pqrs RLTsrqp :,,,
qs
pr
pqrs RLTsrqp :,,,
qbcdrsai
paibcd
pqrs RLTsrqp :,,,
11 ExaTensor
Math Framework: Tensor Decompositions
• Graphical (diagrammatic) representation:
Matrix: Matrix*Matrix:
• Linear algebra: SVD is optimal in the 2-norm:
• Tensor (multi-linear) algebra: Many choices:
Canonicalpolyadic
Tucker Tensor tree Tensor train
12 ExaTensor
Tensor Sparsity• Dense tensors:
• Block-sparse tensors (container of dense slices):
• Regular sparse tensors:– Some regular sparsity pattern other than block sparsity
• Irregular sparse tensors:– Sparsity pattern has no regularity
13 ExaTensor
Adaptive (+Hierarchical) Tensor Algebra
• Large tensor elements can become tensors themselves (higher resolution);
• Weak tensor slices can be compressed by lowering the resolution, up to a single (complex) number;
• Adapt to the calculated electronic state and available HPC resources;
• Should be better than just black-and-white discarding.
D.I.L. Int. J. Quantum Chem. 2014; D.I.L. ArXiv 2017
Extrapolation of H/H2-matrixalgebra to TENSORS
14 ExaTensor
Computational Challenges• Dense tensor algebra:
– Communication bandwidth: Communication avoiding and regularization (similar to the matrix multiplication, e.g.,CARMA by Demmel et al.)
– Memory size: Out-of-core algorithms
• Block-sparse tensor algebra adds:– Irregular data placement and computational workload: Load
balancing → Task-based programming model
• Adaptive hierarchical block-sparse tensor algebra adds:– Dynamic irregular data placement and computational workload:
Load balancing, dynamic data layout → Adaptive task-based programming model
ExaTENSOR: 1. Implementation of the tensor algebra DSVP for heterogeneous computing; 2. Implementation of scalable tensor algebra algorithms for TA-DSVP.