LLNL-PRES-768399 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC Aaron Fisher Acrotensor: Flexible Tensor Contractions on the GPU Aaron Fisher, Tzanio Kolev, Johann Dahm Feb 2019 Computational Scientist
17
Embed
Acrotensor: Flexible Tensor Contractions on the GPUicl.utk.edu/bblas/siam-cse19/files/08-FisherSIAM2019.pdf · 2019. 3. 5. · 3 LLNL-PRES-768399 §CPU Only —Ftensor —Taco —libtensor
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
LLNL-PRES-768399This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Aaron Fisher
Acrotensor: Flexible Tensor Contractions on the GPUAaron Fisher, Tzanio Kolev, Johann Dahm
Feb 2019Computational Scientist
2LLNL-PRES-768399
Why should we care about tensor algebra?
§ All the usual dense matrix operations can be represented with tensor algebra.
§ Tensor algebra extends naturally to enable batching.
§ Higher rank tensor algebra has many applications including:
— Finite elements
— Machine Learning
— Quantum simulation
§ Growth opportunity for linear algebra packages
!" =$"%"" d= ∑" '"("
)"* =$+%"+,+*
%"+ = '"(+
)-"* =$+%-"+,-+*
3LLNL-PRES-768399
§ CPU Only— Ftensor— Taco— libtensor— Numpy::einsum— …
# elements TensorFlow (s) AcroTensor (s) Improvement Factor
10,000 0.25 0.05 5.0
20,000 0.44 0.06 7.3
40,000 0.78 0.12 6.5
80,000 1.50 0.23 6.5
# elements TensorFlow (s) AcroTensor (s) Improvement Factor
10,000 0.75 0.12 6.3
20,000 1.54 0.23 6.7
40,000 3.03 0.45 6.7
80,000 6.37 0.90 7.1
3rd orderelements
5th orderelements
Note: Acrotensor requires additional 0.2 s to JIT compile the kernel on the first pass.
16LLNL-PRES-768399
§ Acrotensor is a C++ library for large scale tensor contractions on GPUs.
§ JIT compilation is utilized to provide a user friendly dynamic interface without giving up on high performance.
§ Acrotensor performs >5x faster than TensorFlow’s similar einsum operation.
§ Future work— CPU JIT compilation for single- and multi-threaded execution.— Further GPU optimization.— Integration with interested application codes.
Summary and Future Work
https://github.com/LLNL/acrotensor
DisclaimerThis document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes.