PARALLEL ALGORITHMS FOR LOW-RANK APPROXIMATIONS OF MATRICES AND TENSORS BY LAWTON MANNING A Thesis Submitted to the Graduate Faculty of WAKE FOREST UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE Computer Science May 2021 Winston-Salem, North Carolina Approved By: Grey Ballard, Ph.D., Advisor Jennifer Erway, Ph.D., Chair Samuel Cho, Ph.D.
100
Embed
PARALLEL ALGORITHMS FOR LOW-RANK APPROXIMATIONS OF ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
PARALLEL ALGORITHMS FOR LOW-RANK APPROXIMATIONS OFMATRICES AND TENSORS
BY
LAWTON MANNING
A Thesis Submitted to the Graduate Faculty of
WAKE FOREST UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES
Low-rank approximations are useful in the compression and interpretation of largedatasets. Distributed parallel algorithms of such approximations, like those for matri-ces and tensors, are applicable for even larger datasets that cannot conceivably fit onone computer. In this thesis I will present parallelizing two such approximation algo-rithms: Hierarchical Nonnegative Matrix Factorization, and Tensor Train Rounding.In both cases, the distributed parallel algorithms outperform the state of the art.
Nonnegative Matrix Factorization (NMF) is a tool for clustering nonnegative ma-trix data. A Hierarchical NMF clustering can be achieved by recursively clusteringa dataset using Rank-2 or two cluster NMF. The hierarchical clustering algorithmcan reveal more detailed information about the data. Also, it is faster than a flatclustering of the same size since Rank-2 NMF is faster and scales better than thegeneral NMF algorithm as the number of clusters increases.
Tensor Train (TT) uses a series of 3-dimensional TT cores to approximate anN-dimensional tensor. TT ranks determine the sizes of these cores. Arithmetic withTensor Train causes an artificial increase in the TT ranks, and thus the sizes of the TTcores. So, TT applications use an algorithm called TT rounding to truncate TT rankssubject to some approximation error. The TT rounding algorithm can be thought of asa Truncated Singular Value Decomposition (tSVD) of a product of highly structuredmatrices. The state-of-the-art approach requires a slow orthogonalization phase. Afaster Gram SVD algorithm avoids this slow phase and reduces the computation timeof TT Rounding and improves its parallel scalability.
v
Chapter 1: Introduction
Low-rank approximations are useful in the compression and interpretation of large
datasets. Distributed-memory parallel algorithms of such approximations, like those
for matrices and tensors, are applicable for even larger datasets that cannot conceiv-
ably fit on one computer. In this thesis we will present parallelizing two such approxi-
mation algorithms: Hierarchical Nonnegative Matrix Factorization, and Tensor Train
Rounding. In both cases, the distributed-memory parallel algorithms outperform the
state of the art.
1.1 Low-Rank Approximations
There are a wide variety of low-rank approximations that are used in a range of appli-
cations such as facial recognition [18], dimensional reduction [62], hyperspectral image
segmentation [25], and data completion [55]. Some of these low-rank approximations
include: Singular Value Decomposition (SVD), Nonnegative Matrix Factorization
(NMF), Principal Component Analysis (PCA), the tensor CP Decomposition, and
Tensor Train (TT).
For example, hyperspectral image segmentation is a popular application for Non-
negative Matrix Factorization (NMF). NMF is a clustering algorithm that can cluster
individual pixels in a hyperspectral image. The resulting NMF clustering also con-
tains feature signatures for each cluster and fractional cluster membership for each
pixel. For hyperspectral images, these feature signatures can describe the types of
materials each pixel captures as different materials reflect light at different spectra
(colors) [25].
Another example is the low-rank approximation of incomplete tensor data called
1
tensor completion. Tensor completion is the problem of filling missing or unobserved
entries of partially observed tensors [55]. Filling missing entries in a tensor gives many
degrees of freedom for what those entries could ultimately be, so tensor completion
problems require constraints so that they can be solvable. One of the common con-
straints is maintaining a low rank in the resulting completed tensor. There are several
definitions of rank for a tensor approximation, depending on the type of approxima-
tion used. One of the common tensor decompositions used for tensor completion is
the CP decomposition. After computing the CP decomposition that best fits the
observed data and has a minimal rank, the unobserved data is predicted using the
corresponding value from that CP model.
1.2 Distributed-Memory Parallel Algorithms
In 1965, Gordon Moore observed that the number of transistors on a single silicon
chip has increased by a factor of two per year and proposed that it would continue
to do so for at least the next 10 years [42]. This observation, now known as Moore’s
Law has been generalized over time to computational instead of transistor density. As
engineers met the physical limits of transistor density, other strategies were developed
to meet the extended Moore’s law, such as multiple processor cores on a single chip
and GPU accelerators. However, even as computers become more and more powerful,
there are still problems that take too long to solve. These problems also typically
require large amounts of memory as well. Both computational and storage bottlenecks
lead us to work on distributed-memory systems such as supercomputers.
The most powerful supercomputers in the world are not made up of futuristic
processors or overly large hard drives. Instead, they are giant networks of individ-
ual computers made of commercially available technology. For example, the Summit
supercomputer at Oak Ridge National Laboratory was the most powerful supercom-
2
puter in the world, with 4608 individual “nodes”, each with 2 IBM POWER-9 CPUs
and 6 NVIDIA Volta GPUs [39]. Although each of these nodes are powerful in their
own right, the ability to utilize mulitple nodes in tandem makes distributed-memory
parallel algorithms high performing.
The Summit nodes each contain 512 GB of main memory for use by the proces-
sors [39]. If a problem requires more than this amount of memory, which is likely
for problems requiring high performance computing, adding more nodes to the com-
putation can allow for the distribution of that problem’s data across many nodes.
However, distributing memory like this comes with a downside, which is the commu-
nication between nodes.
Relative to the speed of computation on an individual node, the costs associated
with communicating data between two nodes is orders of magnitude higher. In the
worst cases, the majority of time spent in a distributed-memory algorithm can be
that slow communication of data instead of the actual computations of the algorithm,
which limits parallel scalability. This is why we must design parallel algorithms that
avoid this communication as much as possible. The algorithms presented in this thesis
both avoid communicating the bulk of their data but instead communicate the results
of smaller, intermediate calculations.
1.3 Applications
This thesis will cover two distributed-memory parallel algorithms for low-rank approx-
imations: Nonnegative Matrix Factorization and Tensor Train. Nonnegative Matrix
Factorization (NMF) is a clustering algorithm for nonnegative data that can extract
feature signatures and cluster membership for individual samples. Hierarchical Clus-
tering with Rank-2 NNMF (HierNMF) results from an optimization on a flat NMF
clustering algorithm. HierNMF can give a deeper answer than the flat algorithm and
3
potentially do it faster. This algorithm is discussed further in chapter 3. Tensor Train
(TT) is a data compression format for tensors, which are multidimensional arrays in
any number of dimensions. TT allows for computations do be done on tensors implic-
itly without being uncompressed. TT Rounding is a common bottleneck subroutine
used in many TT applications and so, chapter 4 proposes another approach to that
subroutine that avoids both communication and computation to result in a faster
approximation.
4
Chapter 2: Preliminaries
This chapter will provide some background knowledge on how distributed-memory
algorithms are designed and implemented using the Message Passing Interface (MPI),
and analyzed using the α − β − γ model, and the linear algebra concepts needed to
understand the content of future chapters.
2.1 Distributed-Memory Parallel Computing
Distributed-memory parallel architectures consist of multiple processors, each with
their own local memory. We use the Message Passing Interface (MPi) to allow for
processors to explicitly send and receive data. MPI is a standard interface for writing
distributed-memory parallel code in C, C++, and FORTRAN. Unlike shared memory
interfaces like OpenMP, MPI requires that data must be explicitly passed between
processors, often through collectively invoked functions.
2.1.1 MPI Cost Model
In analyzing MPI algorithms, there are the normal costs of computation as well as the
additional communication costs of passing data between processors. Communication
costs can be broken down into two parts: bandwidth and latency. Bandwidth is
the cost associated with the amount of data sent between processors. Latency is
the overhead cost of sending any amount of data in MPI. To analyze these costs
together, we use the α − β − γ model defined in [11]. This model combines the
costs of latency, bandwidth, and computation by assigning the coefficients α, β, and
γ to each, respectively. On distributed-memory systems, latency is the most costly
followed by bandwidth and then computation. So, α β γ. In this model, the
5
cost of sending w words of data costs βw + α.
2.1.2 MPI Collectives
MPI collectives are commonly used functions where groups of processors invoke one
function to pass data collectively between them. Table 2.1 shows the MPI collectives
used in this thesis and their initial and final data distributions. For example, given
elements of a vector x scattered across processors, AllGather will gather those ele-
ments so that all processors have a full copy of x. If instead each processor had a
local x, AllReduce would sum the individual x and store the result to all processors.
ReduceScatter would sum the local x on each processor and distribute the elements
of that sum across processors [11].
Operation Before After
All-Reducep0 p1 p2
x(0) x(1) x(2)
p0 p1 p2∑pj x
(j)∑p
j x(j)
∑pj x
(j)
Reduce-Scatter
p0 p1 p2
x(0)0 x
(1)0 x
(2)0
x(0)1 x
(1)1 x
(2)1
x(0)2 x
(1)2 x
(2)2
p0 p1 p2∑pj x
(j)0 ∑p
j x(j)1 ∑p
j x(j)2
All-Gather
p0 p1 p2
x0
x1
x2
p0 p1 p2
x0 x0 x0
x1 x1 x1
x2 x2 x2
Table 2.1: MPI collective algorithm data distributions [11]. xi is a segment of a vectorx. x(j) is data originally belonging to processor pj.
Table 2.2 shows the minimal α−β−γ costs of each of the three collectives described
in Table 2.1. As the number of processors p increases, the latency costs increase,
eventually creating a bottleneck in any distributed-memory parallel algorithm.
Table 2.2: MPI collective algorithm costs using the α− β − γ model [11]. The costsassume an input array of n words that is communicated using p processors.
2.1.3 Parallel Scaling
Scaling is useful for analyzing parallel algorithms. There are two types of scaling:
strong and weak. Strong scaling is done by observing the performance boost by
increasing the number of processors for working on the same problem. An algorithm
is said to have perfect strong scaling when the performance “speed-up” relative to
one processor is identical to the number of processors used (e.g. 8x speed-up for 8
processors). Perfect strong scaling is possible when the problem is computationally
bound and the computations can be evenly distributed between processors. However,
after a certain point the communication cost in a parallel algorithm will start to
dominate entirely since it can grow with the number of processors used.
Weak scaling is done by observing the performance as the number of processors
increases in step with the size of the problem. Applications for weak scaling are
generally problems where resolution can be increased. This could be the number of
spatial grid points in a simulation, for example.
2.2 Matrices
A matrix is a two-dimensional grid of numbers and is a useful data storage format. In
this work, a matrix called “A” is written as A. One of the important characteristics
of a matrix that is explored in this thesis is its rank. We will explore the rank further
in section 2.2.1.
7
Low-rank approximations of matrices extract the most useful features out of the
original matrix. This can be useful in things like image compression as the resulting
representation of the matrix can be smaller but still maintain the essence of the
original data.
2.2.1 Singular Value Decomposition
The Singular Value Decomposition (SVD) is a popular factorization of real or complex
matrices into interpretable component matrices. The SVD is given by
A = UΣVT (2.1)
where A ∈ Rm×n, U ∈ Rm×n, Σ ∈ Rn×n, V ∈ Rn×n, and m ≥ n.
U and V are orthonormal matrices. Orthonormal matrices have orthogonal col-
umn vectors with unit norms. This means that each column vector is perpendicular
to the other column vectors in the matrix and their “length” is 1. In the case of the
SVD, the column vectors of U and V are called the left and right singular vectors,
respectively.
Σ is a diagonal matrix with positive descending diagonal entries. This means that
only the entries along the main diagonal from upper-left to lower-right can be nonzero
while the rest of the matrix is zero. These diagonal entries are called the singular
values, and they are unique to the matrix A.
The SVD has many properties that are useful for Numerical Linear Algebra. The
rank r of the matrix A is defined as the number nonzero of singular values in Σ.
Since the number of singular values is bounded by the number of diagonal entries of
the matrix Σ, the rank is also bounded as r ≤ n. If r = n, a matrix is said to be full
rank.
8
2.2.2 Truncated SVD
Given a matrix A with rank r and SVD A = UΣVT , the best rank k ≤ r approxi-
mation of A can be defined as
Ak =k∑j=1
σjujvjT (2.2)
as provided by [60], where σj are the singular values of A up to k and uj and vjT
are column vectors of U and V.
From eq. (2.2), the truncated SVD is the first k vectors of U and VT and the first
k singular values from the full SVD of a matrix A. The Truncated SVD is represented
as Ak = UΣVT
.
So, after computing the full SVD as described in section 2.2.1, the truncated SVD
for any rank k is trivial to compute.
2.2.3 QR Decomposition
Similar to the Singular Value Decomposition (section 2.2.1), the QR decomposition
takes any matrix A and computes
A = QR (2.3)
where A ∈ Rm×n, Q ∈ Rm×n, and R ∈ Rn×n. Like U and V in the SVD, Q has
orthonormal columns. R is an upper triangular matrix. This type of matrix generally
has nonzeros along the main diagonal and every entry above the main diagonal in a
triangle, while every entry below the main diagonal is zero.
The QR decomposition is useful for solving least squares problems. As will be
explained in chapter 4, it can also be used to solve the Truncated SVD, as it is less
computationally expensive to compute.
9
A
(m×n)
≈ W
(m×k)
HT
(k×n)
Figure 2.1: Nonnegative Matrix Factorization (NMF) of a matrix A by factor matricesW and H. The dimensions of each matrix are listed in parentheses below the boxes.The boxes of each matrix are relative in size to one another given dimension choices.
2.2.4 Nonnegative Matrix Factorization
Nonnegative Matrix Factorization (NMF) is an approximation of a matrix with high
dimensions as a product of two lower dimensional nonnegative matrices. The approx-
imation is written as
A ≈WHT (2.4)
where A ∈ Rm×n+ and is a data matrix. W ∈ Rm×k
+ and H ∈ Rn×k+ are both nonneg-
ative factor matrices. The chosen k ≤ min (m,n) is a parametrized value and is the
rank of the factor matrices and also the nonnegative rank of the approximation of A.
This approximation is also depicted in fig. 2.1.
There are several methods for computing a NMF. One of these methods is the
Alternating Nonnegative Least Squares (ANLS) method [38]. This method starts
with the minimization problem
minW≥0‖A−WHT‖ (2.5)
for finding W and the similar problem of
minH≥0‖AT −HW T‖ (2.6)
10
for finding H. These are both constrained Least Squares (LS) problems with nonneg-
ativity constraints. They are referred to as Nonnegative Least Squares (NNLS).
By fixing either W or H and solving the linear system for the other, an alternating
update algorithm can converge to a stopping point, since both minimizations are
convex problems [38].
There are different algorithms used solve the NNLS problems as described in
eq. (2.5) and eq. (2.6). One of these methods, Block Principal Pivoting (or BPP),
is described in [35] and [31]. BPP uses the active set method in order to compute
the NNLS. The active set method deals with the non-negative constraint of NNLS
by iteratively computing the unconstrained LS and grouping negative contributions.
This active set method is well-defined for the vector case, and is extended to the
matrix case by going column-by-column.
2.2.5 Hierarchical NMF
NMF can be used to cluster data by interpreting the W and H factor matrices. For
example, if columns of a data matrix represent samples of data and rows represent
features of those samples, then the k columns of W represent k clusters of data and
the k rows of HT represent the membership of each data point in the k clusters.
Since NMF can naturally be used as a clustering algorithm, recursively calling
NMF with k = 2 on data can result in a hierarchical tree of clusters. This is the
basic premise of the Hierarchical NMF algorithm. In Hierarchical NMF, k refers to
the number of leaf clusters in the resulting tree.
From section 2.2.4, BPP is a general approach to solving NNLS for any k and
scales like O(k). In [38], the authors propose a faster NNLS that requires k = 2.
The possible active sets for a matrix with k = 2 is only of size 4 and so can be
computed exhaustively without being infeasible. Since the algorithm proposed in [38]
11
is so simple to compute for k = 2, the authors proposed that it be used as a subroutine
for Hierarchical NMF. In chapter 3, we parallelize this Hierarchical NMF algorithm
using a parallel Rank-2 NMF.
2.3 Tensors
Tensors are a generalization of matrices in higher dimensions. In this work, a tensor
called “T” is written as T. Tensors are popular in a number of fields such as sig-
nal processing, numerical linear algebra, computer vision, numerical analysis, data
mining, graph analysis, and neuroscience [36].
2.3.1 Tensor Train
One of the problems of working with tensors is the so-called “curse of dimensionality”,
where the number of elements of the tensor is exponential in the number of modes [47].
Some tensor applications can use tens to thousands of modes and so can lead to tensors
of infeasible size in both storage and computation. A solution to this problem is to
use a tensor decomposition that can compress the data and is not exponential in the
number of modes. One such decomposition is called Tensor Train.
Tensor Train (TT) is a low-rank tensor decomposition. It’s been used in areas
such as molecular simulations, data completion, uncertainty quantification, and clas-
sification [1]. The “train” of tensor train is a series of tensors, called TT cores. Each
of these tensors, with the exception of the first and last tensors, is a three-way tensor.
The first and last tensor in the train are both matrices. Figure 2.2 shows a diagram
of a five-way tensor in TT format.
12
i
j
k
l
m
I1
R1
I2
R1R2
I3
R2R3
I4
R3R4
I5
R4
Figure 2.2: TT format of a five-way tensor X ∈ RI1×I2×I3×I4×I5 . Note that R0 =RN = 1 is shown through the first and last TT cores being matrices. The blue shadedregions represent the matrices and vectors required in computing eq. (2.7). Althoughthe In can be of any size, they are generally thought to be much larger relative toRn and so this representation shows tall TT cores.
2.3.2 Tensor Train Notation
Given a tensor X ∈ RI1×···×IN where N is the number of modes of X and each Ik is
the dimension of that mode, if X can be represented in TT format, then there exist
positive integers R0, . . . , RN with R0 = RN = 1 and N TT cores where the nth TT
core is TX,n ∈ RRn−1×In×Rn . In other words, X is in TT format if can be represented
where TX,n is the nth tensor core of N cores in the train [47]. Figure 2.2 shows the
pattern of element access for the entry X(i, j, k, l,m).
The integers R0, . . . , RN are called the TT ranks. By reducing these TT ranks
and approximating X, then the resulting tensor is in a more compressed format. This
TT rank reduction is called TT rounding.
One of the advantages of using Tensor Train over other tensor low-rank approxima-
tions is that the number of elements of the TT format is linear rather than exponential
13
in the number of modes of the original tensor. In other words,
|TT (X)| =N∑k
Rk−1IkRk (2.8)
where |TT (X)| is the number of elements of the TT representation of X. Note that
eq. (2.8) shows that
|TT (X)| = O(NIR2
)(2.9)
where N is the number of modes of X, I is the largest dimension of X and R is the
largest TT-rank of X.
By comparison to eq. (2.9), another decomposition called Tucker hasO(RN +NIR
)elements, where R is called the Tucker rank, which might be different than TT ranks.
TT avoids having elements that are exponential in the number of modes by limiting
the modes of the factor tensors.
Some computations with tensors, such as the truncated SVD, require the individ-
ual TT cores to be “unfolded”. Figure 2.3 shows this pattern of unfolding for vertical
and horizontal unfoldings.
14
In
Rn−1Rn
TX,n ∈ RRn−1×In×Rn
are TT cores
Rn
Rn−1 · · ·Rn
· · ·Rn
In
H(TX,n) ∈ RRn−1×InRn
Rn
Rn−1
...
Rn−1
...
Rn−1
In
V(TX,n) ∈ RRn−1In×Rn
Figure 2.3: Types of unfolding for TT tensors. TX,n is the nth TT core. The blueshaded region is a slice of TX,n. H(TX,n) is the horizontal unfolding of TX,n. V(TX,n)is the vertical unfolding of TX,n.
2.3.3 TT Rounding
The truncated SVD is necessary to reduce the ranks of a TT tensor X. In general,
each TT rank Rn is reduced as the TT rounding algorithm proceeds down the train
of TT cores. The current state-of-the-art method of computing TT rounding requires
an orthogonalization step using the QR decomposition. Although it is quite accurate,
this approach is slow. Chapter 4 describes an improvement on this method that
avoids using the QR orthogonalization step, improving the speed of the overall TT
Rounding algorithm.
15
Chapter 3: Parallel Hierarchical Clustering using
Rank-Two Nonnegative Matrix Factorization
The following chapter is a manuscript published to the International Conference
on High Performance Computing (HiPC’20) authored by myself, Grey Ballard, Ra-
makrishnan Kannan, and Haesun Park. For this work, I contributed to designing and
implementing the parallel algorithms identified in the paper. I also contributed to the
experimental section of the manuscript by reporting results and choosing data sets
for experimentation.
3.1 Abstract
Nonnegative Matrix Factorization (NMF) is an effective tool for clustering nonnega-
tive data, either for computing a flat partitioning of a dataset or for determining a
hierarchy of similarity. In this paper, we propose a parallel algorithm for hierarchical
clustering that uses a divide-and-conquer approach based on rank-two NMF to split a
data set into two cohesive parts. Not only does this approach uncover more structure
in the data than a flat NMF clustering, but also rank-two NMF can be computed
more quickly than for general ranks, providing comparable overall time to solution.
Our data distribution and parallelization strategies are designed to maintain compu-
tational load balance throughout the data-dependent hierarchy of computation while
limiting interprocess communication, allowing the algorithm to scale to large dense
and sparse data sets. We demonstrate the scalability of our parallel algorithm in terms
of data size (up to 800 GB) and number of processors (up to 80 nodes of the Summit
supercomputer), applying the hierarchical clustering approach to hyperspectral imag-
ing and image classification data. Our algorithm for Rank-2 NMF scales perfectly
16
on up to 1000s of cores and the entire hierarchical clustering method achieves 5.9x
speedup scaling from 10 to 80 nodes on the 800 GB dataset.
3.2 Introduction
Nonnegative Matrix Factorization (NMF) has been demonstrated to be an effective
tool for unsupervised learning problems including clustering [15, 51, 65]. An NMF
consists of two tall-and-skinny non-negative matrices whose product approximates a
nonnegative data matrix. That is, given an m×n data matrix A, we seek nonnegative
matrices W and H that each have k columns so that A ≈ WHT. Each pair of
corresponding columns of W and H form a latent component of the NMF. If the
rows of A correspond to features and the columns to samples, the ith row of the H
matrix represents the loading of sample i onto each latent component and provides a
soft clustering. Because the W factor is also nonnegative, each column can typically
be interpreted as a latent feature vector for each cluster.
Hierarchical clustering is the process of recursively paritioning a group of samples.
While standard NMF is interpreted as a flat clustering, it can also be extended for
hierarchical clustering. Kuang and Park [38] propose a method that uses rank-2 NMF
to recursively bipartition the samples. The method determines a binary tree such that
all leaves contain unique samples and the structure of the tree determines hierarchical
clusters.A single W vector for each node can also be used for cluster interpretation.
We discuss the hierarchical method in more detail in Section 3.3 and Section 3.4.1.
We illustrate the output of the hierarchical clustering method with an example
data set and output tree. Following Gillis et al. [25], we apply the method to a
hyperspectral imaging (HSI) data set of the Washington, D.C national mall, which
has pixel dimensions 1280 × 307 and 191 spectral bands. Figure 3.1 visualizes the
output tree with 6 leaves along with their hierarchical relationships. The root node,
17
labeled 0, is a flattening of the HSI data to a 2D grayscale image. Each other node is
represented by an overlay of the member pixels of the clusters (in blue) on the original
grayscale image. The first bipartitioning separates vegetation (cluster 1) from non-
vegetation (cluster 2), the bipartitioning of cluster 1 separates grass (cluster 3) from
trees (cluster 4), the bipartitioning of cluster 2 separates buildings (cluster 5) from
sidewalks/water (cluster 6), and so on. If the algorithm continues, it chooses to split
the leaf node that provides the greatest benefit to the overall tree, which can be
quantified as a node’s “score” in various ways.
While the hierarchical clustering method offers advantages in terms of interpre-
tation as well as execution time compared to flat NMF, implementations of the al-
gorithm are limited to single workstations and the dataset must fit in the available
memory. Currently available implementations can utilize multiple cores via MAT-
LAB [38] or explicit shared-memory parallelization in the SmallK library [17].
The goal of this work is to use distributed-memory parallelism to scale the algo-
rithm to large datasets that require the memory of multiple compute nodes and to
high processor counts. While flat NMF algorithms have been scaled to HPC plat-
forms [6, 21, 32, 41], our implementation is the first to our knowledge to scale a hier-
archical NMF method to 1000s of cores. As discussed in detail in Section 3.4.2, we
choose to parallelize the computations associated with each node in the tree, which
involve a Rank-2 NMF and the computation of the node’s score. We choose a data
matrix distribution across processors that avoids any redistribution of the input ma-
trix regardless of the data-dependent structure of the tree’s splitting decisions so that
the communication required involves only the small factor matrices. Analysis of the
algorithm shows the dependence of execution time on computation and communica-
tion costs as well as on k, the number of clusters computed. In particular, we confirm
that many of the dominant costs are logarithmic in k, which is favorable to the linear
18
or sometimes superlinear dependence of flat NMF algorithms.
We demonstrate in Section 3.5 the efficiency and scalability of our parallel al-
gorithm on three data sets, including the HSI data of the DC mall and an image
classification data set involving skin melanoma. The experimental results show that
our parallelization of Rank-2 NMF is highly scalable, maintaining computation bound
performance on 1000s of cores. We also show the limits of strong scalability when
scaling to large numbers of clusters (leaf nodes), as the execution time shifts to be-
coming interprocessor bandwidth bound and eventually latency bound. The image
classification data set requires 800 GB of memory across multiple nodes to process,
and in scaling from 10 nodes to 80 nodes of the Summit supercomputer (see Sec-
tion 3.5.1), we demonstrate parallel speedups of 7.1× for a single Rank-2 NMF and
5.9× for a complete hierarchical clustering.
3.3 Preliminaries and Related Work
3.3.1 Non-negative Matrix Factorization(NMF)
The NMF constrained optimization problem
minW,H≥0
‖A−WHT‖2
is nonlinear and nonconvex, and various optimization techniques can be used to ap-
proximately solve it. A popular approach is to use alternating optimization of the
two factor matrices because each subproblem is a nonnegative least squares (NNLS)
problem, which is convex and can be solved exactly. Many block coordinate descent
(BCD) approaches are possible [34], and one 2-block BCD algorithm that solves the
NNLS subproblems exactly is block principal pivoting [35]. This NNLS algorithm is
an active-set-like method that determines the sets of entries in the solution vectors
that are zero and those that are positive through an iterative but finite process.
19
3
9 10
4
1
11 12
5 6
2
0
Figure 3.1: Hierarchical Clustering of DC Mall HSI
20
When the rank of the factorization (the number of columns of W and H) is
2, the NNLS subproblems can be solved much more quickly because the number
of possible active sets is only 4. As explained in more detail in Section 3.4.1, the
optimal solution across the 4 sets can be determined efficiently to solve the NNLS
subproblem more quickly than general-rank approaches like block principal pivoting.
Because of the relative ease of solving the NMF problem for the rank-2 case, Kuang
and Park [38] propose a recursive method to use a rank-2 NMF to partition the input
data into 2 parts, whereby each part can be further partitioned via rank-2 NMF
of the corresponding original data. This approach yields a hierarchical factorization,
potentially uncovering more global structure of the input data and allowing for better
scalability of the algorithm to large NMF ranks.
The hierarchical rank-2 NMF method has been applied to document clustering [38]
and hyperspectral image segmentation [25]. The leaves of the tree also yield a set of
column vectors that can be aggregated into an approximate W factor (ignoring their
hierarchical structure). Using this factor matrix to initialize a higher-rank NMF com-
putation leads to quick convergence and overall faster performance than initializing
NMF with random data; this approach is known as Divide-and-Conquer NMF [19].
We focus in this paper on parallelizing the hierarchical algorithms proposed by Kuang
and Park [38] and Gillis et al. [25].
3.3.2 Parallel NMF
Scaling algorithms for NMF to large data often requires parallelization in order to fit
the data across the memories of multiple compute nodes or speed up the computation
to complete in reasonable time. Parallelizations of multiple optimization approaches
have been proposed for general NMF [6, 17, 21, 32, 41]. In particular, we build upon
the work of Kannan et al. [20, 31, 32] and the open-source library PLANC, designed
21
for nonnegative matrix and tensor factorizations of dense and sparse data. In this
parallelization, the alternating optimization approach is employed with various op-
tions for the algorithm used to (approximately) solve the NNLS subproblems. The
efficiency of the parallelization is based on scalable algorithms for the parallel ma-
trix multiplications involved in all NNLS algorithms; these algorithms are based on
Cartesian distributions of the input matrix across 1D or 2D processor grids.
3.3.3 Communication Model
We use the α-β-γ model [4, 11, 58] for analysis of distributed-memory parallel algo-
rithms. In this model, the cost of sending a single message of n words of data between
two processors is α + β · n, so that α represents the latency cost of the message and
β represents the bandwidth cost of each word in the message. The γ parameter
represents the computational cost of a single floating point operation (flop). In this
simplified communication model, we ignore contention in the network, assuming in
effect a fully connected network, and other limiting factors in practice such as the
number of hops between nodes and the network injection rate [28]. We let p represent
the number of processors available on the machine.
All of the interprocessor communication in the algorithms presented in this work
are encapsulated in collective communication operations that involve the full set of
processors. Algorithms for implementing the collective operations are built out of
pairwise send and receive operations, and we assume the most efficient algorithms are
used in our analysis [11, 58]. The collectives used in our algorithms are all-reduce,
all-gather, and reduce-scatter. In an all-reduce, all processors start out with the same
amount of data and all end with a copy of the same result, which is in our case a sum
of all the inputs (and the same size as a single input). The cost of an all-reduce of size
n words is α ·O(log p) + (β+ γ) ·O(n) for n > p and α ·O(log p) + (β+ γ) ·O(n log p)
22
for n < p. In an all-gather, all processors start out with separate data and all end
with a copy of the same result, which is the union of all the input data. If each
processor starts with n/p data and ends with n data, the cost of the all-gather is
α · O(log p) + β · O(n). In a reduce-scatter, all processors start out with the same
amount of data and all end with a subset of the result, which is in our case a sum of
all the inputs (and is smaller than its input). If each processor starts with n data and
ends with n/p data, the cost of the reduce-scatter is α ·O(log p)+(β+γ) ·O(n). In the
case of all-reduce and reduce-scatter, the computational cost is typically dominated
by the bandwidth cost because β γ.
3.4 Algorithms
3.4.1 Sequential Algorithms
Rank-2 NMF
Using the 2-block BCD approach for a rank-2 NMF yields NNLS subproblems of the
form minH≥0‖WH
T −A‖ and minW≥0 ‖HWT −AT‖. In each case, the columns of the
transposed variable matrix can be computed independently. Considering the ith row
of H, for example, the NNLS problem to solve is
minhi,1,hi,2≥0
∥∥∥∥[w1 w2
] [hi,1hi,2
]− ai
∥∥∥∥= min
hi,1,hi,2≥0
∥∥hi,1w1 + hi,2w2 − ai∥∥
where w1 and w2 are the two columns of W and ai is the i column of A. We note that
there are four possibilities of solutions, as each of the two variables may be positive
or zero.
As shown by Kuang and Park [38], determining which of the four possible solutions
is feasible and optimal can be done efficiently by exploiting the following properties:
23
• if the solution to the unconstrained least squares problem admits two positive
values, it is the optimal solution to the nonnegatively constrained problem,
• if W and A are both nonnegative, then the candidate solution with two zero
values is never (uniquely) optimal and can be discarded, and
• if the unconstrained problem does not admit a positive solution, the better of
the two remaining solutions can be determined by comparing aTj w1/‖w1‖ and
aTj w2/‖w2‖.
If the unconstrained problem is solved via the normal equations, then the temporary
matrices computed for the normal equations (WTW and ATW) can be re-used to
determine the better of the two solutions with a single positive variable.
Algorithm 1 implements this strategy for all rows of H simultaneously. It takes as
input the matrices C = ATW and G = WTW, first solves the normal equations for
the unconstrained problem, and then chooses between the two alternate possibilities as
necessary. We note that each row of H is independent, and therefore this algorithm is
easily parallelized. Solving for W can be done using inputs C = AH and G = HTH.
Given that the computational complexity of Algorithm 1 is O(n) (or O(m) when
computing W), and the complexity of computing WTW and HTH is O(m+ n), the
typical dominant cost of each iteration of Rank-2 NMF is that of computing ATW
and AH, which is O(mn).
Hierarchical Clustering
A Rank-2 NMF can be used to partition the columns of the matrix into two parts.
In this case, the columns of the W factor represent feature weights for each of the
two latent components, and the strength of membership in the two components for
each column of A is given by the two values in the corresponding row of H. We can
24
Algorithm 1 Rank-2 Nonnegative Least Squares Solve [38]
Require: C is n× 2 and G is 2× 2 and s.p.d.1: function H = Rank2-NLS-Solve(C,G)2: H = CG−1 . Solve unconstrained system3: for i = 1 to n do4: if hi1 < 0 or hi2 < 0 then5: . Choose between single-variable solutions6: if ci1/
√g11 < ci2/
√g22 then
7: hi1 = 08: hi2 = ci2/g22
9: else10: hi1 = ci1/g11
11: hi2 = 012: end if13: end if14: end for15: end functionEnsure: H = arg min
H≥0
‖A−WHT‖ is n× 2 with C = ATW and G = WTW
determine part membership by comparing those values: if hi1 > hi2, then column i of
A is assigned to the first part, which is associated with feature vector w1. Membership
can be determined by other metrics that also take into account balance across parts
or attempt to detect outliers.
Given Rank-2 NMF as a splitting procedure, hierarchical clustering builds a binary
tree such that each node corresponds to a subset of samples from the original data
set and each node’s children correspond to a 2-way partition of the node’s samples.
In this way, the leaves form a partition of the original data, and the internal nodes
specify the hierarchical relationship among clusters. As the tree is built, nodes are
split in order of their score, or relative value to the overall clustering of the data.
The process can be continued until a target number of leaves is produced or until all
remaining leaves have a score below a given threshold.
A node’s score can be computed in different ways. For document clustering, Kuang
25
Internal NodeFrontier Node
Leaf Node
Figure 3.2: Hierarchy node classification
and Park [38] propose using modified normalized discounted cumulative gain, which
measures how distinct a node’s children are from each other using the feature weights
associated with the node and its children. For hyperspectral imaging data, Gillis et
al. [25] propose using the possible reduction in overall NMF error if the node is split
– the difference in error between using the node itself or using its children. We use
the latter in our implementation.
In any case, a node’s score depends on properties of its children, so the compu-
tation for a split must be done before the split is actually accepted. To this end,
we define a frontier node to be a parent of leaves; these are nodes whose children
have been computed but whose splits have not been accepted. Figure 3.2 depicts the
classification of nodes into internal, frontier, and leaf nodes. As the tree is built, the
algorithm selects the frontier node with the highest score to split, though no compu-
tation is required to split the node. When a frontier node split is accepted, it becomes
an internal node and its children are split (so that their scores can be computed) and
added to the set of frontier nodes. When the algorithm terminates, the leaves are
discarded and the frontier nodes become the leaves of the output tree.
Our hierarchical clustering algorithm is presented in Algorithm 2 and follows that
26
of Kuang and Park [38]. Each node includes a field A, which is a subset of columns
(samples) of the original data, a feature vector w, which is its corresponding column
of the W matrix from its parent’s Rank-2 NMF, a score, and pointers to its left and
right children. A priority queue Q tracks the frontier nodes so that the node with the
highest score is split at each step of the algorithm. We use a target number of leaf
clusters k as the termination condition. When a node is selected from the priority
queue, it is removed from the set of frontier nodes and its children are added.
Algorithm 2 Hierarchical Clustering [38]
Require: A is m× n, k is target number of leaf clusters1: function T = Hier-R2-NMF(A)2: R = node(A) . create root node3: Split(R)4: inject(Q,R.left) . create priority queue5: inject(Q,R.right) . of frontier nodes6: while size(Q) < k do7: N = eject(Q) . frontier node with max score8: Split(N .left) . split left child9: inject(Q,N .left) . and add to Q10: Split(N .right) . split right child11: inject(Q,N .right) . and add to Q12: end while13: end functionEnsure: T is binary tree rooted at R with k frontier nodes, each node has subset of
cols of A and feature vector w
The splitting procedure is specified in Algorithm 3. After the Rank-2 NMF is
performed, the H factor is used to determine part membership, and the columns of
the W factor are assigned to the child nodes. The score of the node is computed as
the reduction in overall NMF error if the node is split, which can be computed from
the principal singular values of the subsets of columns of the node and its children,
as given in Line 6. The principal singular values of the children are computed via the
power method. Note that the principal singular value of the node itself need not be
recomputed as it was needed for its parent’s score.
27
Algorithm 3 Node Splitting via Rank-Two NMF
Require: N has a subset of columns given by field A1: function Split(N )2: [W,H] = Rank2-NMF(N .A) . split N3: partition N .A into A1 and A2 using H4: N .left = node(A1,w1) . create left child5: N .right = node(A2,w2) . create right child6: N .score = σ2
1(A1) + σ21(A2)− σ2
1(N .A)7: end function
Ensure: N has two children and a score
3.4.2 Parallelization
In this section, we consider the options for parallelizing Hierarchical Rank-2 NMF
Clustering (Algorithm 2) and provide an analysis for our approach. The running
time of an algorithm is data dependent because not only does each Rank-2 NMF
computation require a variable number of iterations, but also the shape of the tree
can vary from a balanced binary tree with O(log k) levels to a tall, flat tree with O(k)
levels. For the sake of analysis, we will assume a fixed number of NMF iterations for
every node of the tree and we will analyze the cost of complete levels.
The first possibility for parallelization is across the nodes of the tree, as each Rank-
2 NMF split is independent. We choose not to parallelize across nodes in the tree for
two reasons. The first reason is that while the NMF computations are independent,
choosing which nodes to split may depend on global information. In particular, when
the global target is to determine k leaf clusters, the nodes must be split in order
of their scores, which leads to a serialization of the node splits. This serialization
might be relaxed using speculative execution, but it risks performing unnecessary
computation. If the global target is to split all nodes with sufficiently high scores,
then this serialization is also avoided and node splits become truly independent. We
choose not to parallelize in this way to remain agnostic to the global stopping criterion.
The second reason is that parallelizing across nodes requires redistribution of the
28
input data. Given a node split by p processors, in order to assign disjoint sets of
processors to each child node, each of the p processors would have to redistribute
their local data, sending data for samples not in their child’s set and receiving data
for those in their child’s set. The communication would be data dependent, but on
average, each processor would communicate half of its data in the redistribution set,
which could have an all-to-all communication pattern among the p processors. For
a node with n columns, the communication cost would be at least O(mn/p) words,
which is much larger than the communication cost per iteration of Parallel Rank-2
NMF, as we will see in Section 3.4.2.
By choosing not to parallelize across nodes in the tree, we employ all p proces-
sors on each node, and split nodes in sequence. The primary computations used to
split a node are the Rank-2 NMF and the score computation, which is based on an
approximation of the largest singular value. We use an alternating-updating algo-
rithm for Rank-2 NMF as described in Section 3.3, and we parallelize it following the
methodology proposed in [20] and presented in Algorithm 4.
The communication cost of the algorithm depends on the parallel distribution of
the input matrix data A. In order to avoid redistribution of the matrix data, we choose
a 1D row distribution so that each processor owns a subset of the rows of A. Because
the clustering partition splits the columns of A, each processor can partition its
local data into left and right children to perform the split without any interprocessor
communication. If we use a 2D distribution for a given node, then because the
partition is data dependent, a data redistribution is required in order to obtain a
load balanced distribution of both children. Figure 3.3 presents a visualization of the
node-splitting process using a 1D processor distribution. In the following subsections,
we describe the parallel algorithms for Rank-2 NMF and approximating the principal
singular value given this 1D data distribution and analyze their complexity in the
29
AW
HT
A1w1 A2w2
Figure 3.3: Parallel splitting using Rank-2 NMF and 1D processor distribution. ARank-2 NMF computes factor matrices W and H to approximate A, the values of Hare used to determine child membership of each column (either red or blue), and thecorresponding column of the W matrix represents the part’s feature weighting. The1D distribution is depicted for 3 processors to show that splitting requires no inter-processor redistribution as children are evenly distributed identically to the parent.
context of the hierarchical clustering algorithm.
Algorithms
Parallel Rank-2 NMF Algorithm 4 presents the parallelization of an alternating-
updating scheme for NMF that uses the exact rank-2 solve algorithm presented in
Algorithm 1 to update each factor matrix. The algorithm computes the inputs to
the rank-2 solves in parallel and then exploits the parallelism across rows of the
30
factor matrix so that each processor solves for a subset of rows simultaneously. The
distribution of all matrices is 1D row distribution, so that each processor owns a
subset of the rows of A, W, and H. We use the notation A to refer to the (m/p)×n
local data matrix and W and H to refer to the (m/p)× 2 and (n/p)× 2 local factor
matrices. With this distribution, the computation of WTW and HTH each is done
via local multiplication followed by a single all-reduce collective. All processors own
the data they need to compute their contribution to ATW; in order to distribute
the result to compute the rows H independently, a reduce-scatter collective is used
to sum and simultaneously distribute across processors. To obtain the data needed
to compute W, each processor must access all of H, which is performed via an all-
gather collective. The iteration progresses until a convergence criterion is satisfied.
For performance benchmarking we use a fixed number of iterations, and in practice
we use relative change in objective function value (residual norm).
Parallel Power Method In order to compute the score for a frontier node, we
use the difference between the principal singular value of the matrix columns of the
node and the sum of those of its children. Thus, we must determine the principal
singular value of every node in the tree once, including leaf nodes. We use the power
method to approximate it, repeatedly applying AAT to a vector until it converges to
the leading right singular vector. We present the power method in Algorithm 5. Note
that we do not normalize the approximate left singular vector so that the computed
value approximates the square of the largest singular value.
Given the 1D distribution, only one communication collective is required for the
pair of matrix-vector multiplications. That is, the approximate right singular vector v
is redundantly owned on each processor, and the approximate left singular vector u is
distributed across processors. Each processor can compute its local u from v without
31
Algorithm 4 Parallel Rank-2 NMF
Require: A is m×n and row-distributed across processors so that A is local (m/p)×n submatrix
1: function [W,H] = Parallel-Rank2-NMF(A)2: Initialize local W randomly3: while not converged do4: . Compute H
5: GW = WTW
6: GW = All-Reduce(GW )
7: B = ATW
8: C = Reduce-Scatter(B)9: H = Rank2-NLS-Solve(C,GW )10: . Compute W
11: GH = HTH
12: GH = All-Reduce(GH)13: H = All-Gather(H)14: D = AH15: W = Rank2-NLS-Solve(D,GH)16: end while17: end functionEnsure: A ≈WHT with W, H row-distributed
32
communication and use the result for its contribution to v = ATu. An all-reduce
collective is used to obtain a copy of v on every processor for the next iteration,
and the norm is redundantly computed without further communication. We used the
relative change in σ as the stopping criterion for benchmarking.
Algorithm 5 Parallel Power Method
Require: A is m×n and row-distributed across processors so that A is local (m/p)×n submatrix
1: function σ = Parallel-Power-Method(A)2: Initialize v randomly and redundantly3: while not converged do4: u = Av5: z = A
Tu
6: v = All-Reduce(z)7: σ = ‖v‖8: v = v/σ9: end while10: end functionEnsure: σ ≈ σ2
1(A) is redundantly owned by all procs
Analysis
Parallel Rank-2 NMF Each iteration of Algorithm 4 incurs the same cost, so we
analyze per-iteration computation and communication costs. We first consider the
cost of the Rank-2 NNLS solves, which are local computations. In the notation of
Algorithm 1, matrix G is 2 × 2, so solving the unconstrained system (via Cholesky
decomposition) and then choosing between single-positive-variables solutions if neces-
sary requires constant time per row of C. Thus, the cost of Algorithm 1 is proportional
to the number of rows of the first input matrix. In the context of Algorithm 4, the
per-iteration computational cost of rank-2 solves is then O((m+n)/p). The other lo-
cal computations are the matrix multiplications WTW and H
TH, which also amount
to O((m+n)/p) flops, and ATW and AH, which require O(mn/p) flops because they
33
involve the data matrix. Thus, the computation cost is γ · O((mn + m + n)/p) and
typically dominated by the multiplications involving A. We track the lower order
terms corresponding to NNLS solves because their hidden constants are larger than
that of the dominating term.
There are four communication collectives each iteration, and each involves all p
processors. The two all-reduce collectives to compute the Gram matrices of the factor
matrices involve 2×2 matrices and incur a communication cost of (γ+β+α)·O(log p).
The reduce-scatter and all-gather collectives involve n × 2 matrices (the size of H)
and require β ·O(n) +α ·O(log p) in communication cost (we ignore the computation
cost of the reduce-scatter because it is typically dominated by the bandwidth cost).
If the algorithm performs ı iterations, the overall cost of Algorithm 4 is
γ ·O(ı(mn+m+ n)
p
)+ β ·O(ın) + α ·O(ı log p). (3.1)
Parallel Power Method Similar to the previous analysis, we consider a single
iteration of the power method. The local computation is dominated by two matrix-
vector products involving the local data matrix of size O(mn/p) words, incurring
O(mn/p) flops. The single communication collective is an all-reduce of the approxi-
mate right singular vector, which is of size n, incurring β ·O(n) +α ·O(log p) commu-
nication. We ignore the O(n) computation cost of normalizing the vector, as it will
typically be dominated by the communication cost of the all-reduce. Over iterations,
Algorithm 5 has an overall cost of
γ ·O(mn
p
)+ β ·O(n) + α ·O( log p). (3.2)
Note the per-iteration cost of the power method differs by only a constant from the
per-iteration cost of Rank-2 NMF. Because the power method involves single vectors
34
rather than factor matrices with two columns, its constants are smaller than half the
size of their counterparts.
Hierarchical Clustering To analyze the overall cost of the hierarchical clustering
algorithm, we sum the costs over all nodes in the tree. Because the shape of the tree
is data dependent and affects the overall costs, for the sake of analysis we will analyze
only complete levels. The number of rows in any node is m, the same as the root node,
as each splitting corresponds to a partition of the columns. Furthermore, because each
split is a partition, every column of A is represented exactly once in every complete
level of the tree. If we assume that all nodes perform the same number of NMF
iterations (ı) and power method iterations (), then the dominating costs of a node
with n columns is
γ ·O(
(ı+ )mn+ ı(m+ n)
p
)+ β ·O((ı+ )n) + α ·O((ı+ ) log p).
Because the sum of the number of columns across any level of the tree is n, the cost
of the `th level of the tree is
γ ·O(
(ı+ )mn+ ım2`
p
)+ β ·O((ı+ )n) + α ·O((ı+ )2` log p). (3.3)
Note that the only costs that depend on the level index ` are the latency cost and a
lower-order computational cost.
Summing over levels and assuming the tree is nearly balanced and has height
O(log k) where k is the number of frontier nodes, we obtain an overall cost of Algo-
rithm 2 of
γ ·O(
(ı+ )mn
plog k +
ımk
p
)+ β ·O((ı+ )n log k) + α ·O((ı+ )k log p). (3.4)
We see that the leading order computational cost is logarithmic in k and perfectly
load balanced. If the overall running time is dominated by the computation (and
35
in particular the matrix multiplications involving A), we expect near-perfect strong
scaling. The bandwidth cost is also logarithmic in k but does not scale with the
number of processors. The latency cost grows most quickly with the target number
of clusters k but is also independent of the matrix dimensions m and n.
3.5 Experimental Results
3.5.1 Experimental Platform
All the experiments in this section were conducted on Summit. Summit is a su-
percomputer created by IBM for the Oak Ridge National Laboratory. There are
approximately 4,600 nodes on Summit. Each node contains two IBM POWER9 pro-
cessors on separate sockets with 512 GB of DDR4 memory. Each POWER9 processor
utilizes 22 IBM SIMD Multi-Cores (SMCs), although one of these SMCs on each pro-
cessor is dedicated to memory transfer and is therefore not available for computation.
For node scaling experiments, all 42 available SMCs were utilized in each node so
that every node computed with 42 separate MPI processes. Additionally, every node
also supports six NVIDIA Volta V100 accelerators but these were unused by our
algorithm.
Our implementation builds on the PLANC open-source library [20] and uses the
Armadillo library (version 9.900.1) for all matrix operations. On Summit, we linked
this version of Armadillo with OpenBLAS (version 0.3.9) and IBM’s Spectrum MPI
(version 10.3.1.2-20200121).
3.5.2 Datasets
Hyperspectral Imaging We use the Hyperspectral Digital Imagery Collection Ex-
periment (HYDICE) image of the Washington DC Mall. We will refer to this dataset
as DC-HYDICE [40]. DC-HYDICE is formatted into a 3-way tensor representing two
36
spatial dimensions of pixels and one dimension of spectral bands. So, a slice along the
spectral band dimension would be the full DC-HYDICE image in that spectral band.
For hierarchical clustering, these tensors are flattened so that the rows represent the
191 spectral bands and the columns represent the 392960 pixels. The data set is
approximately 600 MB in size.
Image Classification The SIIM-ISIC Melanoma classification dataset, which we
will refer to as SIIM-ISIC [52], consists of 33126 RGB training images equally sized
at 1024 × 1024. Unlike with hyperspectral imaging, the resulting matrix used in hi-
erarchical clustering consists of image pixels along the rows and individual images
along the columns. So, the resulting sized matrix is 3145728 × 33126, which is ap-
proximately 800 GB in size. Given its size, SIIM-ISIC requries 10 Summit nodes to
perform hierarchical clustering.
Synthetic Dataset Our synthetic dataset has the same aspect ratio of SIIM-ISIC
but consists of fewer rows and columns by a factor of 3. The resulting matrix is
1048576 × 11042. We choose the smaller size in order to fit on a single node for
scaling experiments.
3.5.3 Performance
For all hierarchical clustering experiments in this section, the number of tree leaf
nodes k was set at 100, the number of NMF iterations was set to 100, the power
iteration was allowed to stop iterating after convergence, and only complete levels
were considered for analysis purposes for both level and strong scaling plots.
37
0 5 10 15 20 25 30 35 40
Number of Compute Cores
2
4
6
8
10
12
14
Rel
ativ
eS
pee
dup
Figure 3.4: Strong Scaling for Clustering on DC-HYDICE
Single-Node Scaling for DC Dataset
DC-HYDICE is small compared to the other datasets, so it can easily fit on one com-
pute node. Also, its small number of 191 rows doesn’t allow for parallelizing beyond
that number of MPI processes. So, this dataset was used for a single-node scaling
experiment on Summit from 1 to 42 cores. Because Rank-2 NMF is memory band-
width bound, we expect limited speedup on one node due to the memory bandwidth
not scaling linearly with the number of cores. Figure 3.4 shows that there is enough
speedup (14× on 42 cores) for it to be worth parallelizing such a small problem, but
perfect scaling requires more memory bandwidth. In this experiment, the processes
were distributed across both sockets so that an even number of cores on each socket
are used.
38
1 10 20 30 40
Number of Compute Nodes
0
10
20
30
40
Rel
ativ
eS
pee
dup
(a) Synthetic Data
10 20 30 40 50 60 70 80
Number of Compute Nodes
1
2
3
4
5
6
7
Rel
ativ
eS
pee
dup
(b) SIIM-ISIC Data
Figure 3.5: Strong Scaling Speedup for Rank-2 NMF
Rank-2 NMF Strong Scaling
We perform strong scaling experiments for a single Rank-2 NMF (Algorithm 4) on
the synthetic and SIIM-ISIC datasets. The theory (Equation (3.1)) suggests that
perfect strong scaling is possible as long as the execution time is dominated by local
computation. Both the matrix multiplications and NNLS solves scale linearly with
1/p (we expect MatMul to dominate), but the bandwidth cost is independent of p
and latency increases slightly with p.
Figures 3.5a and 3.5b show performance relative to the smallest number of com-
pute nodes required to store data and factor matrices. For these data sets, we observe
nearly perfect strong scaling, with 42× speedup on 40 compute nodes (over 1 compute
node) for synthetic data and 7.1× speedup on 80 compute nodes (over 10 compute
nodes) for SIIM-ISIC data.
The relative time breakdowns are presented in Figures 3.6 and 3.7 and explain
the strong scaling performance. Each experiment is normalized to 100% time, so
comparisons cannot be readily made across numbers of compute nodes. For both data
sets, we see that the time is dominated by MatMul, which is the primary reason for
the scalability. The dominant matrix multiplications are between a large matrix and
a matrix with 2 columns, so it is locally memory bandwidth bound, with performance
39
1 10 20 30 40
Number of Compute Nodes
0.0
0.2
0.4
0.6
0.8
1.0
Rel
ativ
eT
ime
MatMul
NNLS
Gram
Comp-Sigma
AllGather
ReduceScatter
AllReduce
Comm-Sigma
Figure 3.6: Time Breakdown for Rank-2 NMF on Synthetic
40
10 20 30 40 50 60 70 80
Number of Compute Nodes
0.0
0.2
0.4
0.6
0.8
1.0
Rel
ativ
eT
ime
MatMul
NNLS
Gram
Comp-Sigma
AllGather
ReduceScatter
AllReduce
Comm-Sigma
Figure 3.7: Time Breakdown for Rank-2 NMF on SIIM-ISIC
proportional to the size of the large matrix. In each plot, we also see the relative time
of all-gather and reduce-scatter increasing, which is because the local computation is
decreasing while the communication cost is slightly increasing with p. This pattern
will continue as p increases, which will eventually limit scalability, but for these data
sets the MatMul takes around 80% of the time at over 2000 cores.
Hierarchical Clustering Strong Scaling
From Equation (3.4), we expect to see perfect strong scaling in a computationally
bound clustering problem with target cluster count k = 100. As k is large, we expect
the latency cost of small problems deep in the tree to limit scalability.
Figure 3.8a demonstrates the scalability of the synthetic data set on up to 40 nodes,
and we observe a 15× speedup compared to 1 node. Figure 3.9 shows the relative
41
0 5 10 15 20 25 30 35 40
Number of Compute Nodes
2
4
6
8
10
12
14
Rel
ativ
eS
pee
dup
(a) Synthetic Data
10 20 30 40 50 60 70 80
Number of Compute Nodes
1
2
3
4
5
6
Rel
ativ
eS
pee
dup
(b) SIIM-ISIC Data
Figure 3.8: Strong Scaling Speedup for Clustering
time breakdown and explains the limitation on scaling. On 40 nodes, computation
still takes 60% of the total time, but the all-gather and reduce-scatter costs have
grown in relative time because they do not scale with p. Because all-reduce involves
only a constant amount of data and its time remains relatively small, we conclude
the communication is bandwidth bound at this scale.
With the larger SIIM-ISIC dataset, it’s possible to scale much further as seen in
Figure 3.8b, where we observe a 5.9× speedup of 80 compute nodes compared to 10.
From Figure 3.10, we see that the communication cost constitutes less than 20% of
the total time even at 80 compute nodes.
We note that the speedup of the overall hierarchical clustering algorithm is not
as high as for a single Rank-2 NMF (measured at the root node). This is due to
inefficiencies in the lower levels of the tree, as we explore in the next section.
Level Scaling
To compare execution time across levels of a particular tree, we consider only complete
levels. From Equation (3.3), the dominant computational term (due to MatMul) is
constant per level, the lower order computational term (represented by NNLS) grows
like O(2`), and the latency cost grows similarly like O(2`).
42
1 10 20 30 40
Number of Compute Nodes
0.0
0.2
0.4
0.6
0.8
1.0
Rel
ativ
eT
ime
MatMul
NNLS
Gram
Comp-Sigma
AllGather
ReduceScatter
AllReduce
Comm-Sigma
Figure 3.9: Time Breakdown for Clustering on Synthetic
43
10 20 30 40 50 60 70 80
Number of Compute Nodes
0.0
0.2
0.4
0.6
0.8
1.0
Rel
ativ
eT
ime
MatMul
NNLS
Gram
Comp-Sigma
AllGather
ReduceScatter
AllReduce
Comm-Sigma
Figure 3.10: Time Breakdown for Clustering on SIIM-ISIC
44
Figure 3.11 show absolute time across levels for the synthetic data set on 1 node.
The MatMul cost decreases slightly per level, which may be explained by cache effects
in the local matrix multiply, as each node’s subproblem decreases in size. The NNLS
grows exponentially, as expected, and communication is negligible.
Figure 3.12 shows the level breakdown for the synthetic data on 40 nodes, where we
see different behavior. MatMul cost is again constant across levels and the NNLS cost
becomes dominating at lower levels suggesting it does not scale as well as MatMul.
We also see all-reduce time becoming significant as communication time increases,
indicating that the nodes at lower levels are becoming more latency bound. Thus,
we see that the poorer scaling at the lower levels of the tree is the main reason the
overall hierarchical clustering algorithm does not scale as well as the single Rank-2
NMF at the root node.
Rank Scaling
To confirm the slow growth in running time of the hierarchical algorithm in terms
of the number of clusters k, we perform rank scaling experiments for DC-HYDICE
and synthetic data. Assuming a balanced tree and relatively small k, Equation (3.4)
shows that the dominant computational cost is proportional to log k, while a flat
NMF algorithm has a dominant cost that is linear in k [32]. Figure 3.13 shows the
raw time for various values of k, confirming that running time for HierNMF grows
more slowly in k than a flat NMF algorithm (based on Block Principal Pivoting)
from PLANC [20] with the same number of columns and processor grid. We see that
for sufficiently large k, the hierarchical algorithm outperforms flat NMF and it scales
much better with k.
45
0 1 2 3 4 5
Levels
0
50
100
150
200
250
Wal
lC
lock
Tim
e(i
nS
ecs)
MatMul
NNLS
Gram
Comp-Sigma
AllGather
ReduceScatter
AllReduce
Comm-Sigma
Figure 3.11: Level Times for 1 Compute Node on Synthetic
46
0 1 2 3 4 5
Levels
0
5
10
15
20
25
30
35W
all
Clo
ckT
ime
(in
Sec
s)
MatMul
NNLS
Gram
Comp-Sigma
AllGather
ReduceScatter
AllReduce
Comm-Sigma
Figure 3.12: Level Times for 40 Compute Nodes on Synthetic
10 20 30 40 50
Number of Clusters k
50
100
150
200
Tim
e(s
)
Hier NMF Flat NMF
(a) DC-HYDICE Data
10 20 30 40 50 60 70 80 90 100
Number of Clusters k
200
300
400
500
600
700
800
900
Tim
e(s
)
Hier NMF Flat NMF
(b) Synthetic Data
Figure 3.13: Rank Scaling for Hierarchical and Flat NMF
47
3.6 Conclusion
As shown in the theoretical analysis (Section 3.4.2) and experimental results (Sec-
tion 3.5.3), Algorithm 2 can efficiently scale to large p as long as the execution time
is dominated by local matrix multiplication. The principal barriers to scalability are
the bandwidth cost due to Rank-2 NMF, which is consistent across levels of the tree
and proportional to the number of columns n of the original data set, and the latency
cost due to large numbers of tree nodes in lower levels of the tree. When n is small
relative to m and the number of leaves k and levels ` are small, then these barriers
do not pose a problem until p is very large. However, if the input matrix is short and
fat (i.e., has many samples with few features), then the bandwidth cost can hinder
performance for smaller p. Likewise, if k is large or the tree is lopsided, then achieving
scalability for very small problems is more difficult. We also note that in the case of
sparse A, it becomes more difficult to hide communication behind the cheaper matrix
multiplications, and other costs may become more dominant.
One approach for reducing the bandwidth cost of Rank-2 NMF is to choose a
more balanced data distribution over a 2D grid, as proposed by Kannan et al. [31].
This will reduce the communicated data and achieve a local data matrix that is more
square, which can improve local matrix multiplication performance. The downside
of this approach is requiring a redistribution of the data for each split, but if many
NMF iterations are required, then the single upfront cost may be amortized.
Another approach to alleviate the rising latency costs of lower levels of the tree
is to parallelize across nodes of the tree. This will result in fewer processors working
on any given node, reducing the synchronization time among them, and it will allow
small, latency-bound problems to be solved simultaneously. Prioritizing the sequence
of node splits is more difficult in this case, but modifying the stopping criterion for
splitting to use a score threshold instead of a target number of leaves will allow truly
48
independent computation.
In the future, we also plan to compare performance of Algorithm 2 with flat NMF
algorithms and employ the Divide-and-Conquer NMF technique [19] of seeding an
iterative flat NMF algorithm with the feature vectors of the leaf nodes. The parallel
technique proposed here can be combined with the existing PLANC library [20] to
obtain faster overall convergence for very large datasets.
49
Chapter 4: Tensor Train Rounding using Gram Matrices
The following chapter is a manuscript that has been submitted. For this work, I
contributed mainly to the results section by performing experiments and generating
plots.
4.1 Abstract
Tensor Train (TT) is a low-rank tensor representation consisting of a series of three-
way cores whose dimensions specify the TT ranks. Formal tensor train arithmetic
often causes an artificial increase in the TT ranks. Thus, a key operation for appli-
cations that use the TT format is rounding, which truncates the TT ranks subject
to an approximation error guarantee. Truncation is performed via SVD of a highly
structured matrix, and current rounding methods require careful orthogonalization
to compute an accurate SVD. We propose a new algorithm for TT rounding based
on the Gram SVD algorithm that avoids the expensive orthogonalization phase. Our
algorithm performs less computation and can be parallelized more easily than ex-
isting approaches, at the expense of a slight loss of accuracy. We demonstrate that
our implementation of the rounding algorithm is efficient, scales well, and consistently
outperforms the existing state-of-the-art parallel implementation in our experiments.
4.2 Preliminaries
4.2.1 Tensor Train Notation
An order-N low rank tensor X ∈ RI1×···×IN is in the Tensor Train (TT) format if there
exist strictly positive integers R0, . . . , RN with R0 = RN = 1 and N order-3 tensors
50
TX,1, . . . ,TX,N , called TT cores, with TX,n ∈ RRn−1×In×Rn , such that:
X(i1, . . . , iN) = · · · · · · .
Since R0 = RN = 1, the first and last TT cores are (order-2) matrices so ∈ RR1 and
∈ RRN−1 and hence · · · · · · ∈ R. We refer to the Rn−1 × Rn matrix as the inth slice
of the nth TT core of X, where 1 ≤ in ≤ In.
Different types of matricization (also known as unfolding) of a tensor are used
to express linear algebra operations on tensors. In this work, we will often use two
particular matricization of 3D tensors. The horizontal unfolding of TT core TX,n
corresponds to stacking the slices for in = 1, . . . , In horizontally. The horizontal
unfolding operator is denoted by H, therefore, H(TX,n) ∈ RRn−1×RnIn . The vertical
unfolding corresponds to stacking the slices for in = 1, . . . , In vertically. The vertical
unfolding operator is denoted by V , therefore, V(TX,n) ∈ RRn−1In×Rn . These two
unfoldings are important for the linearization of tensor entries in memory as they
enable performing matrix operations on the TT core without shuffling or permuting
data.
Another type of unfolding which we will use to express mathematical relationships
among TT cores maps the first n modes to rows and the rest to columns [49]. We use
the notation X(1:n) to represent this unfolding, so that X(1:n) ∈ RI1···In×In+1···IN . The
n TT rank of X is the rank of X(1:n).
4.2.2 Cholesky QR and Gram SVD
Given a tall and skinny matrix A, recall that the corresponding Gram matrices are
AAT and ATA. We are typically interested in GA = ATA for efficient algorithms
because it is a smaller matrix.
Cholesky QR is an algorithm that exploits the fact that, for A full rank, the
upper triangular Cholesky factor of GA is also the upper triangular factor in the QR
51
decomposition of A. That is, for A = QR, we have GA = RTQTQR = RTR. If
A is full rank, then R is invertible and Q can be recovered as Q = AR−1 using
a triangular solve. In finite precision, Cholesky QR obtains a small decomposition
error ‖A−QR‖, but the orthogonality error ‖QTQ−I‖ grows quadratically with the
condition number of A. By comparison, Householder QR obtains small orthogonality
error regardless of the conditioning of A [30]. We note there are techniques for
improving the numerical properties of Cholesky QR, by using 2 or 3 passes [22, 23].
Likewise, Gram SVD is an algorithm that exploits the connection between the SVD
of a matrix and the eigenvalue decompositions of its Gram matrices. For A = UΣVT,
we have GA = VΣUTUΣVT = VΣ2VT. We see that the eigenvalues of GA are the
squares of the singular values of A and the eigenvectors of GA are the right singular
vectors of A. We can recover the left singular vectors via U = AVΣ−1 (assuming full
rank). Like Cholesky QR, Gram SVD computes an accurate decomposition but suffers
from higher orthogonality error of U as well as reduced accuracy of the singular values.
SVD algorithms using orthogonal transformations compute singular values with error
proportional to ‖A‖ · ε, where ε is the working precision, while the error for Gram
SVD can be larger by a factor as large as the condition number of A [60]. This implies
that backwards stable SVD algorithms can compute singular values in a range of 1/ε,
while Gram SVD is limited to computing singular values in a range of 1/√ε.
4.2.3 Cookies Problem and TT-GMRES
As a concrete example of a parametrized PDE for which TT methods work well, we
consider the two-dimensional cookies problem [37, 59] described as follows:
−div(σ(x, y;ρ)∇(u(x, y))) = f(x, y) in Ω,
u(x, y) = 0 on δΩ,
52
where Ω is (−1, 1)× (−1, 1), δΩ is the boundary of Ω and σ is defined as:
σ(x, y;ρ) =
1 + ρi if (x, y) ∈ Di
1 elsewhere
where Di for i = 1, . . . , p are disjoint disks distributed in Ω such that their centers
are equidistant and ρi is selected from a set of samples Ji ⊂ R for i = 1, . . . , p. To
solve this problem, for each combination of values (ρ1, . . . , ρp), one can solve the linear
system (G1,1 +∑p
i=1 ρiGi+1,1) u = f , where G1,1 ∈ RI1×I1 is the discretization of the
operator −div(∇(·)) in Ω, Gi+1,1 is the discretization of −div(χDi∇(·)) in Ω where
χS is the indicator function of the set S, and f is the discretization of the function f .
The number of linear systems to solve in that case is the product of the cardinalities
of the sets (Ji)1≤i≤p. Knowing that the set of solutions can be well approximated by a
low-rank tensor [13, 26], another approach to solve the problem is to use an iterative
method that exploits the low-rank structure and solves one large system including
all combinations of parameters. That is, to solve a (p + 1)-order problem of the
form GU = F. The operator G is given as G =∑p+1
i=1 Gi,1 ⊗ · · · ⊗ Gi,p+1, , where
Gi,i ∈ RIi×Ii for i = 2, . . . , p + 1 is a diagonal matrix containing the samples of ρi,
and the remaining matrices Gi,j for i = 1, . . . , p + 1, j = 2, . . . , p + 1 and j 6= i are
the identity matrices of suitable size. The right-hand side F = f ⊗ 1I2 ⊗ · · · ⊗ 1Ip+1 ,
where 1Ii is the vector of ones of size Ii.
In this application and many others, the operator G has an operator rank that is
low and the right-hand side F is given in a low-rank form [3,7,9,37,63,64]. One way to
approximate the solution by a low-rank tensor is to apply a Krylov method adapted
to low rank tensors such as TT-GMRES [16]. In each iteration, the operator G is
applied to a low rank tensor leading to a formal expansion of the ranks. Furthermore,
one needs to orthonormalize the new basis tensor against previous ones by using a
Gram–Schmidt procedure, see algorithm 6. Again, the ranks will increase formally.
53
In order to keep memory and computations tractable, one has to round the resulting
tensors after performing these two steps. Most of the time, a small reduction in
the final relative residual norm is sufficient, which allows performing aggressive TT
rounding with loose tolerances.
Algorithm 6 TT-GMRES [16]
1: function U = TT-GMRES(G,F,m, ε)2: Set β = ‖F‖F , V1 = U/β, r = β3: for j = 1 : m do4: Set δ = εβ
r5: W = TT-Round(GVj , δ)6: for i = 1 : j do7: H(i, j) = InnerProd(W,Vi)8: end for9: W = TT-Round(W−
∑ji=1 H(i, j)Vi, δ)
10: H(j + 1, j) = ‖W‖F11: r = min ‖H(1 : j + 1, 1 : j)y − βe1‖212: Vj+1 = W/H(j + 1, j)13: end for14: ym = argminy ‖Hy − βe1‖215: U =
∑mj=1 ym(j)Vj
16: end function
4.2.4 TT-Rounding via Orthogonalization
The standard algorithm for TT-rounding [47] is given in algorithm 7. This procedure
is composed of two phases, an orthogonalization phase and a truncation phase. The
orthogonalization phase consists of a sequence of QR decompositions of the vertical
unfolding of each core starting from the leftmost to orthonormalize its columns and
then a multiplication of the triangular factor by the following core. The truncation
phase consists of a sequence of truncated SVDs of the horizontal unfolding of each
core starting from the rightmost, leaving its rows orthonormal (set as the leading right
singular vectors), and multiplying the preceding core by the singular values and the
leading left singular vectors. The direction of these two phases can be reversed.
54
Given a required accuracy, the TT-Rounding procedure provides a quasi-optimal
approximation with given TT ranks [47].
Algorithm 7 TT-Rounding via Orthogonalization [1, 47]
1: function Y = TT-Round-QR(X, ε)2: Set TY,1 = TX,1
3: for n = 1 to N − 1 do4: [V(TY,n),R] = QR(V(TY,n))5: H(TY,n+1) = RH(TX,n+1)6: end for7: Compute ‖X‖ = ‖TY,N‖F and ε0 = ‖X‖F√
N−1ε
8: for n = N down to 2 do9: [Q,R] = QR(H(TY,n)
T)
10: [U, Σ, V] = tSVD(R, ε0)11: H(TY,n)
T = QU
12: V(TY,n−1) = V(TY,n−1)VΣ13: end for14: end function
4.2.5 Previous Work on Parallel TT-Rounding
Algorithm 7 has been parallelized by Al Daas et al. [1], who use a 1-D distribution
of TT cores to partition a TT tensor across processors. Each core is distributed over
all processors along the physical mode such that each processor owns Ik/P slices of
the kth core. This distribution guarantees a load balancing and allows to perform
TT arithmetic efficiently. In particular, the QR decompositions are performed via the
Tall-Skinny QR algorithm [14], and multiplications involving TT cores are parallelized
following the 1D distributions. We improve upon this prior work by using an alternate
TT-rounding approach that avoids QR decompositions, reducing arithmetic by a
constant factor and also reducing communication.
55
4.3 Introduction
Low-rank representations of tensors help to make algorithms addressing large-scale
multidimensional problems computationally feasible. While the size of explicit rep-
resentations of these tensors grows very quickly (an instance of the “curse of dimen-
sionality”), low-rank representations can often approximate explicit forms to sufficient
accuracy while requiring orders of magnitude less space and computational time. For
example, suppose a parametrized PDE depends on 10 parameters, where each param-
eter has 10 possible values. Computing the solution for each of the 1010 configurations
becomes infeasible even for modest discretizations of the state space, but if the so-
lution depends smoothly on the parameters, then the qualitative behavior of the
solution over the entire configuration space can be captured using far fewer than 1010
parameters [13,26,37].
As we describe in detail in section 3.3, the Tensor Train (TT) format [47] is a
low-rank representation with a number of parameters that is linear in the sum of the
tensor dimensions, as compared to an explicit representation whose size is the prod-
uct of the tensor dimensions. The TT format consists of a series of 3-way tensors,
or TT cores, with one dimension corresponding to an original tensor dimension and
two dimensions corresponding to much smaller TT ranks. TT approximations can be
computed from explicit tensors as a means of compression for scientific computing and
machine learning applications [27,47,50,66], but they are also often used to represent
tensors that cannot be formed explicitly at all. In the context of parametrized PDEs,
the TT format has been used to represent both the discretized operators as well as
the solution, residual, and other related vectors [7–9, 16]. In this case, TT tensors
are manipulated using operations such as additions, dot products, and elementwise
multiplications, which causes the TT ranks to grow in size. The key operation that
prevents uncontrolled growth in TT ranks is known as TT rounding, in which a TT
56
tensor is approximated by another TT tensor with minimal ranks subject to a spec-
ified approximation error. This operation requires a sequence of highly structured
matrix singular value decomposition (SVD) problems, and is typically a computa-
tional bottleneck.
There exists a wide array of high-performance, parallel implementations of tensor
computations for computing decompositions such as CP and Tucker of dense and
sparse tensors [5, 10, 12, 20, 33, 53], as well as for performing contractions of dense,
sparse, and structured tensors [2, 54, 56]. However, the available software for com-
puting, manipulating, and rounding TT tensors is largely limited to productivity
languages such as MATLAB and Python [44, 61]. Aside from the work of Al Daas
et al. [1], which we describe in section 3.3 and compare against in section 4.6, we
are not aware of other HPC implementations of TT-based algorithms. One of the
aims of this paper is to raise the bar for parallel performance for TT rounding and
demonstrate that TT-based approaches can scale to scientific problems with more
and higher dimensions using efficient parallelization.
The TT rounding algorithm utilizes multiple truncated SVDs. The central con-
tribution of this paper is the development of a parallel algorithm that performs these
truncated SVDs more efficiently than the existing approach, by reducing both com-
putational and communication costs. The basic tool of the algorithm is the Gram
SVD algorithm, which exploits the connection between the SVD of a matrix A and
the eigenvalue decomposition of its Gram matrix ATA. The truncated SVD must
be performed on a highly structured matrix which is analogous a matrix represented
as X = ABT, where A and B are tall-skinny matrices. We present our approach in
full detail for this matrix analogue in section 4.4, including empirical results for the
numerical properties, and then show how it can be applied within the TT rounding
algorithm in section 4.5. The key to efficiency in the context of TT rounding is the
57
computation of Gram matrices of matrices with overlapping TT structure.
We present performance results in section 4.6, demonstrating the efficiency of our
algorithm compared to the existing state of the art. In a MATLAB-based experi-
ment, we show that improvement of a TT-rounding implementation leads to overall
performance improvement for a TT-based linear solver. Then we demonstrate that
our C/MPI implementation is both weakly and strongly scalable on TT tensors with
representative dimensions and ranks. In particular, we achieve up to Y× parallel
speedup when scaling to 64 nodes of a distributed-memory platform for a Z-way
tensor with dimensions of size W and TT ranks of size Q. We also achieve up to
a 8× speedup over a state-of-the-art implementation of the standard TT-rounding
approach. Our results demonstrate that TT rounding is highly scalable using our
algorithm, and we target parallelization of TT-based solvers based on our approach
as future work.
4.4 Truncation of Matrix Product
To gain intuition for the use of Gram SVD within TT-Rounding, we focus in the
section on the (degenerate) case of TT with 2 modes, with dimensions I × J . In this
case, the tensor is a matrix represented by a low-rank product of matrices:
X = ABT, (4.1)
where A and B are tall and skinny matrices with R columns. The goal is to approx-
A numerically accurate and reasonably efficient approach to truncate the represen-
tation of X is via orthogonalization. By computing (compact) QR decompositions
A = QARA and B = QBRB, we have
X = QARARTBQT
B (4.3)
and the SVD of RARTB yields the (compact) SVD of X because QA and QB have
orthonormal columns. Note that RARTB is R × R, so its SVD is much cheaper to
compute.
We formalize this approach in algorithm 8. In order to truncate the rank of X,
we can truncate the SVD of RARTB. To obtain factors A and B, we apply QA and
QB to the left and right singular vectors, respectively. The singular values can be
distributed arbitrarily, we choose to distribute them evenly to left and right factors.
4.4.2 Truncation via Gram SVD
We now show our proposed method for a faster but potentially less accurate rounding
algorithm for the matrix product. Our method is based on the Gram SVD algorithm,
but we note it is not a straightforward application. For example, we can represent
XXT as ABTBAT, and while BTB is R×R, we cannot obtain the eigenvalue decom-
position easily without orthogonalizing A. Instead, we consider the Gram matrices
59
of A and B separately, letting GA = ATA and GB = BTB. For clarity, we first
describe the method using Cholesky QR, then discuss pivoting within Cholesky, and
finally explain the use of Gram SVD. We compare numerical results for the matrix
product case in section 4.4.4.
Cholesky QR
Let us first assume A and B are full rank, and use Cholesky QR to orthonormalize
the columns of A and B. Computing Cholesky decompositions, we have RTARA = GA
and RTBRB = GB. Then eq. (4.3) becomes
X = (AR−1A )RART
B(BR−1B )T.
Given the truncated SVD UΣVT
= RARTB, we can compute
A = A(R−1A UΣ
1/2)
and B = B(R−1B VΣ
1/2)
to obtain eq. (4.2).
Pivoted Cholesky QR
Now suppose that A and B are low rank with ranks LA and LB. While the standard
Cholesky algorithm will fail in this case, we can employ pivoted Cholesky to obtain
RTARA = PT
AGAPA and RTBRB = PT
BGBPB, where PA and PB are permutation
matrices and RA and RB can be written
RA =
[RA RA
0
]and RB =
[RB RB
0
],
60
with R−1
A and R−1
B having dimensions LA × LA and LB × LB, respectively. Then
eq. (4.3) becomes
X = QA
[RA RA
]PTAPB
[R
T
B
RT
B
]︸ ︷︷ ︸
M
QT
B = QAMQT
B
where
QA = APA
[R−1
A
0
]and QB = BPB
[R−1
B
0
].
Given the truncated SVD UΣVT
= M, we compute
A = A
(PA
[R−1A
0
]UΣ
1/2
)and B = B
(PB
[R−1B
0
]VΣ
1/2
)
to obtain eq. (4.2).
Gram SVD
Pivoted Cholesky QR works well for the low rank case in exact arithmetic, but in the
case of numerically low rank matrices, it provides a sharp truncation for each of A and
B individually. We now consider using the Gram SVD approach, which we will see in
section 4.4.4 is more robust than pivoted Cholesky QR. Here, we consider A and B to
be possibly low rank. Given the SVDs A = UAΣAVTA and B = UBΣBVT
B, we have
eigenvalue decompositions GA = VAΣ2AVT
A = VAΣ2
AVT
A and GB = VBΣ2BVT
B =
VBΣ2
BVT
B, where ΣA and ΣB represent the nonzero singular values and VA and VB
are the corresponding vectors. We can then write the corresponding left singular
vectors via UA = AVAΣ−1
A and UB = AVBΣ−1
B . With these quantities, eq. (4.1)
becomes
X = (AVAΣ−1
A︸ ︷︷ ︸UA
) ΣAVT
AVBΣB︸ ︷︷ ︸M
(BVBΣ−1
B︸ ︷︷ ︸UB
)T = UAMUT
B.
61
Algorithm 9 Truncated SVD of ABT using Gram SVDs
1: function [A, B] = tSVD-ABt-Gram(A,B, ε)2: GA = ATA3: GB = BTB4: [VA,ΛA] = Eig(GA)5: [VB,ΛB] = Eig(GB)
6: [U, Σ, V] =tSVD(Λ1/2A VT
AVBΛ1/2B , ε)
7: A = A(VAΛ
−1/2A UΣ
1/2)
8: B = B(VBΛ
−1/2B VΣ
1/2)
9: end function
Given the truncated SVD UΣVT
= M, we compute
A = A(VAΣ
−1
A UΣ1/2)
and B = B(VBΣ
−1
B VΣ1/2)
to obtain eq. (4.2). The algorithm for the Gram SVD approach is given as algorithm 9,
which can be adapted to pivoted Cholesky QR following the algebra of section 4.4.2.
4.4.3 Complexity Analysis
We now consider the computational complexity of the truncation methods, where
we assume A is I × R, B is J × R, A is I × L, and B is J × L. Truncation via
orthogonalization is specified in algorithm 8. The QR decompositions in lines 2 and 3
require 2(I+J)R2 flops, where we assume that the orthogonal factors QA and QB are
maintained in implicit (e.g., Householder) form. The multiplication and truncated
SVD of line 4 cost O(R3). Applying the implicit orthogonal factors to R×L matrices
to compute A and B require 4(I + J)RL flops for a total cost bounded by
2(I + J)R2 + 4(I + J)RL+O(R3). (4.4)
In the case of the Gram SVD approach, we unify the analysis for Cholesky QR
and Gram SVD. Algorithm 9 gives the explicit steps assuming Gram SVD is used.
62
The cost of lines 2 and 3 together is (I + J)R2 operations, which is performed for
either method. The eigendecompositions of lines 4 and 5 is O(R3). This cost is
approximately 10 times more expensive than performing Cholesky decomposition of
the Gram matrices, but we note that O(R3) is a lower order term compared to the cost
of computing the Gram matrices. The matrix multiplications and truncated SVD of
line 6 are also O(R3), possibly less if A and B are low rank and similar across the two
methods. Finally, lines 7 and 8 first involve computations of small matrices (of size
R×L or smaller) followed by a single multiplication with the large A or B matrices,
which together cost 2(I + J)RL. Overall, the computational cost of the Gram SVD
method is bounded by
(I + J)R2 + 2(I + J)RL+O(R3), (4.5)
which is about half the cost of that of the orthogonalization approach, given in
eq. (4.4). Furthermore, the dominant costs of eq. (4.5) come from (symmetric) ma-
trix multiplication rather than computation of/with implicit orthogonal factors, so
we expect higher efficiency for the Gram SVD approach in addition to the reduced
arithmetic.
4.4.4 Numerical Examples
In this section, we will demonstrate the empirical error of computing a truncated SVD
of X = ABT using Gram matrices and compare it to the more accurate orthogonal-
ization approach. We consider 3 synthetic input matrices with differing condition-
ing properties to illustrate the differences among the three methods (including both
Cholesky QR and Gram SVD approaches).
In each case, we construct input matrices A and B each to be 1000 × 50 and
to have geometrically distributed singular values with random left and right singular
vectors. We use double precision in these experiments. In the first case, we construct
63
both A and B to have condition numbers of 106: κ(A) = κ(B) = 106. That is,
the largest singular value of each matrix is 106, the smallest is 100, and the rest are
geometrically distributed within that range. The condition number of X in this case
is bounded above by 1012. The second synthetic case has input matrices that are
more ill-conditioned: κ(A) = κ(B) = 1012. The third case has input matrices that
are imbalanced, with κ(A) = 1012 and κ(B) = 100.
Figure 4.1 reports the results from truncation via QR (algorithm 8), Gram SVD
(algorithm 9), and Cholesky QR (variant of algorithm 9 described in section 4.4.2).
Each column of the figure corresponds to a different pair of inputs, the top row
plots the computed relative singular values (normalized by σ1 so that the first index
is equal to 1), the middle row reports the approximation error after truncation for
various tolerances, and the bottom row reports the computed truncation ranks.
In the left column, we see an example of a typical use case of the algorithm: all
algorithms perform equivalently and the approximation error matches the specified
tolerance. Note that when the tolerance is smaller than the smallest singular value,
no truncation is performed. If both input matrices have condition number smaller
than the inverse of the square root of machine precision, then we expect no distinction
among algorithms. In this case, the conditioning of the Gram matrices is such that
the eigenvalues can be computed accurately and Cholesky decomposition will not fail.
In the middle column, we see an example of input matrices whose condition num-
bers are larger than 108. In this case, the Gram matrices are numerically low rank,
causing truncation of the Cholesky decomposition and a loss of accuracy of the small-
est eigenvalues. This causes a sharp truncation of the rank in the case of Cholesky
and an overestimate of the singular values of X in the case of Gram SVD. For toler-
ances smaller than 10−8, we see that the approximation error of Cholesky QR does
not drop below the square root of machine precision. The Gram SVD approach’s rank
64
0 10 20 30 40 50
100
10−4
10−8
10−12
10−16
Index
Com
pSin
gV
als
κ(A)=106, κ(B)=106
QRGram SVDCholQR
0 10 20 30 40 50
Index
κ(A)=1012, κ(B)=1012
0 10 20 30 40 50
Index
κ(A)=1012, κ(B)=101
100 10−4 10−8 10−12
100
10−4
10−8
10−12
10−16
Tol
Err
or
100 10−4 10−8 10−12
Tol
100 10−4 10−8 10−12
Tol
100 10−4 10−8 10−12
10
20
30
40
50
Tol
Ran
ks
100 10−4 10−8 10−12
Tol
100 10−4 10−8 10−12
Tol
Figure 4.1: Numerical results for truncation of matrix product X = ABT. Columnscorrespond to input matrices with different conditioning properties (details given insection 4.4.4). Top row specifies computed relative singular values before truncation,middle row reports relative approximation error after truncation, and bottom rowspecifies the truncation rank used for various requested tolerances.
65
selection deviates slightly from that of QR, but only for very small tolerances near
10−14. We note that for tolerance larger than 10−8, we see no deviation in behavior
across all three algorithms.
In the right column, we consider input with one matrix close to low rank but
the other well conditioned. Again, for tolerances larger than 10−8, all algorithms
perform well. For tighter tolerances, however, we see that the inaccuracy of small
eigenvalues of the Gram matrix of A causes deviation in truncation rank selection
and approximation error. As in the second case, the Cholesky QR approach does not
attain error below 10−8 because of the sharp truncation performed by the pivoted
algorithm. The Gram SVD approach computes approximation errors that match
the tolerance closely below 10−8, but as the tolerance tightens, the method begins
overestimating the truncation rank and eventually stops truncating at all. In this
way, the approximation error satisfies the tolerance, but the rank is not truncated as
much as possible.
Based on these results, we conclude that for tolerances greater than the square
root of machine precision, truncation using Gram matrices is sufficiently accurate.
While small singular of A and B are not computed as accurately via the Gram
SVD approach, they are not necessary for computing low rank approximations with
large approximation error. We note that the relationship between the SVDs of A,
B, and X have an effect of the overall accuracy. Even if a less accurate method
is used for the SVD of A and B, these results show that the Gram SVD approach
can compute singular values of X that are smaller than the square root of machine
precision. Despite the fact that the cheapest approach using pivoted Cholesky QR
is sufficiently accurate for large tolerances, we use the Gram SVD approach in the
context of TT rounding because it is more robust for smaller tolerances and because
the extra computation has little effect on the overall run time.
66
TX,1 TX,2 TX,3 TX,4 TX,5 TX,6
R1 R2 R3 R4 R5
I1 I2 I3 I4 I5 I6
A BQ Z
X
(a) Equation (4.6) for N =6, n = 3
TX,1
TX,1
TX,2
TX,2
TX,3
TX,3
R1
R1
R2
R2
R3
R3
I1 I2 I3
A
A
GL1
GL2
GL3
(b) A(1:n)TA(1:n) for n = 3
TX,3
TX,3
R3
R3
I3GL2
R2
R2
GL3
TC,3
(c) Computing GLn from
GLn−1 for n = 3
Figure 4.2: Tensor network diagrams
4.5 TT-Rounding via Gram SVD
We now apply the approach described in section 4.4 for X = ABT to the case of
TT rounding. In section 4.5.1, we explain the analogues of matrices A and B within
the TT rounding algorithm, and in section 4.5.2 we show how to compute the Gram
matrices for the associated structure matrices. We then present two algorithmic vari-
ants of TT rounding based on the approach in section 4.5.3 and provide complexity
analysis in section 4.5.5 with comparison against the standard TT rounding via or-
thogonalization.
4.5.1 TT Rounding Structure
The nth TT rank of a tensor X is the rank of the unfolding X(1:n), which is an I1 · · · In×
In+1 · · · IN matrix where each column is a vectorization of an n-mode subtensor. If
X is already in TT format, then X(1:n) has the following structure [1, Eq. (2.3)]: