Topological Autoencoders - arXiv

Topological Autoencoders

Michael Moor † 1 2 Max Horn † 1 2 Bastian Rieck ‡ 1 2 Karsten Borgwardt ‡ 1 2

Abstract

We propose a novel approach for preserving topo-logical structures of the input space in latent rep-resentations of autoencoders. Using persistent ho-mology, a technique from topological data analy-sis, we calculate topological signatures of both theinput and latent space to derive a topological lossterm. Under weak theoretical assumptions, weconstruct this loss in a differentiable manner, suchthat the encoding learns to retain multi-scale con-nectivity information. We show that our approachis theoretically well-founded and that it exhibitsfavourable latent representations on a syntheticmanifold as well as on real-world image data sets,while preserving low reconstruction errors.

1. IntroductionWhile topological features, in particular multi-scale featuresderived from persistent homology, have seen increasing usein the machine learning community (Carrière et al., 2019,Guss & Salakhutdinov, 2018, Hofer et al., 2017, 2019a,b,Ramamurthy et al., 2019, Reininghaus et al., 2015, Riecket al., 2019a,b), employing topology directly as a constraintfor modern deep learning methods remains a challenge. Thisis due to the inherently discrete nature of these computa-tions, making backpropagation through the computation oftopological signatures immensely difficult or only possiblein certain special circumstances (Chen et al., 2019, Hoferet al., 2019a, Poulenard et al., 2018).

This work presents a novel approach that permits obtaininggradients during the computation of topological signatures.This makes it possible to employ topological constraintswhile training deep neural networks, as well as buildingtopology-preserving autoencoders. Specifically, we make

†Equal contribution. ‡These authors jointly directed thiswork. 1Department of Biosystems Science and Engineering, ETHZurich, 4058 Basel, Switzerland 2SIB Swiss Institute of Bioin-formatics, Switzerland. Correspondence to: Karsten Borgwardt<[email protected]>.

Proceedings of the 37 th International Conference on MachineLearning, Online, PMLR 119, 2020. Copyright 2020 by the au-thor(s).

the following contributions:

1. We develop a new topological loss term for autoen-coders that helps harmonise the topology of the dataspace with the topology of the latent space.

2. We prove that our approach is stable on the level ofmini-batches, resulting in suitable approximations ofthe persistent homology of a data set.

3. We empirically demonstrate that our loss term aidsin dimensionality reduction by preserving topologicalstructures in data sets; in particular, the learned latentrepresentations are useful in that the preservation oftopological structures can improve interpretability.

2. Background: Persistent HomologyPersistent homology (Barannikov, 1994, Edelsbrunner &Harer, 2008) is a method from the field of computationaltopology, which develops tools for analysing topological fea-tures (connectivity-based features such as connected com-ponents) of data sets. We first introduce the underlyingconcept of simplicial homology. For a simplicial complexK, i.e. a generalised graph with higher-order connectivityinformation such as cliques, simplicial homology employsmatrix reduction algorithms to assign K a family of groups,the homology groups. The dth homology group Hd(K) ofK contains d-dimensional topological features, such as con-nected components (d = 0), cycles/tunnels (d = 1), andvoids (d = 2). Homology groups are typically summarisedby their ranks, thereby obtaining a simple invariant “signa-ture” of a manifold. For example, a circle in R2 has onefeature with d = 1 (a cycle), and one feature with d = 0 (aconnected component).

In practice, the underlying manifoldM is unknown and weare working with a point cloud X := {x1, . . . , xn} ⊆ Rdand a metric dist : X × X → R such as the Euclideandistance. Persistent homology extends simplicial homol-ogy to this setting: instead of approximatingM by meansof a single simplicial complex, which would be an unsta-ble procedure due to the discrete nature of X , persistenthomology tracks changes in the homology groups over mul-tiple scales of the metric. This is achieved by construct-ing a special simplicial complex, the Vietoris–Rips com-plex (Vietoris, 1927). For 0 ≤ ε < ∞, the Vietoris–Ripscomplex of X at scale ε, denoted by Rε(X), contains all

arX

iv:1

906.

0072

2v5

[cs

.LG

] 3

1 M

ay 2

021


(a) ε0 (b) ε1 (c) ε2

ε0 ε1 ε2

ε0

ε1

ε2

(d) Dd

Figure 1. The Vietoris–Rips complex Rε(X) of a point cloudX at different scales ε0, ε1, and ε2. As the distance threshold ε increases, theconnectivity changes. The creation and destruction of d-dimensional topological features is recorded in the dth persistence diagram Dd.

simplices (i.e. subsets) of X whose elements {x0, x1, . . . }satisfy dist(xi, xj) ≤ ε for all i, j. Given a matrix A ofpairwise distances of a point cloud X , we will use Rε(A)and Rε(X) interchangeably because constructing Rε onlyrequires distances. Vietoris–Rips complexes satisfy a nest-ing relation, i.e. Rεi(X) ⊆ Rεj (X) for εi ≤ εj , making itpossible to track changes in the homology groups as ε in-creases (Edelsbrunner et al., 2002). Figure 1 illustrates thisprocess. Since X contains a finite number of points, a max-imum ε value exists for which the connectivity stabilises;therefore, calculating Rε is sufficient to obtain topologicalfeatures at all scales.

We write PH(Rε(X)) for the persistent homology calcu-lation of the Vietoris–Rips complex. It results in a tuple({D1,D2, . . .}, {π1, π2, . . .}) of persistence diagrams (1st

component) and persistence pairings (2nd component).The d-dimensional persistence diagram Dd (Figure 1d) ofRε(X) contains coordinates of the form (a, b), where arefers to a threshold ε at which a d-dimensional topologi-cal feature is created in the Vietoris–Rips complex, and brefers to a threshold ε′ at which it is destroyed (please referto Supplementary Section A.1 for a detailed explanation).When d = 0, for example, the threshold ε′ indicates atwhich distance two connected components in X are mergedinto one. This calculation is known to be related to span-ning trees (Kurlin, 2015) and single-linkage clustering, butthe persistence diagrams and the persistence pairings carrymore information than either one of these concepts.

The d-dimensional persistence pairing contains indices(i, j) corresponding to simplices si, sj ∈ Rε(X) thatcreate and destroy the topological feature identified by(a, b) ∈ Dd, respectively. Persistence diagrams are knownto be stable with respect to small perturbations in thedata set (Cohen-Steiner et al., 2007). Two diagrams Dand D′ can be compared using the bottleneck distancedb(D,D′) := infη : D→D′ supx∈D ‖x − η(x)‖∞, whereη : D → D′ denotes a bijection between the points of thetwo diagrams, and ‖ · ‖∞ refers to the L∞ norm. We useDX to refer to the set of persistence diagrams of a pointcloud X arising from PH(Rε(X)).

ZLatent code

XInput data

XReconstruction

Reconstruction loss

ε

ε′

ε

ε′Topological loss

Figure 2. An overview of our method. Given a mini-batch X ofdata space X , we train an autoencoder to reconstruct X , leadingto a reconstruction X . In addition to the usual reconstructionloss, we calculate our topological loss based on the topologicaldifferences between persistence diagrams, i.e. topological featuredescriptors, calculated on the mini-batch X and its correspondinglatent code Z. The objective of our topological loss term is toconstrain the autoencoder such that topological features in the dataspace are preserved in latent representations.

3. A Topology-Preserving AutoencoderWe propose a generic framework for constraining autoen-coders to preserve topological structures (measured via per-sistent homology) of the data space in their latent encodings.Figure 2 depicts an overview of our method; the subsequentsections will provide more details about the individual steps.

3.1. Vietoris–Rips Complex Calculation

Given a finite metric space S , such as a point cloud, we firstcalculate the persistent homology of the Vietoris–Rips com-plex of its distance matrix AS . It is common practice to usethe Euclidean distance for the calculation of AS , but boththe persistent homology calculation and our method are notrestricted to any particular distance; previous research (Wag-ner & Dłotko, 2014) shows that even similarity measuresthat do not satisfy the properties of a metric can be usedsuccessfully with PH(·). Subsequently, let ε := maxAS sothat Rε

(AS)

is the corresponding Vietoris–Rips complex


as described in Section 2. Given a maximum dimension1

of d ∈ N>0, we obtain a set of persistence diagrams DS ,and a set of persistence pairings πS . The dth persistencepairing πSd contains indices of simplices that are pertinentto the creation and destruction of d-dimensional topologicalfeatures. We can consider each pairing to represent edge in-dices, namely the edges that are deemed to be “topologicallyrelevant” by the computation of persistent homology (seebelow for more details). This works because the Vietoris–Rips complex is a clique complex, i.e. a simplicial complexthat is fully determined by its edges (Zomorodian, 2010).

Selecting indices from pairings The basic idea of ourmethod involves selecting indices in the persistence pairingand mapping them back to a distance between two ver-tices. We then adjust this distance to harmonise topolog-ical features of the input space and the latent space. For0-dimensional topological features, it is sufficient to con-sider the indices of edges, which are the “destroyer” sim-plices, in the pairing πS0 . Each index corresponds to anedge in the minimum spanning tree of the data set. Thiscalculation is computationally efficient, having a worst-casecomplexity of O

(m2 · α

(m2))

, where m is the batch sizeand α(·) denotes the extremely slow-growing inverse Ack-ermann function (Cormen et al., 2009, Chapter 22). For1-dimensional features, where edges are paired with trian-gles, we obtain edge indices by selecting the edge withthe maximum weight of the triangle. While this proce-dure, and thus our method, generalises to higher dimensions,our current implementation supports no higher-dimensionalfeatures. Since preliminary experiments showed that us-ing 1-dimensional topological features merely increasesruntime, the subsequent experiments will focus only on0-dimensional persistence diagrams. We thus use

(DS , πS

)to denote the 0-dimensional persistence diagram and pairingof S, respectively.

3.2. Topological Autoencoder

In the following, we consider a mini-batchX of sizem fromthe data space X as a point cloud. Furthermore, we definean autoencoder as the composition of two functions h ◦ g,where g : X → Z represents the encoder and h : Z → Xrepresents the decoder, denoting latent codes by Z := g(X).During a forward pass of the autoencoder, we compute thepersistent homology of the mini-batch in both the data aswell as the latent space, yielding two sets of tuples, i.e.(DX , πX

):= PH(Rε(X)) and

(DZ , πZ

):= PH(Rε(Z)).

The values of the persistence diagram can be retrieved bysubsetting the distance matrix with the edge indices pro-vided by the persistence pairings; we write DX ' AX

[πX]

1This means that we do not have to consider higher-dimensional topological features, making the calculation moreefficient.

to indicate that the diagram, which is a set, contains thesame information as the distances we retrieve with the pair-ing. We treat AX

[πX]

as a vector in R|πX |. Informally

speaking, the persistent homology calculation can thus beseen as a selection of topologically relevant edges of theVietoris–Rips complex, followed by the selection of cor-responding entries in the distance matrix. By comparingboth diagrams, we can construct a topological regularisa-tion term Lt := Lt

(AX ,AZ , πX , πZ

), which we add to the

reconstruction loss of an autoencoder, i.e.

L := Lr(X,h(g(X))) + λLt (1)

where λ ∈ R is a parameter to control the strength of theregularisation (see also Supplementary Section A.6).

Next, we discuss how to specify Lt. Since we only selectedge indices from πX and πZ , the PH calculation representsa selection of topologically relevant distances from the dis-tance matrix. Each persistence diagram entry correspondsto a distance between two data points. Following standardassumptions in persistent homology (Hofer et al., 2019a,Poulenard et al., 2018), we assume that the distances areunique so that each entry in the diagram has an infinitesimalneighbourhood that only contains a single point. In prac-tice, this can always be achieved by performing (symbolic)perturbations of the distances. Given this fixed pairing anda differentiable distance function, the persistence diagramentries are therefore also differentiable with respect to theencoder parameters. Hence, the persistence pairing doesnot change upon a small perturbation of the underlying dis-tances, thereby guaranteeing the existence of the derivativeof our loss function. This, in turn, permits the calculation ofgradients for backpropagation.

A straightforward approach to impose the data space topol-ogy on the latent space would be to directly calculate aloss based on the selected distances in both spaces. Suchan approach will not result in informative gradients for theautoencoder, as it merely compares topological featureswithout matching2 the edges between Rε(X) and Rε(Z).A cleaner approach would be to enforce similarity on theintersection of the selected edges in both complexes. How-ever, this would initially include very few edges, preventingefficient training and leading to highly biased estimates ofthe topological alignments between the spaces3. To over-come this, we account for the union of all selected edges in

2We use the term “matching” only to build intuition. Ourapproach does not calculate a matching in the sense of a bottleneckor Wasserstein distance between persistence diagrams.

3When initialising a random latent space Z, the persistencepairing in the latent space will select random edges, resulting inonly 1 expected matched edge (independent of mini-batch size)between the two pairings. Thus, only one edge (referring to onepairwise distance between two latent codes) could be used toupdate the encoding of these two data points.


X and Z. Our topological loss term decomposes into twocomponents, each handling the “directed” loss occurring astopological features in one of the two spaces remain fixed.Hence, Lt = LX→Z +LZ→X , with

LX→Z :=1

2

∥∥AX[πX]−AZ[πX]∥∥2

(2)

and

LZ→X :=1

2

∥∥AZ[πZ]−AX[πZ]∥∥2

, (3)

respectively. The key idea for both terms is to align and pre-serve topologically relevant distances from both spaces. Bytaking the union of all selected edges (and the correspond-ing distances), we obtain an informative loss term that isdetermined by at least |X| distances. This loss can be seenas a more generic version of the loss introduced by Hoferet al. (2019a), whose formulation does not take the twodirected components into account and optimises the destruc-tion values of all persistence tuples with respect to a uniformparameter (their goal is different from ours and does notrequire a loss term that is capable of harmonising topologi-cal features across the two spaces; please refer to Section 4for a brief discussion). By contrast, our formulation aimsto to align the distances between X and Z (which in turnwill lead to an alignment of distances between X and Z). Ifthe two spaces are aligned perfectly, LX→Z = LZ→X = 0because both pairings and their corresponding distances co-incide. The converse implication is not true: if Lt = 0,the persistence pairings and their corresponding persistencediagrams are not necessarily identical. Since we did notobserve such behaviour in our experiments, however, weleave a more formal treatment of these situations for futurework.

Gradient calculation Letting θ refer to the parametersof the encoder and using ρ :=

(AX[πX]−AZ

[πX])

, wehave

∂

∂θLX→Z =

∂

∂θ

(1

2

∥∥AX[πX]−AZ[πX]∥∥2)

(4)

= −ρ>(∂AZ

[πX]

∂θ

)(5)

= −ρ>

|πX |∑

i=1

∂AZ[πX]i

∂θ

, (6)

where∣∣πX ∣∣ denotes the cardinality of a persistence pair-

ing and AZ[πX]i

refers to the ith entry of the vector ofpaired distances. This derivation works analogously forLZ→X (with πX being replaced by πZ). Furthermore, anyderivative of AX with respect to θ must vanish because thedistances of the input samples do not depend on the en-coding by definition. These equations assume infinitesimal

perturbations. The persistence diagrams change in a non-differentiable manner during the training phase. However,for any given update step, a diagram is robust to infinitesi-mal changes of its entries (Cohen-Steiner et al., 2007). As aconsequence, our topological loss is differentiable for eachupdate step during training. We make our code publiclyavailable4.

3.3. Stability

Despite the aforementioned known stability of persistencediagrams with respect to small perturbations of the underly-ing space, we still have to analyse our topological approxi-mation on the level of mini-batches. The following theoremguarantees that subsampled persistence diagrams are closeto the persistence diagrams of the original point cloud.

Theorem 1. Let X be a point cloud of cardinality nand X(m) be one subsample of X of cardinality m, i.e.X(m) ⊆ X , sampled without replacement. We can boundthe probability of the persistence diagrams of X(m) exceed-ing a threshold in terms of the bottleneck distance as

P(db

(DX,DX

(m))>ε)≤ P

(dH

(X,X(m)

)>2ε

), (7)

where dH refers to the Hausdorff distance between the pointcloud and its subsample.

Proof. See Section A.2 in the supplementary materials.

For m→ n, each mini-batch converges to the original pointcloud, so we have limm→n dH

(X,X(m)

)= 0. Please refer

to Section A.3 for an analysis of empirical convergence ratesas well as a discussion of a worst-case bound. Given certainindependence assumptions, the next theorem approximatesthe expected value of the Hausdorff distance between thepoint cloud and a mini-batch. The calculation of an exactrepresentation is beyond the scope of this work.

Theorem 2. Let A∈ Rn×m be the distance matrix betweensamples of X and X(m), where the rows are sorted suchthat the first m rows correspond to the columns of the msubsampled points with diagonal elements aii = 0. Assumethat the entries aij with i > m are random samples follow-ing a distance distribution FD with supp(FD) ∈ R≥0. Theminimal distances δi for rows with i > m follow a distribu-tion F∆. Letting Z := max1≤i≤n δi with a correspondingdistribution FZ , the expected Hausdorff distance between

4https://github.com/BorgwardtLab/topological-autoencoders

https://github.com/BorgwardtLab/topological-autoencoders

https://github.com/BorgwardtLab/topological-autoencoders


X and X(m) for m < n is bounded by

E[dH(X,X

(m))]= EZ∼FZ

[Z] (8)

≤+∞∫0

(1− F∆(z)n−m

)dz, (9)

where

F∆(z) = −m∑k=1

(m

k

)(−FD(z))m−k. (10)

Proof. See Section A.4 in the supplementary materials.

From Eq. 9, we obtain E[dH(X,Xm)] = 0 as m → n,

so the expected value converges as the subsample size ap-proaches the total sample size5. We conclude that our sub-sampling approach results in point clouds that are suitableproxies for the large-scale topological structures of the pointcloud X .

4. Related WorkComputational topology and persistent homology (PH) havestarted gaining traction in several areas of machine learn-ing research. PH is often used as as post hoc method foranalysing topological characteristics of data sets. Thus,there are several methods that compare topological fea-tures of high-dimensional spaces with different embeddingsto assess the fidelity and quality of a specific embeddingscheme (Khrulkov & Oseledets, 2018, Paul & Chalup, 2017,Rieck & Leitte, 2015, 2017, Yan et al., 2018). PH can alsobe used to characterise the training of neural networks (Guss& Salakhutdinov, 2018, Rieck et al., 2019b), as well as theirdecision boundaries (Ramamurthy et al., 2019). Our methoddiffers from all these publications in that we are able to ob-tain gradient information to update a model while training.Alternatively, topological features can be integrated intoclassifiers to improve classification performance. Hoferet al. (2017) propose a neural network layer that learns pro-jections of persistence diagrams, which can subsequentlybe used as feature descriptors to classify structured data.Moreover, several vectorisation strategies for persistencediagrams exist (Adams et al., 2017, Carrière et al., 2015),making it possible to use them in kernel-based classifiers.These strategies have been subsumed (Carrière et al., 2019)in a novel architecture based on deep sets. The commonalityof these approaches is that they treat persistence diagramsas being fixed; while they are capable of learning suitableparameters for classifying them, they cannot adjust inputdata to better approximate a certain topology.

5For m = n, the two integrals switch their order as m(n −m) = 0 < n− 1 (for n > 1).

Such topology-based adjustments have only recently be-come feasible. Poulenard et al. (2018) demonstrated howto optimise real-valued functions based on their topology.This constitutes the first approach for aligning persistencediagrams by modifying input data; it requires the connec-tivity of the data to be known, and the optimised functionshave to be node-based and scalar-valued. By contrast, ourmethod works directly on distances and sidesteps connec-tivity calculations via the Vietoris–Rips complex. Chenet al. (2019) use a similar optimisation technique to regu-larise the decision boundary of a classifier. However, thisrequires discretising the space, which can be computation-ally expensive. Hofer et al. (2019a), the closest work to ours,also presents a differentiable loss term. Their formulationenforces a single scale, referred to as η, on the latent space.The learned encoding is then applied to a one-class learningtask in which a scoring function is calculated based on thepre-defined scale. By contrast, the goal of our loss termis to support the model in learning a latent encoding thatbest preserves the data space topology in said latent space,which we use for dimensionality reduction. We thus targeta different task, and can preserve multiple scales (those se-lected through the filtration process) that are present in thedata domain.

5. ExperimentsOur main task is to learn a latent space in an unsupervisedmanner such that topological features of the data space,measured using persistent homology approximations onevery batch, are preserved as much as possible.

5.1. Experimental Setup

Subsequently, we briefly describe our data sets and evalua-tion metrics. Please refer to the supplementary materials fortechnical details (calculation, hyperparameters, etc.).

5.1.1. DATA SETS

We generate a SPHERES data set that consists of ten high-dimensional 100-spheres living in a 101-dimensional spacethat are enclosed by one larger sphere that consists of thesame number of points as the total of inner spheres (seeSection A.5 for more details). We also use three image datasets (MNIST, FASHION-MNIST, and CIFAR-10), whichare particularly amenable to our topology-based analysisbecause real-world images are known to lie on or near low-dimensional manifolds (Lee et al., 2003, Peyré, 2009).

5.1.2. BASELINES & TRAINING PROCEDURE

We compare our approach with several dimensionality re-duction techniques, including UMAP (McInnes et al., 2018),t-SNE (van der Maaten & Hinton, 2008), Isomap (Tenen-


baum et al., 2000), PCA, as well as standard autoen-coders (AE). We apply our proposed topological constraintto this standard autoencoder architecture (TopoAE).

For comparability and interpretability, each method is re-stricted to two latent dimensions. We split each data set intotraining and testing (using the predefined split if available;90% versus 10% otherwise). Additionally, we remove 15%of the training split as a validation data set for tuning thehyperparameters. We normalised our topological loss termby the batch size m in order to disentangle λ from it. Allautoencoders employ batch-norm and are optimised usingADAM (Kingma & Ba, 2014). Since t-SNE is not intendedto be applied to previously unseen test samples, we evaluatethis method only on the train split. In addition, significantcomputational scaling issues prevent us from running a hy-perparameter search for Isomap on real-world data sets, sowe only compare this algorithm on the synthetic data set.Please refer to Section A.6 for more details on architecturesand hyperparameters.

5.1.3. EVALUATION

We evaluate the quality of latent representations in termsof (1) low-dimensional visualisations, (2) dimensionalityreduction quality metrics (evaluated between input data andlatent codes), and (3) reconstruction errors (Data MSE; eval-uated between input and reconstructed data), provided thatinvertible transformations are available6. For (2), we con-sider several measures (please refer to Section A.7 for moredetails). First, we calculate KLσ, the Kullback–Leibler di-vergence between the density estimates of the input andlatent space, based on density estimates (Chazal et al., 2011,2014b), where σ ∈ R>0 denotes the length scale of theGaussian kernel, which is varied to account for multipledata scales. We chose minimising KL0.1 as our hyperparam-eter search objective. Furthermore, we calculate commonnon-linear dimensionality reduction (NLDR) quality met-rics, which use the pairwise distance matrices of the inputand the latent space (as indicated by the “`” in the abbrevi-ations), namely (1) the root mean square error (`-RMSE),which—despite its name—is not related to the reconstruc-tion error of the autoencoder but merely measures to whatextent the two distributions of distances coincide, (2) themean relative rank error (`-MRRE), (3) the continuity (`-Cont), and (4) the trustworthiness (`-Trust) . The reportedmeasures are computed on the test splits (except for t-SNEwhere no transformation between splits is available, so wereport the measures on a random subsample of the train split,preserving the cardinality of the test split).

6Invertible transformations are available for PCA and allautoencoder-based methods.

(a) PCA (b) Isomap

(c) t-SNE (d) UMAP

(e) AE (f) TopoAE

Figure 3. Latent representations of the SPHERES data set. Onlyour method is capable of representing the complicated nestingrelationship inherent to the data; t-SNE, for example, tears theoriginal data apart. For TopoAE, we used a batch size of 28.Please refer to Figure A.5 in the supplementary materials for anenlarged version.

5.2. Results

Next to a quantitative evaluation in terms of various qualitymetrics, we also discuss qualitative results in terms of visu-alisations, which are interpretable in case the ground truthmanifold is known.

5.2.1. QUANTITATIVE RESULTS

Table 1 reports the quantitative results. Overall, we observethat our method is capable of preserving the data densityover multiple length scales (as measured by KL). Further-more, we find that TopoAE displays competitive continuityvalues (`-Cont) and reconstruction errors (Data MSE). Thelatter is particularly relevant as it demonstrates that im-posing our topological constraints does not result in largeimpairments when reconstructing the input space.

The remaining classical measures favour the baselines (fore-


Data set Method KL0.01 KL0.1 KL1 `-MRRE `-Cont `-Trust `-RMSE Data MSE

SPHERES

Isomap 0.181 0.420 0.00881 0.246 0.790 0.676 10.4 –PCA 0.332 0.651 0.01530 0.294 0.747 0.626 11.8 0.9610TSNE 0.152 0.527 0.01271 0.217 0.773 0.679 8.1 –UMAP 0.157 0.613 0.01658 0.250 0.752 0.635 9.3 –AE 0.566 0.746 0.01664 0.349 0.607 0.588 13.3 0.8155TopoAE 0.085 0.326 0.00694 0.272 0.822 0.658 13.5 0.8681

F-MNIST

PCA 0.356 0.052 0.00069 0.057 0.968 0.917 9.1 0.1844TSNE 0.405 0.071 0.00198 0.020 0.967 0.974 41.3 –UMAP 0.424 0.065 0.00163 0.029 0.981 0.959 13.7 –AE 0.478 0.068 0.00125 0.026 0.968 0.974 20.7 0.1020TopoAE 0.392 0.054 0.00100 0.032 0.980 0.956 20.5 0.1207

MNIST


CIFAR


Table 1. Embedding quality according to multiple evaluation metrics (Section 5.1). The hyperparameters of all tunable methods wereselected to minimise the objective KL0.1. For each criterion, the winner is shown in bold and underlined, the runner-up in bold. Pleaserefer to Supplementary Table A.2 for more σ scales and variance estimates. The column “Data MSE” indicates the reconstruction error. Itis included to demonstrate that applying our loss term has no adverse effects.

most the train (!) performance of t-SNE). However, we willsubsequently see by inspecting the latent spaces that thoseclassic measures fail to detect the relevant structural infor-mation, as exemplified with known ground truth manifolds,such as the SPHERES data set.

5.2.2. VISUALISATION OF LATENT SPACES

For the SPHERES data set (Figure 3), we observe that onlyour method is capable of assessing the nesting relation-ship of the high-dimensional spheres correctly. By contrast,t-SNE “cuts open” the enclosing sphere, distributing mostof its points around the remaining spheres. We see that theKL-divergence confirms the visual assessment that only ourproposed method preserves the relevant structure of this dataset. Several classical evaluation measures, however, favourt-SNE, even though this method fails to capture the globalstructure and nesting relationship of the enclosing spheremanifold accounting for half of the data set.

On FASHION-MNIST (Figure 4, leftmost column), we seethat, as opposed to AE, which is purely driven by the re-construction error, our method has the additional objectiveof preserving structure. Here, the constraint helps the regu-larised autoencoder to “organise” the latent space, resultingin a comparable pattern as in UMAP, which is also topolog-

ically motivated (McInnes et al., 2018). Furthermore, weobserve that t-SNE tends to fragment certain classes (darkorange, red) into multiple distinct subgroups. This likelydoes not reflect the underlying manifold structure, but con-stitutes an artefact frequently observed with this method.For MNIST, the latent embeddings (Figure 4, middle col-umn) demonstrate that the non-linear competitors—mostlyby pulling apart distinct classes—lose some of the relation-ship information between clusters when comparing againstour proposed method or PCA. Finally, we observe thatCIFAR-10 (Figure 4, rightmost column), is challengingto embed in two latent dimensions in a purely unsupervisedmanner. Interestingly, our method (consistently, i.e. over allruns) was able to identify a linear substructure that separatesthe latent space in two additional groups of classes.

6. Discussion and ConclusionWe presented topological autoencoders, a novel method forpreserving topological information, measured in terms ofpersistent homology, of the input space when learning latentrepresentations with deep neural networks. Under weaktheoretical assumptions, we showed how our persistent ho-mology calculations can be combined with backpropagation;moreover, we proved that approximating persistent homol-ogy on the level of mini-batches is theoretically justified.


FASHION-MNIST MNIST CIFAR-10

PCA

t-SNE

UMAP

AE

TopoAE

Figure 4. Latent representations of the FASHION-MNIST (leftcolumn), MNIST (middle column), CIFAR-10 (right column)data sets. The respective batch size values for our methodare (95, 126, 82). Please refer to Figures A.6, A.7, and A.8 inthe supplementary materials for enlarged versions.

In our experiments, we observed that our method is uniquelyable to capture spatial relationships between nested high-dimensional spheres. This is relevant, as the ability to copewith several manifolds in the domain of manifold learningstill remains a challenging task. On real-world data sets,we observed that our topological loss leads to competitiveperformance in terms of numerous quality metrics (such asa density preservation metric), while not adversely affectingthe reconstruction error. In both synthetic and real-worlddata sets, we obtain interesting and interpretable represen-tations, as our method does not merely pull apart differentclasses, but tries to spatially arrange them meaningfully.Thus, we do not observe mere distinct “clouds”, but rather

entangled structures, which we consider to constitute a moremeaningful representation of the underlying manifolds (anauxiliary analysis in Supplementary Section A.10 confirmsthat our method influences topological features, measuredusing PH, in a beneficial manner).

Future work Our topological loss formulation is highlygeneralisable; it only requires the existence of a distancematrix between individual samples (either globally, or onthe level of batches). As a consequence, our topologicalloss term can be directly integrated into a variety of dif-ferent architectures and is not limited to standard autoen-coders. For instance, we can also apply our constraint tovariational setups (see Figure A.3 for a sketch) or createa topology-aware variant of principal component analy-sis (dubbed “TopoPCA”; see Table A.2 for more details,as well as Figures A.6, A.7, and A.8 for the correspondinglatent space representations). Employing our loss term tomore involved architectures will be an exciting route forfuture work. One issue with the calculation is that, given thecomputational complexity of calculating Rε(·), for higher-dimensional features, we would scale progressively worsewith increasing batch size. However, in our low-dimensionalsetup, we observed that runtime tends to grow with decreas-ing batch size, i.e. the mini-batch speed-up still dominatesruntime (for more details concerning the effect of batchsizes, see Supplementary Section A.8). In future work,scaling to higher dimensions could be mitigated by approxi-mating the calculation of persistent homology (Choudharyet al., 2018, Kerber & Sharathkumar, 2013, Sheehy, 2013)or by exploiting recent advances in parallelising it (Baueret al., 2014, Lewis & Morozov, 2015). Another interestingextension would be to tackle classification scenarios withtopology-preserving loss terms. This might prove challeng-ing, however, because the goal in classification is to increaseclass separability, which might be achieved by removingtopological structures. This goal is therefore at odds withour loss term that tries preserving those structures. We thinkthat such an extension might require restricting the methodto a subset of scales (namely those that do not impede classseparability) to be preserved in the data.

ACKNOWLEDGEMENTS

The authors wish to thank Christian Bock for fruitful discus-sions and valuable feedback.

This project was supported by the grant #2017-110 of theStrategic Focal Area “Personalized Health and Related Tech-nologies (PHRT)” of the ETH Domain for the SPHN/PHRTDriver Project “Personalized Swiss Sepsis Study” and theSNSF Starting Grant “Significant Pattern Mining” (K.B.,grant no. 155913). Moreover, this work was funded in partby the Alfried Krupp Prize for Young University Teachers ofthe Alfried Krupp von Bohlen und Halbach-Stiftung (K.B.).


ReferencesAdams, H., Emerson, T., Kirby, M., Neville, R., Peterson,

C., Shipman, P., Chepushtanova, S., Hanson, E., Motta,F., and Ziegelmeier, L. Persistence images: A stablevector representation of persistent homology. Journal ofMachine Learning Research, 18(1):218–252, 2017.

Barannikov, S. A. The framed Morse complex and its in-variants. Advances in Soviet Mathematics, 21:93–115,1994.

Bauer, U., Kerber, M., and Reininghaus, J. Distributed com-putation of persistent homology. In McGeoch, C. C. andMeyer, U. (eds.), Proceedings of the Sixteenth Workshopon Algorithm Engineering and Experiments (ALENEX),pp. 31–38. Society for Industrial and Applied Mathemat-ics, 2014.

Bibal, A. and Frénay, B. Measuring quality and interpretabil-ity of dimensionality reduction visualizations. Safe Ma-chine Learning Workshop at ICLR, 2019.

Burago, D., Burago, Y., and Ivanov, S. A course in metricgeometry, volume 33 of Graduate Studies in Mathematics.American Mathematical Society, 2001.

Carrière, M., Oudot, S. Y., and Ovsjanikov, M. Stabletopological signatures for points on 3D shapes. In Pro-ceedings of the Eurographics Symposium on GeometryProcessing (SGP), pp. 1–12, Aire-la-Ville, Switzerland,2015. Eurographics Association.

Carrière, M., Chazal, F., Ike, Y., Lacombe, T., Royer, M.,and Umeda, Y. PersLay: A neural network layer for per-sistence diagrams and new graph topological signatures.arXiv e-prints, art. arXiv:1904.09378, 2019.

Chazal, F., Cohen-Steiner, D., Guibas, L. J., Mémoli, F.,and Oudot, S. Y. Gromov–Hausdorff stable signaturesfor shapes using persistence. Computer Graphics Forum,28(5):1393–1403, 2009.

Chazal, F., Cohen-Steiner, D., and Mérigot, Q. Geomet-ric inference for probability measures. Foundations ofComputational Mathematics, 11(6):733–751, 2011.

Chazal, F., de Silva, V., and Oudot, S. Y. Persistence stabilityfor geometric complexes. Geometriæ Dedicata, 173(1):193–214, 2014a.

Chazal, F., Fasy, B. T., Lecci, F., Michel, B., Rinaldo, A.,and Wasserman, L. Robust topological inference: Dis-tance to a measure and kernel distance. arXiv e-prints,art. arXiv:1412.7197, 2014b.

Chazal, F., Fasy, B., Lecci, F., Michel, B., Rinaldo, A.,and Wasserman, L. Subsampling methods for persistent

homology. In Bach, F. and Blei, D. (eds.), Proceed-ings of the 32nd International Conference on MachineLearning (ICML), volume 37 of Proceedings of MachineLearning Research, pp. 2143–2151. PMLR, 2015a.

Chazal, F., Glisse, M., Labruère, C., and Michel, B. Conver-gence rates for persistence diagram estimation in topolog-ical data analysis. Journal of Machine Learning Research,16:3603–3635, 2015b.

Chen, C., Ni, X., Bai, Q., and Wang, Y. A topologicalregularizer for classifiers via persistent homology. InChaudhuri, K. and Sugiyama, M. (eds.), Proceedings ofMachine Learning Research, volume 89 of Proceedingsof Machine Learning Research, pp. 2573–2582. PMLR,2019.

Choudhary, A., Kerber, M., and Raghvendra, S. Improvedtopological approximations by digitization. arXiv e-prints, art. arXiv:1812.04966, 2018.

Cohen-Steiner, D., Edelsbrunner, H., and Harer, J. Stabil-ity of persistence diagrams. Discrete & ComputationalGeometry, 37(1):103–120, 2007.

Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C.Introduction to algorithms. MIT Press, Cambridge, MA,USA, 3rd edition, 2009.

Edelsbrunner, H. and Harer, J. Persistent homology—asurvey. In Goodman, J. E., Pach, J., and Pollack, R.(eds.), Surveys on discrete and computational geometry:Twenty years later, number 453 in Contemporary Math-ematics, pp. 257–282. American Mathematical Society,Providence, RI, USA, 2008.

Edelsbrunner, H., Letscher, D., and Zomorodian, A. J. Topo-logical persistence and simplification. Discrete & Com-putational Geometry, 28(4):511–533, 2002.

Gracia, A., González, S., Robles, V., and Menasalvas, E. Amethodology to compare dimensionality reduction algo-rithms in terms of loss of quality. Information Sciences,270:1–27, 2014.

Guss, W. H. and Salakhutdinov, R. On characterizing thecapacity of neural networks using algebraic topology.arXiv e-prints, art. arXiv:1802.04443, 2018.

Hinton, G. E. and Salakhutdinov, R. R. Reducing the di-mensionality of data with neural networks. Science, 313(5786):504–507, 2006.

Hofer, C., Kwitt, R., Niethammer, M., and Uhl, A. Deeplearning with topological signatures. In Guyon, I.,Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vish-wanathan, S., and Garnett, R. (eds.), Advances in NeuralInformation Processing Systems 30, pp. 1633–1643. Cur-ran Associates, Inc., 2017.


Hofer, C., Kwitt, R., Niethammer, M., and Dixit, M.Connectivity-optimized representation learning via per-sistent homology. In Chaudhuri, K. and Salakhutdinov,R. (eds.), Proceedings of the 36th International Confer-ence on Machine Learning, volume 97 of Proceedingsof Machine Learning Research, pp. 2751–2760. PMLR,2019a.

Hofer, C. D., Graf, F., Rieck, B., Niethammer, M., andKwitt, R. Graph filtration learning. arXiv e-prints, art.arXiv:1905.10996, 2019b.

Kerber, M. and Sharathkumar, R. Approximate Cech com-plexes in low and high dimensions. arXiv e-prints, art.arXiv:1307.3272, 2013.

Khrulkov, V. and Oseledets, I. Geometry score: A methodfor comparing generative adversarial networks. In Dy, J.and Krause, A. (eds.), Proceedings of the 35th Interna-tional Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pp. 2621–2629. PMLR, 2018.

Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv e-prints, art. arXiv:1412.6980, 2014.

Kurlin, V. A one-dimensional homologically persistentskeleton of an unstructured point cloud in any metricspace. Computer Graphics Forum, 34(5):253–262, 2015.

Lee, A. B., Pedersen, K. S., and Mumford, D. The nonlinearstatistics of high-contrast patches in natural images. In-ternational Journal of Computer Vision, 54(1–3):83–103,2003.

Lee, J. A. and Verleysen, M. Quality assessment of dimen-sionality reduction: Rank-based criteria. Neurocomput-ing, 72(7):1431–1443, 2009.

Lewis, R. and Morozov, D. Parallel computation of persis-tent homology using the blowup complex. In Proceedingsof the 27th ACM Symposium on Parallelism in Algorithmsand Architectures (SPAA), pp. 323–331. ACM, 2015.

McInnes, L., Healy, J., and Melville, J. UMAP: Uniformmanifold approximation and projection for dimensionreduction. arXiv e-prints, art. arXiv:1802.03426, 2018.

Mémoli, F. and Sapiro, G. Comparing point clouds. InProceedings of the Eurographics/ACM SIGGRAPH Sym-posium on Geometry Processing (SGP), pp. 32–40, NewYork, NY, USA, 2004. Association for Computing Ma-chinery.

Paul, R. and Chalup, S. K. A study on validating non-lineardimensionality reduction using persistent homology. Pat-tern Recognition Letters, 100:160–166, 2017.

Peyré, G. Manifold models for signals and images. Com-puter Vision and Image Understanding, 113(2):249–260,2009.

Poulenard, A., Skraba, P., and Ovsjanikov, M. Topologi-cal function optimization for continuous shape matching.Computer Graphics Forum, 37(5):13–25, 2018.

Ramamurthy, K. N., Varshney, K., and Mody, K. Topologi-cal data analysis of decision boundaries with applicationto model selection. In Chaudhuri, K. and Salakhutdinov,R. (eds.), Proceedings of the 36th International Confer-ence on Machine Learning, volume 97 of Proceedingsof Machine Learning Research, pp. 5351–5360. PMLR,2019.

Reininghaus, J., Huber, S., Bauer, U., and Kwitt, R. A stablemulti-scale kernel for topological machine learning. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pp. 4741–4748, 2015.

Rieck, B. and Leitte, H. Persistent homology for the eval-uation of dimensionality reduction schemes. ComputerGraphics Forum, 34(3):431–440, 2015.

Rieck, B. and Leitte, H. Agreement analysis of quality mea-sures for dimensionality reduction. In Carr, H., Garth,C., and Weinkauf, T. (eds.), Topological Methods in DataAnalysis and Visualization IV. Springer, Cham, Switzer-land, 2017.

Rieck, B., Bock, C., and Borgwardt, K. A persistentWeisfeiler–Lehman procedure for graph classification.In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceed-ings of the 36th International Conference on MachineLearning, volume 97 of Proceedings of Machine Learn-ing Research, pp. 5448–5458. PMLR, 2019a.

Rieck, B., Togninalli, M., Bock, C., Moor, M., Horn, M.,Gumbsch, T., and Borgwardt, K. Neural persistence: acomplexity measure for deep neural networks using alge-braic topology. In International Conference on LearningRepresentations (ICLR), 2019b.

scikit-optimize contributers, T.scikit-optimize/scikit-optimize: v0.5.2,March 2018.

Sheehy, D. R. Linear-size approximations to the Vietoris–Rips filtration. Discrete & Computational Geometry, 49(4):778–796, 2013.

Tenenbaum, J. B., De Silva, V., and Langford, J. C. Aglobal geometric framework for nonlinear dimensionalityreduction. Science, 290(5500):2319–2323, 2000.

van der Maaten, L. J. and Hinton, G. Visualizing data usingt-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008.


van der Maaten, L. J., Postma, E. O., and van den Herik,H. J. Dimensionality reduction: A comparative review.Technical Report 2009-005, Tilburg University, 2009.

Venna, J. and Kaski, S. Visualizing gene interaction graphswith local multidimensional scaling. In Proceedings ofthe 14th European Symposium on Artificial Neural Net-works, pp. 557–562. d-side publishing, 2006.

Vietoris, L. Über den höheren Zusammenhang kompak-ter Räume und eine Klasse von zusammenhangstreuenAbbildungen. Mathematische Annalen, 97(1):454–472,1927.

Wagner, H. and Dłotko, P. Towards topological analysis ofhigh-dimensional feature spaces. Computer Vision andImage Understanding, 121:21–26, 2014.

Yan, L., Zhao, Y., Rosen, P., Scheidegger, C., and Wang,B. Homology-preserving dimensionality reduction viamanifold landmarking and tearing. arXiv e-prints, art.arXiv:1806.08460, 2018.

Zomorodian, A. J. Fast construction of the Vietoris–Ripscomplex. Computers & Graphics, 34(3):263–271, 2010.


A. AppendixA.1. Persistent Homology Calculation Details

This section provides more details about the persistent ho-mology calculation; it is more geared towards an expertreader and aims for a concise description of all requiredconcepts.

Simplicial homology To understand persistent homology,we first have to understand simplicial homology. Given asimplicial complex K, i.e. a high-dimensional generalisationof a graph, let Cd(K) denote the vector space generatedover Z2 whose elements are the d-simplices in K7. Forσ = (v0, . . . , vd) ∈ K, let ∂d : Cd(K) → Cd−1(K) be theboundary homomorphism defined by

∂d(σ) :=

d∑i=0

(v0, . . . , vi−1, vi+1, . . . , vd) (11)

for a single simplex and linearly extended to Cd(K). Thedth homology group Hd(K) of K is defined as the quotientgroup Hd(K) := ker ∂d/ im ∂d+1. The rank of the dth ho-mology group is known as the dth Betti number βd, i.e.βd(K) := rankHd(K). The sequence of Betti numbersβ0, . . . , βd of a d-dimensional simplicial complex is com-monly used to distinguish between different manifolds. Forexample, a 2-sphere inR3 has Betti numbers (1, 0, 1), whilea 2-torus in R3 has Betti numbers (1, 2, 1). Betti numbersare of limited use for analysing real-world data sets, how-ever, because their representation is too coarse and easilyaffected by small changes in the underlying simplicial com-plex. In an idealised, platonic setting, this does not pose aproblem, because one assumes that the triangulation of amanifold is known a priori; for real-world data sets, how-ever, we are typically dealing with point clouds and have noknowledge of the underlying manifold, making the calcu-lation of the “proper” simplicial complex nigh impossible.These disadvantages prompted the development of persis-tent homology.

Persistent homology Let ∅ = K0 ⊆ K1 ⊆ · · · ⊆Km−1 ⊆ Km = K be a nested sequence of simplicial com-plexes, called filtration. Filtrations can be defined basedon different functions; the Vietoris–Rips filtration that wediscuss in the paper, for example, is defined by a distancefunction, such as the Euclidean distance between points ofa point cloud. Notice that we may still calculate the sim-plicial homology of each Ki in the filtration. The filtrationprovides more information, though: the family of boundaryoperators ∂(·), together with the inclusion homomorphism,induces a homomorphism between corresponding homology

7It is also possible to describe this calculation with coefficientsin other fields, but the case of Z2 is advantageous because itsimplifies the implementation of all operations.

groups of the filtration, i.e. f i,jd : Hd(Ki)→ Hd(Kj). Thishomomorphism yields a sequence of homology groups

0 = Hd(K0)f0,1d−−→ Hd(K1)

f1,2d−−→ . . .

. . .fm−2,m−1d−−−−−−−→ Hd(Km−1)

fm−1,md−−−−−→ Hd(Km) = Hd(K)

for every dimension d. Given indices i ≤ j, the dth persis-tent homology group is defined as

Hi,jd := ker ∂d(Ki)/ (im ∂d+1(Kj) ∩ ker ∂d(Ki)) . (12)

It can be seen as the homology group that contains all ho-mology classes created in Ki that are still present (“active”)in Kj . We define the dth persistent Betti number to be therank of this group, i.e. βi,jd := rankHi,jd , which generalisesthe previous definition for simplicial homology. Persistenthomology results in a sequence of Betti numbers—insteadof a single number—that permits a fine-grained descriptionof topological activity. This activity is typically summarisedin a persistence diagram, thus replacing the indices i, j withreal numbers based on the function that was used to calcu-late the filtration.

Persistence diagrams A filtration often has associatedvalues (or weights) w0 ≤ w1 ≤ · · · ≤ wm−1 ≤ wm, suchas the pairwise distances in a point cloud. These valuespermit the calculation of topological feature descriptorsknown as persistence diagrams: for each dimension d andeach pair i ≤ j, one stores a pair (a, b) := (wi, wj) ∈ R2

with multiplicity

µ(d)i,j :=

(βi,j−1d − βi,jd

)−(βi−1,j−1d − βi−1,j

d

)(13)

in a multiset (typically, µ(d)i,j = 0 for many pairs). The pair

(a, b) represents a topological feature that was created at acertain threshold a, and destroyed at another threshold b. Inthe case of the Vietoris–Rips filtration and connected com-ponents, we have a = 0 because all connected componentsare present at the beginning of the filtration by definition.Similarly, b will correspond to an edge used in the minimumspanning tree of the data set. In general, the resulting setof points is called the dth persistence diagram Dd. Givena point (a, b) ∈ Dd, the quantity pers(x, y) := |a − b| isreferred to as its persistence.

A.2. Proof of Theorem 1

Theorem 1. Let X be a point cloud of cardinality nand X(m) be one subsample of X of cardinality m, i.e.X(m) ⊆ X , sampled without replacement. We can boundthe probability of X(m) exceeding a threshold in terms ofthe bottleneck distance as

P(db

(DX,DX

(m))>ε

)≤ P

(dH

(X,X(m)

)>2ε

), (14)


0 20 40 60 80 100

0

2

4

6

m

dH

( X,X(m

)) d = 2

d = 5d = 10

Figure A.1. Empirical convergence rate (mean) of the Hausdorffdistance for a subsample of sizem of 100 points in a d-dimensionalspace, following a standard normal distribution.

where dH refers to the Hausdorff distance between the pointcloud and its subsample, i.e.

dH(X,Y ) := max{ supx∈X

infy∈Y

dist(x, y),

supy∈Y

infx∈X

dist(x, y)}(15)

for a baseline distance dist(x, y) such as the Euclideandistance.

Proof. The stability of persistent homology calculationswas proved by Chazal et al. (2014a) for finite metric spaces.More precisely, given two metric spaces X and Y , we have

db

(DX ,DY

)≤ 2 dGH(X,Y ), (16)

where dGH(X,Y ) refers to the Gromov–Hausdorff dis-tance (Burago et al., 2001, p. 254) of the two spaces. Itis defined as the infimum Hausdorff distance over all isomet-ric embeddings of X and Y . This distance can be employedfor shape comparison (Chazal et al., 2009, Mémoli & Sapiro,2004), but is hard to compute. In our case, with X = Xand Y = X(m), we consider both spaces to have the samemetric (for Y , we take the canonical restriction of the metricfrom X to the subspace Y ). By definition of the Gromov–Hausdorff distance, we thus have dGH(X,Y ) ≤ dH(X,Y ),so Eq. 15 leads to

db

(DX ,DY

)≤ 2 dH(X,Y ), (17)

from which the original claim from Eq. 14 follows by takingprobabilities on both sides.

A.3. Empirical Convergence Rates of dH

(X,X(m)

)Figure A.1 depicts the mean of the convergence rate (mean)of the Hausdorff distance for a subsample of size m of100 points in a d-dimensional space, following a standardnormal distribution. We can see that the convergence rateis roughly similar, but shown on different absolute levels

that depend on the ambient dimension. While boundingthe convergence rate of this expression is feasible (Chazalet al., 2015a,b), it requires more involved assumptions onthe measures from which X and X(m) are sampled. Ad-ditionally, we can give a simple bound using the diam-eter diam(X) := sup{dist(x, y) | x, y ∈ X}. We havedH

(X,X(m)

)≤ diam(X) because the supremum is guar-

anteed to be an upper bound for the Hausdorff distance. Thisworst-case bound does not account for the sample size (ormini-batch size) m, though (see Theorem 2 for an expres-sion that takes m into account).

A.4. Proof of Theorem 2

Prior to the proof we state two observations that arise fromour special setting of dealing with finite point clouds.

Observation 1. Since X(m) ⊆ X , we havesupx′∈X(m) infx∈X dist(x, x′) = 0. Hence, the Hausdorffdistance simplifies to:

dH

(X,X(m)

):= sup

x∈Xinf

x′∈Xmdist(x, x′) (18)

In other words, we only have to consider a “one-sided” ex-pression of the distance because the distance from the sub-sample to the original point cloud is always zero.

Observation 2. Since our point clouds of interest are fi-nite sets, the suprema and infima of the Hausdorff distancecoincide with the maxima and minima, which we will subse-quently use for easier readability.

Hence, the computation of dH(X,X(m) can be divided into

three steps.

1. Using the baseline distance dist(·, ·), we compute adistance matrix A ∈ Rn×m between all points in Xand X(m).

2. For each of the n points inX , we compute the minimaldistance to the m samples of X(m) by extracting theminimal distance per row of A and gather all minimaldistances in δ ∈ Rn.

3. Finally, we return the maximal entry of δ asdH

(X,X(m)

).

In the subsequent proof, we require an independence as-sumption of the samples.

Proof. Using Observations 1 and 2 we obtain a simplifiedexpression for the Hausdorff distance, i.e.

dH

(X,X(m)

):= max

i,1≤i≤n

(min

j,1≤j≤m(aij)

). (19)

The minimal distances of the firstm rows of A are trivially 0.Hence, the outer maximum is determined by the remaining


n−m row minima {δi | m < i ≤ n } with δi = min1≤j≤m

aij .

Those minima follow the distribution F∆(y) with

F∆(y) = P(δi ≤ y) = 1− P(δi > y) (20)

= 1− P(

min1≤j≤m

aij > y

)(21)

= 1− P

⋂j

aij > y

(22)

= 1− (1− FD(y))m (23)

= −m∑k=1

(m

k

)(−FD(y))m−k. (24)

Next, we consider Z := max1≤i≤n δi. To evaluate thedensity of Z, we first need to derive its distribution FZ :

FZ(z) = P(Z ≤ z) = P(

maxm<i≤n

δi ≤ z)

(25)

= P

⋂m<i≤n

δi ≤ z

(26)

Next, we approximateZ byZ ′ by imposing i.i.d sampling ofthe minimal distances δi from F∆. This is an approximationbecause in practice, the rowsm+1 to n are not stochasticallyindependent because of the triangular inequality that holdsfor metrics. However, assuming i.i.d., we arrive at

FZ′(z) = F∆(z)n−m. (27)

Since Z ′ has positive support its expectation can then beevaluated as:

EZ′∼FZ′ [Z′] =

+∞∫0

(1− FZ′(z)) dz (28)

=

+∞∫0

(1− F∆(z)n−m

)dz (29)

The independence assumption leading to Z ′ results in over-estimating the variance of the drawn minima δi. Thus, theexpected maximum of those minima, E[Z ′], is overestimat-ing the actual expectation of the maximum E[Z], whichis why Eq. 28 constitutes an upper bound of E[Z], andequivalently, an upper bound of E

[dH(X,X

(m))].

A.5. Synthetic Data Set

SPHERES consists of eleven high-dimensional 100-spheresliving in 101−dimensional space. Ten spheres of radiusr = 5 are each shifted in a random direction (by addingthe same Gaussian noise vector per sphere). To this end,

we draw ten d-dimensional Gaussian vectors followingN (0, I(10/

√d)) for d = 101. Crucially, to add interest-

ing topological information to the data set, the ten spheresare enclosed by an additional larger sphere of radius 5r. Thespheres were generated using the library scikit-tda.

A.6. Architectures and Hyperparameter Tuning

Architectures for synthetic data set For the syntheti-cally generated data set, we use a simple multilayer per-ceptron architecture consisting of two hidden layer with 32neurons each both encoder and decoder and a bottleneck oftwo neurons such that the sequence of hidden-layer neuronsis 32− 32− 2− 32− 32. ReLU non-linearities and batchnormalization were applied between the layers excludingthe output layer and the bottleneck layer. The networks werefit using mean squared error loss.

Architectures for real world data sets For the MNIST,FASHION-MNIST, and CIFAR-10 data sets, we use anarchitecture inspired by DeepAE (Hinton & Salakhutdi-nov, 2006). This architecture is composed of 3 layers ofhidden neurons of decreasing size (1000 − 500 − 250)for the encoder part, a bottleneck of two neurons, anda sequence of three layers of hidden neurons in decreas-ing size (250 − 500 − 1000) for the decoder. In contrastto the originally proposed architecture, we applied ReLUnon-linearities and batch normalization between the layersas we observed faster and more stable training. For thenon-linearities of the final layer, we applied the tanh non-linearity, such that the image of the activation matches therange of input images scaled between −1 and 1. Also here,we applied mean squared error loss.

All neural network architectures were fit using Adam andweight-decay of 10−5.

Hyperparameter tuning For hyperparameter tuning weapply random sampling of hyperparameters using thescikit-optimize library (scikit-optimize contributers,2018) with 20 calls per method on all data sets. We selectthe best model parameters in terms of KL0.1 on the vali-dation split and evaluate and report it on the test split. Toestimate performance means and standard deviations, we re-peated the evaluation on an independent test split 5 times byusing the best parameters (as identified in the hyperparam-eter search on the validation split) and refitting the modelsby resampling the train-validation split.

Neural networks For the neural networks, we sample thelearning rate log-uniformly in the range [10−4, 10−2], thebatch size uniformly between [16, 128], and for our topo-logical autoencoder method (TopoAE), we sample the reg-ularisation strength λ log-uniformly in the range [10−1, 3].Each model was allowed to train for at most 100 epochs and


we applied early stopping with patience = 10 based on thevalidation loss.

Competitor methods For t-SNE, we sample the perplex-ity uniformly in the range 5− 50 and the learning rate log-uniformly in the range 10− 1000. For Isomap and UMAP,the number of neighbors included in the computation wasvaried between 15− 500. For UMAP, we additionally varythe min_dist parameter uniformly between 0 and 1.

A.7. Measuring the Quality of Latent Representations

Next to the reconstruction error (if available; please seethe paper for a discussion on this), we use a variety ofNLDR metrics to assess the quality of our method. Ourprimary interest concerns the quality of the latent space be-cause, among others, it can be used to visualise the data set.We initially considered classical quality metrics from non-linear dimensionality reduction (NLDR) algorithms (seeBibal & Frénay (2019), Gracia et al. (2014), van der Maatenet al. (2009) for more in-depth descriptions), namely

(1) the root mean square error (`-RMSE) between thedistance matrix of the original space and the latentspace (as mentioned in the main text, this is not relatedto the reconstruction error),

(2) the mean relative rank error (`-MRRE), which mea-sures the changes in ranks of distances in the originalspace and the latent space (Lee & Verleysen, 2009),

(3) the trustworthiness (`-Trust) measure (Venna & Kaski,2006), which checks to what extent the k nearest neigh-bours of a point are preserved when going from theoriginal space to the latent space, and

(4) the continuity (`-Cont) measure (Venna & Kaski, 2006),which is defined analogously to `-Trust, but checks towhat extent neighbours are preserved when going fromthe latent space to the original space.

All of these measures are defined based on comparisons ofthe original space and the latent space; the reconstructedspace is not used here. As an additional measure, we cal-culate the Kullback–Leibler divergence between densitydistributions of the input space and the latent space. Specif-ically, for a point cloud X with an associated distancedist, we first use the distance to a measure density esti-mator (Chazal et al., 2011, 2014b), defined as fσ

X (x) :=∑y∈X exp

(−σ−1 dist(x, y)

2)

, where σ ∈ R>0 repre-sents a length scale parameter. For dist, we use the Eu-clidean distance and normalise it between 0 and 1. Given σ,we evaluate KLσ := KL

(fσX ‖ fσZ

), which measures the

similarity between the two density distributions. Ideally, wewant the two distributions to be similar because this impliesthat density estimates in a low-dimensional representationare similar to the ones in the original space.

A.8. Assessing the Batch Size

As we used fixed architectures for the hyperparametersearch, the batch size remains the main determinant forthe runtime of TopoAE. In Figure A.2, we display trends(linear fits) on how loss measures vary with batch size. Ad-dtionally, we draw runtime estimates. As we applied earlystopping, for better comparability, we approximated theepoch-wise runtime by dividing the execution time of a runby its number of completed epochs. Interestingly, theseplots suggest that the runtime grows with decreasing batchsize (even though the topological computation is more costlyfor larger batch sizes!). In these experiments, sticking to0−dimensional topological features we conclude that thebenefit of using mini-batches for neural network trainingstill dominate the topological computations. The few steeppeaks most likely represent outliers (the corresponding runsstopped after few epochs, which is why the effective runtimecould be overestimated).

For the loss measures, we see that reconstruction loss tendsto decrease with increasing batch size, while our topologicalloss tends to increase with increasing batch size (despite nor-malization). The second observation might be due to largerbatch size enabling more complex data point arrangementsand corresponding topologies.

A.9. Extending to Variational Autoencoders

In Figure A.3 we sketch a preliminary experiment, wherewe apply our topological constraint to variational autoen-coders for the SPHERES data set. Also here, we observe thatour constraint helps identifying the nesting structure of theenclosing sphere.

A.10. Topological Distance Calculations

To assess the topological fidelity of the resulting latentspaces, we calculate several topological distances betweenthe test data set (full dimensionality) and the latent spacesobtained from each method (two dimensions). More pre-cisely, we calculate (i) the 1st Wasserstein distance (W1),(ii) the 2nd Wasserstein distance (W2), and (iii) the bot-tleneck distance (W∞) between the persistence diagramsobtained from the test data set of the SPHERES data andtheir resulting 2D latent representations. Even thoughour loss function is not optimising this distance, we ob-serve in Table A.1 that the topological distance of ourmethod (“TopoAE”) is always the lowest among all themethods. In particular, it is always smaller than the topo-logical distance of the latent space of the autoencoder archi-tecture; this is true for all distance measures, even thoughW∞, for example, is known to be susceptible to outliers.Said experiment serves as a simple “sanity check” as itdemonstrates that the changes induced by our method are


(a) F-MNIST (b) MNIST (c) CIFAR-10 (d) SPHERES

Figure A.2. A scatterplot of batch sizes verses three measures of interest: Topological Loss, Reconstruction Loss, and KL0.1, our objectivefor the hyperparameter search. Additionally, we draw per-epoch runtime estimates.

(a) VAE (b) TopoVAE

Figure A.3. A depiction of latent spaces obtained for the SPHERES

data set with variational autoencoders (VAEs). Here, VAE repre-sents a standard MLP-based VAE, whereas TopoVAE representsthe same architecture plus our topological constraint.

Method W1 W2 W∞

Isomap 4.32±0.037 0.477±0.0045 0.165±0.00096PCA 4.42±0.053 0.476±0.0046 0.158±0.00108t-SNE 4.38±0.038 0.478±0.0045 0.164±0.00094UMAP 4.47±0.042 0.478±0.0045 0.160±0.00092AE 3.99±0.037 0.469±0.0053 0.154±0.00128

TopoAE 3.73±0.076 0.459±0.0055 0.152±0.00268

Table A.1. Topological distances between the test data set and thecorresponding latent space. We used subsamples of size m = 500and 10 repetitions (obtaining a mean and a standard deviation).

beneficial in that they reduce the topological distance of thelatent space to the original data set. For a proper comparisonof topological features between the two sets of spaces, amore involved approach would be required, though.

A.11. Alternative Loss Formulations

Our choice of loss function was motivated by the observa-tion that only aligning the persistence diagrams betweenmini-batches of X and Z can lead to degenerate or “mean-ingless” latent spaces. As a simple example (see Figure A.4for a visualisation), imagine three non-collinear points in theinput space and the triangle they are forming. Now assumethat the latent space consists of the same triangle (in terms

AB

C

d1

d3 d2

(a) X

A

BC

d2d3

d1

(b) Z

Figure A.4. An undesirable configuration of the latent space ofthree non-collinear points, resulting in equal persistence diagramsfor X and Z . Pairwise distances are shown as dotted lines. Weprevent this by not explicitly minimising the distances betweenpersistence diagrams but by including persistence pairings.

of its side lengths) but with permuted labels. A loss term ofthe form

L′ :=∥∥AX[πX]−AZ

[πZ]∥∥2

(30)

only measures the distance between persistence dia-grams (which would be zero in this situation) and would notbe able to penalise such a configuration.


(a) PCA (b) Isomap (c) t-SNE

(d) UMAP (e) AE (f) TopoPCA

(g) TopoAE

Figure A.5. A depiction of all latent spaces obtained for the SPHERES data set. TopoAE used a batch size of 28. This is an enlargedversion of the figure shown in Section 5.2.


(a) PCA (b) t-SNE (c) UMAP

(d) AE (e) TopoPCA (f) TopoAE

Figure A.6. Latent representations of the FASHION-MNIST data set. TopoAE used a batch size of 95. This is a larger extension of thefigure shown in Section 5.2.




Figure A.7. Latent representations of the MNIST data set. TopoAE used a batch size of 126. This is a larger extension of the figureshown in Section 5.2.




Figure A.8. Latent representations of the CIFAR-10 data set. TopoAE used a batch size of 82. This is a larger extension of the figureshown in Section 5.2.


Dat

ase

tM

etho

dK

L0.0

01

KL0.0

1K

L0.1

KL1

KL10

`-C

ont

`-M

RR

E`-

Trus

t`-

RM

SED

ata

MSE

SP

HE

RE

S

Isom

ap0.

5309

5±0.

0192

90.

1809

6±0.

0254

70.

4204

8±0.

0055

90.

0088

1±0.

0002

00.

0000

89±

0.00

0002

0.79

027±

0.00

244

0.24

573±

0.00

158

0.67

643±

0.00

323

10.3

7188±

0.22

856

–PC

A0.

2244

5±0.

0069

10.

3323

1±0.

0055

20.

6512

1±0.

0025

60.

0153

0±0.

0001

00.

0001

59±

0.00

0001

0.74

740±

0.00

140

0.29

402±

0.00

108

0.62

557±

0.00

066

11.7

6482±

0.01

460

0.96

103±

0.00

029

TSN

E0.

2279

4±0.

0072

20.

1522

8±0.

0080

50.

5272

2±0.

0326

10.

0127

1±0.

0005

80.

0001

33±

0.00

0006

0.77

300±

0.00

513

0.21

740±

0.00

472

0.67

862±

0.00

474

8.05

018±

0.11

057

–U

MA

P0.

2475

2±0.

0191

70.

1568

7±0.

0059

90.

6132

6±0.

0075

20.

0165

8±0.

0002

80.

0001

78±

0.00

0003

0.75

153±

0.00

360

0.24

968±

0.00

094

0.63

483±

0.00

185

9.27

009±

0.03

417

–V

anill

a0.

2843

2±0.

0216

50.

5657

1±0.

0286

40.

7458

8±0.

0432

30.

0166

4±0.

0011

50.

0001

72±

0.00

0013

0.60

663±

0.01

685

0.34

918±

0.00

903

0.58

843±

0.00

475

13.3

3061±

0.05

198

0.81

545±

0.00

106

Topo

PCA

0.43

344±

0.01

823

0.17

837±

0.00

888

0.39

816±

0.01

178

0.00

866±

0.00

025

0.00

0087±

0.00

0003

0.77

320±

0.00

135

0.32

765±

0.00

158

0.62

260±

0.00

251

11.9

1542±

0.56

134

0.97

305±

0.00

067

Topo

AE

0.62

765±

0.05

415

0.08

504±

0.01

270

0.32

572±

0.02

050

0.00

694±

0.00

055

0.00

0069±

0.00

0006

0.82

200±

0.01

813

0.27

239±

0.01

108

0.65

775±

0.01

428

13.4

5753±

0.04

177

0.86

812±

0.00

074

F-M

NIS

T

PCA

0.22

559±

0.00

011

0.35

594±

0.00

011

0.05

205±

0.00

004

0.00

069±

0.00

000

0.00

0007±

0.00

0000

0.96

777±

0.00

001

0.05

744±

0.00

001

0.91

681±

0.00

003

9.05

121±

0.00

041

0.18

439±

0.00

000

TSN

E0.

0351

6±0.

0022

60.

4047

7±0.

0125

10.

0709

5±0.

0096

20.

0019

8±0.

0002

60.

0000

23±

0.00

0003

0.96

731±

0.00

268

0.01

962±

0.00

073

0.97

405±

0.00

070

41.2

5460±

0.53

671

–U

MA

P0.

0506

9±0.

0023

80.

4236

2±0.

0060

90.

0649

1±0.

0016

10.

0016

3±0.

0000

50.

0000

19±

0.00

0001

0.98

126±

0.00

016

0.02

867±

0.00

034

0.95

874±

0.00

060

13.6

8933±

0.02

896

–V

anill

a0.

1717

7±0.

1360

30.

4779

8±0.

0956

70.

0679

1±0.

0070

00.

0012

5±0.

0001

70.

0000

14±

0.00

0002

0.96

849±

0.00

372

0.02

562±

0.00

217

0.97

418±

0.00

119

20.7

0674±

3.56

861

0.10

197±

0.00

222

Topo

PCA

0.18

857±

0.00

197

0.36

201±

0.00

186

0.05

296±

0.00

045

0.00

083±

0.00

002

0.00

0009±

0.00

0000

0.97

030±

0.00

013

0.05

584±

0.00

022

0.91

790±

0.00

034

20.8

8881±

0.29

929

0.18

315±

0.00

002

Topo

AE

0.11

039±

0.02

948

0.39

204±

0.03

264

0.05

353±

0.00

959

0.00

100±

0.00

015

0.00

0011±

0.00

0002

0.97

998±

0.00

194

0.03

156±

0.00

253

0.95

612±

0.00

391

20.4

9122±

0.93

206

0.12

071±

0.00

238

MN

IST

PCA

0.16

754±

0.00

051

0.38

876±

0.00

146

0.16

301±

0.00

059

0.00

160±

0.00

001

0.00

0016±

0.00

0000

0.90

084±

0.00

016

0.16

582±

0.00

022

0.74

546±

0.00

048

13.1

7437±

0.00

216

0.22

269±

0.00

002

TSN

E0.

0376

7±0.

0014

00.

2769

5±0.

0526

60.

1326

6±0.

0236

20.

0021

4±0.

0004

10.

0000

24±

0.00

0004

0.92

101±

0.00

288

0.03

953±

0.00

129

0.94

624±

0.00

147

22.8

9261±

0.24

373

–U

MA

P0.

0721

4±0.

0009

10.

3206

3±0.

0032

00.

1456

8±0.

0020

70.

0023

4±0.

0000

40.

0000

27±

0.00

0001

0.93

992±

0.00

066

0.05

109±

0.00

022

0.93

770±

0.00

039

14.6

1535±

0.04

332

–V

anill

a0.

4469

0±0.

0854

00.

6199

3±0.

1174

20.

1554

2±0.

0220

30.

0015

6±0.

0002

30.

0000

16±

0.00

0002

0.91

293±

0.00

564

0.05

828±

0.00

353

0.93

699±

0.00

262

18.1

8105±

0.21

459

0.13

732±

0.00

160

Topo

PCA

0.16

138±

0.00

716

0.39

157±

0.01

283

0.15

556±

0.00

313

0.00

162±

0.00

007

0.00

0017±

0.00

0001

0.90

301±

0.00

057

0.16

297±

0.00

049

0.75

040±

0.00

091

17.3

3353±

0.82

592

0.22

477±

0.00

009

Topo

AE

0.32

427±

0.03

312

0.34

069±

0.03

056

0.11

012±

0.01

069

0.00

114±

0.00

010

0.00

0012±

0.00

0001

0.93

210±

0.00

132

0.05

553±

0.00

044

0.92

844±

0.00

142

19.5

7784±

0.01

812

0.13

884±

0.00

066

CIF

AR

PCA

0.27

320±

0.00

014

0.59

073±

0.00

004

0.01

961±

0.00

001

0.00

023±

0.00

000

0.00

0002±

0.00

0000

0.93

130±

0.00

000

0.11

921±

0.00

005

0.82

117±

0.00

002

17.7

1567±

0.00

084

0.14

816±

0.00

000

TSN

E0.

0445

1±0.

0022

20.

6273

3±0.

0142

70.

0301

4±0.

0033

30.

0007

3±0.

0000

70.

0000

09±

0.00

0001

0.90

300±

0.00

611

0.10

265±

0.00

242

0.86

325±

0.00

151

25.6

1099±

0.11

551

–U

MA

P0.

0693

4±0.

0020

20.

6167

3±0.

0005

20.

0256

2±0.

0001

90.

0005

0±0.

0000

10.

0000

06±

0.00

0000

0.92

045±

0.00

013

0.12

680±

0.00

028

0.81

668±

0.00

019

33.5

7785±

0.00

796

–V

anill

a0.

3773

7±0.

0650

70.

6683

4±0.

0299

20.

0345

8±0.

0044

80.

0006

2±0.

0002

10.

0000

07±

0.00

0002

0.85

072±

0.00

429

0.13

204±

0.00

316

0.86

359±

0.00

442

36.2

6827±

0.56

159

0.14

030±

0.00

190

Topo

PCA

0.27

207±

0.00

619

0.58

401±

0.00

515

0.01

973±

0.00

040

0.00

024±

0.00

001

0.00

0003±

0.00

0000

0.92

439±

0.00

115

0.12

607±

0.00

133

0.81

551±

0.00

139

30.7

3848±

1.35

231

0.15

002±

0.00

076

Topo

AE

0.20

877±

0.00

951

0.55

642±

0.00

412

0.01

879±

0.00

051

0.00

031±

0.00

002

0.00

0003±

0.00

0000

0.92

691±

0.00

100

0.10

809±

0.00

210

0.84

514±

0.00

359

37.8

5914±

0.03

303

0.13

975±

0.00

171

Tabl

eA

.2.E

xten

ded

vers

ion

ofth

eta

ble

from

the

mai

npa

per,

show

ing

mor

ele

ngth

scal

esan

dva

rian

cees

timat

es.

Topological Autoencoders - arXiv

Documents