Topological Data Analysis

D A M I A N A . V O N S C H O E N B O R N

Topological Data Analysis

Abstract

By now, the Big Data revolution is well on its way. Storage capacity has ballooned, and simple queries against these data stores can be executed with relative ease. However, analytic techniques have generally not matured to handle the massive datasets of this new era. This talk will present a set of techniques known collectively as Topological Data Analysis (TDA), where concepts from Topology are applied to classify, visualize, and explore data. TDA shows promise in the era of Big Data.

Agenda

Issues with Big Data analysis

Topology Overview

Computational Topology and Formal TDA

Relaxed TDA

Q&A

Problems in Big Data Analytics

Problems with legacy analytic techniques

Run in series, in memory

hypothesis-driven

Visualizations limited

Topology Overview (as relevant here)

Metric Space

• Pair-wise distance between points • Continuously defined surfaces

Coordinate free

• Orientation doesn’t matter • Ability to compare sets from different coordinate

systems

Small deformations don’t change topology

• Stretching, bending, etc. okay • Cutting, gluing, etc. not okay • Less sensitivity to noise [1]

Simplicial Complexes

• Course (“compressed”) representations of reality

Intuitively, a topological space is a set of points, each of whom knows its neighbors. Formally, a topology on a set X is a subset T ⊆ 2X such that: • If 𝑆1, 𝑆2 ∈ 𝑇, then 𝑆1 ∩ 𝑆2 ∈ 𝑇

• If 𝑆𝐽|𝑗 ∈ 𝐽 ⊆ 𝑇, then

∪𝑗∈𝐽 𝑆𝑗 ∈ 𝑇

• ∅, 𝑋 ∈ 𝑇 [3]

Topological Data Analysis

Definition: Given a finite dataset S ⊆ 𝕐 of noisy points sampled from an unknown space 𝕏, topological data analysis recovers the topology of 𝕏, assuming both 𝕏 and 𝕐 are topological spaces.[3]

We want a process that does not require assumptions about manifold structure, smoothness, or lack of curvature.[3]

Formal Combinatorial Representations

• Construct a combinatorial representation that approximates the underlying space from which the data was sampled[3]

• Many types of these representations (simplicial complexes) have been developed

Goal

• Both the Čech and VR complexes typically produce simplices in dimensions much higher than the dimension of the space [4]

• The VR Complex is less expensive than the corresponding Cech complex, even though the VR complex has more simplices[2]

• The Čech Complex is not computed in practice due to its computational complexity[3]

• Currently, the VR complex is one of the few practical methods for topological analysis in high dimensions[3]

Two of the most popular are the

Čech and Vietoris-Rips

(VR) Complexes

Defining the VR Complex

Definition 1[3]

Given 𝑆 ⊆ 𝕐 and 𝜀 ∈ ℝ, let 𝐺𝜀 = (𝑆, 𝐸𝜀) be the ε-

neighborhood graph on S, where 𝐸𝜀 =

𝑢, 𝑣 |𝑑(𝑢, 𝑣) ≤ 𝜀, 𝑢 ≠ 𝑣 ∈ 𝑆

The VR Complex is the clique complex of the ε-neighborhood

graph

A clique is the subset of vertices that induces a complete

subgraph and is maximal if it cannot be made any larger

The clique complex has the maximal cliques of a graph as

its maximal simplices

Definition 2[4]

Let X denote a metric space with metric d. Then the VR complex for X, attached

to the parameter 𝜀, will be the simplicial complex whose vertex set is X and where {x0, x1, …, xk} spans a k-

simplex if and only if d(xi,xj) ≤ 𝜀 for all 0 ≤ i,j ≤k

Creating the VR Complex

Begin with complete dataset Create ε-balls around each

data point Draw an edge connecting

each overlapping ε-ball pair

[2]

Describe with Betti Numbers b0: # of connected components

b1: # of 1D holes b2: # of 2D holes

What features are an artifact of the chosen ε vs. a representation of the underlying structure?

Betti Numbers insufficient

Persistence

Features persisting over large range of ε values are significant

Features that quickly arise and drop off are noise and can be ignored

[2]

Graphs Barcodes

Visualizing Persistent Homology

[2] [3]

[3]

Potential Application: Optimizing Model Selection

[7]

So where do we stand? P

ros • Useful when high

resolution representation needed • Surface reconstruction • Anomaly detection

• Comparing datasets • Optimize models

• Choose models and parameters best suited to handle the type of dataset you’re analyzing

Co

ns • Some subjective judgment

• Potentially difficult to read • Not ideal for Big Data

• Computationally expensive(epsilon balls, pairwise overlap flags, etc. all computed for every epsilon value in range) [4]

• Typically need to sample from data, reducing resolution.

Dimensionality Reduction Principal Components Analysis, MDS, ISOMAP

Record Consolidation Cluster Analysis

Retain much of the underlying structure of the data while limiting the number of dimensions needed to describe it [6]

Drawbacks Loss of information, missing

subtleties Assumes normality Assumes that data is from a

flat hyperplane with no curvature[3]

Discover underlying segments of the data by grouping data points that are most similar [6]

Drawbacks Distinct groups, no relationship

between them, arbitrary distinction in continuous data

Specification of number of clusters upfront

Often difficult to apply clustering algorithms to very large datasets[4]

Shrinking Data Size

With many algorithms in each category, choosing the right one takes experience or luck

An alternate approach

1

2

3

4 [6]

Process Overview

A. Discrete sample space

B. Filter function can be any combination of dimensions in the dataset or derived calculated fields

C. Slightly-overlapping bins

D. Simplified representation

[1]

Useful filter functions[5]

• Combinations of in-data dimensions (or derivations thereof), typically chosen by domain knowledge

Field(s) from the data

• Use Gaussian kernel: 𝑓𝜀 𝑥 = 𝐶𝜀 𝑒−𝑑(𝑥,𝑦)2

𝜀𝑦 Density

• Identify points which are far from the center without identifying the actual center

• For 1 ≤ 𝑝 < ∞, let 𝐸𝑝 𝑥 = 𝑑(𝑥,𝑦)𝑝𝑦∈𝑋

𝑁

1𝑝

Eccentricity (data depth)

• Let 𝐿 𝑥, 𝑦 =𝑤(𝑥,𝑦)

𝑤(𝑥,𝑧)𝑧 𝑤(𝑥,𝑧)𝑧 where 𝑤 𝑥, 𝑦 = 𝑘 𝑑 𝑥, 𝑦 for smoothing

kernel 𝑘 (e.g. Gaussian) • Eigenvectors of L(x,y) are a set of orthogonal vectors that give interesting

geometric information

Eigenvectors of graph Laplacians

Traditional methods TDA

Application: Gene expression in cancer cells [1]

Benefits

• Able to move away from hypothesis-driven analyses[1]

• Visualize entire dataset, without making unfounded assumptions

Visual Exploration

• Process can be applied to wide variety of data sources • No predefined format, scaling, etc. needed • Multiscale representations: Useful to have the flexibility of changing the

resolution “on the fly” [4]

Fungibility

• Choice of clustering algorithms • Choice of filter functions

Integration of favorite machine learning techniques

• Clustering performed on subsets – allows for parallelization

Computation

Q & A

References

1. Lum, P.Y. et al. Extracting insights from the shape of complex data using topology. Sci. Rep. 3, 1236; DOI: 10.1038/srep01236 (2013)

2. Ghrist, R. Barcodes: The Persistent Topology of Data. Bulletin of the AMS 45.1 pp61-75 S 0273-0979(07)01191-3 (2008)

3. Zomorodian, A. Topological Data Analysis. Proceedings of Symposia in Applied Mathematics. AMS (2011)

4. Carlsson, G. Topology and Data. Bulletin of the AMS 46.2 pp255-308 S 0273-0979(09)01249-X (2009)

5. Singh, G. et al. Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. Eurographics Symposium on Point-Based Graphics (2007)

6. Ayasdi. TDA and Machine Learning: Better Together. (2015) 7. "Clustering." 2.3. Clustering — Scikit-learn 0.15.2 Documentation.

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830 (2011)

Topological Data Analysis

Documents

vr complex definition

clique complex

topological data analysis

topological analysis

data stores

data point

era of big data

big data revolution