DAMIAN A. VON SCHOENBORN Topological Data Analysis
Aug 11, 2015
D A M I A N A . V O N S C H O E N B O R N
Topological Data Analysis
Abstract
By now, the Big Data revolution is well on its way. Storage capacity has ballooned, and simple queries against these data stores can be executed with relative ease. However, analytic techniques have generally not matured to handle the massive datasets of this new era. This talk will present a set of techniques known collectively as Topological Data Analysis (TDA), where concepts from Topology are applied to classify, visualize, and explore data. TDA shows promise in the era of Big Data.
Agenda
Issues with Big Data analysis
Topology Overview
Computational Topology and Formal TDA
Relaxed TDA
Q&A
Problems in Big Data Analytics
Problems with legacy analytic techniques
Run in series, in memory
hypothesis-driven
Visualizations limited
Topology Overview (as relevant here)
Metric Space
• Pair-wise distance between points • Continuously defined surfaces
Coordinate free
• Orientation doesn’t matter • Ability to compare sets from different coordinate
systems
Small deformations don’t change topology
• Stretching, bending, etc. okay • Cutting, gluing, etc. not okay • Less sensitivity to noise [1]
Simplicial Complexes
• Course (“compressed”) representations of reality
Intuitively, a topological space is a set of points, each of whom knows its neighbors. Formally, a topology on a set X is a subset T ⊆ 2X such that: • If 𝑆1, 𝑆2 ∈ 𝑇, then 𝑆1 ∩ 𝑆2 ∈ 𝑇
• If 𝑆𝐽|𝑗 ∈ 𝐽 ⊆ 𝑇, then
∪𝑗∈𝐽 𝑆𝑗 ∈ 𝑇
• ∅, 𝑋 ∈ 𝑇 [3]
Topological Data Analysis
Definition: Given a finite dataset S ⊆ 𝕐 of noisy points sampled from an unknown space 𝕏, topological data analysis recovers the topology of 𝕏, assuming both 𝕏 and 𝕐 are topological spaces.[3]
We want a process that does not require assumptions about manifold structure, smoothness, or lack of curvature.[3]
Formal Combinatorial Representations
• Construct a combinatorial representation that approximates the underlying space from which the data was sampled[3]
• Many types of these representations (simplicial complexes) have been developed
Goal
• Both the Čech and VR complexes typically produce simplices in dimensions much higher than the dimension of the space [4]
• The VR Complex is less expensive than the corresponding Cech complex, even though the VR complex has more simplices[2]
• The Čech Complex is not computed in practice due to its computational complexity[3]
• Currently, the VR complex is one of the few practical methods for topological analysis in high dimensions[3]
Two of the most popular are the
Čech and Vietoris-Rips
(VR) Complexes
Defining the VR Complex
Definition 1[3]
Given 𝑆 ⊆ 𝕐 and 𝜀 ∈ ℝ, let 𝐺𝜀 = (𝑆, 𝐸𝜀) be the ε-
neighborhood graph on S, where 𝐸𝜀 =
𝑢, 𝑣 |𝑑(𝑢, 𝑣) ≤ 𝜀, 𝑢 ≠ 𝑣 ∈ 𝑆
The VR Complex is the clique complex of the ε-neighborhood
graph
A clique is the subset of vertices that induces a complete
subgraph and is maximal if it cannot be made any larger
The clique complex has the maximal cliques of a graph as
its maximal simplices
Definition 2[4]
Let X denote a metric space with metric d. Then the VR complex for X, attached
to the parameter 𝜀, will be the simplicial complex whose vertex set is X and where {x0, x1, …, xk} spans a k-
simplex if and only if d(xi,xj) ≤ 𝜀 for all 0 ≤ i,j ≤k
Creating the VR Complex
Begin with complete dataset Create ε-balls around each
data point Draw an edge connecting
each overlapping ε-ball pair
[2]
Describe with Betti Numbers b0: # of connected components
b1: # of 1D holes b2: # of 2D holes
What features are an artifact of the chosen ε vs. a representation of the underlying structure?
Betti Numbers insufficient
Persistence
Features persisting over large range of ε values are significant
Features that quickly arise and drop off are noise and can be ignored
[2]
Graphs Barcodes
Visualizing Persistent Homology
[2] [3]
[3]
Potential Application: Optimizing Model Selection
[7]
So where do we stand? P
ros • Useful when high
resolution representation needed • Surface reconstruction • Anomaly detection
• Comparing datasets • Optimize models
• Choose models and parameters best suited to handle the type of dataset you’re analyzing
Co
ns • Some subjective judgment
• Potentially difficult to read • Not ideal for Big Data
• Computationally expensive(epsilon balls, pairwise overlap flags, etc. all computed for every epsilon value in range) [4]
• Typically need to sample from data, reducing resolution.
Dimensionality Reduction Principal Components Analysis, MDS, ISOMAP
Record Consolidation Cluster Analysis
Retain much of the underlying structure of the data while limiting the number of dimensions needed to describe it [6]
Drawbacks Loss of information, missing
subtleties Assumes normality Assumes that data is from a
flat hyperplane with no curvature[3]
Discover underlying segments of the data by grouping data points that are most similar [6]
Drawbacks Distinct groups, no relationship
between them, arbitrary distinction in continuous data
Specification of number of clusters upfront
Often difficult to apply clustering algorithms to very large datasets[4]
Shrinking Data Size
With many algorithms in each category, choosing the right one takes experience or luck
An alternate approach
1
2
3
4 [6]
Process Overview
A. Discrete sample space
B. Filter function can be any combination of dimensions in the dataset or derived calculated fields
C. Slightly-overlapping bins
D. Simplified representation
[1]
Useful filter functions[5]
• Combinations of in-data dimensions (or derivations thereof), typically chosen by domain knowledge
Field(s) from the data
• Use Gaussian kernel: 𝑓𝜀 𝑥 = 𝐶𝜀 𝑒−𝑑(𝑥,𝑦)2
𝜀𝑦 Density
• Identify points which are far from the center without identifying the actual center
• For 1 ≤ 𝑝 < ∞, let 𝐸𝑝 𝑥 = 𝑑(𝑥,𝑦)𝑝𝑦∈𝑋
𝑁
1𝑝
Eccentricity (data depth)
• Let 𝐿 𝑥, 𝑦 =𝑤(𝑥,𝑦)
𝑤(𝑥,𝑧)𝑧 𝑤(𝑥,𝑧)𝑧 where 𝑤 𝑥, 𝑦 = 𝑘 𝑑 𝑥, 𝑦 for smoothing
kernel 𝑘 (e.g. Gaussian) • Eigenvectors of L(x,y) are a set of orthogonal vectors that give interesting
geometric information
Eigenvectors of graph Laplacians
Traditional methods TDA
Application: Gene expression in cancer cells [1]
Benefits
• Able to move away from hypothesis-driven analyses[1]
• Visualize entire dataset, without making unfounded assumptions
Visual Exploration
• Process can be applied to wide variety of data sources • No predefined format, scaling, etc. needed • Multiscale representations: Useful to have the flexibility of changing the
resolution “on the fly” [4]
Fungibility
• Choice of clustering algorithms • Choice of filter functions
Integration of favorite machine learning techniques
• Clustering performed on subsets – allows for parallelization
Computation
Q & A
References
1. Lum, P.Y. et al. Extracting insights from the shape of complex data using topology. Sci. Rep. 3, 1236; DOI: 10.1038/srep01236 (2013)
2. Ghrist, R. Barcodes: The Persistent Topology of Data. Bulletin of the AMS 45.1 pp61-75 S 0273-0979(07)01191-3 (2008)
3. Zomorodian, A. Topological Data Analysis. Proceedings of Symposia in Applied Mathematics. AMS (2011)
4. Carlsson, G. Topology and Data. Bulletin of the AMS 46.2 pp255-308 S 0273-0979(09)01249-X (2009)
5. Singh, G. et al. Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. Eurographics Symposium on Point-Based Graphics (2007)
6. Ayasdi. TDA and Machine Learning: Better Together. (2015) 7. "Clustering." 2.3. Clustering — Scikit-learn 0.15.2 Documentation.
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830 (2011)