Top Banner
1 High dimensionality Evgeny Maksakov CS533C Department of Computer Science UBC
66

High dimensionality

Feb 03, 2016

Download

Documents

kyros

High dimensionality. Evgeny Maksakov CS533C Department of Computer Science UBC. Today. Problem Overview Direct Visualization Approaches Dimensional anchors Scagnostic SPLOMs Nonlinear Dimensionality Reduction Locally Linear Embedding and Isomaps Charting manifold. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: High dimensionality

1

High dimensionality

Evgeny Maksakov

CS533C

Department of Computer Science

UBC

Page 2: High dimensionality

2

Today

• Problem Overview

• Direct Visualization Approaches– Dimensional anchors

– Scagnostic SPLOMs

• Nonlinear Dimensionality Reduction– Locally Linear Embedding and Isomaps

– Charting manifold

Page 3: High dimensionality

3

Problems with visualizing high dimensional data

• Visual cluttering

• Clarity of representation

• Visualization is time consuming

Page 4: High dimensionality

4

Classical methods

Page 5: High dimensionality

5

Multiple Line Graphs

Pictures from Patrick Hoffman et al. (2000)

Page 6: High dimensionality

6

Multiple Line Graphs

- Hard to distinguish dimensions if multiple line graphs overlaid

- Each dimension may have different scale that should be shown

- More than 3 dimensions can become confusing

Advantages and disadvantages:

Page 7: High dimensionality

7

Scatter Plot Matrices

Pictures from Patrick Hoffman et al. (2000)

Page 8: High dimensionality

8

Scatter Plot Matrices

+ Useful for looking at all possible two-way interactions between dimensions

- Becomes inadequate for medium to high dimensionality

Advantages and disadvantages:

Page 9: High dimensionality

9

Bar Charts, Histograms

Pictures from Patrick Hoffman et al. (2000)

Page 10: High dimensionality

10

Bar Charts, Histograms

+ Good for small comparisons

- Contain little data

Advantages and disadvantages:

Page 11: High dimensionality

11

Survey Plots

Pictures from Patrick Hoffman et al. (2000)

Page 12: High dimensionality

12

Survey Plots

+ allows to see correlations between any two variables when the data is sorted according to one particular dimension

- can be confusing

Advantages and disadvantages:

Page 13: High dimensionality

13

Parallel Coordinates

Pictures from Patrick Hoffman et al. (2000)

Page 14: High dimensionality

14

Parallel Coordinates

+ Many connected dimensions are seen in limited space

+ Can see trends in data

- Become inadequate for very high dimensionality

- Cluttering

Advantages and disadvantages:

Page 15: High dimensionality

15

Circular Parallel Coordinates

Pictures from Patrick Hoffman et al. (2000)

Page 16: High dimensionality

16

Circular Parallel Coordinates

+ Combines properties of glyphs and parallel coordinates making pattern recognition easier

+ Compact

- Cluttering near center

- Harder to interpret relations between each pair of dimensions than parallel coordinates

Advantages and disadvantages:

Page 17: High dimensionality

17

Andrews’ Curves

Pictures from Patrick Hoffman et al. (2000)

Page 18: High dimensionality

18

Andrews’ Curves

+ Allows to draw virtually unlimited dimensions

- Hard to interpret

Advantages and disadvantages:

Page 19: High dimensionality

19

Radviz

Radviz employs spring model

Pictures from Patrick Hoffman et al. (2000)

Page 20: High dimensionality

20

Radviz

+ Good for data manipulation

+ Low cluttering

- Cannot show quantitative data

- High computational complexity

Advantages and disadvantages:

Page 21: High dimensionality

21

Dimensional Anchors

Page 22: High dimensionality

22

Attempt to Generalize Visualization Methods

for High Dimensional Data

Page 23: High dimensionality

23

What is dimensional anchor?

Picture from members.fortunecity.com/agreeve/seacol.htm & http://kresby.grafika.cz/data/media/46/dimension.jpg_middle.jpg

Page 24: High dimensionality

24

What is dimensional anchor?

Nothing like that

DA is just an axis line… Anchorpoints are coordinates…

Page 25: High dimensionality

25

Parameters of DA

Scatterplot features

– Size of the scatter plot points

– Length of the perpendicular lines extending from individual anchor points in a scatter plot

– Length of the lines connecting scatter plot points that are associated with the same data point

Page 26: High dimensionality

26

Parameters of DA

Survey plot feature

4. Width of the rectangle in a survey plot

Parallel coordinates features

5. Length of the parallel coordinate lines

6. Blocking factor for the parallel coordinate lines

Page 27: High dimensionality

27

Parameters of DA

Radviz features

7. Size of the radviz plot point

8. Length of “spring” lines extending from individual anchor points of radviz plot

9. Zoom factor for the “spring” constant K

Page 28: High dimensionality

28

DA Visualization Vector

P (p1,p2,p3,p4,p5,p6,p7,p8,p9)

Page 29: High dimensionality

29

DA describes visualization for any combination of:

• Parallel coordinates

• Scatterplot matrices

• Radviz

• Survey plots (histograms)

• Circle segments

Page 30: High dimensionality

30

Scatterplots

2 DAs, P = (0.8, 0.2, 0, 0, 0, 0, 0, 0, 0) 2 DAs, P = (0.1, 1.0, 0, 0, 0, 0, 0, 0, 0)

Picture from Patrick Hoffman et al. (1999)

Page 31: High dimensionality

31

Scatterplots with other layouts

3 DAs, P = (0.6, 0, 0, 0, 0, 0, 0, 0, 0) 5 DAs, P = (0.5, 0, 0, 0, 0, 0, 0, 0, 0)

Picture from Patrick Hoffman et al. (1999)

Page 32: High dimensionality

32

Survey Plots

P = (0, 0, 0, 0.4, 0, 0, 0, 0, 0) P = (0, 0, 0, 1.0, 0, 0, 0, 0, 0)

Picture from Patrick Hoffman et al. (1999)

Page 33: High dimensionality

33

Circular Segments

P = (0, 0, 0, 1.0, 0, 0, 0, 0, 0)

Picture from Patrick Hoffman et al. (1999)

Page 34: High dimensionality

34

Parallel Coordinates

P = (0, 0, 0, 0, 1.0, 1.0, 0, 0, 0)

Picture from Patrick Hoffman et al. (1999)

Page 35: High dimensionality

35

Radviz like visualization

P = (0, 0, 0, 0, 0, 0, 0.5, 1.0, 0.5)

Picture from Patrick Hoffman et al. (1999)

Page 36: High dimensionality

36

Playing with parameters

Crisscross layout with P = (0, 0, 0, 0, 0, 0, 0.4, 0, 0.5)

Parallel coordinates with P = (0, 0, 0, 0, 0, 0, 0.4, 0, 0.5)

Pictures from Patrick Hoffman et al. (1999)

Page 37: High dimensionality

37

More?

Pictures from Patrick Hoffman et al. (1999)

Page 38: High dimensionality

38

Scatterplot Diagnostics

or

Scagnostics

Page 39: High dimensionality

39

Tukey’s Idea of Scagnostics

• Take measures from scatterplot matrix

• Construct scatterplot matrix (SPLOM) of these measures

• Look for data trends in this SPLOM

Page 40: High dimensionality

40

Scagnostic SPLOM

Is like:• Visualization of a set of pointers

Also:• Set of pointers to pointers also can be constructed

Goal:• To be able to locate unusual clusters of measures that characterize

unusual clusters of raw scatterplots

Page 41: High dimensionality

41

Problems with constructing Scagnostic SPLOM

1) Some of Tukeys’ measures presume underlying continuous empirical or theoretical probability function. It can be a problem for other types of data.

2) The computational complexity of some of the Tukey measures is O( n³ ).

Page 42: High dimensionality

42

Solution*

1. Use measures from the graph-theory. – Do not presume a connected plane of support – Can be metric over discrete spaces

2. Base the measures on subsets of the Delaunay triangulation• Gives O(nlog(n)) in the number of points

3. Use adaptive hexagon binning before computing to further reduce the dependence on n.

4. Remove outlying points from spanning tree

* Leland Wilkinson et al. (2005)

Page 43: High dimensionality

43

Properties of geometric graph for measures

• Undirected (edges consist of unordered pairs)

• Simple (no edge pairs a vertex with itself)

• Planar (has embedding in R2 with no crossed edges)

• Straight (embedded eges are straight line segments)

• Finite (V and E are finite sets)

Page 44: High dimensionality

44

Graphs that fit these demands:

• Convex Hull

• Alpha Hull

• Minimal Spanning Tree

Page 45: High dimensionality

45

Measures:

• Length of en edge

• Length of a graph

• Look for a closed path (boundary of a polygon)

• Perimeter of a polygon

• Area of a polygon

• Diameter of a graph

Page 46: High dimensionality

46

Five interesting aspects of scattered points:

• Outliers – Outlying

• Shape – Convex– Skinny– Stringy– Straight

• Trend – Monotonic

• Density – Skewed– Clumpy

• Coherence – Striated

Page 47: High dimensionality

47

Classifying scatterplots

Picture from L. Wilkinson et al. (2005)

Page 48: High dimensionality

48

Looking for anomalies

Picture from L. Wilkinson et al. (2005)

Page 49: High dimensionality

49

Picture from L. Wilkinson et al. (2005)

Page 50: High dimensionality

50

Nonlinear Dimensionality Reduction (NLDR)

Assumptions:• data of interest lies on embedded nonlinear manifold

within higher dimensional space• manifold is low dimensional can be visualized in low

dimensional space.

Picture from: http://en.wikipedia.org/wiki/Image:KleinBottle-01.png

Page 51: High dimensionality

51

Manifold

Topological space that is “locally Euclidean”.

Picture from: http://en.wikipedia.org/wiki/Image:Triangle_on_globe.jpg

Page 52: High dimensionality

52

Methods

• Locally Linear Embedding

• ISOMAPS

Page 53: High dimensionality

53

Isomaps Algorithm

1. Construct neighborhood graph

2. Compute shortest paths

3. Construct d-dimensional embedding (like in MDS)

Picture from: Joshua B. Tenenbaum et al. (2000)

Page 54: High dimensionality

54

Pictures taken from http://www.cs.wustl.edu/~pless/isomapImages.html

Page 55: High dimensionality

55

Locally Linear Embedding (LLE) Algorithm

Picture from Lawrence K. Saul at al. (2002)

Page 56: High dimensionality

56

Original Sample Mapping by LLE

Application of LLE

Picture from Lawrence K. Saul at al. (2002)

Page 57: High dimensionality

57

Limitations of LLE

• Algorithm can only recover embeddings whose dimensionality, d, is strictly less than the number of neighbors, K. Margin between d and K is recommended.

• Algorithm is based on assumption that data point and its nearest neighbors can be modeled as locally linear; for curved manifolds, too large K will violate this assumption.

• In case of originally low dimensionality of data algorithm degenerates.

Page 58: High dimensionality

58

Proposed improvements*

• Analyze pairwise distances between data points instead of assuming that data is multidimensional vector

• Reconstruct convex

• Estimate the intrinsic dimensionality

• Enforce the intrinsic dimensionality if it is known a priori or highly suspected

* Lawrence K. Saul at al (2002)

Page 59: High dimensionality

59

Strengths and weaknesses:

• ISOMAP handles holes well

• ISOMAP can fail if data hull is non-convex

• Vice versa for LLE

• Both offer embeddings without mappings.

Page 60: High dimensionality

60

Charting manifold

Page 61: High dimensionality

61

Algorithm Idea

1) Find a set of data covering locally linear neighborhoods (“charts”) such that adjoining neighborhoods span maximally similar subspaces

2) Compute a minimal-distortion merger (“connection”) of all charts

Page 62: High dimensionality

62

Picture from Matthew Brand (2003)

Page 63: High dimensionality

63

Video test

Picture from Matthew Brand (2003)

Page 64: High dimensionality

64

Where ISOMAPs and LLE fail, Charting Prevail

Picture from Matthew Brand (2003)

Page 65: High dimensionality

65

Questions?

Page 66: High dimensionality

66

Literature

Covered papers:1. Graph-Theoretic Scagnostics L. Wilkinson, R. Grossman, A. Anand. Proc.

InfoVis 2005. 2. Dimensional Anchors: a Graphic Primitive for Multidimensional Multivariate

Information Visualizations, Patrick Hoffman et al., Proc. Workshop on New Paradigms in Information Visualization and Manipulation, Nov. 1999, pp. 9-16.

3. Charting a manifold Matthew Brand, NIPS 2003. 4. Think Globally, Fit Locally: Unsupervised Learning of Nonlinear Manifolds.

Lawrence K. Saul & Sam T. Roweis. University of Pennsylvania Technical Report MS-CIS-02-18, 2002

Other papers:• A Global Geometric Framework for Nonlinear Dimensionality Reduction,

Joshua B. Tenenbaum, Vin de Silva, John C. Langford, SCIENCE VOL 290 2319-2323 (2000)