Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying.

Visualizing and Exploring Data

1

Outline1.Introduction2.Summarizing Data: Some Simple Examples3.Tools for Displaying Single Variable4.Tools for Displaying Relationships between Two

Variables5.Tools for Displaying More Than Two Variables6.Principal Components Analysis7.Multidimensional Scaling

2

Introduction

• Visual methods are important and ideal for sifting through data to find unexpected relationships.

• Exploratory data analysis is to find the structure that may indicate deeper relationships between cases or variables.

3

Summarizing Data: Some Simple Examples

The measure of locationMeanMedianFirst quartileThird quartileDecilesPercentilesMode

4

Summarizing Data: Some Simple Examples(Cont.)

Suppose that x(1),x(2),…..x(n) comprise a set of n data value.

• Sample mean

μ: true mean of population : estimate of true mean

5


Sample mean can minimize the sum of squared difference between it and the data values.

Ex. data set{1,2,3,4,5}μ =3

μ =1

6


• Median: The value that has equal number of data points above and below it.

Ex.data set{1,2,3,4,5}Median=3Ex.data set{1,2,3,4,5,6}Median=(3+4)/2=3.5

7


• First quartile: The value that is greater than a quarter of data points.

• Third quartile: The value that is greater than three quarters of data points.

• Interquartile range: The difference between the third and first quartile.

• Range: The difference between the largest and smallest data point.

8


Percentiles: The value of a variable below which a certain percent of observations fall.

Deciles

9


• Mode: The value that occurs most frequently in a data set or a probability distribution

Ex.data set{1,3,6,6,6,6,7,7,12,12,17}Mode=6Ex.data set{1,1,2,4,4}Mode=1,4

10


• Unimodal: A data set or a distribution with one mode

• Bimodal• Multimodal

11


• Variance

If μ is replaced with then the variance is estimated as

12


• Standard deviation

13


• Skewness: It measures whether or not a distribution has a single long tail.

• A distribution is said to be right-skewed if the long tail extends in the direction of increasing values and left-skewed otherwise. Symmetric distribution have zero skewness.

14

Tools for Displaying Single Variable

• Histogram-1

15

Tools for Displaying Single Variable(Cont.)

• Histogram-2

16


• Kernel estimateA single variable X Have measured values

{x(1),x(2),……x(n)}

K():Kernel function, Gaussian curve in commonh: Width

17


• Gaussian curve

C: Normalization constantt=x-x(i)h:standard deviation

18

19


• Box and whisker plot

20

Tools for Displaying Relationships between Two Variables

• Scatterplot

21

Tools for Displaying Relationships between Two Variables(Cont.)

• Contour plot

22

Tools for Displaying More Than Two Variables

• Scatterplot matrix

23

Tools for Displaying More Than Two Variables(Cont.)

• Trellis plot

24


• Star plot

25


• Chernoff’s face

26


• Parallel coordinates plot

27

Principal Components Analysis

28

• Objective: To find vectors let data project on them to keep maximum variance.

• Advantage: This method can reduce the dimensions of data.

Principal Components Analysis(Cont.)

29

• Suppose an n×p data matrix X that each row is a data vector x and columns represent the variables.

• X is mean-centered (i.e column has subtracted the sample mean for that variable )


• a p×1 column vector a of projection weights and let the data vector x project along a represent that .

• All data vectors in X are projected on a represent that Xa is an n×1column vector of projected values.

30

p

jjj

T xa1

xa


• Define the variance along a as

• : The p×p covariance matrix of the data

31

Vaa

XaXa

XaXaa

T

TT

T

)()(2

XXV T


• Using some constraint such that and use Lagrange multiplier to find a that maximize the variance along a.

• Differentiating with respect to a yields

32

1aaT

)1( aaVaa TTu

aVa

aVaa

022u


• The first principal component a is the eigenvector associated with the largest eigenvalue of the covariance matrix V

• The second principal component is associated with the second largest eigenvalue and it’s direction orthogonal to the first , and so on.

33


• The data are projected into first k eigenvectors the variance of the projected data can be expressed as

• : The jth eigenvalue

34

k

jj

1

j


• The loss of data

35

p

ll

p

kjj

1

1


• Scree plot

36


37

• Ex.269.8 38.9 50.5

272.4 39.5 50.0

272.0 39.3 50.2

268.2 38.6 50.2

268.2 38.6 50.8

267.0 38.2 51.1

267.8 38.4 51.0

273.6 39.6 50.0

271.2 39.1 50.4

270.0 38.9 50.5


38


39

Multidimensional Scaling

• Objective: To seek to represent data points in lower dimensional space while preserving ,as far as is possible, the distances between the data points.

40

Multidimensional Scaling(Cont.)

• Classical multidimensional scaling• Metric multidimensional scaling• Non-metric multidimensional scaling

41


• Assume an 3×2 data matrix X that the mean of each variable is zero.

• Then compute an 3×3 matrix B that

42

3231

2221

1211

xx

xx

xx

X

333231

232221

131211

232

2312232213122321131

32223121222

22112221122

3212311122122111212

211

bbb

bbb

bbb

xxxxxxxxxx

xxxxxxxxxx

xxxxxxxxxxTXXB

i j

ijij bb 0


• The squared Euclidean distance between object1 and 2 that

43

)1.....(....................2

2

2

)(2

22

22

122211

22122111222

212

212

211

2222212

212

2212111

211

212

ijjjiiijijjjiiij

dbbbbbbd

bbb

xxxxxxxx

xxxxxxxxd


• Define an 3×3 distance matrix D that

44

022

202

220

322233311133

233322211122

133311122211

233

232

231

223

222

221

213

212

211

bbbbbb

bbbbbb

bbbbbb

ddd

ddd

ddd

D

)4....(......................................................................).........(2

)3........(......................................................................)(

)2.......(......................................................................)(

3

220

2

2

11332211

311133211122

231

221

211

2

B

B

B

trnd

nbtrd

nbtr

bbbb

bbbbbb

dddd

ijij

iij

ij

jj

iij


45

)9....(..................................................21

)8....(..................................................21

thenEq(6)andEq(5)into )(fordsubstitute is Eq(7)

)7......(........................................2

1)()4(

)6...(........................................

)(

)3(

)5...(........................................)(

)2(

22

22

2

2

2

n

dn

d

b

n

dn

d

b

tr

dn

trEq

n

trd

bEq

n

trdbEq

ijij

iij

jj

ijij

jij

ii

ijij

jij

ii

iij

jj

B

B

B

B


46

)111

(2

1

2

1

2

1

2

1

2

12

12

21

21

Eq(1) into andfor dsubstitute are Eq(9) and Eq(8)

22

222

22

222

2222

2

2222

ijij

jij

iijij

ijij

jij

iijij

ijij

ijj

iji

ij

ijij

iji

ijij

ijj

ij

ij

jjii

dn

dn

dn

d

dn

dn

dn

d

n

nddn

dd

dn

dn

d

n

dn

d

b

bb


47

• Using Singular Value Decomposition to B that

n

n

nTnn

TT

T

....,

of eigenvalue is diagonalon element each matrix, diagonal:

1],......[

of rseigenvecto are torscolumn vec alland

, meansit matrix, lorthonorma:

212

1

21

B

vvvvvV

B

IVVVVV

VVB


• We can choose first r eigenvalues more large than others that decide to how many dimensions we want to map.

48

matrix:

matrix:

,2

1~

rr

rn

pr

r

r

rr

T

T

V

VX

XX

VVB


• Ex.• Data eigenvalues distance

• Transformed data stress distance

49

1 2 8

3 4 5

5 6 9

16.9641

7.7025

0

-2.4621 1.5436

-0.7528 -2.2085

3.2149 0.6649

0 4.1231 5.7446

4.1231 0 4.8990

5.7446 4.8990 0

0 4.1231 5.7446

4.1231 0 4.8990

5.7446 4.8990 0

1.0325e-016


• Stress

: The observed distance between point i and j in the p-dimensional space.

: The distance between points representing these objects in the two-dimensional space.

• Sstress

50

i j

iji j

ijij dd 22/)(

i j

iji j

ijij dd 4222 /)(

ij

ijd

Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying.

Documents

summarizing data

data set

data values

simple examplescont

set of n data value

quarters of data points

quarter of data points

smallest data point