Top Banner
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.
24

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

Jan 12, 2016

Download

Documents

Maryann Patrick
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

1

Peter Fox

Data Analytics – ITWS-4963/ITWS-6965

Week 14b, May 2, 2014

PCA and return to Big Data infrastructure…. and

assignment time.

Page 2: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

Visual approaches for PCA/DR• Screeplot - A plot, in descending order of

magnitude, of the eigenvalues of a correlation matrix. In the context of factor analysis or principal components analysis a scree plot helps the analyst visualize the relative importance of the factors — a sharp drop in the plot signals that subsequent factors are ignorable.

2

Page 3: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

require(graphics)

## the variances of the variables in the

## USArrests data vary by orders of magnitude, so scaling is appropriate

prcomp(USArrests) # inappropriate

prcomp(USArrests, scale = TRUE)

prcomp(~ Murder + Assault + Rape, data = USArrests, scale = TRUE)

plot(prcomp(USArrests))

summary(prcomp(USArrests, scale = TRUE))

biplot(prcomp(USArrests, scale = TRUE)) 3

Page 4: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

prcomp> prcomp(USArrests) # inappropriate

Standard deviations:

[1] 83.732400 14.212402 6.489426 2.482790

Rotation:

PC1 PC2 PC3 PC4

Murder 0.04170432 -0.04482166 0.07989066 -0.99492173

Assault 0.99522128 -0.05876003 -0.06756974 0.03893830

UrbanPop 0.04633575 0.97685748 -0.20054629 -0.05816914

Rape 0.07515550 0.20071807 0.97408059 0.07232502

> prcomp(USArrests, scale = TRUE)

Standard deviations:

[1] 1.5748783 0.9948694 0.5971291 0.4164494

Rotation:

PC1 PC2 PC3 PC4

Murder -0.5358995 0.4181809 -0.3412327 0.64922780

Assault -0.5831836 0.1879856 -0.2681484 -0.74340748

UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773

Rape -0.5434321 -0.1673186 0.8177779 0.08902432

4

Page 5: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

screeplot

5

Page 6: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

> prcomp(~ Murder + Assault + Rape, data = USArrests, scale = TRUE)

Standard deviations:

[1] 1.5357670 0.6767949 0.4282154

Rotation:

PC1 PC2 PC3

Murder -0.5826006 0.5339532 -0.6127565

Assault -0.6079818 0.2140236 0.7645600

Rape -0.5393836 -0.8179779 -0.1999436

> summary(prcomp(USArrests, scale = TRUE))

Importance of components:

PC1 PC2 PC3 PC4

Standard deviation 1.5749 0.9949 0.59713 0.41645

Proportion of Variance 0.6201 0.2474 0.08914 0.04336

Cumulative Proportion 0.6201 0.8675 0.95664 1.000006

Page 7: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

bigplot

7

Page 8: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

Line plots lab 6 prcomp (top) and metaPCA (bottom)

8

Looking for convergence as iteration increases

Eigen Angle RobustAngle SparseAngle

http://cran.r-project.org/web/packages/MetaPCA/MetaPCA.pdf

Page 9: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

prostate data (lab 7) 2D plot.

9

Page 10: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

Lab 9library(dr)

data(ais)

# default fitting method is "sir"

s0 <- dr(LBM~log(SSF)+log(Wt)+log(Hg)+log(Ht)+log(WCC)+log(RCC)+

log(Hc)+log(Ferr),data=ais)

# Refit, using a different function for slicing to agree with arc.

summary(s1 <- update(s0,slice.function=dr.slices.arc))

# Refit again, using save, with 10 slices; the default is max(8,ncol+3)

summary(s2<-update(s1,nslices=10,method="save"))

# Refit, using phdres. Tests are different for phd, and not

# Fit using phdres; output is similar for phdy, but tests are not justifiable.

summary(s3<- update(s1,method="phdres"))

# fit using ire:

summary(s4 <- update(s1,method="ire"))

# fit using Sex as a grouping variable.

s5 <- update(s4,group=~Sex)10

Page 11: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

> s0

dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) +

log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais)

Estimated Basis Vectors for Central Subspace:

Dir1 Dir2 Dir3 Dir4

log(SSF) 0.150963358 -0.0501785457 0.10898336 -0.002210206

log(Wt) -0.916480522 -0.1942298625 -0.20123696 -0.089722026

log(Hg) -0.131538894 0.6854750758 0.71997546 -0.663097774

log(Ht) -0.093358860 -0.0433408964 0.46445398 0.290838658

log(WCC) 0.004467838 0.0001833808 0.04497590 0.071904557

log(RCC) -0.188973540 0.3475652934 0.29496908 0.037056363

log(Hc) 0.274758965 -0.6058301419 -0.34196615 0.678877114

log(Ferr) -0.005631238 0.0130588502 -0.08702709 0.015547214

Eigenvalues:

[1] 0.95766163 0.24504161 0.10707594 0.0904130511

Page 12: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

> summary(s1 <- update(s0,slice.function=dr.slices.arc))

Call:

dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) +

log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc)

Method:

sir with 11 slices, n = 202.

Slice Sizes:

19 19 19 19 19 19 19 18 18 18 15

Estimated Basis Vectors for Central Subspace:

Dir1 Dir2 Dir3 Dir4

log(SSF) 0.143177 -0.0476079 -0.02815 0.003785

log(Wt) -0.879504 -0.1425841 0.23303 -0.094970

log(Hg) -0.195963 0.6318503 0.24483 -0.509424

log(Ht) -0.058923 -0.1100757 -0.87893 0.217803

log(WCC) -0.007276 -0.0029772 -0.05309 0.043056

log(RCC) -0.167736 0.3924936 -0.19711 -0.213689

log(Hc) 0.368652 -0.6418658 -0.26373 0.796849

log(Ferr) -0.002697 0.0002593 0.03492 0.03911612

Dir1 Dir2 Dir3 Dir4Eigenvalues 0.9572 0.2275 0.09368 0.07319R^2(OLS|dr) 0.9980 0.9981 0.99839 0.99864

Large-sample Marginal Dimension Tests: Stat df p.value0D vs >= 1D 284.78 80 0.000001D vs >= 2D 91.43 63 0.011132D vs >= 3D 45.48 48 0.576903D vs >= 4D 26.55 35 0.84694

Page 13: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

> summary(s2<-update(s1,nslices=10,method="save"))

Call:

dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) +

log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc,

nslices = 10, method = "save")

Method:

save with 10 slices, n = 202.

Slice Sizes:

21 21 20 20 20 25 24 22 20 9

Estimated Basis Vectors for Central Subspace:

Dir1 Dir2 Dir3 Dir4

log(SSF) 0.127709 -0.00907 0.01018 -0.06144

log(Wt) -0.905004 -0.07107 -0.15734 0.25774

log(Hg) -0.056187 0.50674 -0.34064 -0.38087

log(Ht) 0.399868 0.36613 0.68439 -0.54216

log(WCC) 0.032608 0.02733 0.02277 0.03474

log(RCC) -0.008463 0.15137 -0.24136 -0.47219

log(Hc) -0.021630 -0.76164 0.57591 0.51526

log(Ferr) 0.002116 -0.01670 0.01631 -0.03360

13

Dir1 Dir2 Dir3 Dir4Eigenvalues 0.9389 0.6611 0.5129 0.4653R^2(OLS|dr) 0.9936 0.9950 0.9985 0.9989

Large-sample Marginal Dimension Tests: Stat df(Nor) p.value(Nor) p.value(Gen)0D vs >= 1D 378.3 324 0.02012 0.10711D vs >= 2D 279.6 252 0.11214 0.31162D vs >= 3D 179.9 189 0.67101 0.51603D vs >= 4D 134.3 135 0.50176 0.2786

Page 14: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

S0 v. S2

14

Page 15: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

S3 and S4

15

Page 16: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

Infrastructure tools

In R Studio – Install the rmongodb package– http://cran.r-project.org/web/packages/rmongodb/

vignettes/rmongodb_cheat_sheet.pdf

– http://cran.r-project.org/web/packages/rmongodb/vignettes/rmongodb_introduction.html

• MongoDB - http://www.mongodb.org/ – http://kkovacs.eu/cassandra-vs-mongodb-vs-cou

chdb-vs-redis - get familiar with the choices

• General idea:– These are “backend” stores that can do various

“things”

16

Page 17: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

Back-ends• Files (e.g. csv), application files (e.g. Rdata,

xls, mat, …) – essentially for reading/input• Databases – for reading and writing

– Also – for advanced operations inside the database!!

– Operations range from simple summaries to array operations and analytics functions

– Overhead is opening/ maintaining connections/ closing – easy on your laptop – harder when they are remote (network, authentication, etc.)

– Overhead is also around their internal storage formats (e.g. BSON for MongoDB)

17

Page 18: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

Functions versus languages• Libraries for R mean that you code in R and

call functions and the result returns into R– Whatever the function does (i.e. how it is

implemented) is what you get (subject to setting parameters)

• Languages (like Pig) provide more direct access to efficiently using the underlying capabilities of the application engine/ database– Cost is learning this new language 18

Page 19: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

Example layout - Hadoop

19

Page 20: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

Relating Open-Source and Commercial

20

Page 21: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

Even further• http://projects.apache.org/indexes/category.ht

ml#database – Hadoop (MapReduce) – distributed execution

(via disk when data is large)– Pig (http://wiki.apache.org/pig/RunPig )– HIVE (http://hive.apache.org/releases.html )

– Spark – in memory (RSpark still not easy to find/ install) http://gigaom.com/2014/02/27/as-mapreduce-fades-apache-spark-is-now-a-top-level-project/

21

Page 22: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

~ Objectives• Provide an application, i.e. predictive/

prescriptive model view of data analytics by focusing on the “front-end” (Rstudio)

• Over a variety of data…• Provide enough of a view of the back-end to

know how you will need to interface to them (both open-source and commercial)

22

Page 23: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

Layers across the Analytics Stack

23

Page 24: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

Time for assignments

24