Top Banner
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6b, February 28, 2014 Weighted kNN, clustering, more plottong, Bayes
44

Weighted kNN , clustering, more plottong , Bayes

Jan 13, 2016

Download

Documents

Zsolt Deak

Weighted kNN , clustering, more plottong , Bayes. Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6 b , February 28, 2014. Plot tools/ tips. http ://statmethods.net/advgraphs/ layout.html http://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-in-r / - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Weighted  kNN , clustering, more  plottong , Bayes

1

Peter Fox

Data Analytics – ITWS-4963/ITWS-6965

Week 6b, February 28, 2014

Weighted kNN, clustering, more plottong, Bayes

Page 2: Weighted  kNN , clustering, more  plottong , Bayes

Plot tools/ tipshttp://statmethods.net/advgraphs/layout.html

http://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-in-r/

pairs, gpairs, scatterplot.matrix, clustergram, etc.

data()

# precip, presidents, iris, swiss, sunspot.month (!), environmental, ethanol, ionosphere

More script fragments in Lab6b_*_2014.R on the web site (escience.rpi.edu/data/DA )

2

Page 3: Weighted  kNN , clustering, more  plottong , Bayes

Weighted KNN?require(kknn)

data(iris)

m <- dim(iris)[1]

val <- sample(1:m, size = round(m/3), replace = FALSE,

prob = rep(1/m, m))

iris.learn <- iris[-val,]

iris.valid <- iris[val,]

iris.kknn <- kknn(Species~., iris.learn, iris.valid, distance = 1,

kernel = "triangular")

summary(iris.kknn)

fit <- fitted(iris.kknn)

table(iris.valid$Species, fit)

pcol <- as.character(as.numeric(iris.valid$Species))

pairs(iris.valid[1:4], pch = pcol, col = c("green3", "red”)[(iris.valid$Species != fit)+1])

3

Page 4: Weighted  kNN , clustering, more  plottong , Bayes

4

Try Lab6b_8_2014.R

Page 5: Weighted  kNN , clustering, more  plottong , Bayes

New dataset - ionosphererequire(kknn)

data(ionosphere)

ionosphere.learn <- ionosphere[1:200,]

ionosphere.valid <- ionosphere[-c(1:200),]

fit.kknn <- kknn(class ~ ., ionosphere.learn, ionosphere.valid)

table(ionosphere.valid$class, fit.kknn$fit)

# vary kernel

(fit.train1 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15,

kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1))

table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class)

#alter distance

(fit.train2 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15,

kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2))

table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class)5

Page 6: Weighted  kNN , clustering, more  plottong , Bayes

Cluster plottingsource("http://www.r-statistics.com/wp-content/uploads/2012/01/source_https.r.txt") # source code from github

require(RCurl)

require(colorspace)

source_https("https://raw.github.com/talgalili/R-code-snippets/master/clustergram.r")

data(iris)

set.seed(250)

par(cex.lab = 1.5, cex.main = 1.2)

Data <- scale(iris[,-5]) # scaling

clustergram(Data, k.range = 2:8, line.width = 0.004) # line.width - adjust according to Y-scale 6

Page 7: Weighted  kNN , clustering, more  plottong , Bayes

Clustergram

7

Page 8: Weighted  kNN , clustering, more  plottong , Bayes

Any good?set.seed(500)

Data2 <- scale(iris[,-5])

par(cex.lab = 1.2, cex.main = .7)

par(mfrow = c(3,2))

for(i in 1:6) clustergram(Data2, k.range = 2:8 , line.width = .004, add.center.points = T)

8

Page 9: Weighted  kNN , clustering, more  plottong , Bayes

9

Page 10: Weighted  kNN , clustering, more  plottong , Bayes

How can you tell it is good?set.seed(250)

Data <- rbind( cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),

cbind(rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3)),

cbind(rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3)))

clustergram(Data, k.range = 2:5 , line.width = .004, add.center.points = T)

10

Page 11: Weighted  kNN , clustering, more  plottong , Bayes

More complex…set.seed(250)

Data <- rbind( cbind(rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),

cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),

cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3)),

cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3)))

clustergram(Data, k.range = 2:8 , line.width = .004, add.center.points = T)

11

Page 12: Weighted  kNN , clustering, more  plottong , Bayes

12

• Look at the location of the cluster points on the Y axis. See when they remain stable, when they start flying around, and what happens to them in higher number of clusters (do they re-group together)

• Observe the strands of the datapoints. Even if the clusters centers are not ordered, the lines for each item might (needs more research and thinking) tend to move together – hinting at the real number of clusters

• Run the plot multiple times to observe the stability of the cluster formation (and location)

http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/

Page 13: Weighted  kNN , clustering, more  plottong , Bayes

13

Page 14: Weighted  kNN , clustering, more  plottong , Bayes

Swiss - pairs

14

pairs(~ Fertility + Education + Catholic, data = swiss, subset = Education < 20, main = "Swiss data, Education < 20")

Page 15: Weighted  kNN , clustering, more  plottong , Bayes

ctree

15

require(party)

swiss_ctree <- ctree(Fertility ~ Agriculture + Education + Catholic, data = swiss)

plot(swiss_ctree)

Page 16: Weighted  kNN , clustering, more  plottong , Bayes

Hierarchical clustering

16

> dswiss <- dist(as.matrix(swiss))

> hs <- hclust(dswiss)

> plot(hs)

Page 17: Weighted  kNN , clustering, more  plottong , Bayes

scatterplotMatrix

17

Page 18: Weighted  kNN , clustering, more  plottong , Bayes

require(lattice); splom(swiss)

18

Page 19: Weighted  kNN , clustering, more  plottong , Bayes

Decision tree (reminder)> str(iris)

'data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

> str(swiss)

19

Page 20: Weighted  kNN , clustering, more  plottong , Bayes

Beyond plot: pairspairs(iris[1:4], main = "Anderson's Iris Data -- 3 species”, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])

20

Try Lab6b_2_2014.R - USJudgeRatings

Page 21: Weighted  kNN , clustering, more  plottong , Bayes

Try hclust for iris

21

Page 22: Weighted  kNN , clustering, more  plottong , Bayes

gpairs(iris)

22

Try Lab6b_3_2014.R

Page 23: Weighted  kNN , clustering, more  plottong , Bayes

Better scatterplots

23

install.packages("car")

require(car)

scatterplotMatrix(iris)

Try Lab6b_4_2014.R

Page 24: Weighted  kNN , clustering, more  plottong , Bayes

splom(iris) # default

24

Try Lab6b_7_2014.R

Page 25: Weighted  kNN , clustering, more  plottong , Bayes

splom extra!require(lattice)

super.sym <- trellis.par.get("superpose.symbol")

splom(~iris[1:4], groups = Species, data = iris,

panel = panel.superpose,

key = list(title = "Three Varieties of Iris",

columns = 3,

points = list(pch = super.sym$pch[1:3],

col = super.sym$col[1:3]),

text = list(c("Setosa", "Versicolor", "Virginica"))))

splom(~iris[1:3]|Species, data = iris,

layout=c(2,2), pscales = 0,

varnames = c("Sepal\nLength", "Sepal\nWidth", "Petal\nLength"),

page = function(...) {

ltext(x = seq(.6, .8, length.out = 4),

y = seq(.9, .6, length.out = 4),

labels = c("Three", "Varieties", "of", "Iris"),

cex = 2)

})

parallelplot(~iris[1:4] | Species, iris)

parallelplot(~iris[1:4], iris, groups = Species,

horizontal.axis = FALSE, scales = list(x = list(rot = 90)))

> Lab6b_7_2014.R

25

Page 26: Weighted  kNN , clustering, more  plottong , Bayes

26

Page 27: Weighted  kNN , clustering, more  plottong , Bayes

27

Page 28: Weighted  kNN , clustering, more  plottong , Bayes

28

Page 29: Weighted  kNN , clustering, more  plottong , Bayes

29

Page 30: Weighted  kNN , clustering, more  plottong , Bayes

Ctree> iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)

> print(iris_ctree)

Conditional inference tree with 4 terminal nodes

Response: Species

Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

Number of observations: 150

1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264

2)* weights = 50

1) Petal.Length > 1.9

3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894

4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865

5)* weights = 46

4) Petal.Length > 4.8

6)* weights = 8

3) Petal.Width > 1.7

7)* weights = 46 30

Page 31: Weighted  kNN , clustering, more  plottong , Bayes

plot(iris_ctree)

31

Try Lab6b_5_2014.R> plot(iris_ctree, type="simple”) # try this

Page 32: Weighted  kNN , clustering, more  plottong , Bayes

Try these on mapmeans, etc.

32

Page 33: Weighted  kNN , clustering, more  plottong , Bayes

Something simpler – kmeans and…

> mapmeans<-data.frame(as.numeric(mapcoord$NEIGHBORHOOD), adduse$GROSS.SQUARE.FEET, adduse$SALE.PRICE, adduse$'querylist$latitude', adduse$'querylist$longitude')

> mapobjnew<-kmeans(mapmeans,5, iter.max=10, nstart=5, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"))

> fitted(mapobjnew,method=c("centers","classes"))

• Others? 33

Page 34: Weighted  kNN , clustering, more  plottong , Bayes

Plotting clusters (DIY)library(cluster)

clusplot(mapmeans, mapobj$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

# Centroid Plot against 1st 2 discriminant functions

#library(fpc)

plotcluster(mapmeans, mapobj$cluster)• dendogram?

library(fpc)• cluster.stats

34

Page 35: Weighted  kNN , clustering, more  plottong , Bayes

Bayes> cl <- kmeans(iris[,1:4], 3)

> table(cl$cluster, iris[,5])

setosa versicolor virginica

2 0 2 36

1 0 48 14

3 50 0 0

#

> m <- naiveBayes(iris[,1:4], iris[,5])

> table(predict(m, iris[,1:4]), iris[,5])

setosa versicolor virginica

setosa 50 0 0

versicolor 0 47 3

virginica 0 3 47 35

pairs(iris[1:4],main="Iris Data (red=setosa,green=versicolor,blue=virginica)", pch=21, bg=c("red","green3","blue")[unclass(iris$Species)])

Page 36: Weighted  kNN , clustering, more  plottong , Bayes

Digging into irisclassifier<-naiveBayes(iris[,1:4], iris[,5])

table(predict(classifier, iris[,-5]), iris[,5], dnn=list('predicted','actual'))

classifier$apriori

classifier$tables$Petal.Length

plot(function(x) dnorm(x, 1.462, 0.1736640), 0, 8, col="red", main="Petal length distribution for the 3 different species")

curve(dnorm(x, 4.260, 0.4699110), add=TRUE, col="blue")

curve(dnorm(x, 5.552, 0.5518947 ), add=TRUE, col = "green") 36

Page 37: Weighted  kNN , clustering, more  plottong , Bayes

37

Page 38: Weighted  kNN , clustering, more  plottong , Bayes

Using a contingency table> data(Titanic)

> mdl <- naiveBayes(Survived ~ ., data = Titanic)

> mdl

38

Naive Bayes Classifier for Discrete PredictorsCall: naiveBayes.formula(formula = Survived ~ ., data = Titanic)A-priori probabilities:Survived No Yes 0.676965 0.323035 Conditional probabilities: ClassSurvived 1st 2nd 3rd Crew No 0.08187919 0.11208054 0.35436242 0.45167785 Yes 0.28551336 0.16596343 0.25035162 0.29817159 SexSurvived Male Female No 0.91543624 0.08456376 Yes 0.51617440 0.48382560 AgeSurvived Child Adult No 0.03489933 0.96510067 Yes 0.08016878 0.91983122 Try Lab6b_9_2014.R

Page 39: Weighted  kNN , clustering, more  plottong , Bayes

http://www.ugrad.stat.ubc.ca/R/library/mlbench/html/HouseVotes84.html

require(mlbench)

data(HouseVotes84)

model <- naiveBayes(Class ~ ., data = HouseVotes84)

predict(model, HouseVotes84[1:10,-1])

predict(model, HouseVotes84[1:10,-1], type = "raw")

pred <- predict(model, HouseVotes84[,-1])

table(pred, HouseVotes84$Class) 39

Page 40: Weighted  kNN , clustering, more  plottong , Bayes

Exercise for you> data(HairEyeColor)

> mosaicplot(HairEyeColor)

> margin.table(HairEyeColor,3)

Sex

Male Female

279 313

> margin.table(HairEyeColor,c(1,3))

Sex

Hair Male Female

Black 56 52

Brown 143 143

Red 34 37

Blond 46 81

How would you construct a naïve Bayes classifier and test it? 40

Page 41: Weighted  kNN , clustering, more  plottong , Bayes

Assignment 5• Project proposals…

• Let’s look at it

• Assignment 4 - how is it going – assume you all start after today?

41

Page 42: Weighted  kNN , clustering, more  plottong , Bayes

Assignment 6 preview• Your term projects should fall within the scope of a data analytics

problem of the type you have worked with in class/ labs, or know of yourself – the bigger the data the better. This means that the work must go beyond just making lots of figures. You should develop the project to indicate you are thinking of and exploring the relationships and distributions within your data. Start with a hypothesis, think of a way to model and use the hypothesis, find or collect the necessary data, and do both preliminary analysis, detailed modeling and summary (interpretation). – Note: You do not have to come up with a positive result, i.e. disproving the hypothesis

is just as good. Please use the section numbering below for your written submission for this assignment.

• Introduction (2%)• Data Description (3%)• Analysis (8%)• Model Development (8%)• Conclusions and Discussion (4%)• Oral presentation (5%) (10 mins)

42

Page 43: Weighted  kNN , clustering, more  plottong , Bayes

Assignments to come• Term project (6). Due ~ week 13/ 14 – early May. 30% (25%

written, 5% oral; individual). Available after spring break.

• Assignment 7: Predictive and Prescriptive Analytics. Due ~ week 10. 15% (15% written; individual);

43

Page 44: Weighted  kNN , clustering, more  plottong , Bayes

Admin info (keep/ print this slide)• Class: ITWS-4963/ITWS 6965• Hours: 12:00pm-1:50pm Tuesday/ Friday• Location: SAGE 3101• Instructor: Peter Fox• Instructor contact: [email protected], 518.276.4862 (do not

leave a msg)• Contact hours: Monday** 3:00-4:00pm (or by email appt)• Contact location: Winslow 2120 (sometimes Lally 207A

announced by email)• TA: Lakshmi Chenicheri [email protected] • Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014

– Schedule, lectures, syllabus, reading, assignments, etc.

44