Exploring Data with R Abhik Seal May 8, 2014 This is a introductory tutorial to get you started with Visualization data and Exploring Data with R. There are some popular books and many online materials i will Provide the links and references at the end of the tutorial. library(ggplot2) library(gcookbook) Scatter Plots and line plots plot(cars$dist~cars$speed, # y~x main="Relationship between car distance & speed", #Plot Title xlab="Speed (miles per hour)", #X axis title ylab="Distance travelled (miles)", #Y axis title xlim=c(0,30), #Set x axis limits from 0 to 30 yaxs="i", #Set y axis style as internal col="red", #Set the colour of plotting symbol to red pch=19) #Set the plotting symbol to filled dots 0 5 10 15 20 25 30 20 40 60 80 120 Relationship between car distance & spee Speed (miles per hour) Distance travelled (miles) Let’s draw vertical error bars with 5% errors on our cars scatterplot using arrows function 1
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Exploring Data with RAbhik Seal
May 8, 2014
This is a introductory tutorial to get you started with Visualization data and Exploring Data with R. Thereare some popular books and many online materials i will Provide the links and references at the end of thetutorial.
library(ggplot2)library(gcookbook)
Scatter Plots and line plots
plot(cars$dist~cars$speed, # y~xmain="Relationship between car distance & speed", #Plot Titlexlab="Speed (miles per hour)", #X axis titleylab="Distance travelled (miles)", #Y axis titlexlim=c(0,30), #Set x axis limits from 0 to 30yaxs="i", #Set y axis style as internalcol="red", #Set the colour of plotting symbol to redpch=19) #Set the plotting symbol to filled dots
0 5 10 15 20 25 30
2040
6080
120
Relationship between car distance & speed
Speed (miles per hour)
Dis
tanc
e tr
avel
led
(mile
s)
Let’s draw vertical error bars with 5% errors on our cars scatterplot using arrows function
xlab="Speed (miles per hour)", #X axis titleylab="Distance travelled (miles)", #Y axis titlexlim=c(0,30), #Set x axis limits from 0 to 30 ylim=c(0,140), #Set y axis limits from 0 to 30140 xaxs="i", #Set x axis style as internalyaxs="i", #Set y axis style as internalcol="red", #Set the colour of plotting symbol to redpch=19) #Set the plotting symbol to filled dots
From library(gcookbook) I am using heightweight dataset to group data points by variables, The groupingvariable must be categorical—in other words, a factor or character vector.
# Other shapes and color can be used by scale_shape_manual() scale_colour_manual()ggplot(heightweight, aes(x=ageYear, y=heightIn, shape=sex, colour=sex)) +
geom_point()
8
50
55
60
65
70
12 14 16ageYear
heig
htIn sex
f
m
# Change shape of pointsggplot(heightweight, aes(x=ageYear, y=heightIn)) +
geom_point(shape=3)
50
55
60
65
70
12 14 16ageYear
heig
htIn
9
# Change point size sex is categoricalggplot(heightweight, aes(x=ageYear, y=heightIn, shape=sex)) +
# Adding annotations to regression plotmodel <- lm(heightIn ~ ageYear, heightweight)summary(model)# First generate prediction data# Given a model, predict values of yvar from xvar# This supports one predictor and one predicted variable# xrange: If NULL, determine the x range from the model object. If a vector with# two numbers, use those as the min and max of the prediction range.# samples: Number of samples across the x range.# ...: Further arguments to be passed to predict()predictvals <- function(model, xvar, yvar, xrange=NULL, samples=100, ...) {
# If xrange isn't passed in, determine xrange from the models.# Different ways of extracting the x range, depending on model typeif (is.null(xrange)) {
if (any(class(model) %in% c("lm", "glm")))xrange <- range(model$model[[xvar]])
else if (any(class(model) %in% "loess"))xrange <- range(model$x)
x <- rnorm(1000, 50, 30)y <- 3*x + rnorm(1000, 0, 20)require(Hmisc)plot(x,y)#scat1d adds tick marks (bar codes. rug plot)# on any of the four sides of an existing plot,# corresponding with non-missing values of a vector x.scat1d(x, col = "red") # density bars on top of graphscat1d(y, 4, col = "blue") # density bars at right
16
−50 0 50 100 150
−20
00
100
200
300
400
x
y
plot(x,y, pch = 20)histSpike(x, add=TRUE, col = "green4", lwd = 2)histSpike(y, 4, add=TRUE,col = "blue", lwd = 2 )histSpike(x, type='density',col = "red", add=TRUE) # smooth density at bottomhistSpike(y, 4, type='density', col = "red", add=TRUE)
17
−50 0 50 100 150
−20
00
100
200
300
400
x
y
Bar graphs and Histograms
barplot(BOD$demand, names.arg=BOD$Time)
18
1 2 3 4 5 7
05
1015
# Using the table functionbarplot(table(mtcars$cyl))
# Bar graph of values. This uses the BOD data frame, with the# "Time" column for x values and the "demand" column for y values.ggplot(BOD, aes(x=Time, y=demand)) +
# corrgram with pie chartscorrgram(R, order = TRUE, lower.panel = panel.shade, upper.panel = panel.pie,
text.panel = panel.txt, main = "mtcars Data")
gear
am
drat
mpg
vs
qsec
wt
disp
cyl
hp
carb
mtcars Data
43
The package ellipse provides the function plotcorr() that helps us to visualize correlations. plotcorr() usesellipse-shaped glyphs for each entry of the correlation matrix. Here’s the default plot using our matrix of R:
# default corrgramlibrary(ellipse)plotcorr(R)
mpgcyl
disphp
dratwt
qsecvs
amgearcarb
mpg
cyl
disp
hp drat
wt
qsec
vs am gear
carb
# colored corrgramplotcorr(R, col = colorRampPalette(c("firebrick3", "white", "navy"))(10))
44
mpgcyl
disphp
dratwt
qsecvs
amgearcarb
mpg
cyl
disp
hp drat
wt
qsec
vs am gear
carb
Another colored corrgram
plotcorr(R, col = colorRampPalette(c("#E08214", "white", "#8073AC"))(10), type = "lower")
cyldisp
hpdrat
wtqsec
vsam
gearcarb
mpg
cyl
disp
hp drat
wt
qsec
vs am gear
45
Visualizing Dendrograms
# prepare hierarchical clusterhc = hclust(dist(mtcars))plot(hc, hang = -1) ## labels at the same level
Mas
erat
i Bor
aC
hrys
ler
Impe
rial
Cad
illac
Fle
etw
ood
Linc
oln
Con
tinen
tal
For
d P
ante
ra L
Dus
ter
360
Cam
aro
Z28
Hor
net S
port
abou
tP
ontia
c F
irebi
rdH
orne
t 4 D
rive
Val
iant
Mer
c 45
0SLC
Mer
c 45
0SE
Mer
c 45
0SL
Dod
ge C
halle
nger
AM
C J
avel
inH
onda
Civ
icTo
yota
Cor
olla
Fia
t 128
Fia
t X1−
9F
erra
ri D
ino
Lotu
s E
urop
aM
erc
230
Vol
vo 1
42E
Dat
sun
710
Toyo
ta C
oron
aP
orsc
he 9
14−
2M
erc
240D
Maz
da R
X4
Maz
da R
X4
Wag
Mer
c 28
0M
erc
280C
030
0
Cluster Dendrogram
hclust (*, "complete")dist(mtcars)
Hei
ght
An alternative way to produce dendrograms is to specifically convert hclust objects into dendrograms objects.
# using dendrogram objectshcd = as.dendrogram(hc)# alternative way to get a dendrogramplot(hcd)
46
010
020
030
040
0
Mas
erat
i Bor
aC
hrys
ler
Impe
rial
Cad
illac
Fle
etw
ood
Linc
oln
Con
tinen
tal
For
d P
ante
ra L
Dus
ter
360
Cam
aro
Z28
Hor
net S
port
abou
tP
ontia
c F
irebi
rdH
orne
t 4 D
rive
Val
iant
Mer
c 45
0SLC
Mer
c 45
0SE
Mer
c 45
0SL
Dod
ge C
halle
nger
AM
C J
avel
inH
onda
Civ
icTo
yota
Cor
olla
Fia
t 128
Fia
t X1−
9F
erra
ri D
ino
Lotu
s E
urop
aM
erc
230
Vol
vo 1
42E
Dat
sun
710
Toyo
ta C
oron
aP
orsc
he 9
14−
2M
erc
240D
Maz
da R
X4
Maz
da R
X4
Wag
Mer
c 28
0M
erc
280C
Having an object of class dendrogram, we can also plot the branches in a triangular form.
# using dendrogram objectsplot(hcd, type = "triangle")
levels =rev(paste("ID",21:40, sep = ""))), matrix(sample(LETTERS[7:10],100, T), ncol = 5))# converting data to long form for ggplot2 usedatf1y <- melt(datfy, id.var = 'indv')
# Define layout for the plots (2 rows, 2 columns)layt<-grid.layout(nrow=2,ncol=2,heights=c(6/8,2/8),widths=c(2/8,6/8),default.units=c('null','null'))#View the layout of plotsgrid.show.layout(layt)
52
(1, 1)0.75null
0.25null
(1, 2) 0.75null
0.75null
(2, 1)0.25null
0.25null
(2, 2)
0.75null
0.25null
#Draw plots one by one in their positionsgrid.newpage()pushViewport(viewport(layout=layt))print(py,vp=viewport(layout.pos.row=1,layout.pos.col=1))print(pxy,vp=viewport(layout.pos.row=1,layout.pos.col=2))print(px,vp=viewport(layout.pos.row=2,layout.pos.col=2))