A picture speaks a thousand words Data Visualisation with R Barbara Fusinska @BasiaFusinska
Apr 05, 2017
A picture speaks a thousand words
Data Visualisation with RBarbara Fusinska@BasiaFusinska
About me
ProgrammerMachine Learning
Data Solutions Architect@BasiaFusinska
Agenda• Exploratory Data Analysis
• Elements of EDA• Visual artifacts
• R Visualisation ecosystem• Base/Lattice/ggplot2 comparison• Layers in ggplot2• Interesting visualisations
https://github.com/BasiaFusinskahttps://katacoda.com/BasiaFusinska
Exploratory Data Analysis (EDA) is an approach to analysing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modelling or hypothesis testing task.
Why do we need visualisations?
Insight Impress
Use Case - Online Learning Platform
UserArea
Vendor
CourseCourseTaken
Cloud (25%)Data Science (50%)Web (15%)Software Engineering (10%)Software Mind (20%)
Cloud Solutions (3%)InfraNet (12%)DataLearn (7%)WWW Way (11%)Soft Skills (4%)Edu Zen (10%)Data Foundation (25%)Learning Island (5%)Design Your Way (3%)
201420152016
Prices:10$ (25%) 99$ (20%)19$ (15%) 250 (15%)49$ (20%) 500 (5%)
courses.aggregateName Area Vendor Year Month Price [$]
Perez, Lisa Data Science Data Foundation 2015 7 99
Tran, Janiro Software Engineering DataLearn 2016 2 10
Bajwa, John Cloud InfraNet 2015 9 250
Lindsey, Aaron Web Software Mind 2014 6 19
Cooper, Duncan Software Engineering Learning Island 2014 7 250
Grumbach, Alexander Web Design Your Way 2015 2 99
Categorical data - count occurrences
Cloud Data Science
Software Engineering
Web
693 2271 462 1574
# Count occurrencescourses.areas <- table(courses.aggregate$area
Bar plot – Number of courses taken by Area
# Draw the plotbarplot(courses.areas,
ylab="Count", main="Areas")
Categorical data count occurrences
# Count occurrencesvendor.area <- table(data.frame(
courses.aggregate$area,courses.aggregate$vendor))
CSol DataF DataL DesYW EZen …
Cloud 0 263 49 28 0
Data Science
91 636 90 0 192
Software 0 44 83 95 0
Web 0 267 207 0 158
Stacked Bar plot – Areas by Vendors
# Draw the plotbarplot(vendor.area, ylab="Count",
main="Areas by Vendor", col=rainbow(4))
legend("topright", fill=rainbow(4),
legend=row.names(vendor.area))
Stacked Beside Bar plot – Areas by Year
# Count occurrencesareas.year <- table(data.frame(
courses.aggregate$area,courses.aggregate$year))
# Draw the plotbarplot(areas.year, ylab="Count",
main="Areas By Year", col=rainbow(4), beside=TRUE)
legend("topleft", fill=rainbow(4),legend=row.names(areas.year))
Stacked Bar plot – Areas by Year# Draw the plotbarplot(areas.year, ylab="Count",
main="Areas by year", col=rainbow(4))
legend("topright",
legend=row.names(areas.year), fill=rainbow(4))
100% Stacked Bar plot – Areas by Year
# Draw the plotbarplot(prop.table(areas.year, 2)*100,
col=rainbow(4), ylab="%",main="Years by Areas")
legend("topright",
legend=row.names(areas.year), fill=rainbow(4))
Pie chart – Areas# Areas occurrencesper_labels <- round( courses.areas/sum(courses.areas) * 100, 1)per_labels <- paste(per_labels, "%", sep="")
# Draw the plotpie(courses.areas, col=rainbow(4), labels=per_labels)legend("topleft", fill=rainbow(4)
legend=names(courses.areas))
Numerical data – summarise
# Calculate yearly revenuerevenue.year <- aggregate(price~year, data=courses.aggregate, sum)
Year Price
2014 139001
2015 159002
2016 180197
Bar plot – Revenue per year
# Draw the plotbarplot(revenue.year$price, names.arg = revenue.year$year, ylab="Count [$]", main="Revenue per year")
Categorical data - count occurrences
# Prepare datalibrary(reshape)revenue.year.area <- aggregate( price ~ year + area, data=courses.aggregate, sum)rya <- t(cast(revenue.year.area, year ~ area, value="price"))
2014 2015 2016
Cloud 127474 17873 16819
Data Science
65639 73645 74289
Software 8342 9976 11781
Web 52556 57508 77308
Stacked Bar plot – Revenue by Year and Area
# Draw the plotbarplot(rya, col=rainbow(4), ylab="Count [$]", main="Revenue by Year & Area")legend("topright", fill=rainbow(4), legend=row.names(rya))
Stacked Beside Bar plot – Areas Revenue by Year
# Draw the plotbarplot(rya, col=rainbow(4), ylab="Count [$]", main="Revenue by Year & Area", beside=TRUE)
legend("topright", fill=rainbow(3), legend=row.names(rya))
Histograms – Frequency & Density
Histogram – Course Prices
# Draw the plothist(courses.aggregate$price,
main="Ditribution of prices",
xlab="Course price",breaks=20,col=heat.colors(20))
Histogram – Course Prices per month
# Prepare the datarevenue.year.month <-
aggregate(price ~ year + month, data=courses.aggregate, sum)
# Draw the plothist(revenue.year.month$price, main="Distribution of revenue per month", xlab="Revenue per month", breaks=20, col=heat.colors(20))
Density – Course Prices per month# Probability densityhist(revenue.year.month$price, main="Distribution of revenue per month", xlab="Revenue per month", breaks=20, col=heat.colors(20), prob=TRUE)
lines(density(revenue.year.month$price))
Bivariate graphs
Bar & line plot – Revenue by month# Draw the plotrevenue.bar <- barplot( revenue.month$price, names.arg = labels , ylab="Revenue [$]", main="2016 Revenue by month")lines(x=revenue.bar, y=revenue.month$units*100)points(x=revenue.bar, y=revenue.month$units*100)
Line plot & trend – Revenue by month
# Draw the plotmonths <- 1:12plot(price ~ month, data=revenue.month,
xaxt="n", type="l", ylab="Revenue [$]", xlab="",main="Revenue in 2016")
axis(1, at=months, labels=labels)
# Display the trendlines(c(1,12), c(25000, 12000), type="l",
lty=2, col="blue")legend("topright", c("Revenue", "Trend"),
col=c("black", "blue"), lty=1:2)
Line plot & trend – Revenue by Units# Draw the plotplot(price~units, data=revenue.month, xlab="Units", ylab="Revenue [$]", main="Revenue by Units in 2016")lines(c(30, 380), c(3000, 35000), type='l', lty=2, col="blue")legend("topleft", c("revenue/freq", "trend"), col=c("black", "blue"), lty=c(0,2), pch=c(21, -1))
Line plot & trend – Revenue by Units# Draw the plotplot(price~units, data=revenue.month.area, xlab="Units", ylab="Revenue [$]", col=area, main="Revenue by Units (All years)")legend("topleft", legend=levels(revenue.month.area$area), col=1:length( levels(revenue.month.area$area)), pch=21, text.width = 30)
base vs. lattice vs. ggplot2
Stacked Bar chart – base vs. lattice
barplot(rya, col=rainbow(4), ylab="Count [$]", main="Revenue by Year & Area")legend("topright", fill=rainbow(4), legend=row.names(rya))
barchart(Cloud + `Data Science` + `Software Engineering` + Web ~ year data=t(rya), auto.key=TRUE, stack=TRUE, horizontal=FALSE, ylab="Count [$]", main="Areas by Year")
Stacked Bar chart – base vs. ggplot2
barplot(rya, col=rainbow(4), ylab="Count [$]", main="Revenue by Year & Area")legend("topright", fill=rainbow(4), legend=row.names(rya))
ggplot(revenue.year.area, aes(x = year, y=price, fill = area)) + geom_bar(stat = "identity") + ggtitle("Revenue by Year & Area") + ylab("Count [$]")
Histogram – base vs. lattice
hist(revenue.year.month$price, main="Ditribution of revenue per month", xlab="Revenue per month", breaks=20, col=heat.colors(20))
histogram(~price, data=revenue.year.month, main="Ditribution of revenue per month", xlab="Revenue per month", breaks = 20, type = "count", col=heat.colors(20))
Histogram – base vs. ggplot2
hist(revenue.year.month$price, main="Ditribution of revenue per month", xlab="Revenue per month", breaks=20, col=heat.colors(20))
ggplot(revenue.year.month, aes(x = price)) + geom_histogram(stat = "bin", binwidth=2500, aes(fill=..count..)) + ggtitle("Ditribution of revenue per month") + xlab("Revenue per month")
Box plot – base vs. lattice
boxplot(price~year, data=revenue.year.month, col=2:4, main="Revenue by Year", xlab="Year", ylab="Revenue")
boxplot(price~year, data=revenue.year.month, col=2:4, main="Revenue by Year", xlab="Year", ylab="Revenue")
Box plot – base vs. ggplot
boxplot(price~year, data=revenue.year.month, col=2:4, main="Revenue by Year", xlab="Year", ylab="Revenue")
ggplot(revenue.year.month, aes(x=factor(year), y=price)) + geom_boxplot(aes(fill=factor(year))) + ggtitle("Total by Year") + ylab("Revenue") + xlab("Year")
Scatter plot – base vs. lattice
plot(price~units, data=revenue.month.area, xlab="Units", ylab="Revenue [$]", col=area, main="Revenue by Units (All years)")# And you need legend manually created
xyplot(price~units, data=revenue.month.area, xlab="Units", ylab="Revenue [$]", pch=19, group = area, auto.key = TRUE)
Scatter plot – base vs. ggplot2
plot(price~units, data=revenue.month.area, xlab="Units", ylab="Revenue [$]", col=area, main="Revenue by Units (All years)")# And you need legend manually created
ggplot(revenue.month.area, aes(x=units, y=price)) + geom_point(aes(col=area)) + ggtitle("Revenue by Units (All years)") + ylab("Revenue [$]") + xlab("Units")
ggplot2 & layers
Scatter plot# Draw the dotsggplot(revenue.month,
aes(x=units, y=total)) + geom_point()
Scatter plot – Colours per area# Draw the dotsggplot(revenue.month,
aes(x=units, y=total)) + geom_point(aes(col=area))
Scatter plot – Labels# Draw the dotsggplot(revenue.month,
aes(x=units, y=total)) + geom_point(aes(col=area)) + ggtitle("Revenue by Units (All years)") + ylab("Revenue [$]") + xlab("Units")
Scatter plot – Dots’ size# Draw the dotsggplot(revenue.month,
aes(x=units, y=total)) + geom_point(aes(col=area, size=dltotal)) + ggtitle("Revenue by Units (All years)") + ylab("Revenue [$]") + xlab("Units")
Scatter plot – Lines# Draw the dotsggplot(revenue.month,
aes(x=units, y=total)) + geom_point(aes(col=area)) + geom_line() + ggtitle("Revenue by Units (All years)") + ylab("Revenue [$]") + xlab("Units")
Scatter plot – ab line# Draw the dotsggplot(revenue.month,
aes(x=units, y=total)) + geom_point(aes(col=area)) + geom_abline(intercept = 0, slope = 110) + ggtitle("Revenue by Units (All years)") + ylab("Revenue [$]") + xlab("Units")
Scatter plot – smooth line# Draw the dotsggplot(revenue.month,
aes(x=units, y=total)) + geom_point(aes(col=area)) + stat_smooth() + ggtitle("Revenue by Units (All years)") + ylab("Revenue [$]") + xlab("Units")
Scatter plot – smooth line# Draw the dotsggplot(revenue.month,
aes(x=units, y=total)) + geom_point(aes(col=area)) + stat_smooth() + ggtitle("Revenue by Units (All years)") + ylab("Revenue [$]") + xlab("Units") + theme(legend.title=element_text( colour="chocolate", size=16, face="bold"))
Scatter plot – smooth line# Draw the dotsggplot(revenue.month,
aes(x=units, y=total)) + geom_point(aes(col=area)) + stat_smooth() + ggtitle("Revenue by Units (All years)") + ylab("Revenue [$]") + xlab("Units") + theme(legend.title=element_text( colour="chocolate", size=16, face="bold")) + scale_color_discrete( name="Learning Areas")
Scatter plot – smooth line# Draw the dotsggplot(revenue.month,
aes(x=units, y=total)) + geom_point(aes(col=area)) + ... theme(legend.title=element_text( colour="chocolate", size=16, face="bold")) + scale_color_discrete( name="Learning Areas") + guides(colour = guide_legend( override.aes = list(size=4)))
ggplot2 & maps (ggmap)
Treemap – Revenue by Vendor
# Draw the plotlibrary(treemap)
treemap(courses.aggregate, index=c("vendor"), vSize="price", title="Revenue per vendor",type="index")
Interactive and dynamic graphs• plotly• ggiraph• D3.js• streamgraph• animation
plotly - Interactive graphs# Draw the plotlibrary(plotly)plot_ly(revenue.month.vendor, x=~units, y=~total, mode="markers", color = ~factor(area), size=~dltotal/1000, text=~paste("Units:", units, "</br>Revenue", total, "</br>DataLearn cut:", dltotal), hoverinfo="text", type="scatter") %>% layout(title="Revenue per vendor", xaxis=list(title="Units"), yaxis=list(title="Revenue [$]"))
Make an interactive graph from ggplot
# Draw the plotlibrary(plotly)ggbar <- ggplot(revenue.year.area, aes(x = year, y=price, fill = area)) + geom_bar(stat = "identity")
ggplotly(ggbar)
Network visualisation• igraph• ggnet• ggnetwork• ggraph• visNetwork• sna
igraph – Courses taken by Users# Draw the plotuser.area <- data.frame(
user=courses.aggregate$name, area=courses.aggregate$area)user.area <- user.area[
sample(1:500, 50, replace=FALSE),]user.area <- aggregate(
cbind(user.area[0], width=1),user.area, length)
# Build the graphlibrary(igraph)user.area.graph <- graph.data.frame(
user.area, directed = FALSE,vertices=vertices)
plot(user.area.graph, main="Courses taken by users")
visNetwork – Dynamic Networks
# Draw the plotvisNetwork(nodes, edges, main="Courses taken by users")
Circular graph – Area per Vendor# Prepare the dataarea.vendor <- data.frame(
area=courses.merge$areaname,vendor=courses.merge$vname)
circular.data <- with(area.vendor,table(vendor, area))
# Draw the plotlibrary(circlize)chordDiagram(
as.data.frame(circular.data), transparency = 0.5)
Keep in touch
BarbaraFusinska.com@BasiaFusinska