Amit Kapoor@amitkaps
Visualising Big Data
Visualise Million Data Points
x <- rnorm(1000000, mean=0, sd=2)y <- rnorm(1000000, mean=0, sd=2)
xy <- data.frame(x,y)
Same order as theNumber of Pixels
on my MacBook Air1400 x 900
Data
Data Sample
Sampling can be effective (with overweighting
unusual values)
Require multiple plots or careful
tuning parameters
Data Sample
Model
Models are great as they scale nicely.
But, visualisation is required as
“I don’t know, what I don’t know.”
Data Sample
ModelBinning
Binning can solve a lot of these challenges
“Bin - Summarize - Smooth: A framework
for visualising big data” - Hadley Wickam (2013)
“Visualising big data is the process of creating generalized histograms”
Approach
BIN : fixed size bins = (x-origin)/width
SUMMARIZE : summary stats = count, mean, stdev
SMOOTH : smoothing e.g. kernel mean, regression
VISUALISE : visualise using standard plots
Bigvis Package in R
Aim: To plot 100 million points in under 5 seconds.Approach: - Plotting using standard R libraries- Processing done in (fast) compiled C++ code, using
Rcpp package - Outlier removal in big data- Smoothing to highlight trends & suppress noise
Diamonds dataset
ggplot(diamonds) + aes(carat, price) + geom_point(alpha = 0.2, colour = “orange”)
50k observations e.g. price, carat of diamonds
Condense (bin + summarise)
library(bigvis)library(ggplot2)
Nbin <- 20BinData <- with(diamonds, condense(
bin(carat, find_width(carat,Nbin)),bin(price, find_width(price,Nbin)))
Plotting the Condense
p <- ggplot(BinData) + aes(carat, price, fill=.count) + geom_tile()
Create bins = 20 and summarized using count
Both Points & Condensed
q <- p + geom_point(data = diamonds, aes(fill = NULL), alpha = 0.2, colour = "orange")
Create bins = 20, summarized using count & added base data
Movies dataset
ggplot(movies) + aes(length, rating) + geom_point(alpha = 0.2, colour = “orange”)
130k observations e.g. length, rating of movies on IMDB
Let us see the outliers
title length rating1 Matrjoschka 5700 8.52 The Cure for Insomnia 5220 5.93 The Longest Most Meaningless Movie in the World 2880 7.34 The Hazards of Helen 1428 6.65 **** 1100 6.9
Condense (bin + summarise)
library(bigvis)library(ggplot2)
Nbin <- 1e4BinData <- with(movies, condense(
bin(length, find_width(length,Nbin)),bin(rating, find_width(rating,Nbin)))
Condesed Plot
p <- ggplot(BinData) + aes(length, rating, fill=.count) + geom_tile()
Create bins = 10000 and summarized using count
Remove Outliers
p %>% peel(BinData)
Create bins = 10000, summarize count & peel 1% outlier
Smoothing
smoothBinData <- smooth(peel(binData), h=c(20, 1))autoplot(smoothBinData)
Create bins = 20, summarize count, peel 1% outlier & smooth
Big Data Visualisation
● Approach: Bin - Summarize - Smooth - Visualise● “Interactively” plot nearly 100 millions data point in-
memory for EDA in R● Can be extend to in-database e.g. for binning● Can be parallelised e.g. summarize on count, mean
Amit Kapoor@amitkaps
amitkaps.comnarrativeviz.com
Data
Visual
Story
*