R Statistics with Mon ‐ R Statistics with Mon ‐ goDB goDB Dr. Markus Schmidberger October 14th, 2013 Munich, Germany Email: Twitter: @cloudHPC [email protected] R Statistics with MongoDB 1 von 36
R Statistics with Mon‐R Statistics with Mon‐goDBgoDB
Dr. Markus SchmidbergerOctober 14th, 2013 Munich, Germany
Email: Twitter: @cloudHPC
R Statistics with MongoDB
1 von 36
Dr. Markus SchmidbergerDr. Markus SchmidbergerR Statistics with MongoDB
2 von 36
OutlineOutlineIntroduction to Big Data, MongoSoup and R
R statistics with MongoDB and Examples
Summary & Questions
R Statistics with MongoDB
3 von 36
Big DataBig DataWikipedia: … a collection of data sets so large and complex that itbecomes difficult to process using on-hand database managementtools or traditional data processing. …
storing
processing
R Statistics with MongoDB
4 von 36
Storing: NoSQL - MongoDBStoring: NoSQL - MongoDBdatabases using looser consistency models to store data
German MongoDB as a Service: MongoSoup
cloudControl Add-On
currently running on AWS EU-Region (Ireland)
all features available: shared / dedicated hosting, replicaset, sharding
24/7 support available
R Statistics with MongoDB
5 von 36
MongoSoup in < 5 minMongoSoup in < 5 mingo to cloudControl:
add an account and a billing address
create a new app, e.g. “rmongodb”
install cloudControl command line tools: cctrlapp
enable your preferred MongoSoup hosting: cctrlapprmongodb/default addon.add mongosoup.medium
go to the cloudControl Web-Console-AddOns and get yourcredentials
www.cloudcontrol.com
https://www.cloudcontrol.com/console/app/rmongodb
R Statistics with MongoDB
6 von 36
Processing: Analyzing with R and HadoopProcessing: Analyzing with R and Hadoopbackward-looking analysis is outdated
today: quasi real-time analysis
tomorrow: forward-looking predictive analysis
more complex methods, more data available, moreprocessing time required
Check my Strata London Tutorial “Big Data Analyses with R”
R Statistics with MongoDB
7 von 36
Introduction to RIntroduction to RR is a free software environment for statistical computingand graphics
offers tools to manage and analyze data
standard statistical methods are implemented
compiles and runs under different OS
support via huge community
www.r-project.org
R Statistics with MongoDB
8 von 36
huge online-libraries with > 5000 R-packages:
possibility to write personalized code and to contribute newpackages
really famous since January 6, 2009: The New York Times,“Data Analysts Captivated by R's Power”
http://cran.r-project.org
R Statistics with MongoDB
9 von 36
RStudio IDERStudio IDE
http://www.rstudio.com
R Statistics with MongoDB
10 von 36
R as calculatorR as calculator (5+5) - 1 * 3
[1] 7
x <- 3 x
[1] 3
x^2 + 4
[1] 13
R Statistics with MongoDB
11 von 36
y <- c(1,2,3)y
[1] 1 2 3
x <- 1:10x
[1] 1 2 3 4 5 6 7 8 9 10
x < 5
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
R Statistics with MongoDB
12 von 36
x[3:7]
[1] 3 4 5 6 7
mean(x)
[1] 5.5
help("mean")?mean
R Statistics with MongoDB
13 von 36
R Statistics with MongoDB
14 von 36
Many Statistical FunctionsMany Statistical Functionskmeans(dat, 4)
K-means clustering with 4 clusters of sizes 21, 18, 30, 31
Cluster means: [,1] [,2]1 0.7755 0.85092 -0.1557 -0.23053 1.2299 1.14724 0.1510 0.1507
Clustering vector: [1] 4 2 4 4 2 4 4 4 2 4 4 4 2 2 4 4 1 4 2 2 2 4 4 4 2 4 2 4 4 2 4 2 2 4 4 [36] 4 4 4 4 4 4 4 4 2 4 2 2 4 2 2 1 1 1 1 3 1 3 3 3 1 1 3 3 3 3 1 3 1 3 3 [71] 1 3 1 1 3 3 3 3 1 1 3 3 1 1 1 3 3 3 3 1 3 1 3 3 3 3 1 3 3 3
Within cluster sum of squares by cluster:[1] 3.318 1.166 4.019 3.195 (between_SS / total_SS = 83.0 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" [5] "tot.withinss" "betweenss" "size"
R Statistics with MongoDB
15 von 36
plot(dat, col = cl$cluster, cex=2, pch=16)points(cl$centers, col = 1:4, pch = 13, cex = 4)
R Statistics with MongoDB
16 von 36
R Shiny - easy web applicationR Shiny - easy web applicationdeveloped by RStudio
turns R analyses into interactive web applications thatanyone can use
let your users choose input parameters using friendlycontrols like sliders, drop-downs, and text fields
easily incorporate any number of outputs like plots, tables,and summaries
no HTML or JavaScript knowledge is necessary, only R
http://www.rstudio.com/shiny/
R Statistics with MongoDB
17 von 36
R and DatabasesR and DatabasesSQL provides a standard language to filter, aggregate, group,sort data
SQL in new places: Hive, Impala, …
ODBC provides SQL interface to non-database data (Excel,CSV, text files)
R stores relational data in data.frames (extended lists)
R Statistics with MongoDB
18 von 36
data(iris)head(iris, n=3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa
class(iris)
[1] "data.frame"
R Statistics with MongoDB
19 von 36
R package: sqldfR package: sqldf
running SQL statements on R data frames
library(sqldf)sqldf("select * from iris limit 2")
Sepal_Length Sepal_Width Petal_Length Petal_Width Species1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa
sqldf("select count(*) from iris")
count(*)1 150
R Statistics with MongoDB
20 von 36
Other relational R packageOther relational R packageRMySQL package provides an interface to MySQL
RPostgreSQL package provides an interface to PostgreSQL
ROracle package provides an interface for Oracle
RJDBC package provides access to databases through aJDBC interface
RSQLite package provides access to SQLite(SQLite engine is included)
One big problem:all packages read the full result in R memory
R Statistics with MongoDB
21 von 36
R and MongoDBR and MongoDB
on CRAN there are two packages to connect R with MongoDB
rmongodb supported by MongoDB, Inc.
powerful for big data
difficult to use due to BSON objects
RMongo
easy to use
limited functionality
reads full results in R memory
does not work on MAC OS X
R Statistics with MongoDB
22 von 36
R package: RMongoR package: RMongolibrary(Rmongo)mongo <- mongoDbConnect("cc_JwQcDLJSYQJb", "dbs001.mongosoup.de", 27017)dbAuthenticate(mongo, username="JwQcDLJSYQJb", password="RSXPkUkXXXXX")
dbShowCollections(mongo)dbGetQuery(mongo, "zips","{'state':'AL'}")dbInsertDocument(mongo, "test_data", '{"foo": "bar", "size": 5 }')
dbDisconnect(mongo)
R Statistics with MongoDB
23 von 36
R package: rmongodbR package: rmongodbdeveloped on top of the MongoDB supported C driver
library(rmongodb)mongo <- mongo.create(host="dbs001.mongosoup.de", db="cc_JwQcDLJSYQJb", username="JwQcDLJSYQJb", password="RSXPkUkXXXXX")
mongo
[1] 0attr(,"mongo")<pointer: 0x105a1de80>attr(,"class")[1] "mongo"attr(,"host")[1] "dbs001.mongosoup.de"attr(,"name")[1] ""attr(,"username")[1] "JwQcDLJSYQJb"attr(,"password")[1] "RSXPkUkxRdOX"attr(,"db")[1] "cc_JwQcDLJSYQJb"attr(,"timeout")[1] 0
R Statistics with MongoDB
24 von 36
mongo.get.database.collections(mongo, "cc_JwQcDLJSYQJb")
[1] "cc_JwQcDLJSYQJb.zips" "cc_JwQcDLJSYQJb.ccp" "cc_JwQcDLJSYQJb.test"
mongo <- mongo.disconnect(mongo)
R Statistics with MongoDB
25 von 36
buf <- mongo.bson.buffer.create()mongo.bson.buffer.append(buf, "state", "AL")
[1] TRUE
query <- mongo.bson.from.buffer(buf)query
state : 2 AL
R Statistics with MongoDB
26 von 36
res <- mongo.find.one(mongo, "cc_JwQcDLJSYQJb.zips", query)res
city : 2 ACMAR loc : 4 0 : 1 -86.515570 1 : 1 33.584132
pop : 16 6055 state : 2 AL _id : 2 35004
R Statistics with MongoDB
27 von 36
out <- mongo.bson.to.list(res)out$loc
[1] -86.52 33.58
typeof(out$loc)
[1] "double"
out$pop
[1] 6055
out$state
[1] "AL"
R Statistics with MongoDB
28 von 36
cursor <- mongo.find(mongo, "cc_JwQcDLJSYQJb.zips", query)
res <- NULLwhile (mongo.cursor.next(cursor)){ value <- mongo.cursor.value(cursor) Rvalue <- mongo.bson.to.list(value) res <- rbind(res, Rvalue)}err <- mongo.cursor.destroy(cursor)
head(res, n=4)
city loc pop state _id Rvalue "ACMAR" Numeric,2 6055 "AL" "35004"Rvalue "ADAMSVILLE" Numeric,2 10616 "AL" "35005"Rvalue "ADGER" Numeric,2 3205 "AL" "35006"Rvalue "KEYSTONE" Numeric,2 14218 "AL" "35007"
R Statistics with MongoDB
29 von 36
It is all about creating BSON query or field objects
b <- mongo.bson.from.list( list(name="Fred", age=29, city="Boston"))b
name : 2 Fred age : 1 29.000000 city : 2 Boston
mongo.bson.to.list(b)
$name[1] "Fred"
$age[1] 29
$city[1] "Boston"
R Statistics with MongoDB
30 von 36
?mongo.bson?mongo.bson.buffer.append?mongo.bson.buffer.start.array?mongo.bson.buffer.start.object
buf <- mongo.bson.buffer.create()mongo.bson.buffer.append(buf, "aggregate", "zips")mongo.bson.buffer.start.array(buf, "pipeline") mongo.bson.buffer.start.object(buf, "$group") mongo.bson.buffer.append(buf, "_id", "$state") mongo.bson.buffer.start.object(buf, "totalPop") mongo.bson.buffer.append(buf, "$sum", "$pop") mongo.bson.buffer.finish.object(buf) mongo.bson.buffer.finish.object(buf)mongo.bson.buffer.start.object(buf, "$match") mongo.bson.buffer.start.object(buf, "totalPop") mongo.bson.buffer.append(buf, "$gte", "10000") mongo.bson.buffer.finish.object(buf)mongo.bson.buffer.finish.object(buf)mongo.bson.buffer.finish.object(buf)query <- mongo.bson.from.buffer(buf)
R Statistics with MongoDB
31 von 36
CCP Web Analytics ChallengeCCP Web Analytics Challengebuf <- mongo.bson.buffer.create()query <- mongo.bson.from.buffer(buf)buf <- mongo.bson.buffer.create()err <- mongo.bson.buffer.append(buf, "user", 1)err <- mongo.bson.buffer.append(buf, "type", 1)field <- mongo.bson.from.buffer(buf)out <- mongo.find(mongo, "cc_JwQcDLJSYQJb.ccp", query, fields=field, limit=1000)res <- NULLwhile (mongo.cursor.next(out)){ value <- mongo.cursor.value(out) Rvalue <- mongo.bson.to.list(value) res <- rbind(res, Rvalue)}
R Statistics with MongoDB
32 von 36
boxplot( as.integer(table(unlist(res[,2])) ), cex=4, horizontal=TRUE, main="Number of actions per user")
R Statistics with MongoDB
33 von 36
Shiny MongoShiny MongoR based MongoDB User Interface
R packages shiny and rmongodb
less than 200 lines of code
DEMO: http://localhost:8100
https://github.com/comsysto/ShinyMongo
R Statistics with MongoDB
34 von 36
SummarySummaryR is a powerful statistical tool to analyse many different kindof data
R can access databases
MongoDB and rmongodb ready for Big Data
start playing around with R, Big Data and MongoDB

http://www.r-project.org
http://www.mongodb.org
http://www.mongosoup.de
R Statistics with MongoDB
35 von 36
See you soonSee you soonthanks a lot for your attention
there are R trainings in December 2013 in Munich
we are hosting many events and meetups
meet you at the MongoSoup booth
Email: Twitter: @cloudHPC
http://comsysto.com/events.html#r
R Statistics with MongoDB
36 von 36