Skyrocket your Analytics MongoDB Meetup on December 10, 2012 www.precog.com @precogio Nov - Dec 2012
Dec 10, 2014
Skyrocket your Analytics
MongoDB Meetup on December 10, 2012www.precog.com@precogioNov - Dec 2012
■ Welcome to the Precog & MongoDB Meetup!
■ Questions? Please ask away!
welcome & agenda
7:00 - 7:30Overview of Precog for MongoDB by Derek Chen-Becker
7:30 - 7:45Break (grab a beer, drink and snacks)
7:45 - 8:15Analyzing Big Data with Quirrel by John A. De Goes
8:15 - 8:30Precog Challenge Problems! Win some prizes!
■ Precog TeamDerek Chen-Becker, Lead Infrastructure Engineer
John A. De Goes, CEO/Founder
Kris Nuttycombe, Dir of Engineering
Nathan Lubchenco, Developer Evangelist
■ MongoDB HostClay Mcllrath
■ Thank you to Google for hosting us!
who we are
Current MongoDB Support for Analytics
Derek Chen-BeckerPrecog Lead Infrastructure Engineer@dchenbeckerNov - Dec 2012
■ Mongo has support for a small set of simple aggregation primitives
○ count - returns the count of a given collection's documents with optional
filtering
○ distinct - returns the distinct values for given selector criteria
○ group - returns groups of documents based on given key criteria. Group
cannot be used in sharded configurations
current mongodb support for analytics
> db.london_medals.group({
key : {"Country":1},
reduce : function(curr, result) { result.total += 1 },
initial: { total : 0, fullTotal: db.london_medals.count() },
finalize: function(result){ result.percent = result.total * 100 / result.fullTotal }
})
[
{"Country" : "Great Britain", "total" : 88, "fullTotal" : 1019, "percent" : 8.635917566241414},
{"Country" : "Dominican Republic", "total" : 2, "fullTotal" : 1019, "percent" : 0.19627085377821393},
{"Country" : "Denmark", "total" : 16, "fullTotal" : 1019, "percent" : 1.5701668302257115},
...
■ More sophisticated queries are possible, but require a lot of JS and you'll hit the limits pretty quickly
■ Group cannot be used in sharded configurations. For that you need...
current mongodb support for analytics
■ Map/Reduce: Exactly what its name says.
■ You utilize JavaScript functions to map your documents' data, then reduce that
data into a form of your choosing.
current mongodb support for analytics
Input Collection
Mapping Function Reducing Function
Result Document
Output Collection
■ The mapping function redefines this to be the current document
■ Output mapped keys and values are generated via the emit function
■ Emit can be called zero or more times for a single document
function () { emit(this.Countryname, { count : 1 }); }
function () {
for (var i = 0; i < this.Pupils.length; i++) {
emit(this.Pupils[i].name, { count : 1});
}
function () {
if ((this.parents.age - this.age) < 25) { emit(this.age, { income : this.income }); }
}
current mongodb support for analytics
■ The reduction function is used to aggregate the outputs from the mapping
function
■ The function receives two inputs: the key for the elements being reduced, and
the values being reduced
■ The result of the reduction must be the same format as in the input elements,
and must be idempotent
function (key, values) {
var count = 0;
for (var item in values) {
count += item.count
}
{ "count" : count }
}
current mongodb support for analytics
■ Map/Reduce utilizes JavaScript to do all of its work
○ JavaScript in MongoDB is currently single-threaded (performance bottleneck)
○ Using external JS libraries is cumbersome and doesn't play well with sharding
○ No matter what language you're actually using, you'll be writing/maintaining
JavaScript
■ Troubleshooting the Map/Reduce functions is primitive. 10Gen's advice: "write
your own emit function" (!)
■ Output options are flexible, but have some caveats
○ Output to a result document must fit in a BSON doc (16MB limit)
○ For an output collection: if you want indices on the result set, you need to pre-
create the collection then use the merge output option
current mongodb support for analytics
■ The Aggregation Framework is designed to alleviate some of the issues with
Map/Reduce for common analytical queries
■ New in 2.2
■ Works by constructing a pipeline of operations on data. Similar to M/R, but
implemented in native code (higher performance, not single-threaded)
current mongodb support for analytics
Input Collection Match Project Group
■ Filtering/paging ops
○ $match - utilize Mongo selection syntax to choose input docs
○ $limit
○ $skip
■ Field manipulation ops
○ $project - select which fields are processed. Can add new fields
○ $unwind - flattens a doc with an array field into multiple events, one per array
value
■ Output ops
○ $group
○ $sort
■ Most common pipelines will be of the form $match ⇒ $project ⇒ $group
current mongodb support for analytics
■ $match is very important to getting good performance
■ Needs to be the first op in the pipeline, otherwise indices can't be used
■ Uses normal MongoDB query syntax, with two exceptions
○ Can't use a $where clause (this requires JavaScript)
○ Can't use Geospatial queries (just because)
{ $match : { "Name" : "Fred" } }
{ $match : { "Countryname" : { $neq : "Great Britain" } } }
{ $match : { "Income" : { $exists : 1 } } }
current mongodb support for analytics
■ $project is used to select/compute/augment the fields you want in the output
documents
{ $project : { "Countryname" : 1, "Sportname" : 1 } }
■ Can reference input document fields in computations via "$"
{ $project : { "country_name" : "$Countryname" } } /* renames field */
■ Computation of field values is possible, but it's limited and can be quite painful
{ $project: {
"_id":0, "height":1, "weight":1,
"bmi": { $divide : ["$weight", { $multiply : [ "$height", "$height" ] } ] } }
} /* omit "_id" field, inflict pain and suffering on future maintainers... */
current mongodb support for analytics
■ $group, like the group command, collates and computes sets of values based
on the identity field ("_id"), and whatever other fields you want
{ $group : { "_id" : "$Countryname" } } /* distinct list of countries */
■ Aggregation operators can be used to perform computation ($max, $min, $avg,
$sum)
{ $group : { "_id" : "$Countryname", "count" : { $sum : 1 } } } /* histogram by
country */
{ $group : { "_id" : "$Countryname", "weight" : { $avg : "$weight" } } }
{ $group : { "_id" : "$Countryname", "weight" : { $sum : "$weight" } } }
■ Set-based operations ($addToSet, $push)
{ $group : { "_id" : "$Countryname", "sport" : { $addToSet : "$sport" } } }
current mongodb support for analytics
■ Aggregation framework has a limited set of operators
○ $project limited to $add/$subtract/$multiply/$divide, as well as some
boolean, string, and date/time operations
○ $group limited to $min/$max/$avg/$sum
■ Some operators, notably $group and $sort, are required to operate entirely in
memory
○ This may prevent aggregation on large data sets
○ Can't work around using subsetting like you can with M/R, because output is
strictly a document (no collection option yet)
current mongodb support for analytics
■ Even with these tools, there are still limitations
○ MongoDB is not relational. This means a lot of work on your part if you have
datasets representing different things that you'd like to correlate. Clicks vs
views, for example
○ While the Aggregation Framework alleviates some of the performance issues
of Map/Reduce, it does so by throwing away flexibility
○ The best approach for parallelization (sharding) is fraught with operational
challenges (come see me for horror stories)
current mongodb support for analytics
Overview of Precog for MongoDB
Derek Chen-BeckerPrecog Lead Infrastructure Engineer@dchenbeckerNov - Dec 2012
■ Download file: http://www.precog.com/mongodb
■ Setup:
$ unzip precog.zip
$ cd precog
$ emacs -nw config.cfg (adjust ports, etc)
$ ./precog.sh
overview of precog for mongodb
■ Precog for MongoDB allows you to perform sophisticated analytics utilizing
existing mongo instances
■ Self-contained JAR bundling:
○ The Precog Analytics service
○ Labcoat IDE for Quirrel
■ Does not include the full Precog stack
○ Minimal authentication handling (single api key in config)
○ No ingest service (just add data directly to mongo)
overview of precog for mongodb
■ Some sample queries
-- histogram by countrydata := //summer_games/athletessolve 'country { country: 'country, count: count(data where data.Countryname = 'country) }
overview of precog for mongodb
Analyzing Big Data with Quirrel
John A. De GoesPrecog CEO/Founder@jdegoesNov - Dec 2012
Quirrel is a statistically-oriented query language designed for the analysis of large-scale, potentially heterogeneous data sets.
overview
● Simple● Set-oriented● Statistically-oriented● Purely declarative● Implicitly parallel
quirrel
pageViews := //pageViewsavg := mean(pageViews.duration)bound := 1.5 * stdDev(pageViews.duration)pageViews.userId where pageViews.duration > avg + bound
sneak peek
1true[[1, 0, 0], [0, 1, 0], [0, 0, 1]]
"All work and no play makes jack a dull boy"
{"age": 23, "gender": "female", "interests": ["sports", "tennis"]}
quirrel speaks json
-- Ignore me.(- Ignore me, too -)
comments
2 * 4
(1 + 2) * 3 / 9 > 23
3 > 2 & (1 != 2)
false & true | !false
basic expressions
x := 2
square := x * x
named expressions
//pageViews
load("/pageViews")
//campaigns/summer/2012
loading data
pageViews := load("/pageViews")
pageViews.userId
pageViews.keywords[2]
drilldown
count(//pageViews)
sum(//purchases.total)
stdDev(//purchases.total)
reductions
pageViews := //pageViews
pageViews.userId where pageViews.duration > 1000
filtering
clicks with {dow: dayOfWeek(clicks.time)}
augmentation
import std::stats::rank
rank(//pageViews.duration)
standard library
ctr(day) := count(clicks where clicks.day = day) / count(impressions where impressions.day = day)
ctrOnMonday := ctr(1)
ctrOnMonday
user-defined functions
solve 'day {day: 'day, ctr: count(clicks where clicks.day = 'day) / count(impressions where impressions.day = 'day)}
grouping - implicit constraints
solve 'day = purchases.day {day: 'day, cummTotal: sum(purchases.total where purchases.day < 'day)}
grouping - explicit constraints
http://quirrel-lang.org
questions?
Now, it's your turn! Win some cool prizes!
Precog Challenge ProblemsNov - Dec 2012
■ Using the conversions data, find the state with the highest average income.
■ Variable names: conversions.customers.state and conversions.customers.income
precog challenge #1
■ Use Labcoat to display a bar chart of the clicks per month.
■ Variable names: clicks.timestamp
precog challenge #2
■ What product has the worst overall sales to women? To men?
■ Variable names: billing.product.ID, billing.product.price, billing.customer.gender
precog challenge #3
conversions := //conversions
results := solve 'state
{state: 'state,
aveIncome: mean(conversions.customer.income where
conversions.customer.state = 'state)}
results where results.aveIncome = max(results.aveIncome)
precog challenge #1 possible solution
clicks := //clicks
clicks' := clicks with {month: std::time::monthOfYear(clicks.timeStamp)}
solve 'month
{month: 'month, clicks: count(clicks'.product.price where clicks'.month = 'month)}
precog challenge #2 possible solution
billing := //billing
results := solve 'product, 'gender
{product: 'product,
gender: 'gender,
sales: sum(billing.product.price where
billing.product.ID = 'product &
billing.customer.gender = 'gender)}
worstSalesToWomen := results where results.gender = "female" &
results.sales = min(results.sales where results.gender = "female")
worstSalesToMen := results where results.gender = "male" &
results.sales = min(results.sales where results.gender = "male")
worstSalesToWomen union worstSalesToMen
precog challenge #3 possible solution
Thank you!
Follow us on Twitter@precogio@jdegoes@dchenbecker
Download Precog for MongoDB for FREE:www.precog.com/mongodb
Try Precog for free and get a free account:www.precog.com
Subscribe to our monthly newsletter:www.precog.com/about/newsletter
Nov - Dec 2012