Social Analytics with MongoDB @BuddyMedia
Social Analytics with MongoDB
@BuddyMedia
Disclaimer
+= maybe not the best deck in the world
What is MongoDB?
• Document Store. • Schemaless.• High performance.
Why MongoDB?
• Months of testing– Data Types– Horizontal Scaling – Replication– Querying– Atomicity – Concurrency
Everything in that last slide was a LIE.
Same reason most of you do.
• It’s new and cool and we wanted to check it out.
• We become cool by association.• But mostly because we like learning new
things.
That last slide was kind of a lie too.
• We started with Cassandra.• Cassandra was written by Facebook and
Facebook is really cool, we wanted to be as cool as them.
Why Not Cassandra?
• Thrift. – “Thrift is a software framework for scalable cross-
language services development. It combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, and OCaml.”
• Eff that. We’re a startup.
So MongoDB it Was.
Also, MongoDB Happened to be in NYC. We are in NYC. NYC is Cool.
Proof that NYC is cool.
What You Should Know
• MongoDB is not relational.• It’s also not schemaless even though they love to say that.
(applications always have schemas/data models).• Right tool for right job.
– Logging– Queues– Aggregate Analytics
• Don’t get confused with ORM.• Return what you need.• Don’t worry about document size limits.
Aggregate Analytics
• Lots of “Stuff” happens at Buddy Media.• Need to keep track of it all.• Need to it to be real time. • Need to be able to group it by various levels and
resolutions.• Need to be able to create new metrics on the fly.• Write heavy, Read light.
What does it look like?
Event Queue Processor Metric
Architecture
The Event Listener
• Node.js is the perfect event listener.– Evented IO like Twisted or Event Machine.– 2 days of development (maybe ~100 lines of JS). – 0 lost events– 0 downtime.– Just don’t upgrade
Raw Event
A Pageview
{"_id" : ObjectId("4d8d0df101cddf2e6e0027af"),"created_date" : "2010-07-26 20:15:01","data" : {
"client_id" : "1034","page_id" : "175”
},"status" : {
"state" : 0,"updated" : "2011-04-12 10:15:15"
},"type" : "pageview"
}
Processing
• 3 resolutions– Minute– Hour– Day
• 1 event = 3 metric updates * number of groupings.
"pageview": {"metrics": [
{ "name":"client.pageviews", "key":"client_id" },{ "name":"page.pageviews", "key":"page_id" }
]}
Creating a Metric
A pageview happened and I want to update metrics for the client the page belongs to.
metrics.update({
'name’:client.pageview','period':'minute','start_date':'2010-05-12 12:50:00'
}, { '$inc': {'aggregates.1034':1} }, upsert=True
);
Completed Metric
{"_id" : ObjectId("4da45cf6306a22719829b71b"),"aggregates" : {
”1034" : 11},"end_date" : "2010-05-12 12:54:59","name" : ”client.pageview","period" : "minute","start_date" : "2010-05-12 12:50:00","total" : 11
}
What about another client?If a second pageview comes in for a different client, we end up updating the exact same record. Thus our last metric becomes:
{"_id" : ObjectId("4da45cf6306a22719829b71b"),"aggregates" : {
”1034" : 1,“1213”: 1
},"end_date" : "2010-05-12 12:54:59","name" : ”client.pageview","period" : "minute","start_date" : "2010-05-12 12:50:00","total" : 11
}
Some Queries1. Get pageviews for all clients that occurred on May 12 between 12:50 and 12:51
db.metrics.find({name:"client.pageview",period:"minute",start_date:"2010-05-12 12:50:00”
});
2. Get pageviews for client 1034 that occurred on May 12 between 12:50 and 12:51
db.metrics.find({name:"client.pageview",period:"minute",start_date:"2010-05-12 12:50:00”
},{“aggregates.1034”:1});
1 Document, n entries.
1 Document, 1 entry.
More Queries1. Get pageviews for all clients that occurred on May 12 and graph by hour.
db.metrics.find({name:"client.pageview",period:”hour",start_date:”/2010-05-12/”
});
2. Get pageviews for client 1034 that occurred on May 12 and graph by minute.
db.metrics.find({name:"client.pageview",period:"minute",start_date:”/2010-05-/”
},{“aggregates.1034”:1});
24 Documents, n entries.
1440 Documents, 1 entry.
Let’s take a peak.
@patr1cks@buddymedia