Real-time Location Based Social Discovery using MongoDB Fredrik Björk Director of Engineering MongoSV, Dec 4th 2012
Dec 18, 2014
Real-time Location Based Social Discovery using MongoDB
Fredrik BjörkDirector of Engineering
MongoSV, Dec 4th 2012
What is Banjo?
• The most powerful location based mobile technology that brings you the moments you would otherwise miss
• Aggregates geo tagged posts from Facebook, Twitter, Instagram and Foursquare in real-time
3
Stats
• Launched June 2011• 3 million users• Social graph of 400 million profiles• 50 billion connections• ~200 geo posts created per second
4
Why MongoDB?
• Developer friendly• Easy to maintain and scale• Automatic failover• Rapid prototyping of features• Good fit for consuming, storing and
presenting JSON data• Geospatial features out of the box
5
Infrastructure
• ~160 EC2 instances (75% MongoDB, 25% Redis)
• SSD drives for low latency• App servers (Sinatra & Rails) hosted on
Heroku• Mongos with authentication running on
dedicated servers
6
Geo tagged posts
• Consumed as JSON from social network APIs - streaming, polling & real-time callbacks
• Exposed via REST APIs as JSON to the Banjo iOS and Android apps
7
Schema design
8
https://twitter.com/fbjork/status/262989592561606656
9
> db.posts.find({ _id: ‘2:262989592561606656’ })
{ _id: “2:262989592561606656”, username: “fbjork”, text: “Will give a presentation at #MongoSV on how we use @MongoDB for real-time location based social discovery at @Banjo http://www.10gen.com/events/mongosv”, ...
}
https://twitter.com/fbjork/status/262989592561606656
• _id is composed of provider (Facebook: 1, Twitter: 2 etc.) and post id for uniqueness
10
• Coordinates are stored inside an array with latitude, longitude
{ _id: “2:262989592561606656”, username: “fbjork”, text: “Will give a presentation at #MongoSV on how we use @MongoDB for real-time location based social discovery at @Banjo http://www.10gen.com/events/mongosv”, coordinates: [37.784234,-122.438212],...
}
11
• Friends are stored inside an array
{ _id: “2:262989592561606656”, username: “fbjork”, text: “Will give a presentation at #MongoSV on how we use @MongoDB for real-time location based social discovery at @Banjo http://www.10gen.com/events/mongosv”, coordinates: [37.784234,-122.438212],friend_ids: [8816792, 10324882, 2006261, ...]
}
12
Geospatial Indexing• Create the geo index:
13
> db.posts.ensureIndex( { coordinates: ‘2d’ } )
14
> db.posts.find( { coordinates: { $near: [25.792627,-80.226142] } } )
{ _id: “2:809438082”, coordinates: [25.792610,-80.226100], username: “Rebecca_Boorsma”, text: “I love Miami!”, ... }
{ _id: “2:1234567”, coordinates: [25.781324,-80.431423], username: “foo”, text: “Another day, another dollar”, ... }
Find nearby posts in Miami:
15
16
> db.posts.find({ friend_ids: { $in: [2006261] })
{ _id: “2:10248172”, username: “fbjork”, friend_ids: [8816792, 10324882, 2006261, ...],...
}
Find friend posts globally:
17
> db.posts.find({ coordinates: { $near: [25.792627,-80.226142] }, friend_ids: { $in: [2006261] })
{ _id: “2:10248172”, username: “fbjork”, friend_ids: [8816792, 10324882, 2006261, ...],...
}
Find friend posts in a location:
Compound geo indexes• Create a compound index on coordinates
and friend_ids:
18
> db.posts.ensureIndex( { coordinates: ‘2d’, friend_ids: 1 } )
19
• Fails for compound indexes with large arrays
• Geospatial indexes have a size limit of 1000 bytes
> db.posts.ensureIndex( { coordinates: ‘2d’, friend_ids: 1 } )
Error: Key too large to index
Geospatial query performance
• Do we need a compound index at all?• Geospatial index is usually restrictive
enough• Problem: Array traversal (using $in) is
CPU hungry for large arrays• Solution: Pre-sharded array fields
20
Pre-sharded array fields
• When dealing with large arrays, i.e @BarackObama follower ids
• Partition fields using pre-sharding• shard = Hash(key) MOD shard_count• Keep array sizes in the low hundreds
21
22
{friends_0: [1000, 1002, 1006],friends_1: [1004],friends_2: [1001, 1003, 1005]
}
# shard_example.rb
SHARDS = 3friend_ids = [1000 , 1001, 1002, 1003, 1004, 1005, 1006]friend_ids.each { |f| puts Zlib.crc32(f.to_s) % SHARDS }0202120
23
> db.posts.find({ coordinates: { $near: [25.792627,-80.226142] }, friend_0: { $in: [1000] })
{friends_0: [1000, 1002, 1006],friends_1: [1004],friends_2: [1001, 1003, 1005]
}
Find friend posts using pre-sharding of the friend arrays:
Capped collections
• Good fit for storing a feed of posts for a period of time
• Eliminates need to expire old posts• Documents can’t grow• Documents can’t be deleted• Resizing collections is painful• Can’t be sharded
24
TTL collections
• We switched to TTL collections with MongoDB 2.2
• Deleting and growing documents is now possible
• Easier to change expiration times• Can be sharded (not by geo)
25
Questions
26