Graph Operations With MongoDB Charles Sarrazin Senior Consulting Engineer, MongoDB
Graph OperationsWith MongoDB
Charles SarrazinSenior Consulting Engineer, MongoDB
Charles SarrazinSenior Consulting Engineer, MongoDB
Graph OperationsWith MongoDB
Agenda
MongoDBIntroduction01 New Lookup
Operators03Graph Use &Concepts02
Example Scenarios04 Wrap-up06Design & PerformanceConsiderations
05
MongoDB Introduction
Documents {
first_name: ‘Paul’,
surname: ‘Miller’,cell: 447557505611, city: ‘London’,location: [45.123,47.232],profession: [‘banking’, ‘finance’, ‘trader’],cars: [ { model: ‘Bentley’,year: 1973,value: 100000, … },
{ model: ‘Rolls Royce’,year: 1965,value: 330000, … }
]
}
Fields can contain an array of sub-documents
Fields
Typed field values
Fields can contain arrays
Number
Query Languagedb.collection.find({'city':'London'})db.collection.find({'profession':{'$in':['banking','trader']}},{'surname':1,'profession':1})db.collection.find({'cars.year':{'$lte':1968}}).sort({'surname':1}).limit(10)db.collection.find({'cars.model':'Bentley','cars.year':{'$lt':1966}})db.collection.find({'cars':{'$elemMatch':{'model':'Bentley','year':{'$lt':1966}}}})db.collection.find({'location':{'$geoWithin': { '$geometry': {
'type': 'Polygon',coordinates: [ <array-of-coordinates> ]
}}}})
Secondary Indexes
compound, geospatial, text, multikey, hashed,unique, sparse, partial, TTL
Query Languagedb.collection.aggregate ( [{$match:{'profession':{'$in':['banking','trader']}}},{$addFields:{'surnameLower':{$toLower:"$surname"},'prof':{$ifNull:["$prof","Unknown"]}},{$group: { ... } },{$sort: { ... } },{$limit: { ... } },{$match: { ... } }, ...
] )
Aggregation pipeline
Schema Design
{ first_name: ‘Paul’,
surname: ‘Miller’,cell: 447557505611, city: ‘London’,location: [45.123,47.232],profession: [‘banking’, ‘finance’, ‘trader’],cars: [ { model: ‘Bentley’,year: 1973,value: 100000, … },
{ model: ‘Rolls Royce’,year: 1965,value: 330000, … }
]
}
Embedsamedocument
Schema Design
{ first_name: ‘Paul’,
surname: ‘Miller’,cell: 447557505611, city: ‘London’,location: [45.123,47.232],profession: [‘banking’, ‘finance’, ‘trader’],cars: [ { model: ‘Bentley’,year: 1973,value: 100000, … },
{ model: ‘Rolls Royce’,year: 1965,value: 330000, … }
]
}
Embedsamedocument
{ first_name: ‘Paul’,
surname: ‘Miller’,cell: 447557505611, city: ‘London’,location: [45.123,47.232],profession: [‘banking’, ‘finance’, ‘trader’]
}cars:
{ owner_id: 146model: ‘Bentley’,year: 1973,value: 100000, … },
{ owner_id: 146model: ‘Rolls Royce’,year: 1965,value: 330000, … }
Separate Collection
with reference
Functionality Timeline
2.0 – 2.2
Geospatial Polygon support
Aggregation Framework
New 2dsphere index
Aggregation Framework efficiency optimisations
Full text search
2.4 – 2.6
3.0 – 3.2
Join functionality
Increased geo accuracy
New Aggregation operators
Improved case insensitivity
Recursive graph traversal
Faceted search
Multiple collations
3.4
MongoDB 3.4 - Multi-Model Database
DocumentRichJSONDataStructures
FlexibleSchemaGlobalScale
RelationalLeft-OuterJoin
ViewsSchemaValidation
Key/ValueHorizontalScale
In-Memory
SearchTextSearchMultipleLanguagesFacetedSearch
BinariesFiles&MetadataEncrypted
GraphGraph&HierarchicalRecursiveLookups
GeoSpatialGeoJSON2D&2DSphere
Graph Use & Concepts
Common Use Cases
• Networks• Social – circle of friends/colleagues• Computer network – physical/virtual/application layer
• Mapping / Routes• Shortest route A to B
• Cybersecurity & Fraud Detection• Real-time fraud/scam recognition
• Personalisation/Recommendation Engine• Product, social, service, professional etc.
Graph Key Concepts
• Vertices (nodes)• Edges (relationships)• Nodes have properties• Relationships have name & direction
Relational DBs Lack Relationships
• “Relationships” are actually JOINs• Raw business or storage logic and constraints – not semantic• JOIN tables, sparse columns, null-checks• More JOINS = degraded performance and flexibility
Relational DBs Lack Relationships
• How expensive/complex is:– Find my friends?– Find friends of my friends?– Find mutual friends?– Find friends of my friends of my friends?– And so on…
Native Graph Database Strengths
• Relationships are first class citizens of the database• Index-free adjacency• Nodes “point” directly to other nodes• Efficient relationship traversal
Native Graph Database Challenges
• Complex query languages • Poorly optimized for non-traversal queries
• Difficult to express• May be memory intensive
• Less often used as System Of Record• Synchronisation with SOR required• Increased operational complexity• Consistency concerns
NoSQL DBs Lack Relationships
• “Flat” disconnected documents or key/value pairs• “Foreign keys” inferred at application layer• Data integrity/quality onus is on the application• Suggestions re difficulty of modeling ANY relationships efficiently with
aggregate stores. • However…
Friends Network – Document Style{
_id: 0,
name: "Bob Smith",
friends: ["Anna Jones", "Chris Green"]
},
{
_id: 1,
name: "Anna Jones",
friends: ["Bob Smith", "Chris Green", "Joe Lee"]
},
{
_id: 2,
name: "Chris Green",
friends: ["Anna Jones", "Bob Smith"]
}
Schema Design – before $graphLookup
• Options• Store an array of direct children in each node• Store parent in each node• Store parent and array of ancestors
• Trade-offs• Simple queries…• …vs simple updates5 13 14 16 176
3 15121094
2 7 8 11
1
Why MongoDB For Graph?
Lookup Operators
$lookup
Syntax
$lookup: {from: <target lookup collection>,localField: <field from the input document>,foreignField: <field from the target collection to connect to>,as: <field name for resulting array>
}
$graphLookup
Syntax
$graphLookup: {from: <target lookup collection>,startWith: <expression for value to start from>,connectToField: <field name in target collection to connect to>,connectFromField: <field name in target collection to connect from – recurse from here>,as: <field name for resulting array>,maxDepth: <max number of iterations to perform>,depthField: <field name for number of recursive iterations required to reach this node>,restrictSearchWithMatch: <match condition to apply to lookup>
}
Things To Note
• startWith value is an expression• Referencing value of a field requires the ‘$’ prefix• Can do things like {$toLower: "$name" }• Handles array fields automatically
• connectToField and connectFromField take field names
• restrictSearchWithMatch takes a standard query expressions
Things To Note
• Cycles are automatically detected
• Can be used with 3.4 views:• Define a view• Recurse across existing view (‘base’ or ‘from’)
• Can be used multiple times per Aggregation pipeline
Schema Design – before $graphLookup
• Options• Store an array of direct children in each node• Store parent in each node• Store parent and array of ancestors
• Trade-offs• Simple queries…• …vs simple updates5 13 14 16 176
3 15121094
2 7 8 11
1
• Options• Store immediate parent in each node• Store immediate children in each node
• Traverse in multiple directions• Recurse in same collection• Join/recurse into another collection
5 13 14 16 176
3 15121094
2 7 8 11
1
Schema Design – with $graphLookup
75%of use cases*
*based on beta test user feedback
So just how suitable is MongoDB for the many varied graph use cases I have then?”
Example Scenarios
Scenario: Calculate Friend Network{
_id: 0,
name: "Bob Smith",
friends: ["Anna Jones", "Chris Green"]
},
{
_id: 1,
name: "Anna Jones",
friends: ["Bob Smith", "Chris Green", "Joe Lee"]
},
{
_id: 2,
name: "Chris Green",
friends: ["Anna Jones", "Bob Smith"]
}
Scenario: Calculate Friend Network[
{
$match: { "name": "Bob Smith" } },
{
$graphLookup: {
from: "contacts",
startWith: "$friends",
connectToField: "name",
connectFromField: "friends”,
as: "socialNetwork"
}
}, {
$project: { name: 1, friends:1, socialNetwork: "$socialNetwork.name"}}
]
This field is an array
No maxDepth set
Scenario: Calculate Friend Network{
"_id" : 0,
"name" : "Bob Smith",
"friends" : [
"Anna Jones",
"Chris Green"
],
"socialNetwork" : [
"Joe Lee",
"Fred Brown",
"Bob Smith",
"Chris Green",
"Anna Jones"
]
}
Array
Friends Network - Social
Bob Smith
Chris Greenfriends
Anna Jones
Joe Lee
Recommendation ?
Friends Network - Social
Bob Smith
Chris Greenfriends
Anna Jones
Joe Lee
Recommendation ?Acme Soda
Scenario: Determine Air Travel Options
ORD
JFK
BOS
PWM
LHR
{ "_id" : 0, "airport" : "JFK", "connects" : [ "BOS", "ORD" ] }{ "_id" : 1, "airport" : "BOS", "connects" : [ "JFK", "PWM" ] }{ "_id" : 2, "airport" : "ORD", "connects" : [ "JFK" ] }{ "_id" : 3, "airport" : "PWM", "connects" : [ "BOS", "LHR" ] }{ "_id" : 4, "airport" : "LHR", "connects" : [ "PWM" ] }
Scenario: Determine Air Travel Options
Meet Lucy
{ "_id" : 0, "name" : "Lucy", "nearestAirport" : "JFK" }
[
{
"$match": {"name":"Lucy"}
},
{
"$graphLookup": {
from: "airports",
startWith: "$nearestAirport",
connectToField: "airport",
connectFromField: "connects",
maxDepth: 2,
depthField: "numFlights",
as: "destinations”
}
}
]
Scenario: Determine Air Travel Options
Record the number of recursions
{name: "Lucy”,nearestAirport: "JFK",destinations: [{ _id: 0, airport: "JFK", connects: ["BOS", "ORD"], numFlights: 0 },{ _id: 1, airport: "BOS", connects: ["JFK", "PWM"], numFlights: 1 },{ _id: 2, airport: "ORD", connects: ["JFK"], numFlights: 1 },{ _id: 3, airport: "PWM", connects: ["BOS", "LHR"], numFlights: 2 }
]}
Scenario: Determine Air Travel Options
How many flights this would take
ORD
JFK
BOS
PWM
LHR
ATL
Scenario: Determine Air Travel Options
{ "_id" : 0, "airport" : "JFK", "connects" : [{ "to" : "BOS", "airlines" : [ "UA", "AA" ] },{ "to" : "ORD", "airlines" : [ "UA", "AA" ] },{ "to" : "ATL", "airlines" : [ "AA", "DL" ] }] }
{ "_id" : 1, "airport" : "BOS", "connects" : [ { "to" : "JFK", "airlines" : [ "UA", "AA" ] }, { "to" : "PWM", "airlines" : [ "AA" ] } ]] }
{ "_id" : 2, "airport" : "ORD", "connects" : [ { "to" : "JFK", "airlines" : [ "UA”,"AA" ] }] }
{ "_id" : 3, "airport" : "PWM", "connects" : [{ "to" : "BOS", "airlines" : [ "AA" ] }] }
Scenario: Determine Air Travel Options
[{
"$match":{"name":"Lucy"}},{
"$graphLookup": {from: "airports", startWith: "$nearestAirport",connectToField: "airport", connectFromField: "connects.to”,maxDepth: 2,depthField: "numFlights”, restrictSearchWithMatch: {"connects.airlines":"UA"},as: ”UAdestinations"
}}
]
Scenario: Determine Air Travel Options
We’ve added a filter
{"name" : "Lucy","from" : "JFK","UAdestinations" : [
{ "_id" : 2, "airport" : "ORD", "numFlights" : NumberLong(1) }, { "_id" : 1, "airport" : "BOS", "numFlights" : NumberLong(1) }
]}
Scenario: Determine Air Travel Options
Scenario: Product Categories
Mugs
Kitchen & Dining
Commuter & Travel
Glassware & Drinkware
Outdoor Recreation
Camping Mugs
Running Thermos
Red Run Thermos
White Run Thermos
Blue Run Thermos
Scenario: Product Categories
Get all children 2 levels deep – flat result
Scenario: Product Categories
Get all children 2 levels deep – nested result
Scenario: Article Recommendation
198
91
815
72
68
538
412
34
275
Depth 1
Depth 2
Depth 0
4319
content idconversion rate
recommendation
Scenario: Article Recommendation
198
91
815
72
68
538
412
34
275
Depth 1
Depth 2
Depth 0
4319
content idconversion rate
recommendation
Recommendations for Target #1
Recommendation for Targets #2 and #3
Target #1 (best)
Target #2
Target #3
Syntax
Syntax
Design & Performance Considerations
The Tale of Two Biebers
VS
Follower Churn
• Everyone worries about scaling content• But follow requests can be >> message send rates
• Twitter enforces per day follow limits
Edge Metadata
• Models – friends/followers• Requirements typically start simple• Add Groups, Favorites, Relationships
Options for Storing Graphs in MongoDB
Option One – Embedding Edges
Embedded Edge Arrays
• Storing connections with user (popular choice)üMost compact formüEfficient for reads
• However….• User documents grow• Upper limit on degree (document size)• Difficult to annotate (and index) edge
{ "_id" : "djw","fullname" : "Darren Wood","country" : "Australia","followers" : [ "jsr", "ian"],"following" : [ "jsr", "pete"]
}
Embedded Edge Arrays• Creating Rich Graph Information
• Can become cumbersome
{ "_id" : "djw","fullname" : "Darren Wood","country" : "Australia","friends" : [
{"uid" : "jsr", "grp" : "school"},{"uid" : "ian", "grp" : "work"} ]
}
{ "_id" : "djw","fullname" : "Darren Wood","country" : "Australia","friends" : [ "jsr", "ian"],"group" : [ ”school", ”work"]
}
Option Two – Edge Collection
Edge Collections• Document per edge
• Very flexible for adding edge data
> db.followers.findOne(){
"_id" : ObjectId(…),"from" : "djw","to" : "jsr"
}
> db.friends.findOne(){
"_id" : ObjectId(…),"from" : "djw","to" : "jsr","grp" : "work","ts" : Date("2013-07-10")
}
Edge CollectionIndexing Strategies
Finding FollowersFind followers in single edge collection :
> db.followers.find({from : "djw"}, {_id:0, to:1}){
"to" : "jsr"}
Using index :
{"v" : 1,"key" : { "from" : 1, "to" : 1 },"unique" : true,"ns" : "socialite.followers","name" : "from_1_to_1"
}
Covered index when searching on "from" for all followers
Specify only if multiple edges cannot exist
Finding Following
What about who a user is following?
Could use a reverse covered index :
{"v" : 1,"key" : { "from" : 1, "to" : 1 },"unique" : true,"ns" : "socialite.followers","name" : "from_1_to_1"
}{
"v" : 1,"key" : { "to" : 1, "from" : 1 },"unique" : true,"ns" : "socialite.followers","name" : "to_1_from_1"
}
Notice the flipped field order here
Wait ! There may be an issue with the reverse index…..
{"v" : 1,"key" : { "from" : 1, "to" : 1 },"unique" : true,"ns" : "socialite.followers","name" : "from_1_to_1"
}{
"v" : 1,"key" : { "to" : 1, "from" : 1 },"unique" : true,"ns" : "socialite.followers","name" : "to_1_from_1"
}
If we shard this collection by "from", looking up followers for a specific user is "targeted" to a shard
To find who the user is following however, it must scatter-gather the query to all shards
SHARDING!
Finding Following
Dual Edge Collections
Dual Edge Collections
• When "following" queries are common• Not always the case• Consider overhead carefully
• Can use dual collections storing • One for each direction• Edges are duplicated reversed• Can be sharded independently
Wrap-up
MongoDB $graphLookup
• Efficient, index-based recursive queries• Familiar, MongoDB query language• Use a single System Of Record
• Cater for all query types• No added operational overhead• No synchronization requirements• Reduced technology surface area
Graph OperationsWith MongoDB
Charles SarrazinSenior Consulting Engineer, MongoDB