Schema Design (and its performance implications) Jay Runkel Principal Solutions Architect j [email protected] @jayrunkel
Schema Design(and its performance implications)
Jay RunkelPrincipal Solutions [email protected]@jayrunkel
2
Agenda
1. Today’s Example
2. MongoDB Schema Design vs. Relational
3. Modeling Relationships
4. Schema Design and Performance
4
Medical Records• Collects all patient information in a central repository• Provide central point of access for
– Patients– Care providers: physicians, nurses, etc.– Billing– Insurance reconciliation
• Hospitals, physicians, patients, procedures, records
PatientRecords
Medications
Lab Results
Procedures
Hospital Records
Physicians
Patients
Nurses
Billing
5
Medical Record Data
• Hospitals – have physicians
• Physicians– Have patients– Perform procedures– Belong to hospitals
• Patients– Have physicians– Are the subject of procedures
• Procedures– Associated with a patient– Associated with a physician– Have a record– Variable meta data
• Records– Associated with a procedure– Binary data– Variable fields
MongoDB Relational
Collections Tables
Documents Rows
Data Use Data Storage
What questions do I have? What answers do I have?
MongoDB versus Relational
Attribute MongoDB Relational
Storage N-dimensional Two-dimensional
Field Values 0, 1, many, or embed Single value
Query Any field or level Any field
Schema Flexible Very structured
13
Documents are Rich Data Structures{ first_name: ‘Paul’, surname: ‘Miller’, cell: ‘+447557505611’ city: ‘London’, location: [45.123,47.232], Profession: [banking, finance, trader], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } ]}
Fields can contain an array of sub-documents
Fields
Typed field values
Fields can contain arrays
String
Number
Geo-Coordinates
16
Referencing
Procedure• patient• date• type• physician• type
Results• dataType• size• content: {…}
Use two collections with a reference
Similar to relational
17
Procedure• patient• date• type• results
• equipmentId• data1• data2
• physician
• Results• type• size• content: {…}
Embedding
Document Schema
18
Referencing
Procedure
{ "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : 134}
Results
{ “_id” : 134 "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } }
19
EmbeddingProcedure{ "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : { "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } }}
20
Embedding
• Advantages– Retrieve all relevant information in a single query/document– Avoid implementing joins in application code– Update related information as a single atomic operation
• MongoDB doesn’t offer multi-document transactions
• Limitations– Large documents mean more overhead if most fields are not relevant– 16 MB document size limit
21
Atomicity
• Document operations are atomicdb.patients.update({_id: 12345},
{$inc : {numProcedures : 1}, $push : {procedures : “proc123”}, $set : {addr.state : “TX”}})
• No multi-document transactions
db.beginTransaction();db.patients.update({_id: 12345}, …);db.procedure.insert({_id: “proc123”, …});db.records.insert({_id: “rec123”, …});db.endTransaction();
22
Embedding
• Advantages– Retrieve all relevant information in a single query/document– Avoid implementing joins in application code– Update related information as a single atomic operation
• MongoDB doesn’t offer multi-document transactions
• Limitations– Large documents mean more overhead if most fields are not relevant– 16 MB document size limit
23
Referencing
• Advantages– Smaller documents– Less likely to reach 16 MB document limit– Infrequently accessed information not accessed on every query– No duplication of data
• Limitations– Two queries required to retrieve information– Cannot update related information atomically
24
One to One: General Recommendations
• Embed– No additional data duplication– Can query or index on
embedded field• e.g., “result.type”
• Exceptional cases…• Embedding results in large
documents• Set of infrequently access
fields
{ "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : { "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } }}
26
{ _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [ { id: 12345, date: 2015-02-15, type: “Cat scan”,
…}, { id: 12346, date: 2015-02-15, type: “blood test”,
…}]}
Pat
ient
s
Embed
One-to-Many RelationshipsModeled in 2 possible ways
{ _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [12345, 12346]}
{ _id: 12345, date: 2015-02-15, type: “Cat scan”, …} { _id: 12346, date: 2015-02-15, type: “blood test”, …}
Pat
ient
s
Reference
Pro
cedu
res
27
One to Many: General Recommendations
• Embed, when possible– Access all information in a single query– Take advantage of update atomicity– No additional data duplication– Can query or index on any field
• e.g., { “phones.type”: “mobile” }
• Exceptional cases:– 16 MB document size– Large number of infrequently accessed fields
{ _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [ { id: 12345, date: 2015-02-15, type: “Cat scan”,
…}, { id: 12346, date: 2015-02-15, type: “blood test”,
…}]}
29
Many to ManyTraditional Relational Association
Join table
Physiciansnamespecialtyphone
Hospitalsname
HosPhysicanRelhospitalIdphysicianIdX
Use arrays instead
30
{ _id: 1, name: “Oak Valley Hospital”, city: “New York”, beds: 131, physicians: [ { id: 12345, name: “Joe Doctor”, address: {…},
…}, { id: 12346, name: “Mary Well”, address: {…},
…}]}
Many-to-Many RelationshipsEmbedding physicians in hospitals collection
{ _id: 2, name: “Plainmont Hospital”, city: “Omaha”, beds: 85, physicians: [ { id: 63633, name: “Harold Green”, address: {…},
…}, { id: 12345, name: “Joe Doctor”, address: {…},
…}]}
Data Duplication
31
{ _id: 1, name: “Oak Valley Hospital”, city: “New York”, beds: 131, physicians: [12345, 12346]}
Many-to-Many RelationshipsReferencing
{ id: 63633, name: “Harold Green”, address: {…}, …}
Hospitals
{ _id: 2, name: “Plainmont Hospital”, city: “Omaha”, beds: 85, physicians: [63633, 12345]}
Physicians
{ id: 12345, name: “Joe Doctor”, address: {…}, …}
{ id: 12346, name: “Mary Well”, address: {…}, …}
32
Many to ManyGeneral Recommendation
• Use case determines whether to reference or embed:1. Data Duplication
• Embedding may result in data duplication
• Duplication may be okay if reads dominate updates
2. Referencing may be required if many related items
3. Hybrid approach• Potentially do both
{ _id: 2, name: “Oak Valley Hospital”, city: “New York”, beds: 131, physicians: [12345, 12346]}
{ _id: 12345, name: “Joe Doctor”, address: {…}, …} { _id: 12346, name: “Mary Well”, address: {…}, …}
Hos
pita
ls
Reference
Phy
sici
ans
34
GridFS
Driv
erGridFS APIdoc.jpg(meta data)
doc.jpg(1)doc.jpg
(1)doc.jpg(1)
fs.files fs.chunksdoc.jpg
mongofiles utility provides command line GridFS interface
Tailor Schema to Queries (cont.)
{ "_id" : 593340651, "first" : "Gregorio", "last" : "Lang", "addr" : { "street" : "623 Flowers Rd", "city" : "Groton", "state" : "NH", "zip" : 3266 }, "physicians" : [10387 33456], "procedures” : ["551ac”, “343fs”]}
{ "_id" : "551ac”, "date" :"2000-04-26”, "hospital" : 161, "patient" : 593340651, "physician" : 10387, "type" : "Chest X-ray", "records" : [ “67bc6”]}
Patient Procedure
Find all patients from NH that have had chest x-rays
Tailor Schema to Queries (cont.)
{ "_id" : 593340651, "first" : "Gregorio", "last" : "Lang", "addr" : { "street" : "623 Flowers Rd", "city" : "Groton", "state" : "NH", "zip" : 3266 }, "physicians" : [10387 33456], "procedures” : [ {id : "551ac”, type : “Chest X-ray”}, {id : “343fs”, type : “Blood Test”}]}
{ "_id" : "551ac”, "date" :"2000-04-26”, "hospital" : 161, "patient" : 593340651, "physician" : 10387, "type" : "Chest X-ray", "records" : [ “67bc6”]}
Patient Procedure
Find all patients from NH that have had chest x-rays
41
Vital Sign Monitoring Device
Vital Signs Measured:• Blood Pressure• Pulse• Blood Oxygen Levels
Produces data at regular intervals• Once per minute
43
Data From Vital Signs Monitoring Device
{ deviceId: 123456, spO2: 88, pulse: 74, bp: [128, 80], ts: ISODate("2013-10-16T22:07:00.000-0500")}
• One document per minute per device
• Relational approach
44
Document Per Hour (By minute)
{ deviceId: 123456, spO2: { 0: 88, 1: 90, …, 59: 92}, pulse: { 0: 74, 1: 76, …, 59: 72}, bp: { 0: [122, 80], 1: [126, 84], …, 59: [124, 78]}, ts: ISODate("2013-10-16T22:00:00.000-0500")}
• Store per-minute data at the hourly level
• Update-driven workload
• 1 document per device per hour
45
Characterizing Write Differences
• Example: data generated every minute• Recording the data for 1 patient for 1 hour:
Document Per Event60 inserts
Document Per Hour1 insert, 59 updates
46
Characterizing Read Differences
• Want to graph 24 hour of vital signs for a patient:
• Read performance is greatly improved
Document Per Event 1440 reads
Document Per Hour24 reads
47
Characterizing Memory and Storage Differences
Document Per Minute Document Per HourNumber Documents 52.6 B 876 M
Total Index Size 6364 GB 106 GB
_id index 1468 GB 24.5 GB
{ts: 1, deviceId: 1} 4895 GB 81.6 GB
Document Size 92 Bytes 758 Bytes
Database Size 4503 GB 618 GB
• 100K Devices • 1 years worth of data
100000 * 365 * 24 * 60
100000 * 365 * 24
100000 * 365 * 24 * 60 * 130
100000 * 365 * 24 * 130
100000 * 365 * 24 * 60 * 92
100000 * 365 * 24 * 758
48
Summary• Relationships can be modeled by embedding or references
• Decision should be made in context of application data and query workload– Tailor schema to application workload
• It is okay recommended to violate RDBMS schema design principles– No duplication of data– Normalization
• Different schemas may result in dramatically different– Query performance– Hardware requirements