7/28/2019 2-DataModeling http://slidepdf.com/reader/full/2-datamodeling 1/36 © Christine Collet Big data models Ch. Collet and G. Vargas-Solar
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 1/36
© Christine Collet
Big data models
Ch. Collet and G. Vargas-Solar
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 2/36
2 © Ch. Collet
n Data modeling and processingu Modeling for simplifying data access
u map-reduce programming approach
n Data distribution
u Replication and shardingu Management issues
n Data Storage: Polyglot persistence
n Service coordination
n Open issues and perspectives
Outline
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 3/36
3 © Ch. Collet
Data Applications
n Old modelu “Query the world”
u data acquisition coupled to a specific
hypothesis
n New model
u “Download the world”
u data acquired en masse, in support of
many hypotheses
How to handle scalability ?
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 4/36
4 © Ch. Collet
Scaling: Attack of the clusters
n Scaling up implies bigger machines, moreprocessors, disk storage, and memory.
u Or to use lots of small machines in a cluster.
n A cluster of small machines can use commodity
hardware and ends up being cheaper at thesekinds of scales.
n It can also be more resilient—while individual
machine failures are common, the overall cluster
can be built to keep going despite such failures,providing high reliability.
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 5/36
5 © Ch. Collet
Clustered relational databases
n They use a cluster-aware file system that writes to ahighly available disk subsystem—but this means the
cluster still has the disk subsystem as a single point of
failure.
n Relational systems could also be run as separate servers
for different sets of data, effectively sharding thedatabase.
=> but the distribution has to be controlled by the application
which has to keep track of which database server to talk to
for each bit of data.
Also, we lose any querying, referential integrity, transactions,or consistency controls that cross shards .
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 6/36
6 © Ch. Collet
Clustered relational databases
n They use a cluster-aware file system that writes to ahighly available disk subsystem—but this means the
cluster still has the disk subsystem as a single point of
failure.
n Relational systems could also be run as separate servers
for different sets of data, effectively sharding thedatabase.
=> but the distribution has to be controlled by the application
which has to keep track of which database server to talk to
for each bit of data.
Also, we lose any querying, referential integrity, transactions,or consistency controls that cross shards .
Relational systems are not designed to run efficiently
on clusters
New systems have emerged to address requirementsof data management in the cloud
- NoSQL data stores
- Scalable SQL databases
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 7/36
7 © Ch. Collet
+ Global vision
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 8/36
8 © Ch. Collet
+
Been …
- In between logical data models and data store models
- Able to handle large volumes of data: manage
thousands of processors, distribute data and
components
- Polyglot : picking the right language, execution model,
data store for the job
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 9/36
9 © Ch. Collet
+ NoSQL Common characteristics
n Not using the relational model ; most are notusing SQL for queries
n Schemaless: no predefined schemas
n Aggregate-oriented models and others ; no
therory fundation
n Running well on clusters (map-reduce)
n Open-source
n Built for the 21st century web estates
n Polyglot Persistence: the most important result
of the rise of NoSQL
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 10/36
10 © Ch. Collet
+ Example of Data schema (UML)
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 11/36
11 © Ch. Collet
+ Relational representation
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 12/36
12 © Ch. Collet
+ Aggregate data representation (1)
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 13/36
13 © Ch. Collet
Two aggregates
# Customer
{
"id":1,
"name":"Martin",
"billingAddress":[{"city":"Chicago"}]
}
in JSON
# Order {
"id":99,
"customerId":1,
"orderItems":[ {
"productId":27,
"price": 32.45,"productName": "NoSQL Distilled” }
],
"shippingAddress":[{"city":"Chicago"}]
"orderPayment":[ {
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft",
"billingAddress": {"city": "Chicago”} }
],
}
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 14/36
14 © Ch. Collet
JAVASCRIPT OBJECT NOTATION - JSON
Lightweight text format for exchanging structured datan Based on a complex data model
n Similar to XML
Fundamental concepts:n Object: unordered container of key/value pairs ; objects are
wrapped in [] ; Keys are strings, : separates keys and
values
n Array: ordered sequences of values ; arrays are wrapped
in []n Value: member of the String, Boolean, Object or Array set
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 15/36
15 © Ch. Collet
+
Two main aggregates : customer and order (black diamondcomposition marker)
u The customer contains : billing addresses,
u The order contains: list of order items, a shipping address, and
payments ; the payment itself contains a billing address for that
payment.
u An address may be repeated three times
u The link between the customer and the order isn’t within aggregate
n Aggregate boundary : think about access to datau e.g. putting / not all the orders for a customer into the customer
aggregate.
Aggregate data representation (1)
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 16/36
16 © Ch. Collet
+
Another Aggregate datarepresentation (2)
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 17/36
17 © Ch. Collet
One aggregate
{"customer": {
"id": 1,"name": "Martin",
"billingAddress": [{"city": "Chicago"}],"orders": [ {
"id":99,
"customerId":1,"orderItems":[ {
"productId":27,
"price": 32.45,"productName": "NoSQL Distilled”} ],
"shippingAddress":[{"city":"Chicago"}]"orderPayment":[ {
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft","billingAddress": {"city": "Chicago”} }],
}]}
}
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 18/36
18 © Ch. Collet
Data models
n Key-value Data Model
n Document Data Model
n Column Family Data Model
n Graph Data Model
n Array Data Model
n Object Data Model
Aggregate-oriented
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 19/36
19 © Ch. Collet
Key-value Data Model
•
Interface – put(key, value)
– get(key): value
k1 v1
k2 v2
k3 v3
…
kn vn
• Interface
– set(key, document)
– get(key): document
– set(key, name, value)
– get(key, name): value
k1 “name”:“fred”
k2 “name”:“mary”;“age”:“25”
k3
…
kn “name”:“john”;“address”:“k3”
“name”:“oak st”
PrivatePublic • Interface
– define(family)
– insert(family, key, columns)
– get(family, key): columns
k1 “name”:“fred”
k2 “name”:“mary”
k3
…
kn “name”:“john”
“name”:“oak st”
“title”:“Mr”
“age”:“25”
Document Data Model
Column Family Data Model
Aggregate-oriented Data models
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 20/36
20 © Ch. Collet
Key-value Data Model
n Data storage
u values (data) are stored based on programmer-defined keys
u system is agnostic as to the structure (semantics) of the value
n Queries are expressed in terms of keys
n Indexes are defined over keysu some systems support secondary indexes over (part of) the
value
•
Interface – put(key, value)
– get(key): value
k1 v1
k2 v2
k3 v3
…
kn vn
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 21/36
21 © Ch. Collet
Customer info in a key-value
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 22/36
22 © Ch. Collet
n In this scenario, the application can read thecustomer’s information and all the related data by
using the key.
è a VALUE
n If the application requirements are to read the
orders or the products sold in each order, the
VALUE has to be read and then parsed on theclient side to build the results.
n Possible to split the VALUE into two objects
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 23/36
23 © Ch. Collet
Customer / order info in a key-value
split the value object
into Customerand Order objects
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 24/36
24 © Ch. Collet
# Customer object{"customerId": 1,
"customer": { "name": "Martin",
"billingAddress": [{"city": "Chicago"}],
"payment": [{"type": "debit","ccinfo”: "1000-1000-1000-1000"}],
"orders":[{"orderId":99}] }
}
# Order object {"customerId": 1,
"orderId": 99,
"order":{ "orderDate":"Nov-20-2011",
"orderItems":[{"productId":27, "price": 32.45}],
"orderPayment":[{"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft"}],"shippingAddress":{"city":"Chicago"} }
}
Using aggregates this way allows for read optimization,
but we have to push the orderId reference into Customer
every time with a new Order.
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 25/36
25 © Ch. Collet
Document Data Model
n Data storage
u documents (data) is stored based on programmer-defined keys
u system is aware of the (arbitrary) document structure
u support for lists, pointers and nested documents
n Queries expressed in terms of key (or attribute, if indexexists)
n Support for key-based indexes and secondary indexes
• Interface
– set(key, document)
– get(key): document
– set(key, name, value)
– get(key, name): value
k1 “name”:“fred”
k2 “name”:“mary”;“age”:“25”
k3
…
kn “name”:“john”;“address”:“k3”
“name”:“oak st”
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 26/36
26 © Ch. Collet
# Customer object{"customerId": 1,"name": "Martin",
"billingAddress": [{"city": "Chicago"}],
"payment": [{"type": "debit","ccinfo”: "1000-1000-1000-1000"}],
}
# Order object {"orderId": 99,
"customerId": 1,
"orderDate":"Nov-20-2011",
"orderItems":[{"productId":27, "price": 32.45}],
"orderPayment":[{"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft"}],"shippingAddress":{"city":"Chicago"}
}
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 27/36
27 © Ch. Collet
Document data model
A document database is able to see a structure in the aggregate.
It imposes limits on what we can place in it, defining allowable
structures and types. In return, however, we get more flexibility in
access.
You can submit queries to the database based on the fields in theaggregate (might be the key), retrieve part of the aggregate rather
than the whole thing, and database can create indexes based on
the contents of the aggregate.
The line between key-value and document gets a bit
blurry: People often put an ID field in a document
database to do a key-value style lookup.
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 28/36
28 © Ch. Collet
Column Family Data Model
n Data storage
u <name, value, timestamp> triples (so-called columns) arestored based on a column family and key;
u system is aware of (arbitrary) structure of column family
u system uses column family information to replicate and
distribute data
n Queries are expressed based on key and column family
n Secondary indexes per column family are typically
supported
PrivatePublic • Interface
– define(family) – insert(family, key, columns)
– get(family, key): columns
k1 “name”:“fred”
k2 “name”:“mary”
k3
…
kn “name”:“john”
“name”:“oak st”
“title”:“Mr”
“age”:“25”
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 29/36
29 © Ch. Collet
Customer info in a column family
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 30/36
30 © Ch. Collet
n Column-family databases organize their columnsinto column families.
n Each column has to be part of a single column
family, and the column acts as unit for access,
with the assumption that data for a particular column family will be usually accessed together.
n Be careful a colum family is different than a table
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 31/36
31 © Ch. Collet
+
It is difficult to draw aggregate boundaries:n an order makes a good aggregate when a customer is
making and reviewing orders, an when the retailer is
processing orders.
n however if a retailer wants to analyze sales of its
product over the last few months, then an order aggregate becomes a trouble.
n to get to product sales history: dig into every aggregate
in the database.
So an aggregate structure may help with some data inter-
actions but be an obstacle for others.
Aggregate orientation
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 32/36
32 © Ch. Collet
Key Points
n Aggregates form the boundaries for ACID operations with the
database.
n Aggregates make it easier for the database to manage data
storage over clusters
n Aggregate-oriented databases work best when most datainteraction is done with the same aggregate;
n Aggregate-ignorant databases are better when interactions use
data organized in many different formations.
n Aggregate-oriented databases often compute materialized views
to provide data organized differently from their primaryaggregates. This is often done with map-reduce computations.
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 33/36
33 © Ch. Collet
Big Data models
n Key-value Data Model
n Document Data Model
n Column Family Data Model
n Graph Data Model
n Array Data Model
n Object Data Model
Aggregate-oriented
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 34/36
34 © Ch. Collet
Graph Data Model
n Data storageu data is stored in terms of nodes and (typed) edges
u both nodes and edges can have (arbitrary) attributes
n Queries are expressed based on system ids (if noindexes exist)
n Secondary indexes for nodes and edges are supported
u retrieve nodes by attributes and edges by type, start and/or end
node, and/or attributes
•
Interface – create: id
– get(id)
– connect(id1, id2): id
– addAttribute(id, name, value)
– getAttribute(id, name): value
n1 n2
n3
“name”:“fred”
“name”:“mary”;“age”:“25”
“name”:“oak st”
LIKES
LIKES
“weight”:“-1”
This is ideal for capturing any data consisting of complex
relationships such as social networks, product preferences, or eligibility rules.
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 35/36
35 © Ch. Collet
An example graph structure
7/28/2019 2-DataModeling
http://slidepdf.com/reader/full/2-datamodeling 36/36
36 © Ch. Collet
+
More when considering the
Polyglot persistence