Top Banner
7/28/2019 2-DataModeling http://slidepdf.com/reader/full/2-datamodeling 1/36 © Christine Collet Big data models Ch. Collet and G. Vargas-Solar 
36

2-DataModeling

Apr 03, 2018

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 1/36

© Christine Collet 

Big data models

Ch. Collet and G. Vargas-Solar 

Page 2: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 2/36

2 © Ch. Collet

n  Data modeling and processingu  Modeling for simplifying data access

u  map-reduce programming approach

n  Data distribution

u  Replication and shardingu  Management issues 

n  Data Storage: Polyglot persistence

n  Service coordination

n  Open issues and perspectives

Outline

Page 3: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 3/36

3 © Ch. Collet

Data Applications

n  Old modelu “Query the world”

u data acquisition coupled to a specific

hypothesis

n  New model

u “Download the world”

u data acquired en masse, in support of 

many hypotheses

How to handle scalability ?

Page 4: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 4/36

4 © Ch. Collet

Scaling: Attack of the clusters

n  Scaling up implies bigger machines, moreprocessors, disk storage, and memory.

u Or to use lots of small machines in a cluster.

n   A cluster of small machines can use commodity

hardware and ends up being cheaper at thesekinds of scales.

n  It can also be more resilient—while individual

machine failures are common, the overall cluster 

can be built to keep going despite such failures,providing high reliability. 

Page 5: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 5/36

5 © Ch. Collet

Clustered relational databases

n  They use a cluster-aware file system that writes to ahighly available disk subsystem—but this means the

cluster still has the disk subsystem as a single point of 

failure.

n  Relational systems could also be run as separate servers

for different sets of data, effectively sharding thedatabase.

=> but the distribution has to be controlled by the application

which has to keep track of which database server to talk to

for each bit of data.

 Also, we lose any querying, referential integrity, transactions,or consistency controls that cross shards .

Page 6: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 6/36

6 © Ch. Collet

Clustered relational databases

n  They use a cluster-aware file system that writes to ahighly available disk subsystem—but this means the

cluster still has the disk subsystem as a single point of 

failure.

n  Relational systems could also be run as separate servers

for different sets of data, effectively sharding thedatabase.

=> but the distribution has to be controlled by the application

which has to keep track of which database server to talk to

for each bit of data.

 Also, we lose any querying, referential integrity, transactions,or consistency controls that cross shards .

Relational systems are not designed to run efficiently

on clusters

New systems have emerged to address requirementsof data management in the cloud

- NoSQL data stores

- Scalable SQL databases

Page 7: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 7/36

7 © Ch. Collet

+ Global vision

Page 8: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 8/36

8 © Ch. Collet

+

Been …

- In between logical data models and data store models

-   Able to handle large volumes of data: manage

thousands of processors, distribute data and

components

-  Polyglot : picking the right language, execution model,

data store for the job

Page 9: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 9/36

9 © Ch. Collet

+ NoSQL Common characteristics

n  Not using the relational model ; most are notusing SQL for queries

n  Schemaless: no predefined schemas

n  Aggregate-oriented models and others ; no

therory fundation 

n  Running well on clusters (map-reduce)

n  Open-source

n  Built for the 21st century web estates

n  Polyglot Persistence: the most important result

of the rise of NoSQL 

Page 10: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 10/36

10 © Ch. Collet

+ Example of Data schema (UML)

Page 11: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 11/36

11 © Ch. Collet

+ Relational representation

Page 12: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 12/36

12 © Ch. Collet

+ Aggregate data representation (1)

Page 13: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 13/36

13 © Ch. Collet

Two aggregates

# Customer 

{

"id":1,

"name":"Martin",

"billingAddress":[{"city":"Chicago"}]

}

in JSON 

# Order {

"id":99,

"customerId":1,

"orderItems":[ {

"productId":27,

"price": 32.45,"productName": "NoSQL Distilled” }

],

"shippingAddress":[{"city":"Chicago"}]

"orderPayment":[ {

"ccinfo":"1000-1000-1000-1000",

"txnId":"abelif879rft",

"billingAddress": {"city": "Chicago”} }

],

}

Page 14: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 14/36

14 © Ch. Collet

JAVASCRIPT OBJECT NOTATION - JSON

Lightweight text format for exchanging structured datan  Based on a complex data model

n  Similar to XML

Fundamental concepts:n  Object: unordered container of key/value pairs ; objects are

wrapped in [] ; Keys are strings, : separates keys and

values

n  Array: ordered sequences of values ; arrays are wrapped

in []n  Value: member of the String, Boolean, Object or Array set

Page 15: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 15/36

15 © Ch. Collet

+

Two main aggregates : customer and order (black diamondcomposition marker)

u  The customer contains : billing addresses,

u  The order contains: list of order items, a shipping address, and

payments ; the payment itself contains a billing address for that

payment.

u  An address may be repeated three times

u  The link between the customer and the order isn’t within aggregate

n  Aggregate boundary : think about access to datau  e.g. putting / not all the orders for a customer into the customer 

aggregate.

Aggregate data representation (1)

Page 16: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 16/36

16 © Ch. Collet

+

Another Aggregate datarepresentation (2)

Page 17: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 17/36

17 © Ch. Collet

One aggregate

{"customer": {

"id": 1,"name": "Martin",

"billingAddress": [{"city": "Chicago"}],"orders": [ {

"id":99,

"customerId":1,"orderItems":[ {

"productId":27,

"price": 32.45,"productName": "NoSQL Distilled”} ],

"shippingAddress":[{"city":"Chicago"}]"orderPayment":[ {

"ccinfo":"1000-1000-1000-1000",

"txnId":"abelif879rft","billingAddress": {"city": "Chicago”} }],

}]}

}

Page 18: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 18/36

18 © Ch. Collet

Data models

n  Key-value Data Model

n  Document Data Model

n Column Family Data Model

n  Graph Data Model

n   Array Data Model

n  Object Data Model

Aggregate-oriented

Page 19: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 19/36

19 © Ch. Collet

Key-value Data Model

Interface – put(key, value)

 – get(key): value

k1 v1

k2 v2

k3 v3

 … 

kn vn

• Interface

 – set(key, document)

 – get(key): document

 – set(key, name, value)

 – get(key, name): value

k1 “name”:“fred” 

k2 “name”:“mary”;“age”:“25” 

k3

 … 

kn “name”:“john”;“address”:“k3” 

“name”:“oak st” 

PrivatePublic • Interface

 – define(family)

 – insert(family, key, columns)

 – get(family, key): columns

k1 “name”:“fred” 

k2 “name”:“mary” 

k3

 … 

kn “name”:“john” 

“name”:“oak st” 

“title”:“Mr” 

“age”:“25” 

Document Data Model

Column Family Data Model

Aggregate-oriented Data models

Page 20: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 20/36

20 © Ch. Collet

Key-value Data Model

n  Data storage

u  values (data) are stored based on programmer-defined keys

u  system is agnostic as to the structure (semantics) of the value

n  Queries are expressed in terms of keys

n  Indexes are defined over keysu  some systems support secondary indexes over (part of) the

value

Interface – put(key, value)

 – get(key): value

k1 v1

k2 v2

k3 v3

 … 

kn vn

Page 21: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 21/36

21 © Ch. Collet

Customer info in a key-value

Page 22: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 22/36

22 © Ch. Collet

n In this scenario, the application can read thecustomer’s information and all the related data by

using the key.

è a VALUE

n  If the application requirements are to read the

orders or the products sold in each order, the

VALUE has to be read and then parsed on theclient side to build the results.

n  Possible to split the VALUE into two objects

Page 23: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 23/36

23 © Ch. Collet

Customer / order info in a key-value

split the value object

into Customerand Order objects

Page 24: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 24/36

24 © Ch. Collet

# Customer object{"customerId": 1,

"customer": { "name": "Martin",

"billingAddress": [{"city": "Chicago"}],

"payment": [{"type": "debit","ccinfo”: "1000-1000-1000-1000"}],

"orders":[{"orderId":99}] }

}

# Order object {"customerId": 1,

"orderId": 99,

"order":{ "orderDate":"Nov-20-2011",

"orderItems":[{"productId":27, "price": 32.45}],

"orderPayment":[{"ccinfo":"1000-1000-1000-1000",

"txnId":"abelif879rft"}],"shippingAddress":{"city":"Chicago"} }

}

Using aggregates this way allows for read optimization,

but we have to push the orderId reference into Customer

every time with a new Order.

Page 25: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 25/36

25 © Ch. Collet

Document Data Model

n  Data storage

u  documents (data) is stored based on programmer-defined keys

u  system is aware of the (arbitrary) document structure

u  support for lists, pointers and nested documents

n Queries expressed in terms of key (or attribute, if indexexists)

n  Support for key-based indexes and secondary indexes

• Interface

 – set(key, document)

 – get(key): document

 – set(key, name, value)

 – get(key, name): value

k1 “name”:“fred” 

k2 “name”:“mary”;“age”:“25” 

k3

 … 

kn “name”:“john”;“address”:“k3” 

“name”:“oak st” 

Page 26: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 26/36

26 © Ch. Collet

# Customer object{"customerId": 1,"name": "Martin",

"billingAddress": [{"city": "Chicago"}],

"payment": [{"type": "debit","ccinfo”: "1000-1000-1000-1000"}],

}

# Order object {"orderId": 99,

"customerId": 1,

"orderDate":"Nov-20-2011",

"orderItems":[{"productId":27, "price": 32.45}],

"orderPayment":[{"ccinfo":"1000-1000-1000-1000",

"txnId":"abelif879rft"}],"shippingAddress":{"city":"Chicago"}

}

Page 27: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 27/36

27 © Ch. Collet

Document data model

 A document database is able to see a structure in the aggregate.

It imposes limits on what we can place in it, defining allowable

structures and types. In return, however, we get more flexibility in

access.

You can submit queries to the database based on the fields in theaggregate (might be the key), retrieve part of the aggregate rather 

than the whole thing, and database can create indexes based on

the contents of the aggregate.

The line between key-value and document gets a bit

blurry: People often put an ID field in a document

database to do a key-value style lookup.

Page 28: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 28/36

28 © Ch. Collet

Column Family Data Model

n  Data storage

u  <name, value, timestamp> triples (so-called columns) arestored based on a column family and key; 

u  system is aware of (arbitrary) structure of column family

u  system uses column family information to replicate and

distribute data

n  Queries are expressed based on key and column family

n  Secondary indexes per column family are typically

supported

PrivatePublic • Interface

 – define(family) – insert(family, key, columns)

 – get(family, key): columns

k1 “name”:“fred” 

k2 “name”:“mary” 

k3

 … 

kn “name”:“john” 

“name”:“oak st” 

“title”:“Mr” 

“age”:“25” 

Page 29: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 29/36

29 © Ch. Collet

Customer info in a column family

Page 30: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 30/36

30 © Ch. Collet

n  Column-family databases organize their columnsinto column families.

n  Each column has to be part of a single column

family, and the column acts as unit for access,

with the assumption that data for a particular column family will be usually accessed together.

n  Be careful a colum family is different than a table

Page 31: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 31/36

31 © Ch. Collet

+

It is difficult to draw aggregate boundaries:n  an order makes a good aggregate when a customer is

making and reviewing orders, an when the retailer is

processing orders.

n  however if a retailer wants to analyze sales of its

product over the last few months, then an order aggregate becomes a trouble.

n  to get to product sales history: dig into every aggregate

in the database.

So an aggregate structure may help with some data inter-

actions but be an obstacle for others.

Aggregate orientation

Page 32: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 32/36

32 © Ch. Collet

Key Points 

n   Aggregates form the boundaries for ACID operations with the

database.

n   Aggregates make it easier for the database to manage data

storage over clusters

n  Aggregate-oriented databases work best when most datainteraction is done with the same aggregate;

n   Aggregate-ignorant databases are better when interactions use

data organized in many different formations.

n   Aggregate-oriented databases often compute materialized views

to provide data organized differently from their primaryaggregates. This is often done with map-reduce computations.

Page 33: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 33/36

33 © Ch. Collet

Big Data models

n  Key-value Data Model

n  Document Data Model

n Column Family Data Model

n  Graph Data Model

n   Array Data Model

n  Object Data Model

Aggregate-oriented

Page 34: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 34/36

34 © Ch. Collet

Graph Data Model

n  Data storageu  data is stored in terms of nodes and (typed) edges

u  both nodes and edges can have (arbitrary) attributes

n  Queries are expressed based on system ids (if noindexes exist)

n  Secondary indexes for nodes and edges are supported

u  retrieve nodes by attributes and edges by type, start and/or end

node, and/or attributes

Interface – create: id

 – get(id)

 – connect(id1, id2): id

 – addAttribute(id, name, value)

 – getAttribute(id, name): value

n1 n2

n3

“name”:“fred” 

“name”:“mary”;“age”:“25” 

“name”:“oak st” 

LIKES

LIKES

“weight”:“-1” 

This is ideal for capturing any data consisting of complex

relationships such as social networks, product preferences, or eligibility rules.

Page 35: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 35/36

35 © Ch. Collet

An example graph structure

Page 36: 2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 36/36

36 © Ch. Collet

+

More when considering the

Polyglot persistence