2-DataModeling

7/28/2019 2-DataModeling

http://slidepdf.com/reader/full/2-datamodeling 1/36

© Christine Collet

Big data models

Ch. Collet and G. Vargas-Solar



2 © Ch. Collet

n Data modeling and processingu Modeling for simplifying data access

u map-reduce programming approach

n Data distribution

u Replication and shardingu Management issues

n Data Storage: Polyglot persistence

n Service coordination

n Open issues and perspectives

Outline



3 © Ch. Collet

Data Applications

n Old modelu “Query the world”

u data acquisition coupled to a specific

hypothesis

n New model

u “Download the world”

u data acquired en masse, in support of

many hypotheses

How to handle scalability ?



4 © Ch. Collet

Scaling: Attack of the clusters

n Scaling up implies bigger machines, moreprocessors, disk storage, and memory.

u Or to use lots of small machines in a cluster.

n A cluster of small machines can use commodity

hardware and ends up being cheaper at thesekinds of scales.

n It can also be more resilient—while individual

machine failures are common, the overall cluster

can be built to keep going despite such failures,providing high reliability.



5 © Ch. Collet

Clustered relational databases

n They use a cluster-aware file system that writes to ahighly available disk subsystem—but this means the

cluster still has the disk subsystem as a single point of

failure.

n Relational systems could also be run as separate servers

for different sets of data, effectively sharding thedatabase.

=> but the distribution has to be controlled by the application

which has to keep track of which database server to talk to

for each bit of data.

Also, we lose any querying, referential integrity, transactions,or consistency controls that cross shards .



6 © Ch. Collet

Clustered relational databases

n They use a cluster-aware file system that writes to ahighly available disk subsystem—but this means the

cluster still has the disk subsystem as a single point of

failure.

n Relational systems could also be run as separate servers

for different sets of data, effectively sharding thedatabase.

=> but the distribution has to be controlled by the application

which has to keep track of which database server to talk to

for each bit of data.

Also, we lose any querying, referential integrity, transactions,or consistency controls that cross shards .

Relational systems are not designed to run efficiently

on clusters

New systems have emerged to address requirementsof data management in the cloud

- NoSQL data stores

- Scalable SQL databases



7 © Ch. Collet

+ Global vision



8 © Ch. Collet

+

Been …

- In between logical data models and data store models

- Able to handle large volumes of data: manage

thousands of processors, distribute data and

components

- Polyglot : picking the right language, execution model,

data store for the job



9 © Ch. Collet

+ NoSQL Common characteristics

n Not using the relational model ; most are notusing SQL for queries

n Schemaless: no predefined schemas

n Aggregate-oriented models and others ; no

therory fundation

n Running well on clusters (map-reduce)

n Open-source

n Built for the 21st century web estates

n Polyglot Persistence: the most important result

of the rise of NoSQL



10 © Ch. Collet

+ Example of Data schema (UML)



11 © Ch. Collet

+ Relational representation



12 © Ch. Collet

+ Aggregate data representation (1)



13 © Ch. Collet

Two aggregates

# Customer

{

"id":1,

"name":"Martin",

"billingAddress":[{"city":"Chicago"}]

}

in JSON

# Order {

"id":99,

"customerId":1,

"orderItems":[ {

"productId":27,

"price": 32.45,"productName": "NoSQL Distilled” }

],

"shippingAddress":[{"city":"Chicago"}]

"orderPayment":[ {

"ccinfo":"1000-1000-1000-1000",

"txnId":"abelif879rft",

"billingAddress": {"city": "Chicago”} }

],

}



14 © Ch. Collet

JAVASCRIPT OBJECT NOTATION - JSON

Lightweight text format for exchanging structured datan Based on a complex data model

n Similar to XML

Fundamental concepts:n Object: unordered container of key/value pairs ; objects are

wrapped in [] ; Keys are strings, : separates keys and

values

n Array: ordered sequences of values ; arrays are wrapped

in []n Value: member of the String, Boolean, Object or Array set



15 © Ch. Collet

+

Two main aggregates : customer and order (black diamondcomposition marker)

u The customer contains : billing addresses,

u The order contains: list of order items, a shipping address, and

payments ; the payment itself contains a billing address for that

payment.

u An address may be repeated three times

u The link between the customer and the order isn’t within aggregate

n Aggregate boundary : think about access to datau e.g. putting / not all the orders for a customer into the customer

aggregate.

Aggregate data representation (1)



16 © Ch. Collet

+

Another Aggregate datarepresentation (2)



17 © Ch. Collet

One aggregate

{"customer": {

"id": 1,"name": "Martin",

"billingAddress": [{"city": "Chicago"}],"orders": [ {

"id":99,

"customerId":1,"orderItems":[ {

"productId":27,

"price": 32.45,"productName": "NoSQL Distilled”} ],

"shippingAddress":[{"city":"Chicago"}]"orderPayment":[ {

"ccinfo":"1000-1000-1000-1000",

"txnId":"abelif879rft","billingAddress": {"city": "Chicago”} }],

}]}

}



18 © Ch. Collet

Data models

n Key-value Data Model

n Document Data Model

n Column Family Data Model

n Graph Data Model

n Array Data Model

n Object Data Model

Aggregate-oriented



19 © Ch. Collet

Key-value Data Model

•

Interface – put(key, value)

– get(key): value

k1 v1

k2 v2

k3 v3

…

kn vn

• Interface

– set(key, document)

– get(key): document

– set(key, name, value)

– get(key, name): value

k1 “name”:“fred”

k2 “name”:“mary”;“age”:“25”

k3

…

kn “name”:“john”;“address”:“k3”

“name”:“oak st”

PrivatePublic • Interface

– define(family)

– insert(family, key, columns)

– get(family, key): columns


k2 “name”:“mary”

k3

…

kn “name”:“john”


“title”:“Mr”

“age”:“25”

Document Data Model

Column Family Data Model

Aggregate-oriented Data models



20 © Ch. Collet

Key-value Data Model

n Data storage

u values (data) are stored based on programmer-defined keys

u system is agnostic as to the structure (semantics) of the value

n Queries are expressed in terms of keys

n Indexes are defined over keysu some systems support secondary indexes over (part of) the

value

•

Interface – put(key, value)

– get(key): value

k1 v1

k2 v2

k3 v3

…

kn vn



21 © Ch. Collet

Customer info in a key-value



22 © Ch. Collet

n In this scenario, the application can read thecustomer’s information and all the related data by

using the key.

è a VALUE

n If the application requirements are to read the

orders or the products sold in each order, the

VALUE has to be read and then parsed on theclient side to build the results.

n Possible to split the VALUE into two objects



23 © Ch. Collet

Customer / order info in a key-value

split the value object

into Customerand Order objects



24 © Ch. Collet

# Customer object{"customerId": 1,

"customer": { "name": "Martin",

"billingAddress": [{"city": "Chicago"}],

"payment": [{"type": "debit","ccinfo”: "1000-1000-1000-1000"}],

"orders":[{"orderId":99}] }

}

# Order object {"customerId": 1,

"orderId": 99,

"order":{ "orderDate":"Nov-20-2011",

"orderItems":[{"productId":27, "price": 32.45}],

"orderPayment":[{"ccinfo":"1000-1000-1000-1000",

"txnId":"abelif879rft"}],"shippingAddress":{"city":"Chicago"} }

}

Using aggregates this way allows for read optimization,

but we have to push the orderId reference into Customer

every time with a new Order.



25 © Ch. Collet

Document Data Model

n Data storage

u documents (data) is stored based on programmer-defined keys

u system is aware of the (arbitrary) document structure

u support for lists, pointers and nested documents

n Queries expressed in terms of key (or attribute, if indexexists)

n Support for key-based indexes and secondary indexes

• Interface

– set(key, document)

– get(key): document

– set(key, name, value)

– get(key, name): value


k2 “name”:“mary”;“age”:“25”

k3

…

kn “name”:“john”;“address”:“k3”




26 © Ch. Collet

# Customer object{"customerId": 1,"name": "Martin",

"billingAddress": [{"city": "Chicago"}],

"payment": [{"type": "debit","ccinfo”: "1000-1000-1000-1000"}],

}

# Order object {"orderId": 99,

"customerId": 1,

"orderDate":"Nov-20-2011",

"orderItems":[{"productId":27, "price": 32.45}],

"orderPayment":[{"ccinfo":"1000-1000-1000-1000",

"txnId":"abelif879rft"}],"shippingAddress":{"city":"Chicago"}

}



27 © Ch. Collet

Document data model

A document database is able to see a structure in the aggregate.

It imposes limits on what we can place in it, defining allowable

structures and types. In return, however, we get more flexibility in

access.

You can submit queries to the database based on the fields in theaggregate (might be the key), retrieve part of the aggregate rather

than the whole thing, and database can create indexes based on

the contents of the aggregate.

The line between key-value and document gets a bit

blurry: People often put an ID field in a document

database to do a key-value style lookup.



28 © Ch. Collet

Column Family Data Model

n Data storage

u <name, value, timestamp> triples (so-called columns) arestored based on a column family and key;

u system is aware of (arbitrary) structure of column family

u system uses column family information to replicate and

distribute data

n Queries are expressed based on key and column family

n Secondary indexes per column family are typically

supported

PrivatePublic • Interface

– define(family) – insert(family, key, columns)

– get(family, key): columns


k2 “name”:“mary”

k3

…

kn “name”:“john”


“title”:“Mr”

“age”:“25”



29 © Ch. Collet

Customer info in a column family



30 © Ch. Collet

n Column-family databases organize their columnsinto column families.

n Each column has to be part of a single column

family, and the column acts as unit for access,

with the assumption that data for a particular column family will be usually accessed together.

n Be careful a colum family is different than a table



31 © Ch. Collet

+

It is difficult to draw aggregate boundaries:n an order makes a good aggregate when a customer is

making and reviewing orders, an when the retailer is

processing orders.

n however if a retailer wants to analyze sales of its

product over the last few months, then an order aggregate becomes a trouble.

n to get to product sales history: dig into every aggregate

in the database.

So an aggregate structure may help with some data inter-

actions but be an obstacle for others.

Aggregate orientation



32 © Ch. Collet

Key Points

n Aggregates form the boundaries for ACID operations with the

database.

n Aggregates make it easier for the database to manage data

storage over clusters

n Aggregate-oriented databases work best when most datainteraction is done with the same aggregate;

n Aggregate-ignorant databases are better when interactions use

data organized in many different formations.

n Aggregate-oriented databases often compute materialized views

to provide data organized differently from their primaryaggregates. This is often done with map-reduce computations.



33 © Ch. Collet

Big Data models

n Key-value Data Model

n Document Data Model

n Column Family Data Model

n Graph Data Model

n Array Data Model

n Object Data Model

Aggregate-oriented



34 © Ch. Collet

Graph Data Model

n Data storageu data is stored in terms of nodes and (typed) edges

u both nodes and edges can have (arbitrary) attributes

n Queries are expressed based on system ids (if noindexes exist)

n Secondary indexes for nodes and edges are supported

u retrieve nodes by attributes and edges by type, start and/or end

node, and/or attributes

•

Interface – create: id

– get(id)

– connect(id1, id2): id

– addAttribute(id, name, value)

– getAttribute(id, name): value

n1 n2

n3

“name”:“fred”

“name”:“mary”;“age”:“25”


LIKES

LIKES

“weight”:“-1”

This is ideal for capturing any data consisting of complex

relationships such as social networks, product preferences, or eligibility rules.



35 © Ch. Collet

An example graph structure



36 © Ch. Collet

+

More when considering the

Polyglot persistence

2-DataModeling

Documents