Guide to SQL to NoSQL migration

GUIDE TO SQL - NOSQL MIGRATION

Anton Yazovskiy Solution Architect, Thumbtack Technology

AGENDA

• Why would you want to migrate to NoSQL

• Conceptual difference between RBDMS and NoSQL

• Data modeling and architectural best practices

• Practical migration steps / questions you have to ask

WHY?scalability

performance developer productivity

CONCEPTUAL DIFFERENCE BETWEEN RBDMS AND NOSQL• relational schema allows you to query data in many different ways in different contexts

• accessible for many types of applications and separate dev teams

• schema helps to control rules common for everybody

!

• always remember that in most cases you run queries across the cluster

• NoSQL is about focusing on particular need and goal

• model your data for specific use case

• define what are you willing to sacrifice to achieve better results

DATA MODELING AND ARCHITECTURAL BEST

PRACTICES

POLYGLOT PERSISTENCE• different solutions are designed to solve different problems

• session & fast transactions

• cache

• aggregations

• analytical ad-hoc queries

• graph traversal

• the requirements for OLTP and OLAP storages are very different

POLYGLOT PERSISTENCE

NOSQL DATA STRUCTURES

• Key-Value: Riak, Redis, MemcacheDB, Aerospike and Amazon DynamoDB (Cloud).

• Key-Document: MongoDB and Couchbase.

• Column-Family: Cassandra, HBase

• Graph Databases - Neo4j and OrientDB.

PRACTICAL MIGRATION

STEPS• what would you like to achieve • learn your traffic • lean your data set • what are you willing to sacrifice • apply polyglot persistence • model your data • synchronization

WHAT WOULD YOU LIKE TO ACHIEVE

• better performance

• scale current solution

• process more or(and) different data

• speed-up the development

• I heard of it

LEARN YOUR TRAFFIC• how workload looks like:

• OLTP (simple lookups, short transactions)

• OLAP (aggregations, analytical queries, ad-hock scans, etc.)

• heavy-read, heavy-write

• what kind of queries do you perform in order to address application's questions:

• simple lookups, uncertain search, inner requests, traversal, BI/Analysis

LEAN YOUR DATA SET• what kind of data types do you operate with

• simple key-value

• structure, semi-structure

• nested/hierarchical

• graph-oriented

• what size of each data type do you have

WHAT ARE YOU WILLING TO SACRIFICE

• what data doesn't require a strong consistency

• where transactional guarantees aren't require

• what data are you willing to lost in case of hardware failure

• where are you willing to sacrifice joins

APPLY POLYGLOT PERSISTENCE

• Based on discovered answers, define the most obvious types of storages that you may need

• fast & simple storage for lookups, non-critical data and short transactions

• RDBMS for data that fit into single server

• document-oriented storage for inner/hierarchical data and aggregate-oriented reads & writes

• graph-oriented storage for traversal queries, social relations, etc.

• highly-scalable storage for BigData background processing

DEFINE A DATA MODEL

DATA MODELING: BEFORE YOU START

• from “what data do I have”to “what questions do I have”

• denormalization & duplication are your best friends

• hierarchical and embedded structures make your life easier, but they are your worst enemy

REFERENCES

• in-application joins

• nothing to be ashamed about

• apply carefully

!{ user_name: ayazovskiy, contact: {..}, access: { level: 523, group: dev } } { access_level: 523, rules: [...] }

DUPLICATION• Duplication is a technique of copying pieces of data between

structures in order to either optimize query processing time or convert data into particular business model.

!

• The main advantages of denormalization is ability to:

1. reduce the number of I/O operations and query time

2. reduce complexity of query processing in distributed systems

AGGREGATES• simplify data processing logic

• optimize read/write time

• ability to distribute the data across the cluster

• reduce # of requests across the cluster

• perform atomic updates

{ user_name: ayazovskiy, contact: { phone: 123, email: @thumbtack.net }, access: { level: 5, group: dev } }

AGGREGATES

• updates of duplicated data are heavy and complex

• querying across aggregates heavy and complex

{ user_name: ayazovskiy, contact: { phone: 123, email: @thumbtack.net }, access: { level: 5, group: dev } }

COUNTERS

• NoSQL auto-increment analog

• distributed consistent auto-increment is tricky

• counters aren't always reliable *

COMPOSITE KEYS

{ "ID": "chat#user_1#user_2#december_12_2014", "messages": [ { "user_1": "hey" }, { "user_1": "how is going?" }, { "user_2": "thanks, pretty well!" } ] }

APPEND

{ ID: account#User_A, account_total: $100, account_total_calculation_time: .., changes_since_last_calculation: [ 1399493200: +$10, 1399892139: -$25 ] }

THINK OF DATA SYNCHRONIZATION

• application-level synchronization:

• e.g. update user profile in document-oriented storage, it's social network in graph storage, and session in key-value cache

• regular synchronization:

• this may be a hourly/daily/weekly process that takes updated data and propagates across the system

• incremental background synchronization

• solutions like Tungsten synchronizer allows you to track changes in RDBS via transactional log, and apply these changes immediately to NoSQL storage

• e.g. user profiles in MySQL synchronized with Aerospike via property configured Tungsten Replicator

–Anton Yazovskiy

“always remember that in most cases you run queries across the cluster”

Any questions?

Thank you

@yazovsky [email protected] www.thumbtack.net

http://www.thumbtack.net

THANKS / REFERENCES• NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot

Persistence by Pramod J. Sadalage and Martin Fowler

• NoSQL Data Modeling Techniques

(http://highlyscalable.wordpress.com)

• MongoDB documentation (http://docs.mongodb.org)

• Couchbase documentation (http://docs.couchbase.com)

• FoundationDB Blog (http://blog.foundationdb.com)

http://highlyscalable.wordpress.com

http://docs.mongodb.org

http://docs.couchbase.com

http://blog.foundationdb.com

Guide to SQL to NoSQL migration

Engineering

innerhierarchical data

data doesnt

updated data

nosql data modeling

data processing logic

kind of data types

leanyour data set

noncritical data