GUIDE TO SQL - NOSQL MIGRATION Anton Yazovskiy Solution Architect, Thumbtack Technology
Jan 15, 2015
GUIDE TO SQL - NOSQL MIGRATION
Anton Yazovskiy Solution Architect, Thumbtack Technology
AGENDA
• Why would you want to migrate to NoSQL
• Conceptual difference between RBDMS and NoSQL
• Data modeling and architectural best practices
• Practical migration steps / questions you have to ask
WHY?scalability
performance developer productivity
CONCEPTUAL DIFFERENCE BETWEEN RBDMS AND NOSQL• relational schema allows you to query data in many different ways in different contexts
• accessible for many types of applications and separate dev teams
• schema helps to control rules common for everybody
!
• always remember that in most cases you run queries across the cluster
• NoSQL is about focusing on particular need and goal
• model your data for specific use case
• define what are you willing to sacrifice to achieve better results
DATA MODELING AND ARCHITECTURAL BEST
PRACTICES
POLYGLOT PERSISTENCE• different solutions are designed to solve different problems
• session & fast transactions
• cache
• aggregations
• analytical ad-hoc queries
• graph traversal
• the requirements for OLTP and OLAP storages are very different
POLYGLOT PERSISTENCE
NOSQL DATA STRUCTURES
• Key-Value: Riak, Redis, MemcacheDB, Aerospike and Amazon DynamoDB (Cloud).
• Key-Document: MongoDB and Couchbase.
• Column-Family: Cassandra, HBase
• Graph Databases - Neo4j and OrientDB.
PRACTICAL MIGRATION
STEPS• what would you like to achieve • learn your traffic • lean your data set • what are you willing to sacrifice • apply polyglot persistence • model your data • synchronization
WHAT WOULD YOU LIKE TO ACHIEVE
• better performance
• scale current solution
• process more or(and) different data
• speed-up the development
• I heard of it
LEARN YOUR TRAFFIC• how workload looks like:
• OLTP (simple lookups, short transactions)
• OLAP (aggregations, analytical queries, ad-hock scans, etc.)
• heavy-read, heavy-write
• what kind of queries do you perform in order to address application's questions:
• simple lookups, uncertain search, inner requests, traversal, BI/Analysis
LEAN YOUR DATA SET• what kind of data types do you operate with
• simple key-value
• structure, semi-structure
• nested/hierarchical
• graph-oriented
• what size of each data type do you have
WHAT ARE YOU WILLING TO SACRIFICE
• what data doesn't require a strong consistency
• where transactional guarantees aren't require
• what data are you willing to lost in case of hardware failure
• where are you willing to sacrifice joins
APPLY POLYGLOT PERSISTENCE
• Based on discovered answers, define the most obvious types of storages that you may need
• fast & simple storage for lookups, non-critical data and short transactions
• RDBMS for data that fit into single server
• document-oriented storage for inner/hierarchical data and aggregate-oriented reads & writes
• graph-oriented storage for traversal queries, social relations, etc.
• highly-scalable storage for BigData background processing
DEFINE A DATA MODEL
DATA MODELING: BEFORE YOU START
• from “what data do I have”to “what questions do I have”
• denormalization & duplication are your best friends
• hierarchical and embedded structures make your life easier, but they are your worst enemy
REFERENCES
• in-application joins
• nothing to be ashamed about
• apply carefully
!{ user_name: ayazovskiy, contact: {..}, access: { level: 523, group: dev } } { access_level: 523, rules: [...] }
DUPLICATION• Duplication is a technique of copying pieces of data between
structures in order to either optimize query processing time or convert data into particular business model.
!
• The main advantages of denormalization is ability to:
1. reduce the number of I/O operations and query time
2. reduce complexity of query processing in distributed systems
AGGREGATES• simplify data processing logic
• optimize read/write time
• ability to distribute the data across the cluster
• reduce # of requests across the cluster
• perform atomic updates
{ user_name: ayazovskiy, contact: { phone: 123, email: @thumbtack.net }, access: { level: 5, group: dev } }
AGGREGATES
• updates of duplicated data are heavy and complex
• querying across aggregates heavy and complex
{ user_name: ayazovskiy, contact: { phone: 123, email: @thumbtack.net }, access: { level: 5, group: dev } }
COUNTERS
• NoSQL auto-increment analog
• distributed consistent auto-increment is tricky
• counters aren't always reliable *
COMPOSITE KEYS
{ "ID": "chat#user_1#user_2#december_12_2014", "messages": [ { "user_1": "hey" }, { "user_1": "how is going?" }, { "user_2": "thanks, pretty well!" } ] }
APPEND
{ ID: account#User_A, account_total: $100, account_total_calculation_time: .., changes_since_last_calculation: [ 1399493200: +$10, 1399892139: -$25 ] }
THINK OF DATA SYNCHRONIZATION
• application-level synchronization:
• e.g. update user profile in document-oriented storage, it's social network in graph storage, and session in key-value cache
• regular synchronization:
• this may be a hourly/daily/weekly process that takes updated data and propagates across the system
• incremental background synchronization
• solutions like Tungsten synchronizer allows you to track changes in RDBS via transactional log, and apply these changes immediately to NoSQL storage
• e.g. user profiles in MySQL synchronized with Aerospike via property configured Tungsten Replicator
–Anton Yazovskiy
“always remember that in most cases you run queries across the cluster”
THANKS / REFERENCES• NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot
Persistence by Pramod J. Sadalage and Martin Fowler
• NoSQL Data Modeling Techniques
(http://highlyscalable.wordpress.com)
• MongoDB documentation (http://docs.mongodb.org)
• Couchbase documentation (http://docs.couchbase.com)
• FoundationDB Blog (http://blog.foundationdb.com)