RDBMS vs NoSQL Relational Database Management Systems versus Big Data Management (NoSQL) Systems ARNAB BHATTACHARYA [email protected]Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur, India 9th August, 2017 TEQIP Short Course on Big Data
92
Embed
RDBMS vs NoSQL - IIT Kanpur vs NoSQL Relational Database Management Systems versus Big Data Management (NoSQL) Systems ARNAB BHATTACHARYA [email protected] Department of Computer
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
RDBMS vs NoSQLRelational Database Management Systems
A database is a collection of interrelated dataA database management system (DBMS) provides anenvironment that is efficient and convenient to usePrograms and interface to
Store dataVisualize dataAccess (query) dataManipulate data
Relational algebra as mathematical backgroundOperators precisely definedOperands are relationsRelations are sets of tuplesTuples consist of named attributesConcept of candidate keys to uniquely identify tuplesQuery across relations (joins) are naturalProcedures can be coded into RDBMS engineTriggers and views are supported
Table-basedRelational algebra as mathematical backgroundOperators precisely definedOperands are relations
Relations are sets of tuplesTuples consist of named attributesConcept of candidate keys to uniquely identify tuplesQuery across relations (joins) are naturalProcedures can be coded into RDBMS engineTriggers and views are supported
Table-basedRelational algebra as mathematical backgroundOperators precisely definedOperands are relationsRelations are sets of tuplesTuples consist of named attributesConcept of candidate keys to uniquely identify tuples
Query across relations (joins) are naturalProcedures can be coded into RDBMS engineTriggers and views are supported
Table-basedRelational algebra as mathematical backgroundOperators precisely definedOperands are relationsRelations are sets of tuplesTuples consist of named attributesConcept of candidate keys to uniquely identify tuplesQuery across relations (joins) are natural
Procedures can be coded into RDBMS engineTriggers and views are supported
Table-basedRelational algebra as mathematical backgroundOperators precisely definedOperands are relationsRelations are sets of tuplesTuples consist of named attributesConcept of candidate keys to uniquely identify tuplesQuery across relations (joins) are naturalProcedures can be coded into RDBMS engineTriggers and views are supported
Structured Query LanguageFormally defined programming language based on relationalalgebraDeclarative languageRDBMS engine free to choose implementation of operationsDecades of query optimizationIndexing
A transaction is a logical unit of a programACID properties to preserve data integrity
Atomicity: either all operations or noneConsistency: database remains consistent before and after atransactionIsolation: one transaction has no effect on other even if they runconcurrentlyDurability: effect of a transaction is permanent
RDBMSs offer in-built transaction supportA transaction is a logical unit of a program
ACID properties to preserve data integrityAtomicity: either all operations or noneConsistency: database remains consistent before and after atransactionIsolation: one transaction has no effect on other even if they runconcurrentlyDurability: effect of a transaction is permanent
RDBMSs offer in-built transaction supportA transaction is a logical unit of a programACID properties to preserve data integrity
Atomicity: either all operations or noneConsistency: database remains consistent before and after atransactionIsolation: one transaction has no effect on other even if they runconcurrentlyDurability: effect of a transaction is permanent
A schedule is a chronological sequence of instructions fromconcurrent transactionsIf a transaction appears in a schedule, all instructions of thetransaction must appear in the scheduleOrder of instructions within a transaction must be maintained inthe schedule
To increase concurrencyMultiple transactions should be able to run simultaneously
A schedule is a chronological sequence of instructions fromconcurrent transactionsIf a transaction appears in a schedule, all instructions of thetransaction must appear in the scheduleOrder of instructions within a transaction must be maintained inthe scheduleTo increase concurrency
Multiple transactions should be able to run simultaneously
A schedule is a chronological sequence of instructions fromconcurrent transactionsIf a transaction appears in a schedule, all instructions of thetransaction must appear in the scheduleOrder of instructions within a transaction must be maintained inthe scheduleTo increase concurrency
Multiple transactions should be able to run simultaneously
A schedule is a chronological sequence of instructions fromconcurrent transactionsIf a transaction appears in a schedule, all instructions of thetransaction must appear in the scheduleOrder of instructions within a transaction must be maintained inthe scheduleTo increase concurrency
Multiple transactions should be able to run simultaneously
Scalability of RDBMS is a problemIt is at most vertical, i.e., across relationsAll tuples in a relation must stay in one machineDistributed design is harder
Indexing across distributed machines is not provided naturallyHard to model complex data
Scalability of RDBMS is a problemIt is at most vertical, i.e., across relationsAll tuples in a relation must stay in one machineDistributed design is harderIndexing across distributed machines is not provided naturally
Hard to model complex dataHierarchicalSpatio-temporalGraphsSemi-structured
Scalability of RDBMS is a problemIt is at most vertical, i.e., across relationsAll tuples in a relation must stay in one machineDistributed design is harderIndexing across distributed machines is not provided naturallyHard to model complex data
not “no-SQL”It is not only SQLIt does not aim to provide the ACID propertiesOriginated as no-SQL thoughLater changed since RDBMS is too powerful to always ignoreScalability is horizontal, i.e., can put tuples across ditributedmachinesFlexibility to model any kind of dataNatural way of modeling dataDistribution support is in-built
It does not aim to provide the ACID propertiesOriginated as no-SQL thoughLater changed since RDBMS is too powerful to always ignoreScalability is horizontal, i.e., can put tuples across ditributedmachinesFlexibility to model any kind of dataNatural way of modeling dataDistribution support is in-built
NoSQL is not “no-SQL”It is not only SQLIt does not aim to provide the ACID propertiesOriginated as no-SQL thoughLater changed since RDBMS is too powerful to always ignore
Scalability is horizontal, i.e., can put tuples across ditributedmachinesFlexibility to model any kind of dataNatural way of modeling dataDistribution support is in-built
NoSQL is not “no-SQL”It is not only SQLIt does not aim to provide the ACID propertiesOriginated as no-SQL thoughLater changed since RDBMS is too powerful to always ignoreScalability is horizontal, i.e., can put tuples across ditributedmachines
Flexibility to model any kind of dataNatural way of modeling dataDistribution support is in-built
NoSQL is not “no-SQL”It is not only SQLIt does not aim to provide the ACID propertiesOriginated as no-SQL thoughLater changed since RDBMS is too powerful to always ignoreScalability is horizontal, i.e., can put tuples across ditributedmachinesFlexibility to model any kind of dataNatural way of modeling data
NoSQL is not “no-SQL”It is not only SQLIt does not aim to provide the ACID propertiesOriginated as no-SQL thoughLater changed since RDBMS is too powerful to always ignoreScalability is horizontal, i.e., can put tuples across ditributedmachinesFlexibility to model any kind of dataNatural way of modeling dataDistribution support is in-built
All of C, A, P cannot be satisfied simultaneouslyCA: single-site; partitioning is not allowedCP: what is available is consistentAP: everything is available but may not be consistentNot a theorem – just a hypothesis
All of C, A, P cannot be satisfied simultaneouslyCA: single-site; partitioning is not allowedCP: what is available is consistentAP: everything is available but may not be consistentNot a theorem – just a hypothesis
CA: single-site; partitioning is not allowedCP: what is available is consistentAP: everything is available but may not be consistentNot a theorem – just a hypothesis
All of C, A, P cannot be satisfied simultaneouslyCA: single-site; partitioning is not allowedCP: what is available is consistentAP: everything is available but may not be consistent
All of C, A, P cannot be satisfied simultaneouslyCA: single-site; partitioning is not allowedCP: what is available is consistentAP: everything is available but may not be consistentNot a theorem – just a hypothesis
Basically Available: System guarantees availabilitySoft state: State of system is soft, i.e., it may change without inputto maintain consistencyEventual consistency: Data will be eventually consistent withoutany interim perturbation
Basically Available: System guarantees availabilitySoft state: State of system is soft, i.e., it may change without inputto maintain consistencyEventual consistency: Data will be eventually consistent withoutany interim perturbationSacrifices consistency
Basically Available: System guarantees availabilitySoft state: State of system is soft, i.e., it may change without inputto maintain consistencyEventual consistency: Data will be eventually consistent withoutany interim perturbationSacrifices consistencyTo counter ACID
Instead of rows being stored together, columns are storedconsecutivelyA single disk block (or a set of consecutive blocks) stores a singlecolumn familyA column family may consist of one or multiple columnsThis set of columns is called a super column
Two main typesColumnar relational modelsKey-value stores and/or big tables
Instead of rows being stored together, columns are storedconsecutivelyA single disk block (or a set of consecutive blocks) stores a singlecolumn familyA column family may consist of one or multiple columnsThis set of columns is called a super columnTwo main types
Columnar relational modelsKey-value stores and/or big tables
Not NoSQL and is actually RDBMSColumn-wise storage on the disk
Allows faster querying when only few columns are touched on theentire dataAllows compression of columnsProvides better memory cachingJoins are faster since they are mostly on similar columns from twotablesNot good for updatesNot good when many columns of a few tuples are accessedGood for OLAP (online analytical processing)Not good for OLTP (online transaction processing)Example: MonetDB
Not NoSQL and is actually RDBMSColumn-wise storage on the diskAllows faster querying when only few columns are touched on theentire dataAllows compression of columnsProvides better memory cachingJoins are faster since they are mostly on similar columns from twotables
Not good for updatesNot good when many columns of a few tuples are accessedGood for OLAP (online analytical processing)Not good for OLTP (online transaction processing)Example: MonetDB
Not NoSQL and is actually RDBMSColumn-wise storage on the diskAllows faster querying when only few columns are touched on theentire dataAllows compression of columnsProvides better memory cachingJoins are faster since they are mostly on similar columns from twotablesNot good for updatesNot good when many columns of a few tuples are accessed
Good for OLAP (online analytical processing)Not good for OLTP (online transaction processing)Example: MonetDB
Not NoSQL and is actually RDBMSColumn-wise storage on the diskAllows faster querying when only few columns are touched on theentire dataAllows compression of columnsProvides better memory cachingJoins are faster since they are mostly on similar columns from twotablesNot good for updatesNot good when many columns of a few tuples are accessedGood for OLAP (online analytical processing)Not good for OLTP (online transaction processing)
Not NoSQL and is actually RDBMSColumn-wise storage on the diskAllows faster querying when only few columns are touched on theentire dataAllows compression of columnsProvides better memory cachingJoins are faster since they are mostly on similar columns from twotablesNot good for updatesNot good when many columns of a few tuples are accessedGood for OLAP (online analytical processing)Not good for OLTP (online transaction processing)Example: MonetDB
Two columns: a key and a valueKey is mostly textValue can be anything and is simply an object
Essentially, actual data becomes “value” and an unique id isgenerated which becomes “key”Whole database is then just one big table with these two columnsBecomes schema-lessCan be distributed and is, thus, highly scalableIn essence, a big distributed hash tableAll queries are on keysKeys are necessarily indexedExample: Cassandra, CouchDB, HBase, Tokyo Cabinet, Redis
Two columns: a key and a valueKey is mostly textValue can be anything and is simply an objectEssentially, actual data becomes “value” and an unique id isgenerated which becomes “key”
Whole database is then just one big table with these two columnsBecomes schema-lessCan be distributed and is, thus, highly scalableIn essence, a big distributed hash tableAll queries are on keysKeys are necessarily indexedExample: Cassandra, CouchDB, HBase, Tokyo Cabinet, Redis
Two columns: a key and a valueKey is mostly textValue can be anything and is simply an objectEssentially, actual data becomes “value” and an unique id isgenerated which becomes “key”Whole database is then just one big table with these two columnsBecomes schema-less
Can be distributed and is, thus, highly scalableIn essence, a big distributed hash tableAll queries are on keysKeys are necessarily indexedExample: Cassandra, CouchDB, HBase, Tokyo Cabinet, Redis
Two columns: a key and a valueKey is mostly textValue can be anything and is simply an objectEssentially, actual data becomes “value” and an unique id isgenerated which becomes “key”Whole database is then just one big table with these two columnsBecomes schema-lessCan be distributed and is, thus, highly scalableIn essence, a big distributed hash table
All queries are on keysKeys are necessarily indexedExample: Cassandra, CouchDB, HBase, Tokyo Cabinet, Redis
Two columns: a key and a valueKey is mostly textValue can be anything and is simply an objectEssentially, actual data becomes “value” and an unique id isgenerated which becomes “key”Whole database is then just one big table with these two columnsBecomes schema-lessCan be distributed and is, thus, highly scalableIn essence, a big distributed hash tableAll queries are on keysKeys are necessarily indexed
Example: Cassandra, CouchDB, HBase, Tokyo Cabinet, Redis
Two columns: a key and a valueKey is mostly textValue can be anything and is simply an objectEssentially, actual data becomes “value” and an unique id isgenerated which becomes “key”Whole database is then just one big table with these two columnsBecomes schema-lessCan be distributed and is, thus, highly scalableIn essence, a big distributed hash tableAll queries are on keysKeys are necessarily indexedExample: Cassandra, CouchDB, HBase, Tokyo Cabinet, Redis
Uses documents as the main storage format of dataPopular document formats are XML, JSON, BSON, YAMLDocument itself is the key while the content is the valueDocument can be indexed by id or simply its location (e.g., URI)
Content needs to be parsed to make senseContent can be organised furtherExtremely useful for insert-once read-many scenariosCan use map-reduce framework to computeExample: MongoDB, CouchDB
Uses documents as the main storage format of dataPopular document formats are XML, JSON, BSON, YAMLDocument itself is the key while the content is the valueDocument can be indexed by id or simply its location (e.g., URI)Content needs to be parsed to make senseContent can be organised further
Extremely useful for insert-once read-many scenariosCan use map-reduce framework to computeExample: MongoDB, CouchDB
Uses documents as the main storage format of dataPopular document formats are XML, JSON, BSON, YAMLDocument itself is the key while the content is the valueDocument can be indexed by id or simply its location (e.g., URI)Content needs to be parsed to make senseContent can be organised furtherExtremely useful for insert-once read-many scenariosCan use map-reduce framework to compute
Uses documents as the main storage format of dataPopular document formats are XML, JSON, BSON, YAMLDocument itself is the key while the content is the valueDocument can be indexed by id or simply its location (e.g., URI)Content needs to be parsed to make senseContent can be organised furtherExtremely useful for insert-once read-many scenariosCan use map-reduce framework to computeExample: MongoDB, CouchDB
Nodes represent entities or objectsEdges encode relationships between nodesCan be directedCan have hyper-edges as wellEasier to find distances and neighbors
Nodes represent entities or objectsEdges encode relationships between nodesCan be directedCan have hyper-edges as wellEasier to find distances and neighborsExample: Neo4J, HyperGraph, Infinite Graph, Titan, FlockDB
Column storesUses column familiesRequires Zookeeper to maintain distributed coordination,configuration and maintenanceCentralized master that dictates slavesStrong consistencyVersioning can be doneScales by adding nodesCP
Based on Hadoop, HDFS and BigTableKey-value storeColumn storesUses column families
Requires Zookeeper to maintain distributed coordination,configuration and maintenanceCentralized master that dictates slavesStrong consistencyVersioning can be doneScales by adding nodesCP
Based on Hadoop, HDFS and BigTableKey-value storeColumn storesUses column familiesRequires Zookeeper to maintain distributed coordination,configuration and maintenanceCentralized master that dictates slaves
Strong consistencyVersioning can be doneScales by adding nodesCP
Based on Hadoop, HDFS and BigTableKey-value storeColumn storesUses column familiesRequires Zookeeper to maintain distributed coordination,configuration and maintenanceCentralized master that dictates slavesStrong consistency
Based on Hadoop, HDFS and BigTableKey-value storeColumn storesUses column familiesRequires Zookeeper to maintain distributed coordination,configuration and maintenanceCentralized master that dictates slavesStrong consistencyVersioning can be done
Based on Hadoop, HDFS and BigTableKey-value storeColumn storesUses column familiesRequires Zookeeper to maintain distributed coordination,configuration and maintenanceCentralized master that dictates slavesStrong consistencyVersioning can be doneScales by adding nodes
Based on Hadoop, HDFS and BigTableKey-value storeColumn storesUses column familiesRequires Zookeeper to maintain distributed coordination,configuration and maintenanceCentralized master that dictates slavesStrong consistencyVersioning can be doneScales by adding nodesCP
Decentralized architectureReplicatedAny node can perform any actionStrong securityContinuous availabilityExtremely good single-tuple read performanceNot fully consistentRequires quorum reads for consistencyAP
Column-store based on BigTableDecentralized architecture
ReplicatedAny node can perform any actionStrong securityContinuous availabilityExtremely good single-tuple read performanceNot fully consistentRequires quorum reads for consistencyAP
Column-store based on BigTableDecentralized architectureReplicatedAny node can perform any actionStrong securityContinuous availabilityExtremely good single-tuple read performance
Not fully consistentRequires quorum reads for consistencyAP
Column-store based on BigTableDecentralized architectureReplicatedAny node can perform any actionStrong securityContinuous availabilityExtremely good single-tuple read performanceNot fully consistentRequires quorum reads for consistency
Column-store based on BigTableDecentralized architectureReplicatedAny node can perform any actionStrong securityContinuous availabilityExtremely good single-tuple read performanceNot fully consistentRequires quorum reads for consistencyAP
No join support (unless columnar RDBMS)Cannot work across tablesRequires unraveling of data values to answer deeper queriesNo natural or direct procedural support
No join support (unless columnar RDBMS)Cannot work across tablesRequires unraveling of data values to answer deeper queriesNo natural or direct procedural supportConsistency
NoSQL, although started as anti-SQL, is no more soMore a realisation that, for some cases
RDBMS does not scale or distribute, orACIDity is an overkill
NoSQL is not good for every scenarioNot always consistency can be sacrificedMost legacy systems still use RDBMSMany NoSQL systems are increasingly using features of RDBMSNoSQL horizon is shifting rapidly with almost no control or senseHowever, trend is for NoSQL as cloud computing and big datarelies on ithttps://db-engines.com/en/system/Cassandra%3BHBase%3BMongoDB%3BPostgreSQL
NoSQL, although started as anti-SQL, is no more soMore a realisation that, for some cases
RDBMS does not scale or distribute, orACIDity is an overkill
NoSQL is not good for every scenarioNot always consistency can be sacrificedMost legacy systems still use RDBMS
Many NoSQL systems are increasingly using features of RDBMSNoSQL horizon is shifting rapidly with almost no control or senseHowever, trend is for NoSQL as cloud computing and big datarelies on ithttps://db-engines.com/en/system/Cassandra%3BHBase%3BMongoDB%3BPostgreSQL
NoSQL, although started as anti-SQL, is no more soMore a realisation that, for some cases
RDBMS does not scale or distribute, orACIDity is an overkill
NoSQL is not good for every scenarioNot always consistency can be sacrificedMost legacy systems still use RDBMSMany NoSQL systems are increasingly using features of RDBMS
NoSQL horizon is shifting rapidly with almost no control or senseHowever, trend is for NoSQL as cloud computing and big datarelies on ithttps://db-engines.com/en/system/Cassandra%3BHBase%3BMongoDB%3BPostgreSQL
NoSQL, although started as anti-SQL, is no more soMore a realisation that, for some cases
RDBMS does not scale or distribute, orACIDity is an overkill
NoSQL is not good for every scenarioNot always consistency can be sacrificedMost legacy systems still use RDBMSMany NoSQL systems are increasingly using features of RDBMSNoSQL horizon is shifting rapidly with almost no control or senseHowever, trend is for NoSQL as cloud computing and big datarelies on ithttps://db-engines.com/en/system/Cassandra%3BHBase%3BMongoDB%3BPostgreSQL