Architectural evolution starting from Hadoop Monica Franceschini Solution Architecture Manager Big Data Competency Center Engineering Group
Architectural evolution starting from Hadoop
Monica Franceschini Solution Architecture Manager
Big Data Competency Center Engineering Group
Experiences
ENERGY
Predictive analysis using
geo-spatial sensors data
FINANCE
Big Data architecture for
advanced CRM
Measure of energy
consumption for 15M
users
P.A.
Energy
HDFS Kafka Hbase Spark Flume Phoenix
Hadoop Technologies External system
s
JMS
FS
flum
e HDFS
kafka
HBase KAFKA
Spark Spark streaming
Phoenix
Web apps
RDBMS
sqo
op
Finance
NFS Hbase Spark Phoenix
Hadoop Technologies External system
s
NFS HBase
Spark
Phoenix
Web apps
HD
FS
P.A.
HDFS Hbase Spark Spark MLlib Flume Phoenix
Hadoop Technologies External system
s
JMS flum
e
HDFS
HBase
Spark
Phoenix
Web apps
Spark MLlib
Considerations
Similar scenarios: Flume, HBase & Spark
Online performances
HBase instead of HDFS
Similar data
High throughput
Moreover…
• Adoption of a well-established solution
• Availability of support services
• Community, open source or … free version!
Hadoop storage
HBase HDFS
Large data sets Unstructured data Write-once-read-many access Append-only file system Hive HQL access High-speed writes and scans Fault-tolerant Replication
Many rows/columns Compaction Random read-writes Updates Rowkey access Data modeling NoSQL Untyped data Sparse schema High throughput Variable columns
Some Hbase features:
• Just one index or primary key
• Rowkey composed by other fields
• Big denormalized tables
• Horizontal partitioning rowkey-based
• Focus on the rowkey design and table schema (data modeling)
• The ACCESS PATTERN must be known in advance!
What’s missed?
• SQL language
• Analytic queries
• Secondary index
Performances
for online applications
• Phoenix is fast: Full table scan of 100M rows usually executed in 20 seconds (narrow table on a medium sized cluster). This time comes down to few milliseconds if query contains filter on key columns.
• Phoenix follows the philosophy of bringing the computation to the data by using: • coprocessors to perform operations on the server-side thus minimizing client/server
data transfer • custom filters to prune data as close to the source as possible. In addition, Phoenix
uses native Hbase to minimize any startup costs. Query chunks: Phoenix chunks up your query using the region boundaries and runs them in parallel on the client using a configurable number of threads. The aggregation will be done in a coprocessor on the server-side
• Query engine + metadata store + JDBC driver
• Database over HDFS (for bulk loads and full-table scans queries)
• HBase APIs (not accessing Hfiles directly)
• …what about performances?…
Query: select count(1) from table over 1M and 5M
rows. Data is 3 narrow columns. Number of Region
Server: 1 (Virtual Machine, HBase heap: 2GB,
Processor: 2 cores @ 3.3GHz Xeon)
• Query engine + metadata store + JDBC driver
• DWH over HDFS
• Runs MapReduce jobs to query HBase
• StorageHanlder to read HBase
• …what about performances?…
Query: select count(1) from table over 10M and 100M rows. Data is 5 narrow columns. Number of Region Servers: 4 (HBase heap: 10GB, Processor: 6 cores @ 3.3GHz Xeon)
• Cassandra + Spark as lightweight solution (replacing Hbase+ Spark)
• SQL-like language (CQL) + secondary indexes
• …what about the other Hadoop tools?...
• Converged data platform: batch+NoSQL+streaming
• MapR-FS: great for throughput and files of every size + singolar updates
• Apache Drill as SQL-layer on Mapr-FS
• …proprietary solution…
• Developed by Cloudera is Open Source (->integrated with Hadoop Ecosystem)
• Low-latency random access
• Super-fast Columnar Storage
• Designed for Next-Generation Hardware (storage based on IO of solid state drives + experimental cache implementation)
• …beta version…
With Kudu, Cloudera promises to solve Hadoop's infamous storage problem InfoWorld | Sep 28, 2015
HBase HDFS
Hadoop storage
highly scalable in-memory database per MPP workloads
Fast writes, fast updates, fast reads, fast everything
Structured data SQL+scan use cases
Unstructured data Deep storage
Fixed column schema SQL+scan use cases
Any type column schema
Gets/puts/micro scans
Conclusions • One size doesn’t fit all the different
requirements
• The choice between different Open Source solutions is driven by the context
• Technology evolves
• So what? • REQUIREMENTS
• NO LOCK-IN
• PEER-REVIEWS
Thank you!
Monica Franceschini Twitter @twittmonique Linkedin mfranceschini
Skype monica_franceschini Email [email protected]