Top Banner
Not your Dad’s Old HBase Gilad Moscovitch - Senior Consultant UXC PS @moscovig Yaniv Rodenski - Principal Consultant UXC PS @YRodenski
23

Not your dad's h base new

Apr 15, 2017

Download

Software

Yaniv Rodenski
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Not your dad's h base new

Not your Dad’s Old HBaseGilad Moscovitch - Senior Consultant UXC PS

@moscovigYaniv Rodenski - Principal Consultant UXC PS

@YRodenski

Page 2: Not your dad's h base new

AgendaOur use cases

Introduction to Apache Phoenix

The first use case - retrospective

Managing a large scale Graph with TitanDB

The second use case - retrospective

Page 3: Not your dad's h base new

The Cable CompanyOur story starts with a cable company that grew:

Over a decade ago, bought an ISP

Bought a mobile network

Started new ventures such as VOD and VoIP

Page 4: Not your dad's h base new

Our DatasetBillions of records (PB scale)

Countless number of formats:

Multiple systems

Network equipment

Devices

Dynamic data model

New devices are introduced frequently (on average every two weeks)

New demands are introduced even more frequently

Page 5: Not your dad's h base new

The Cable Guys:

Gilad MoscovitchEngineering Manager

Yaniv RodenskiArchitect in the CTO team

Page 6: Not your dad's h base new

Our Starting Point:

Devices

Systems of Records

ETL via ODI

Oracle Exadata

Page 7: Not your dad's h base new

Challenges The Oracle Data Warehouse and ODI could not handle the load

ETL devs could not handle the load, the ETL team became a bottleneck

Not all data types arrive at the warehouse

We had to prioritise due to lack of ETL devs

Incompatibility with the existing data model

Changes to the data model would take an average of a month

Even when data was loaded, analysts were not aware of the new tables, and we ended up with an unusable schema

Page 8: Not your dad's h base new

More Challenges New data models that are not a good fit for SQL databases:

Sparse data

Geospatial data

Full text

Graph

Need to ask harder questions that require heavy processing:

Machine learning

Page 9: Not your dad's h base new

Breaking OutThe new data platform was Hadoop based

Using CDH (at that time the most advanced option)

Trying to reuse existing components of the platform as much as possible

Page 10: Not your dad's h base new

Challenge #1: Early Data Access

Giving analysts, BI developers and business access to raw data

For this use case we reviewed a few tools, including Apache Phoenix

Page 11: Not your dad's h base new

Apache Phoenix - SQL on HBase

Apache Phoenix is a relational database layer over HBase with a difference:

Table metadata is stored in an HBase table and versioned, snapshot queries over prior versions will automatically use the correct schema

Secondary indexes

Dynamic columns with schema on read

Views

Indexed

Updatable

Page 12: Not your dad's h base new

Demo - Apache Phoenix

Page 13: Not your dad's h base new

Challenge no 1: ResultsIn addition to Phoenix we also looked at Hive and Impala

Spark SQL, Presto and Drill were not considered due to immaturity

Impala was chosen

Schema on read was important

Hive on CDH doesn’t support Tez

Apache Phoenix was overkill and better suited to be a database rather than a warehouse

Page 14: Not your dad's h base new

Challenge no 2: Family Time

Clients are never represented by a single entity:

Households

Business

Clients have multiple devices generating data:

Home and mobile phones

IP adresses for devices

DVRs

Page 15: Not your dad's h base new

Titan - A Distributed Graph

Titan is a scalable graph database

Optimized for storing and querying graphs

Runs on top of:

Cassandra

HBase

DynamoDB

BerkeleyDB

Support for geo, numeric range, and full-text search via:

ElasticSearch

SolR

Supports Gremlin - a graph querying DSL via

Tinkerpop Gremlin over HTTP

Page 16: Not your dad's h base new

Demo - Clash of the Titan

Page 17: Not your dad's h base new

Challenge #2: Testing StageHbase vs Cassandra benchmark + sanity check

Simulation for 1 billion Vertices

Sanity check- OK

Not much difference in loading time and querying time on both stores

HBase chosen because of the existing infrastructure

Retrospective: 1 billion Vertices on an empty graph didn’t really simulate anything.

Page 18: Not your dad's h base new

Challenge #2: POC StageInitializing an untuned Hbase Cluster on all 24 nodes of the existing cluster

Hosted side by side with Map Reduce and Impala

Developing initial ontology for the largest data source together with a developer from the client application team

Developing Map Reduce for loading hundreds of GB a day according to the ontology

Page 19: Not your dad's h base new

POC PerformanceInput Data was stored in hourly directories so at first we scheduled the Map Reduce for each hour.

An hour took about 40 minutes to process and load.

Later on - scheduled the Map-Reduce for a whole day at a time. The whole day loading took about half a day.

Retrospective: Such long Map-Reduce jobs create new challenges - Hold lots of reducers for a long time, not fun to re-run in case of cluster failures.

Page 20: Not your dad's h base new

Performance TuningHBase didn't handle the load, the symptoms included

HBase write-blocking compactions

Retired region servers

Tuning performed:

Region split size - split after 11 GB

Memstore flush size tuning

GC Tuning

Java Heap size decreasing from 32 to 16

Daily major compaction for the graph table

Retrospective: We had to statically partition to two different clusters: One for HBase, and one for everything else

Page 21: Not your dad's h base new

TodayThe main graph ingests:

~1.7 billion edges

~1.7 billion vertices

The main graph size is 20TB

20 region servers

Rebuilding the graph on average every 3 months for new ontology

New data sources are added within a day by one (awesome) developer

Using a web based UI tool for graph explorationRetrospective: Titan on HBase works pretty well for those sizes

Page 22: Not your dad's h base new

SummaryHBase is a versatile datastore

Apache Phoenix modernises HBase with semi-relational SQL layer

Titan provides powerful graph capabilities

Never be naive about Big Data tools, they will bite you, badly

Page 23: Not your dad's h base new

Next month:

Karel AlfonsoApache Flink Ned Shawa

Apache NiFi