page 1 | Evolving a 1 st Generation HBase Deployment to 2 nd and Beyond Doug Meil Chief Software Architect HBase Committer HBaseCon2013
May 10, 2015
page 1 |
Evolving a 1st Generation HBase Deployment to 2nd
and BeyondDoug Meil
Chief Software ArchitectHBase Committer
HBaseCon2013
page 2 |
Company Background
page 3 |
Comprehensive view of care including all venues of delivery representative of all major diseases, treatments, and demographics
14 integrated delivery networks with over 200 hospitals and 100,000 providers
$46 billion in care delivered annual by our network members
24 million truly unique patients
The Explorys Value Based Care Big Network
page 4 |
ClinicalEMRs, claims, labs, registries,
reported outcomes
OperationalProviders org charts, practices,
locations, departments, physical assets, and care workflow
FinancialPrivate / payer claims, billing,
patient accounting systems
The Explorys Platform
PCP Specialist Hospital Post
acuteLong term Home Mobile
Full view of the continuum of care & cost
Secure | Cost Effective | Ready Now
Start with Data Completeness
Aggregation Patient matching Curation & attribution Data governance
Engines
Profiling Risk
analytics Prediction
Insight
page 5 |
Why HBase?
page 6 |
HBase at Explorys
Transactional Store
HBase is our transactional data store
e.g., Clinical and Administrative Data
Why?Flexible data model, Operational Scalability
General Store
Clinical Indexes for searching
Generated results like Measures and Registries
Why?Operational Scalability, Fast Lookups
page 7 |
Source 1
Source 2
Source 3
Source 4
Explorys Apps
1Extract & Load
Loads (Puts)1 Read (Scan)2 Bulk-Load3 Multi-Get4 Impala5
5 Queries
MultiGet
4
Power Search
2
Patient Chart
M/R M/
R
“Late Binding” Transformation & Standardization
Generated Results / Indexes
3
Explore
Measure
Registry
Engage
High Level HBase Usage Overview
page 8 |
Functional Examples
page 9 |
NQF 0575 Example (Simple Example, Condensed)Initial Population
Patients >= 17 and <= 74 before the start of the measurement period
Denominator
2 encounters (non-acute and outpatient) and an active diagnosis of diabetes
Or
Active meds indicative of diabetes
All within 2 years or during the measurement end-date
Exclusions
Things like active diagnosis of gestational diabetes will exclude patient from denominator
Numerator
Most recent HbA1c test < 8%
Measures Generated in MapReduce
Measure Calculations
page 10 |
Measure Results Generated to HBaseResults by
Measure Attributed Provider Patient Reporting Window … generated to HBase
Lots of Generated DataHundreds of Measures Generates Hundreds of Millions of Measure Results Per
Day
Measure Generated Data
page 11 |
Heart Failure Functional Example No evidence of Myocardial Infarction THEN a prescription for Angiotensin-converting enzyme (ACE) inhibitor agent THEN Myocardial Infarction within one year
C. Diff. Infection Functional Example Ambulatory Encounter THEN an Inpatient Encounter THEN evidence of C. Diff. infection within 10 days THEN an Ambulatory Encounter within 30 days
SummaryNoSQL works well as the backend implementation for these kinds of “queries” because it takes complex logic to satisfy this result.
PowerSearch
page 12 |
Technical Details
page 13 |
DistroCDH4.2.1
Hadoop Knobs HDFS Local read shortcut on HDFS Drop behind reads, Read-ahead on Snappy for MR temp files Read-ahead for MR temp files MR heartbeat on task finish
Cluster Information
page 14 |
HBase Knobs We pre-split our tables We Use KeyPrefixRegionSplitPolicy Snappy CF compression HLog compression on RegionSize still 2-3 Gb (we’ve tested bigger, but staying here for now)
HBase Knobs Under Consideration HBase Checksumming - currently off, but will probably turn on FAST_DIFF encoding – currently not in use, but will probably use for lookup
tables
Cluster Information
page 15 |
Compression (HDFS and HBase)LZO Snappy
HBase Key Redesign Our initial HBase RowKeys were too beefy and too Stringy.
• Refactored to be tighter. Column names a bit too descriptive initially Changes related to the new KeyPrefixRegionSplitPolicy.
HBase Table ManagementWe have a layer of metadata around our MR jobs and apps and re-create our
tables from time to time, which makes schema changes easier.
What Have We Changed?
page 16 |
HBase Loading Index tables loaded with bulk-loading Experimented with WAL off and deferred log flushing, but bulk-loading is
better.
HBase Gets When we started multi-Get didn’t even exist in HBase! This feature was very much appreciated, our DAO layer was modified to
accept batch requests.
• Minimizing RPCs makes a difference.
SQL?Impala against HBase for internal data investigation
What Have We Changed?
page 17 |
Data Browsers We’ve built our own data browser for data inspection, and continue to add to it. This isn’t going away any time soon and is highly used. Also kind of necessary if you store complex objects in HBase
HBase Filters We have some. Didn’t initially, but they have proven quite useful.
Things We’ve Built