Top Banner
1• 800.593.4467 • [email protected] The Big Data Quadfecta Brian O’Neill Lead Architect, Health Market Science @boneill42, [email protected]
41

Big data philly_jug

May 10, 2015

Download

Technology

Big Data Overview and Cassandra Deep Dive for the Philly JUG
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The Big Data Quadfecta

Brian O’NeillLead Architect, Health Market Science@boneill42, [email protected]

Page 2: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Quadfecta?1. Quadfecta

• A legendary beirut/beer pong shot that lands on the tops of four cups simultaneously. Considered the rarest shot in the game, topping even the trifecta, 2-cup knockover-and-sink, and simultaneous 6-cup game-ending double bounce-in.

• Kafka• Storm• Elastic Search• Cassandra

http://www.flickr.com/photos/yogma/3584984540/

http://www.urbandictionary.com/define.php?term=quadfecta

Page 3: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Hold on Tight

http://www.flickr.com/photos/aspexdesign/7817329758/

Page 4: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

3 V’s

Volume Variety Velocity

http://www.flickr.com/photos/20989942@N00/373985217/

http://www.flickr.com/photos/rhruzek/4071408305/

http://www.flickr.com/photos/adriansalgado/5310969147/

Page 5: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The Use Case

Page 6: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Our Mission

Prescriber eligibility and remediation

Eliminate fraud, waste and abuse

Insights into the healthcare space

Page 7: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The BusinessBusiness Solutions

Health Care Provider & Facilities

Variety/Velocity• >l2000 of sources

• 6 Million unique HCPs

• 10+ years history

Data Challenges• Constant change in

real world data

• Conflicting & partial info

• Frequent changes to source structure

• Authoritative sources vs. crowdsource

• Predicting source quality

Master Data SolutionsMedical Procedures &

Diagnosis

Volume/Velocity• ~1B claims annually

• +5B records annually

• 5+ years history

Data Challenges• Sources have

incomplete capture

• Overlapping source data

• Statistical projections & biases

• Social media type relationships

Medical Claims Data

CompleteView, Expense Manager,

CompleteSpend

Prescriber Eligibility/Remdi

ation

Analtyics (Influencer Networks)

Page 8: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Our SolutionsBusiness

Needs

Finance & LegalBusiness SystemsCompliance Sales & Marketing

SolutionsProvider Data ComplianceData Assessment, Integration &

Enrichment Services

01010011

Market Intelligence

HMSAuthoritative

SourcesPDC Federal StateMedical Claims Web Derived

AdvancedTechnology

Master Data Management

Page 9: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Datacenter

Hundreds of Machines

1.5 Petabytes of raw storage

Virtualized (VMware)

On a SAN

Should we go physical???

Page 10: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Under the Hood

Visualization

Dashboard / Reports

Structured Storage

RelationalIndexing

Flexible Storage

NoSQL Graph(s)

Interfacing

Web Services

Distributed Processing

Standardize

Validate

MatchConsolidat

e

Analytics

Data Sources

Government

Web

Customer

I’m happy

User Interface

Page 11: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Master Data Management

Harvested

Government

PrivateSchema Change!

Page 12: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The Design

Page 13: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

System of Record

Flexibility (Variety)Scalability (Velocity + Volume)

Page 14: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Deep Dive

www.history.navy.mil/museums/seabee_museum.htm

Page 15: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Installation

As easy as… Downloadhttp://cassandra.apache.org/download/

Uncompresstar -xvzf apache-cassandra-1.2.0-beta3-bin.tar.gz

Runbin/cassandra –f

(-f puts it in foreground)

Page 16: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Data Model

Schema (a.k.a. Keyspace)

Table (a.k.a. Column Family)

RowHave arbitrary #’s of columnsValidator for keys (e.g. UTF8Type)

ColumnValidator for values and keysComparator for keys (e.g. DateType or BYOC)

(http://www.youtube.com/watch?v=bKfND4woylw)

Page 17: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Distributed Architecture

Nodes form a token ring.

Nodes partition the ring by initial tokeninitial_token: (in cassandra.yaml)

Partitioners map row keys to tokens.Usually randomly, to evenly distribute the data

All columns for a row are stored together on disk in sorted order.

Page 18: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Visually

A(67-0)

B(1-33)

C(34-66)

Row Hash

Alice 50

Bob 3

Eve 15

Token/Hash Range : 0-99

Page 19: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Java Interpretation

Each table is a Distributed HashMapEach row is a SortedMap.Each column is an entry in the SortedMap.

Cassandra provides a massively scalable version of: HashMap<rowKey, SortedMap<columnKey, columnValue>

Implications:Direct row fetch is fast.Searching a range of rows can be costly.Searching a range of columns is cheap.

Page 20: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The World-Wide Globally Scalable Naughty List!

How about a Naughty and Nice list for Santa?

1.9 billion childrenThat will fit in a single row!

Queries to support:Children can login and check their standing.Santa can find nice children by country, state or zip.Toy lists for every child in the world.

Page 21: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Two Tables

Children TableStore all the children in the world.One row per child.One column per attribute.

NaughtyOrNice TableSupports the queries we anticipateWide-Row Strategy

Page 22: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Details of the NaughtyOrNice List

One row per standing:countryEnsures all children in a country are grouped together on disk.

One column per child using a compound keyEnsures the columns are sorted to support our search at varying levels of granularity

e.g. All nice children in the US.e.g. All naughty children in PA.

Page 23: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Node 3

Node 2

Node 1

Visually Nice:USA

CA:94333:johny.b.good

CA:94333:richie.rich

Nice:IRL

D:EI33:collin.oneill

D:EI33:owen.oneill

Naughty:USA

CA:94111:bart.simpson

CA:94222:dennis.menace

PA:18964:michael.myers

Watch out for:• Hot spotting• Unbalanced Clusters

(1)Go to the row.(2)Get the column slice

Page 24: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

What about the toys?

No problem. We’re in a NoSQL store. =)Let’s just add a column.

Page 25: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

CQL Collections!

http://www.datastax.com/dev/blog/cql3_collections

SetUPDATE users SET emails = emails + {'[email protected]'} WHERE user_id = 'frodo';

ListUPDATE users SET top_places = [ 'the shire' ] + top_places WHERE user_id = 'frodo';

MapsUPDATE users SET todo['2012-10-2 12:10'] = 'die' WHERE user_id = 'frodo';

Page 26: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Let’s Crank a Bit...

Page 27: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Let’s code!What API should we use?

Production-Readiness

Potential Momentum

Thrift 10 -1 -1

Hector 10 8 8

Astyanax 8 9 10

Kundera (JPA) 6 9 9

Pelops 7 6 7

Firebrand 8 9 8

PlayORM 5 8 7

GORA 6 9 7

CQL Driver 8 10 10

IMHO!

Asytanax + CQL FTW!

Page 28: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Coming up for air...

http://www.flickr.com/photos/64738468@N00/7184463727/

Page 29: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

But continuing at warp speed...

http://www.flickr.com/photos/19942094@N00/4937185452/lightbox/

Page 30: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Primitives of Distributed Processing

emit/proce

ss(tuple(…

))

map<key<map<[], value>>

pop(push(v))

index(field, type)

Kafka

Page 31: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

What we did wrong…

Could not react to transactional changes

Needed extra logic to track what changed

Took too long

Page 32: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

What we did wrong… (II)

AOP-based triggersWorked well initially.Business Processes captured as side-effects.

Page 33: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Design Principles

PatternsIdempotent Operations

Elegantly handle replay

Immutable dataAssertions of facts over time

Anti-PatternsTransactions / Locking

Page 34: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

What we did right.

REST APIs for Loose Coupling

See Virgil:https://github.com/hmsonline/virgil

But really… watch out for Intraverthttps://github.com/zznate/intravert-ug

Page 35: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Kafka• Millions of Messages• Replay Enabled• No transactions / Lightning Fast

Page 36: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Elastic Search• Edit Distance / Soundex• Native Scalability• Fuzzy Search• Geospatial• Facets

Page 37: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Storm• Guaranteed once semantics• Well-designed processing

abstraction• Beats BYODP• Momentum

Page 38: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The System

KafkaQueue(s)

Offset

C*

A

BC

C* ES1Kafka

ElasticSearch

ES2C*

REST API

NP. We can route around

it.

NP. Replication Factor > 1.

NP. Rewind!

Page 39: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Next Steps

Page 40: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Shameless Shoutouts

HMS (https://github.com/hmsonline/)storm-cassandrastorm-elastic-searchstorm-jdbi (coming soon)

ptgoetz (https://github.com/ptgoetz) storm-jmsstorm-signals

Page 41: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The Team

We’re hiring!