Redis and Bloom Filters - Atlanta Java Users Group 9/2014

Failing Fast with Redis backed BloomFilters• Christopher Curtin

• Head of Technical Research

• @ChrisCurtin

About Me 25+ years in technology

Head of Technical Research at Silverpop, an IBM Company (14 + years at Silverpop)

Built a SaaS platform before the term ‘SaaS’ was being used

Prior to Silverpop: real-time control systems, factory automation and warehouse management

Always looking for technologies and algorithms to help with our challenges

Silverpop Open Positions Technical Lead

Senior Engineer

Architect

Automation Engineers

Agenda Redis

Bloom Filters

Failing Fast

Agenda Redis

What it is Why we started looking at using it Basics Concurrency Operational Considerations Challenges

Redis – What is it?From redis.io:

"Redis is an open source, BSD licensed, advanced key-value cache and store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs."

Hyper-what-what?HyperLogLog

Approximation technique for counting distinct entries in a set.

Very small memory footprint for rough approximations (16 kb for 99% accuracy)

Nice – but too much loss for what we need

Features• Unlike typical key-value stores, you can send commands to edit the value on the server vs. reading back to the client, updating and pushing to the server

• pub/sub

•TTL on keys

•Clustering and automatic fail-over

•Lua scripting

•client libraries for just about any language you can think of

So Why did we start looking at NoSQL?“For the cost of an Oracle Enterprise license I can give

you 64 cores and 3 TB of memory”

Redis Basics In Memory-only key-value store Single Threaded. Yes, Single Threaded No Paging, no reading from disk CS 101 data structures and operations 10's of millions of keys isn't a big deal How much RAM defines how big the store can get

Basic DataTypes String

Hashes

Lists

Sets and Sorted Sets

CS 101 ...

HashesHashes

- collection of key-value pairs with a single name

- useful for storing data under a common name

- values can only be strings or numeric. No hash of lists

http://redis.io/commands/hget

Sets and Sorted Sets Buckets of values with very fast membership look-up

No duplicates allowed

Sorted Sets have scores to make them sortable

– Automatically keeps them in order for fast 'top x' look ups

http://redis.io/commands/zadd

http://redis.io/commands/zrange

Lists Most interesting due to how operations are applied to

the remote store

Unbounded (except by memory)

Atomic operations between lists (pop from one, push to another)

CS 101: lpush, rpush, lpop, range etc.

Advanced: blocking pops

Http://redis.io/commands/rpush

http://redis.io/commands/rpoplpush

Concurrency Single threaded

Each operation can work on one or two keys, atomically

Pipelines allow execution of commands in sequence in a single server request (Redis will only execute the pipeline)

Pipelines do not allow for logic between commands

LUA Scripts allow for logic between commands

BE CAREFUL with LUA, scripts block all clients!

Pipeline Java Example BloomFilterRedis.java line 43

Lua Example Lua-scripts example

Operational Information Persistence can be 'none', journal (AOF) or point in

time (RDB)

Optional Master/Slave replication

Home-grown HA platform (Sentinel)

Common deployment model is lots of instances per machine

Millions of keys gets hard to manage – build 'directory' hashes to make it easier for operations to find keys to look at

Challenges with Redis Key Explosion – single name space

LUA scripts can block all others users

Pipelines can block all other users

No nested data types (I want a hash of lists!)

Without name spaces be cautious of how you define key names

Concurrency Demo – JMS replacement Client submits a request to the queue (LPUSH)

Consumer application polls for work when worker is available (RPOPLPUSH)

Worker executes the task assigned to it

When worker is done, its list is removed

Lather, Rinse, Repeat

(We provide a hash of workers for Operations to query for monitoring)

Agenda Bloomfilters

What they are Why we started looking at using them Basics False Positives Example Uses Why not do this in a database?

Bloom FiltersFrom WikiPedia (Don't tell my kid's teacher!)

"A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate"

Hashing Apply 'x' hash functions to the key to be

stored/queried

Each function returns a bit to set in the bitset

Mathematical equations to determine how big to make the bitset, how many functions to use and your acceptable error level

http://hur.st/bloomfilter?n=4&p=1.0E-20

Example

False Positives Perfect hash functions aren't worth the cost to

develop

Sometimes existing bits for a key are set by many other keys

Make sure you understand the business impact of a false positive

Remember, never a false negative

Creation Libraries are available for every language I looked up

(even JavaScript)

Some are built in memory, for a single process/JVM to use

Read-only (ad networks) are built using Hadoop and loaded into memory

In memory is great for lots of reads, single process/JVM etc.

But ...

Updates Updating a 16 MB structure in memory and persisting

to disk is expensive

8 bits change and you write 16 MB!!!!!! (DBAs will love you …)

Deletes Not possible in a regular Bloom Filter – how would you

know what bits are used by other keys?

Counting BloomFilters keep a few bits (3-4) per bit in the bitmap as a counter. 'delete' decrements the key

Not as space friendly any more …

Instead, consider having bloom filters based around the lifetime of the data to be queried

– For a filter 'visited in the last 4 hours' have 4 filters and age the oldest out (TTL in Redis maybe ...)

Issue: Persistence Load a 16 MB filter from database to check 6 bits?

Worse: update 6 bits in a 16 MB filter

DBAs will not be happy

– Undo/redo

– SGA misses, page faults

– Backups, replication traffic etc.

Why were we interested in Bloom Filters? Found a lot of places we went to the database to find

the data didn't exist

Found lots of places where we want to know if a user DIDN'T do something

Persistent Bloom Filters We needed persistent Bloom Filters for lots of user

stories

Found Orestes-BloomFilter on GitHub that used Redis as a store and enhanced it

Added population filters

Fixed a few bugs

Did a pull request and it was accepted!

Benefits Filters are stored in Redis

• Only bitset/bitget calls to server

Reads and updates of the filter from set of application servers

Persistence has a cost, but a fraction of the RDBMS costs

Can load a BF created offline and begin using it

Remember “For the cost of an Oracle License” Thousands of filters

Dozens of Redis instances

TTL on a Redis key makes cleanup of old filters trivial

Population Bloom Filters Unique need we had

Users access the system frequently, but I really only need to count them once per month for billing

10's of Thousands of clients, Finance wants monthly report in seconds

Logic is simple: if any bits weren't set for the key (user id), increment the counter

Note: there are mathematical methods of estimating a BF population but we needed better error rate

Example Uses of Bloom Filters Webcache – what URLs are already in the cache on

another server?

P2P networks – what node contains which part of the file?

Databases

– Do keys exist in this page? If not, don't load the page

– Hbase uses them to detect which blocks do not have the data (HDFS is write-once)

– Many RDBMS use them internally to 'fail fast' and not load pages into memory

– Sadly, no RDBMS or NoSQL I know of offers them as user data types

Example Uses of Bloom Filters Ad networks (old way ...)

– Big Hadoop job hourly/nightly to determine which ads to show based on prior behavior

– Load the filter into a common storage (disk usually)

– Ad servers load all the filters into memory and query for your cookie id to see what to show you

Examples of Redis-backed BloomFilters Has the user be here this month? If not show them a

Message. False positive doesn't matter

White vs. Black list for IP

– Known bad IP in the filter

– Upon login check the filter. Not found, login. Found – check DB to validate bad IP.

– False Positive will lead to query that returns false, but should be rare

• Ad Networks (real time BF updates based on what you searched on)

Client side Joins Most NoSQL don't support joins

Architecture may have data across multiple stores

Keep a Population Bloom Filter by day of unique users in a data source

When needing to join, load smallest data source as the driver and query other sources in order of size

If queries are time based and filters are available for the time, looking up key matches can be very fast

Agenda Fail Fast

What it is Redis-backed BloomFilters Examples

Fail Fast The ability to quickly know to NOT do something

expensive

Example: Black-list of IPs

Think about ways to NOT do some work

Cost of Redis servers is much less than an RDBMS license or the cost of a good DB server with storage!

Hammer Time

Be careful Sometimes the cost of building and maintaining the

structures outweighs the benefit

Convoluted designs to avoid the database

Collect Metrics on 'hits' to see if they are any benefit (CodaHale)

Example (naive) Build a BF for ads shown to a user (hash on user id

and ad id)

When the user visits, hash their user id and the top ad to display this hour and set the bits in the BF

If any were not set, the Population count is incremented and you display the ad

If already set, move to the next most important ad.

Now know total unique views by ad by hour

Can do total gross with a Redis Hash too!

Example – smarter Hash the top 10 ad ids to the user id and parallel

request (Pipeline)

Check the return to see which ones aren't set, submit an update request and set the population

2 round trips to check 10 ads.

(Can also do this in LUA in 1 round trip)

Example – part 2 Same idea as before, but build the bloom filter for

each hour

When user visits, query last 6 filters in parallel (pipeline!) to see if they've seen the ad(s).

Redis TTL on the hourly filter will drop it automatically when it becomes too old

Example 3 Collect lots of data about users (such as virtual cows,

farm land, chickens etc.)

Run a predictive model on the data and identify which special offers to show the user visits again. Store user ids in a Bloom Filter

Load the BF into Redis

Query each time the user logs in and display appropriate offer

No massive database insert/updates to flag who should see it

False positive isn't too bad

Example 4 – Query optimization Client-side joins

Ask the Bloom Filter if the user has performed the action (filters for hour, day, week of year etc.)

If not, don't even call the data source

May need to read some extra data due to 'in the last 11 days' but asking the BF and being told 'no' prevents ANY data source resources to be used

What if the BF is lost? Rebuild it from the base events (Hadoop!)

Conclusion Redis is a very fast, very simple and very powerful

name value store “Data structure server”

Bloom Filters have lots of applications when you want to quickly look up if one of millions of 'things' happened

Redis-backed BloomFilters make updatable bloom filters trivial to use

Think about what you need to know to NOT do an expensive operation

Fail fast

References Redis.io

http://en.wikipedia.org/wiki/Bloom_filter


https://github.com/Baqend/Orestes-Bloomfilter

http://www.slideshare.net/chriscurtin

@ChrisCurtin on twitter

Github.com/chriscurtin

http://en.wikipedia.org/wiki/Bloom_filter


https://github.com/Baqend/Orestes-Bloomfilter

Questions?

Redis and Bloom Filters - Atlanta Java Users Group 9/2014

Software