Failing Fast with Redis backed BloomFilters • Christopher Curtin • Head of Technical Research
Failing Fast with Redis backed BloomFilters• Christopher Curtin
• Head of Technical Research
• @ChrisCurtin
About Me 25+ years in technology
Head of Technical Research at Silverpop, an IBM Company (14 + years at Silverpop)
Built a SaaS platform before the term ‘SaaS’ was being used
Prior to Silverpop: real-time control systems, factory automation and warehouse management
Always looking for technologies and algorithms to help with our challenges
Silverpop Open Positions Technical Lead
Senior Engineer
Architect
Automation Engineers
Agenda Redis
Bloom Filters
Failing Fast
Agenda Redis
What it is Why we started looking at using it Basics Concurrency Operational Considerations Challenges
Redis – What is it?From redis.io:
"Redis is an open source, BSD licensed, advanced key-value cache and store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs."
Hyper-what-what?HyperLogLog
Approximation technique for counting distinct entries in a set.
Very small memory footprint for rough approximations (16 kb for 99% accuracy)
Nice – but too much loss for what we need
Features• Unlike typical key-value stores, you can send commands to edit the value on the server vs. reading back to the client, updating and pushing to the server
• pub/sub
•TTL on keys
•Clustering and automatic fail-over
•Lua scripting
•client libraries for just about any language you can think of
So Why did we start looking at NoSQL?“For the cost of an Oracle Enterprise license I can give
you 64 cores and 3 TB of memory”
Redis Basics In Memory-only key-value store Single Threaded. Yes, Single Threaded No Paging, no reading from disk CS 101 data structures and operations 10's of millions of keys isn't a big deal How much RAM defines how big the store can get
Basic DataTypes String
Hashes
Lists
Sets and Sorted Sets
CS 101 ...
HashesHashes
- collection of key-value pairs with a single name
- useful for storing data under a common name
- values can only be strings or numeric. No hash of lists
http://redis.io/commands/hget
Sets and Sorted Sets Buckets of values with very fast membership look-up
No duplicates allowed
Sorted Sets have scores to make them sortable
– Automatically keeps them in order for fast 'top x' look ups
http://redis.io/commands/zadd
http://redis.io/commands/zrange
Lists Most interesting due to how operations are applied to
the remote store
Unbounded (except by memory)
Atomic operations between lists (pop from one, push to another)
CS 101: lpush, rpush, lpop, range etc.
Advanced: blocking pops
Http://redis.io/commands/rpush
http://redis.io/commands/rpoplpush
Concurrency Single threaded
Each operation can work on one or two keys, atomically
Pipelines allow execution of commands in sequence in a single server request (Redis will only execute the pipeline)
Pipelines do not allow for logic between commands
LUA Scripts allow for logic between commands
BE CAREFUL with LUA, scripts block all clients!
Pipeline Java Example BloomFilterRedis.java line 43
Lua Example Lua-scripts example
Operational Information Persistence can be 'none', journal (AOF) or point in
time (RDB)
Optional Master/Slave replication
Home-grown HA platform (Sentinel)
Common deployment model is lots of instances per machine
Millions of keys gets hard to manage – build 'directory' hashes to make it easier for operations to find keys to look at
Challenges with Redis Key Explosion – single name space
LUA scripts can block all others users
Pipelines can block all other users
No nested data types (I want a hash of lists!)
Without name spaces be cautious of how you define key names
Concurrency Demo – JMS replacement Client submits a request to the queue (LPUSH)
Consumer application polls for work when worker is available (RPOPLPUSH)
Worker executes the task assigned to it
When worker is done, its list is removed
Lather, Rinse, Repeat
(We provide a hash of workers for Operations to query for monitoring)
Agenda Bloomfilters
What they are Why we started looking at using them Basics False Positives Example Uses Why not do this in a database?
Bloom FiltersFrom WikiPedia (Don't tell my kid's teacher!)
"A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate"
Hashing Apply 'x' hash functions to the key to be
stored/queried
Each function returns a bit to set in the bitset
Mathematical equations to determine how big to make the bitset, how many functions to use and your acceptable error level
http://hur.st/bloomfilter?n=4&p=1.0E-20
Example
False Positives Perfect hash functions aren't worth the cost to
develop
Sometimes existing bits for a key are set by many other keys
Make sure you understand the business impact of a false positive
Remember, never a false negative
Creation Libraries are available for every language I looked up
(even JavaScript)
Some are built in memory, for a single process/JVM to use
Read-only (ad networks) are built using Hadoop and loaded into memory
In memory is great for lots of reads, single process/JVM etc.
But ...
Updates Updating a 16 MB structure in memory and persisting
to disk is expensive
8 bits change and you write 16 MB!!!!!! (DBAs will love you …)
Deletes Not possible in a regular Bloom Filter – how would you
know what bits are used by other keys?
Counting BloomFilters keep a few bits (3-4) per bit in the bitmap as a counter. 'delete' decrements the key
Not as space friendly any more …
Instead, consider having bloom filters based around the lifetime of the data to be queried
– For a filter 'visited in the last 4 hours' have 4 filters and age the oldest out (TTL in Redis maybe ...)
Issue: Persistence Load a 16 MB filter from database to check 6 bits?
Worse: update 6 bits in a 16 MB filter
DBAs will not be happy
– Undo/redo
– SGA misses, page faults
– Backups, replication traffic etc.
Why were we interested in Bloom Filters? Found a lot of places we went to the database to find
the data didn't exist
Found lots of places where we want to know if a user DIDN'T do something
Persistent Bloom Filters We needed persistent Bloom Filters for lots of user
stories
Found Orestes-BloomFilter on GitHub that used Redis as a store and enhanced it
Added population filters
Fixed a few bugs
Did a pull request and it was accepted!
Benefits Filters are stored in Redis
• Only bitset/bitget calls to server
Reads and updates of the filter from set of application servers
Persistence has a cost, but a fraction of the RDBMS costs
Can load a BF created offline and begin using it
Remember “For the cost of an Oracle License” Thousands of filters
Dozens of Redis instances
TTL on a Redis key makes cleanup of old filters trivial
Population Bloom Filters Unique need we had
Users access the system frequently, but I really only need to count them once per month for billing
10's of Thousands of clients, Finance wants monthly report in seconds
Logic is simple: if any bits weren't set for the key (user id), increment the counter
Note: there are mathematical methods of estimating a BF population but we needed better error rate
Example Uses of Bloom Filters Webcache – what URLs are already in the cache on
another server?
P2P networks – what node contains which part of the file?
Databases
– Do keys exist in this page? If not, don't load the page
– Hbase uses them to detect which blocks do not have the data (HDFS is write-once)
– Many RDBMS use them internally to 'fail fast' and not load pages into memory
– Sadly, no RDBMS or NoSQL I know of offers them as user data types
Example Uses of Bloom Filters Ad networks (old way ...)
– Big Hadoop job hourly/nightly to determine which ads to show based on prior behavior
– Load the filter into a common storage (disk usually)
– Ad servers load all the filters into memory and query for your cookie id to see what to show you
Examples of Redis-backed BloomFilters Has the user be here this month? If not show them a
Message. False positive doesn't matter
White vs. Black list for IP
– Known bad IP in the filter
– Upon login check the filter. Not found, login. Found – check DB to validate bad IP.
– False Positive will lead to query that returns false, but should be rare
• Ad Networks (real time BF updates based on what you searched on)
Client side Joins Most NoSQL don't support joins
Architecture may have data across multiple stores
Keep a Population Bloom Filter by day of unique users in a data source
When needing to join, load smallest data source as the driver and query other sources in order of size
If queries are time based and filters are available for the time, looking up key matches can be very fast
Agenda Fail Fast
What it is Redis-backed BloomFilters Examples
Fail Fast The ability to quickly know to NOT do something
expensive
Example: Black-list of IPs
Think about ways to NOT do some work
Cost of Redis servers is much less than an RDBMS license or the cost of a good DB server with storage!
Hammer Time
Be careful Sometimes the cost of building and maintaining the
structures outweighs the benefit
Convoluted designs to avoid the database
Collect Metrics on 'hits' to see if they are any benefit (CodaHale)
Example (naive) Build a BF for ads shown to a user (hash on user id
and ad id)
When the user visits, hash their user id and the top ad to display this hour and set the bits in the BF
If any were not set, the Population count is incremented and you display the ad
If already set, move to the next most important ad.
Now know total unique views by ad by hour
Can do total gross with a Redis Hash too!
Example – smarter Hash the top 10 ad ids to the user id and parallel
request (Pipeline)
Check the return to see which ones aren't set, submit an update request and set the population
2 round trips to check 10 ads.
(Can also do this in LUA in 1 round trip)
Example – part 2 Same idea as before, but build the bloom filter for
each hour
When user visits, query last 6 filters in parallel (pipeline!) to see if they've seen the ad(s).
Redis TTL on the hourly filter will drop it automatically when it becomes too old
Example 3 Collect lots of data about users (such as virtual cows,
farm land, chickens etc.)
Run a predictive model on the data and identify which special offers to show the user visits again. Store user ids in a Bloom Filter
Load the BF into Redis
Query each time the user logs in and display appropriate offer
No massive database insert/updates to flag who should see it
False positive isn't too bad
Example 4 – Query optimization Client-side joins
Ask the Bloom Filter if the user has performed the action (filters for hour, day, week of year etc.)
If not, don't even call the data source
May need to read some extra data due to 'in the last 11 days' but asking the BF and being told 'no' prevents ANY data source resources to be used
What if the BF is lost? Rebuild it from the base events (Hadoop!)
Conclusion Redis is a very fast, very simple and very powerful
name value store “Data structure server”
Bloom Filters have lots of applications when you want to quickly look up if one of millions of 'things' happened
Redis-backed BloomFilters make updatable bloom filters trivial to use
Think about what you need to know to NOT do an expensive operation
Fail fast
References Redis.io
http://en.wikipedia.org/wiki/Bloom_filter
http://hur.st/bloomfilter?n=4&p=1.0E-20
https://github.com/Baqend/Orestes-Bloomfilter
http://www.slideshare.net/chriscurtin
@ChrisCurtin on twitter
Github.com/chriscurtin
Questions?