Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida
BloomFilters,Minhashes,andOtherRandomStuff
BrianBrubachUniversityofMaryland,CollegePark
StringBio 2018,UniversityofCentralFlorida
What?
• Probabilistic• Space-efficient• Fast• Notexact
Why?
• Datadeluge/Bigdata/Massivedata• Millionsorbillionsofsequences• Humangenome:3Gbp• 1giga basepairs=1billioncharacters
• Microbiomesampleof1.6billion100bp readsgeneratedin10.8days(Caporaso,etal.,2012)• Mediumdata,butonalaptop• Lotsofbioinformaticshappenshere
• BeyondscalabilityofBWT,FM-index,etc.
(Berger,Daniels,andYu,2016)
CurseofDimensionality
• Sequencesarecomparedinhighdimensionalspace• Comparing𝑁 sequencestakes𝑁" time• Computingeditdistancebetweentwosequencesoflength𝑛 takes𝑛" time• Allegedly
CurseofDimensionality• ATGATCGAGGCTATGCGACCGATCGATCGATTCGTA• ATGATGGAGGCTATGGGAACGATCGATCGACTCGTA• ATGATCGAGGCTATGCCACCGATCGAACGATTCGTA• ATCATCGAGGCTATGCGACCGTTCGATCGATTCCTA• GTGATCGTGGCTATGCGACCGATCGATCGATTCGTC• ATGATCGAGGCTATGCCACCGATCGAACGATTCGTA• ATGATCCAGGCTATGCGACCGATCGATGCATTCGTA
WhyStayinHighDimensions?
• 4%&& possibleDNAstringsoflength100• 4%' ≈ 1billionreads
k-mers ofaSequence
• Allsubstringsoflengthk• Canonical:lexicographicallysmallestamongforwardandreversecomplement• Forgetthisfornow All 7-mers:
ATCTGAGGTCACATCTGAG TCTGAGG CTGAGGT TGAGGTC GAGGTCA AGGTCAC
Reverse complement:ATCTGAGGTCACGTGACCTCAGAT
Hashfunction
String HashMagic Randomintegerin{1,m}
• Willassumeidealizedmodelofhashingforthistalk• Lotsofresearchinthisarea
BloomFilterExampleProblem
• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency
BloomFilterExampleProblem
• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency• Howcanweaddressthiswithhashing?
BloomFilterExampleProblem
• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency• Howcanweaddressthiswithhashing?• Put𝑘-mers inhashtable
BloomFilterExampleProblem
• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency• Howcanweaddressthiswithhashing?• Put𝑘-mers inhashtable• Atleast2𝑁𝑘 bitsfordataplustableoverhead
BloomFilterExampleProblem
• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency• Howcanweaddressthiswithhashing?• Put𝑘-mers inhashtable• Atleast2𝑁𝑘 bitsfordataplustableoverhead
• Whatifwejuststoreonebitateachhashforpresence/absence?• SimpleBloomfilter,potentiallysuboptimal
BloomFilter
• Probabilisticdatastructure• Fastandspace-efficient• Falsepositives,butnofalsenegatives• Insertandcontains,butnodelete• DuetoBurtonHowardBloomin1970• Gaveexampleofautomatichyphenation• Identifythe10%ofwordsthatrequirespecialhyphenationrules
BloomFilter
• 𝑁 itemstostore:𝑥%, 𝑥", … , 𝑥/• 𝑚-bitvector• 𝑑 hashfunctions:ℎ%, ℎ", … , ℎ3• Insert(𝑥):setbitsℎ%(𝑥), ℎ"(𝑥), … , ℎ3(𝑥) to1• Contains(𝑦):• Yesifbitsℎ%(𝑦), ℎ"(𝑦), … , ℎ3(𝑦) are1• Noifanyare0
• 𝑚 = 10,𝑑 = 3,hashfunctions:ℎ%, ℎ", ℎ;
BloomFilterExample
0 0 0 0 0 0 0 0 0 0
• 𝑚 = 10,𝑑 = 3,hashfunctions:ℎ%, ℎ", ℎ;
BloomFilterExample
0 1 1 0 0 0 1 0 0 0
Insert(𝑥%):ℎ% 𝑥% , ℎ" 𝑥% , ℎ;(𝑥%)
• 𝑚 = 10,𝑑 = 3,hashfunctions:ℎ%, ℎ", ℎ;
BloomFilterExample
0 1 1 0 0 0 1 1 0 1
Insert(𝑥%):ℎ% 𝑥% , ℎ" 𝑥% , ℎ;(𝑥%) Insert(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")
• 𝑚 = 10,𝑑 = 3,hashfunctions:ℎ%, ℎ", ℎ;
BloomFilterExample
0 1 1 0 0 0 1 1 0 1
Insert(𝑥%):ℎ% 𝑥% , ℎ" 𝑥% , ℎ;(𝑥%) Insert(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")
Contains(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")
• 𝑚 = 10,𝑑 = 3,hashfunctions:ℎ%, ℎ", ℎ;
BloomFilterExample
0 1 1 0 0 0 1 1 0 1
Insert(𝑥%):ℎ% 𝑥% , ℎ" 𝑥% , ℎ;(𝑥%) Insert(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")
Contains(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")Contains(𝑦):ℎ% 𝑦 , ℎ" 𝑦 , ℎ;(𝑦)
• 𝑚 = 10,𝑑 = 3,hashfunctions:ℎ%, ℎ", ℎ;
BloomFilterExample
0 1 1 0 0 0 1 1 0 1
Insert(𝑥%):ℎ% 𝑥% , ℎ" 𝑥% , ℎ;(𝑥%) Insert(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")
Contains(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")Contains(𝑦):ℎ% 𝑦 , ℎ" 𝑦 , ℎ;(𝑦)
FalsePositive!
FalsePositiveprobability• Pr[onehashmissesabit]
• 1 − %=
• Pr[oneinsertionmissesabit]• 1 − %
=
3
• Pr[allinsertionsmissabit]• 1 − %
=
3>
• Pr[asinglebitflippedto1]• 1 − 1 − %
?
@A≈ 1 − 𝑒C3>/=
• Falsepositiveprobability(assumingindependence)• 1 − 𝑒C3>/= 3
Optimalparameters
• Falsepositiverate𝑝 ≈ 1 − 𝑒C3>/= 3
• Falsepositivesminimizedat𝑑 = =>ln2
• Bitsperitem=>≈ − HIJKL
HA"≈ −1.44log"𝑝
• Approximate:assumingasymptotic,independence,andintegralityof𝑑• 𝑝 = 0.01,needs9.59bitsperitem• 𝑝 = 0.001,needs14.38bitsperitem
• Numberofhashes𝑑 ≈ −log"𝑝
Properties
• Insertandcheckin𝑂(𝑑) time• Independentofnumberofitemsinserted
• Fastandparalleltocomputehashes• CandounionandintersectionwithORandANDofbitvectors• Canestimate𝑁 ifunknown
EndlessVariations
• Deletions• Counting• Bloomier filters:storingvalues• Cacheoptimizations• Distancesensitive:is𝑥 closetotheset
𝑘-mer BloomFilter
• Canwedobetterifweknowtheitemsare𝑘-mersfromagenome?
𝑘-mer BloomFilter
• Canwedobetterifweknowtheitemsare𝑘-mersfromagenome?• Observation:the“items”areoverlappingsubstringsfroma4letteralphabet
𝑘-mer BloomFilter
• Canwedobetterifweknowtheitemsare𝑘-mersfromagenome?• Observation:the“items”areoverlappingsubstringsfroma4letteralphabet• Aftergettingpositive,• Checkall4preceding𝑘-mers andall4following𝑘-mers• Onemustbeinthesetforatruepositive• Falsepositivenexttoanotherpositivelesslikely
• Canreducefalsepositivesorspace• (Pellow,Filippova,andKingsford,2017)
ATCCxATCTCCx
BioApplications
• Pan-genomestorage• Bloomfiltertrie (Holley,Wittler,andStoye,2015)
• Short-readRNA-seq database• SplitSequenceBloomtree(SolomonandKingsford,2016)
• SuccinctdeBruijn graphs• ProbabilisticdeBruijn graph(Pell,etal.,2011)• Exactversion(Chikhi andRizk,2012)
• Humangenome:3Gbp,𝑘 = 27,3.7GB,13.2bitspervertex
LocalitySensitiveHashing(LSH)
• Whatdowetypicallywanttoavoidwhenhashing?
LocalitySensitiveHashing(LSH)
• Whatdowetypicallywanttoavoidwhenhashing?• Collisions!
• Approximatenearestneighbors:towardsremovingthecurseofdimensionality(Indyk andMotwani,1998)• Idea:getsimilarelementstohashtogether• “Itskeyingredientisthenotionoflocality-sensitivehashing whichmaybeofindependentinterest;…”
ComparingTwoSequences
• Mash:fastgenomeandmetagenomedistanceestimationusingMinHash (Ondov etal.,2016)• Let𝐴 and𝐵 betwoDNAsequencestocompare• Construct𝑘-mer sets𝐴 and𝐵• Assume 𝐴 = |𝐵| fornow(nottrue)
• Comparethesetssomehow• Notfasteryet,butwe’llgetthere…
Jaccard Index
• Similaritybetweensets𝐴 and𝐵• |U∩W||U∪W|
• CorrelatedwithAverageNucleotideIdentity(ANI)• Empiricalsupport,butdebatable
Jaccard Index:|U∩W||U∪W|
A B
Jaccard Index:|U∩W||U∪W|
• Whatwouldyoudoifyouwerestudyingapopulation?
Peoplewholikepeanutbutter
PeoplewholikejellyA B
Jaccard Index:|U∩W||U∪W|
• Whatwouldyoudoifyouwerestudyingapopulation?Sample!
Peoplewholikepeanutbutter
PeoplewholikejellyA B
Sketch
• Small“fingerprint”ofadatapoint(string)
A B
Warm-up:NaïveSketch
• Sampleeachstringindependently(don’twanttodo𝑁" sketchesforcomparingallpairsof𝑁 strings)
A B
Warm-up:NaïveSketch
• Sampleeachstringindependently(don’twanttodo𝑁" sketchesforcomparingallpairsof𝑁 strings)
A B
Warm-up:NaïveSketch
A B
• Sampleeachstringindependently(don’twanttodo𝑁" sketchesforcomparingallpairsof𝑁 strings)• Smalloverlap
Minhashing/Bottom-𝑑 Sketch
• Ontheresemblanceandcontainmentofdocuments(Broder,1997)• Forcomparingdocuments
• Hasheach𝑘-mer inasequence• Sketch𝑆(𝐴):smallest𝑑 hashvaluesin𝐴• Ortakeminforeachof𝑑 differenthashfunction
• Usesamehashfunctionfor𝑆 𝐴 and𝑆(𝐵)• Letsussketcheachstring,but“simulate”sketchingtheunion𝑆(𝐴 ∪ 𝐵)• Canonicalk-mers,𝐴 and𝐵 couldbereversecomps
Minhashing/Bottom-𝑑 Sketch
• Samplesmallest𝑑 = 6 hashesof𝑘-mers ineachset
A B
Minhashing/Bottom-𝑑 Sketch
• Samplesmallest𝑑 = 6 hashesof𝑘-mers ineachset
A B
10
3
9
2
4
7
1
5
8
Minhashing/Bottom-𝑑 Sketch
• Samplesmallest𝑑 = 6 hashesof𝑘-mers ineachset
A B
10
3
9
2
4
7
1
5
8
Thiscan’thappen
Comparingsketches
• Jaccardestimate𝑗• U∩WU∪W ≈ |f(U∪W)∩f(U)∩f W |
f U∪W
• Get𝑆 𝐴 ∪ 𝐵 bymergesortoperationin𝑂 𝑑 time• Mergeuntil𝑑 uniquehashesseen• Countnumberofmatches𝑐 = 3• 𝑗 = h
3
• Errorofestimateis𝜖 = %3�
𝑆 𝐴2347910
𝑆 𝐴 ∪ 𝐵123457
𝑆 𝐵124578
BuildingBottom-𝑑 Sketch
• Takes𝑂(𝑛log𝑑) time• Traversestring,hashing𝑘-mers• Keepsortedlistofsmallest𝑑• Checkeachnewhashagainstmaxinlist• 𝑂 log𝑑 timetoinsertifnecessary
• Actuallyexpectedtime𝑂 𝑛 + 𝑑log𝑑log𝑛• BecausePr[𝑖th hashgetsinsertedinlist]= 3
m• Soeffectivelylinear
Minhash parameters
• Probabilitysome𝑘-mer 𝑥 appearsinarandomgenomeoflength𝑛• Pr 𝑥 ∈ 𝐴 ≈ 1 − 1 − Σ Cq >
• Alphabetsize Σ = 4
• For𝑘 = 16,𝑛 = 3Gbp:• Probabilityofagiven16-merinagenomeis≈ 0.5• ≈ 25% of16-mersexpectedtobesharedbetweentworandom3Gbp genomes• Tooshort𝑘-mers canoverestimateJaccard,especiallyfordistantgenomes• Verylongcouldunderestimate,butlessofanissue
Minhash parameters
• Valueof𝑘 toachieveadesiredprobability𝑞 ofseeingagivenk-mer insequencelength𝑛• 𝑘 ≈ log u
> %Cvv
• 5Mbp genome,𝑞 = 0.01, 𝑘 ≈ 14• 3Gbp genome,𝑞 = 0.01,𝑘 ≈ 19• Mashdefault:k=21ands=1000• 8kBpersketch
Mashdistance
• MashdistancebasedonJaccard estimate𝑗• − %
q ln"x%yx
• BasedonPoissonerrormodel• Implicitlyusesaveragesizeofthetwosets,penalizingsetsofdifferentsize
Somerelatedworks
• Assemblyoverlaps• Assemblinglargegenomeswithsingle-moleculesequencingandlocality-sensitivehashing(Berlinetal.,2015)
• Containmentfordifferentsizesets• ImprovingMinHashViatheContainmentIndexwithApplicationstoMetagenomic Analysis(Koslicki andZabeti,2017)
Implementation
• MurmurHash3• OpenBloomFilterLibrary• Mash
OtherRandomStuff
OtherRandomStuff
OtherRandomStuff
FruitFlyBrains
• LocalitySensitiveHashing(LSH)• Aneuralalgorithmforafundamentalcomputingproblem(Dasgupta,Stevens,andNavlakha,2017)
• Bloomfilters• (Dasgupta,Sheehan,Stevens,andNavlakha,upcoming)• Have3specialproperties
• Continuous-valuednovelty• Distancesensitivity• Timesensitivity
Thanks!