Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

BloomFilters,Minhashes,andOtherRandomStuff

BrianBrubachUniversityofMaryland,CollegePark

StringBio 2018,UniversityofCentralFlorida

What?

• Probabilistic• Space-efficient• Fast• Notexact

Why?

• Datadeluge/Bigdata/Massivedata• Millionsorbillionsofsequences• Humangenome:3Gbp• 1giga basepairs=1billioncharacters

• Microbiomesampleof1.6billion100bp readsgeneratedin10.8days(Caporaso,etal.,2012)• Mediumdata,butonalaptop• Lotsofbioinformaticshappenshere

• BeyondscalabilityofBWT,FM-index,etc.

(Berger,Daniels,andYu,2016)

CurseofDimensionality

• Sequencesarecomparedinhighdimensionalspace• Comparing𝑁 sequencestakes𝑁" time• Computingeditdistancebetweentwosequencesoflength𝑛 takes𝑛" time• Allegedly

CurseofDimensionality• ATGATCGAGGCTATGCGACCGATCGATCGATTCGTA• ATGATGGAGGCTATGGGAACGATCGATCGACTCGTA• ATGATCGAGGCTATGCCACCGATCGAACGATTCGTA• ATCATCGAGGCTATGCGACCGTTCGATCGATTCCTA• GTGATCGTGGCTATGCGACCGATCGATCGATTCGTC• ATGATCGAGGCTATGCCACCGATCGAACGATTCGTA• ATGATCCAGGCTATGCGACCGATCGATGCATTCGTA

WhyStayinHighDimensions?

• 4%&& possibleDNAstringsoflength100• 4%' ≈ 1billionreads

k-mers ofaSequence

• Allsubstringsoflengthk• Canonical:lexicographicallysmallestamongforwardandreversecomplement• Forgetthisfornow All 7-mers:

ATCTGAGGTCACATCTGAG TCTGAGG CTGAGGT TGAGGTC GAGGTCA AGGTCAC

Reverse complement:ATCTGAGGTCACGTGACCTCAGAT

Hashfunction

String HashMagic Randomintegerin{1,m}

• Willassumeidealizedmodelofhashingforthistalk• Lotsofresearchinthisarea

BloomFilterExampleProblem

• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency


• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency• Howcanweaddressthiswithhashing?


• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency• Howcanweaddressthiswithhashing?• Put𝑘-mers inhashtable


• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency• Howcanweaddressthiswithhashing?• Put𝑘-mers inhashtable• Atleast2𝑁𝑘 bitsfordataplustableoverhead


• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency• Howcanweaddressthiswithhashing?• Put𝑘-mers inhashtable• Atleast2𝑁𝑘 bitsfordataplustableoverhead

• Whatifwejuststoreonebitateachhashforpresence/absence?• SimpleBloomfilter,potentiallysuboptimal

BloomFilter

• Probabilisticdatastructure• Fastandspace-efficient• Falsepositives,butnofalsenegatives• Insertandcontains,butnodelete• DuetoBurtonHowardBloomin1970• Gaveexampleofautomatichyphenation• Identifythe10%ofwordsthatrequirespecialhyphenationrules

BloomFilter

• 𝑁 itemstostore:𝑥%, 𝑥", … , 𝑥/• 𝑚-bitvector• 𝑑 hashfunctions:ℎ%, ℎ", … , ℎ3• Insert(𝑥):setbitsℎ%(𝑥), ℎ"(𝑥), … , ℎ3(𝑥) to1• Contains(𝑦):• Yesifbitsℎ%(𝑦), ℎ"(𝑦), … , ℎ3(𝑦) are1• Noifanyare0

• 𝑚 = 10,𝑑 = 3,hashfunctions:ℎ%, ℎ", ℎ;

BloomFilterExample

0 0 0 0 0 0 0 0 0 0


BloomFilterExample

0 1 1 0 0 0 1 0 0 0

Insert(𝑥%):ℎ% 𝑥% , ℎ" 𝑥% , ℎ;(𝑥%)


BloomFilterExample

0 1 1 0 0 0 1 1 0 1

Insert(𝑥%):ℎ% 𝑥% , ℎ" 𝑥% , ℎ;(𝑥%) Insert(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")


BloomFilterExample

0 1 1 0 0 0 1 1 0 1


Contains(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")


BloomFilterExample

0 1 1 0 0 0 1 1 0 1


Contains(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")Contains(𝑦):ℎ% 𝑦 , ℎ" 𝑦 , ℎ;(𝑦)


BloomFilterExample

0 1 1 0 0 0 1 1 0 1


Contains(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")Contains(𝑦):ℎ% 𝑦 , ℎ" 𝑦 , ℎ;(𝑦)

FalsePositive!

FalsePositiveprobability• Pr[onehashmissesabit]

• 1 − %=

• Pr[oneinsertionmissesabit]• 1 − %

=

3

• Pr[allinsertionsmissabit]• 1 − %

=

3>

• Pr[asinglebitflippedto1]• 1 − 1 − %

?

@A≈ 1 − 𝑒C3>/=

• Falsepositiveprobability(assumingindependence)• 1 − 𝑒C3>/= 3

Optimalparameters

• Falsepositiverate𝑝 ≈ 1 − 𝑒C3>/= 3

• Falsepositivesminimizedat𝑑 = =>ln2

• Bitsperitem=>≈ − HIJKL

HA"≈ −1.44log"𝑝

• Approximate:assumingasymptotic,independence,andintegralityof𝑑• 𝑝 = 0.01,needs9.59bitsperitem• 𝑝 = 0.001,needs14.38bitsperitem

• Numberofhashes𝑑 ≈ −log"𝑝

Properties

• Insertandcheckin𝑂(𝑑) time• Independentofnumberofitemsinserted

• Fastandparalleltocomputehashes• CandounionandintersectionwithORandANDofbitvectors• Canestimate𝑁 ifunknown

EndlessVariations

• Deletions• Counting• Bloomier filters:storingvalues• Cacheoptimizations• Distancesensitive:is𝑥 closetotheset

𝑘-mer BloomFilter

• Canwedobetterifweknowtheitemsare𝑘-mersfromagenome?


• Canwedobetterifweknowtheitemsare𝑘-mersfromagenome?• Observation:the“items”areoverlappingsubstringsfroma4letteralphabet


• Canwedobetterifweknowtheitemsare𝑘-mersfromagenome?• Observation:the“items”areoverlappingsubstringsfroma4letteralphabet• Aftergettingpositive,• Checkall4preceding𝑘-mers andall4following𝑘-mers• Onemustbeinthesetforatruepositive• Falsepositivenexttoanotherpositivelesslikely

• Canreducefalsepositivesorspace• (Pellow,Filippova,andKingsford,2017)

ATCCxATCTCCx

BioApplications

• Pan-genomestorage• Bloomfiltertrie (Holley,Wittler,andStoye,2015)

• Short-readRNA-seq database• SplitSequenceBloomtree(SolomonandKingsford,2016)

• SuccinctdeBruijn graphs• ProbabilisticdeBruijn graph(Pell,etal.,2011)• Exactversion(Chikhi andRizk,2012)

• Humangenome:3Gbp,𝑘 = 27,3.7GB,13.2bitspervertex

LocalitySensitiveHashing(LSH)

• Whatdowetypicallywanttoavoidwhenhashing?

LocalitySensitiveHashing(LSH)

• Whatdowetypicallywanttoavoidwhenhashing?• Collisions!

• Approximatenearestneighbors:towardsremovingthecurseofdimensionality(Indyk andMotwani,1998)• Idea:getsimilarelementstohashtogether• “Itskeyingredientisthenotionoflocality-sensitivehashing whichmaybeofindependentinterest;…”

ComparingTwoSequences

• Mash:fastgenomeandmetagenomedistanceestimationusingMinHash (Ondov etal.,2016)• Let𝐴 and𝐵 betwoDNAsequencestocompare• Construct𝑘-mer sets𝐴 and𝐵• Assume 𝐴 = |𝐵| fornow(nottrue)

• Comparethesetssomehow• Notfasteryet,butwe’llgetthere…

Jaccard Index

• Similaritybetweensets𝐴 and𝐵• |U∩W||U∪W|

• CorrelatedwithAverageNucleotideIdentity(ANI)• Empiricalsupport,butdebatable

Jaccard Index:|U∩W||U∪W|

A B


• Whatwouldyoudoifyouwerestudyingapopulation?

Peoplewholikepeanutbutter

PeoplewholikejellyA B


• Whatwouldyoudoifyouwerestudyingapopulation?Sample!

Peoplewholikepeanutbutter

PeoplewholikejellyA B

Sketch

• Small“fingerprint”ofadatapoint(string)

A B

Warm-up:NaïveSketch

• Sampleeachstringindependently(don’twanttodo𝑁" sketchesforcomparingallpairsof𝑁 strings)

A B


• Sampleeachstringindependently(don’twanttodo𝑁" sketchesforcomparingallpairsof𝑁 strings)

A B


A B

• Sampleeachstringindependently(don’twanttodo𝑁" sketchesforcomparingallpairsof𝑁 strings)• Smalloverlap

Minhashing/Bottom-𝑑 Sketch

• Ontheresemblanceandcontainmentofdocuments(Broder,1997)• Forcomparingdocuments

• Hasheach𝑘-mer inasequence• Sketch𝑆(𝐴):smallest𝑑 hashvaluesin𝐴• Ortakeminforeachof𝑑 differenthashfunction

• Usesamehashfunctionfor𝑆 𝐴 and𝑆(𝐵)• Letsussketcheachstring,but“simulate”sketchingtheunion𝑆(𝐴 ∪ 𝐵)• Canonicalk-mers,𝐴 and𝐵 couldbereversecomps


• Samplesmallest𝑑 = 6 hashesof𝑘-mers ineachset

A B



A B

10

3

9

2

4

7

1

5

8



A B

10

3

9

2

4

7

1

5

8

Thiscan’thappen

Comparingsketches

• Jaccardestimate𝑗• U∩WU∪W ≈ |f(U∪W)∩f(U)∩f W |

f U∪W

• Get𝑆 𝐴 ∪ 𝐵 bymergesortoperationin𝑂 𝑑 time• Mergeuntil𝑑 uniquehashesseen• Countnumberofmatches𝑐 = 3• 𝑗 = h

3

• Errorofestimateis𝜖 = %3�

𝑆 𝐴2347910

𝑆 𝐴 ∪ 𝐵123457

𝑆 𝐵124578

BuildingBottom-𝑑 Sketch

• Takes𝑂(𝑛log𝑑) time• Traversestring,hashing𝑘-mers• Keepsortedlistofsmallest𝑑• Checkeachnewhashagainstmaxinlist• 𝑂 log𝑑 timetoinsertifnecessary

• Actuallyexpectedtime𝑂 𝑛 + 𝑑log𝑑log𝑛• BecausePr[𝑖th hashgetsinsertedinlist]= 3

m• Soeffectivelylinear

Minhash parameters

• Probabilitysome𝑘-mer 𝑥 appearsinarandomgenomeoflength𝑛• Pr 𝑥 ∈ 𝐴 ≈ 1 − 1 − Σ Cq >

• Alphabetsize Σ = 4

• For𝑘 = 16,𝑛 = 3Gbp:• Probabilityofagiven16-merinagenomeis≈ 0.5• ≈ 25% of16-mersexpectedtobesharedbetweentworandom3Gbp genomes• Tooshort𝑘-mers canoverestimateJaccard,especiallyfordistantgenomes• Verylongcouldunderestimate,butlessofanissue

Minhash parameters

• Valueof𝑘 toachieveadesiredprobability𝑞 ofseeingagivenk-mer insequencelength𝑛• 𝑘 ≈ log u

> %Cvv

• 5Mbp genome,𝑞 = 0.01, 𝑘 ≈ 14• 3Gbp genome,𝑞 = 0.01,𝑘 ≈ 19• Mashdefault:k=21ands=1000• 8kBpersketch

Mashdistance

• MashdistancebasedonJaccard estimate𝑗• − %

q ln"x%yx

• BasedonPoissonerrormodel• Implicitlyusesaveragesizeofthetwosets,penalizingsetsofdifferentsize

Somerelatedworks

• Assemblyoverlaps• Assemblinglargegenomeswithsingle-moleculesequencingandlocality-sensitivehashing(Berlinetal.,2015)

• Containmentfordifferentsizesets• ImprovingMinHashViatheContainmentIndexwithApplicationstoMetagenomic Analysis(Koslicki andZabeti,2017)

Implementation

• MurmurHash3• OpenBloomFilterLibrary• Mash

OtherRandomStuff

OtherRandomStuff

OtherRandomStuff

FruitFlyBrains

• LocalitySensitiveHashing(LSH)• Aneuralalgorithmforafundamentalcomputingproblem(Dasgupta,Stevens,andNavlakha,2017)

• Bloomfilters• (Dasgupta,Sheehan,Stevens,andNavlakha,upcoming)• Have3specialproperties

• Continuous-valuednovelty• Distancesensitivity• Timesensitivity

Thanks!

Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Documents