Intelligent Stream Filtering Using MongoDB

Mihnea Giurgea

CONTENTS

• Who Are We?

• MongoDB On Amazon

• How We Do Stream-Filtering

• First Approach

• Second Approach

• Questions

CONTENTS

• Who Are We?



• First Approach

• Second Approach

• Questions

UBERVU AT A GLANCE

~50K

~30

32T

updates and

inserts per minute

Amazon instances

worth of EBS volumes

The force is

strong at

uberVU

CONTENTS

• Who Are We?



• First Approach

• Second Approach

• Questions

DB INFRASTRUCTURE

DB INFRASTRUCTURE

NO

SCALABLE

single points

of failure

horizontally &

vertically

MULTIPLE DB ENVIRONMENTS

• 4 different mongo environments

• each with its own shards, config servers, etc.

• Why?

• isolate problems & bad behavior

• ++reliability

• better resource (hardware) distribution

• different number of shards per database

• some databases need more or less replica nodes

MULTIPLE ENVIRONMENTS

• application servers hold 4 mongos,

instead of just 1

• each of the 3 config servers has 4 x

mongod processes

mongodmongodmongodmongod



MONGOD

• run only one mongod process per

replica node

• each shard resides on a MDADM

RAID 10 matrix

• consisting of 16 HDD x 250 GB each

AMAZON EC2 INSTANCES

• mongod primary• High-Memory Double Extra Large 34.2 GB

• mongod secondary• High-Memory Extra Large 17.1 GB

• config servers• Large Instance (cheapest 64-bit machine)

• expensive for its purpose :(

CONTENTS

• Who Are We?



• First Approach

• Second Approach

• Questions

THE PROBLEM

Gather mentions from web (Twitter,

Facebook, etc.)

Data Stream =

mentions around

a certain term

• mentions are

annotated (language,

location, sentiment,

etc.)

• data stream is

indexed in MongoDB

FILTERING

• filter data stream by time (since & until)

• filter by other attributes:

• platform: Twitter, Facebook

• language: English, French

• location: UK, US, Romania

• sentiment

• gender

• etc.

FILTERING

“MongoDB”

filtered by:

• United States

• gender: female

• sentiment: positive

CONTENTS

• Who Are We?



• First Approach

• Second Approach

• Questions

FIRST APPROACH

• if no filters are needed, 1 index will suffice:1. stream, time

• 1 filter => 2 indexes1. stream, time

2. stream, platform, time

• sort attribute must be last in index

FIRST APPROACH

• 2 filters => 4 indexes

1. stream, time

2. stream, platform, time

3. stream, language, time

4. stream, platform, language, time

• ...etc... (F filters => 2F indexes)

IMPROVEMENTS

• don’t really need (stream, platform, language,

time)

• when filtering for platform & language, use:

• stream, platform, time OR

• stream, language, time

• which one?

• the one with the smallest cardinality

IMPROVEMENTS

• saves index space

• but increases query scanning time

• finding the right indexes is a trade-off between:

• indexing space

• query scanning time

CONTENTS

• Who Are We?



• First Approach

• Second Approach

• Questions

IMPROVEMENTS

• Question: when filtering by platform &

language, what index should we use?

• stream, platform, time

• stream, language, time

• Answer: smallest cardinality

• we need to know the size of each attribute:

• platform: twitter - 90%

• language: English - 60%

• location: France - 8%

ATTRIBUTES

• normalize each attribute

• language: English => 13

• gender: male => 2038, etc.

• numbers use less space & are faster

• each mention now has several attributes:

{ 'platform': 'twitter','language': English', ---> { 'attributes': [13, 213, 2039, 1] }'location': 'UK','gender': 'female' }

MULTIKEY INDEX

• use a multikey index for attributes:

• stream, attributes, time

• use $all to query for multiple filters

db.find( { 'stream': 'mongo', 'platform': 'twitter', --->'gender': 'male','language': 'romanian'

} )

db.mentions.find( {'stream': 'mongo','attributes': { ‘$all':

[1, 2038, 58]}

} )

SORT BY FILTER

• $all: only the first item uses the index!

• the rest are scanned through

• ensure the first item has the smallest cardinality

• for the smallest query scanning time

{ location: france} < { gender: male } < { platform: twitter }

SECOND APPROACH

• now we only need 2 indexes!

• stream, time

• stream, attributes, time

• works for any number of filters

• is far from perfect

• but gets the job done

• with little resources

MORE IMPROVEMENTS

• don’t store all normalized attributes in index

• skip the very big ones:• platform: twitter - 90%

• language: English - 60%

• 90% selection rate: no index needed

• decreases index size

• no noticeable performance loss

USE _ID!

• use _id index instead of (stream, time)

• saves memory!

• Problem: _id must be unique

• (stream, time) index is not!

• Question: how to make (stream, time) unique?

• Answer: add some random number

USE _ID!

• pack stream, time & random into a number

• why: number look-ups are faster

• use all 64 bits available!

DUPLICATES

• we need to detect duplicates

• modify bit packing

• use mention.url

• instead of random bits

• uniquely identifies a mention

• for fastest index lookup use:• db.find({ _id: docid }).count()

CONTENTS

• Who Are We?



• First Approach

• Second Approach

• Questions

?

Intelligent Stream Filtering Using MongoDB

Technology