Top Banner
Mihnea Giurgea
32

Intelligent Stream Filtering Using MongoDB

Jul 02, 2015

Download

Technology

Mihnea Giurgea
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Intelligent Stream Filtering Using MongoDB

Mihnea Giurgea

Page 2: Intelligent Stream Filtering Using MongoDB

CONTENTS

• Who Are We?

• MongoDB On Amazon

• How We Do Stream-Filtering

• First Approach

• Second Approach

• Questions

Page 3: Intelligent Stream Filtering Using MongoDB

CONTENTS

• Who Are We?

• MongoDB On Amazon

• How We Do Stream-Filtering

• First Approach

• Second Approach

• Questions

Page 4: Intelligent Stream Filtering Using MongoDB

UBERVU AT A GLANCE

~50K

~30

32T

updates and

inserts per minute

Amazon instances

worth of EBS volumes

The force is

strong at

uberVU

Page 5: Intelligent Stream Filtering Using MongoDB

CONTENTS

• Who Are We?

• MongoDB On Amazon

• How We Do Stream-Filtering

• First Approach

• Second Approach

• Questions

Page 6: Intelligent Stream Filtering Using MongoDB

DB INFRASTRUCTURE

Page 7: Intelligent Stream Filtering Using MongoDB

DB INFRASTRUCTURE

NO

SCALABLE

single points

of failure

horizontally &

vertically

Page 8: Intelligent Stream Filtering Using MongoDB

MULTIPLE DB ENVIRONMENTS

• 4 different mongo environments

• each with its own shards, config servers, etc.

• Why?

• isolate problems & bad behavior

• ++reliability

• better resource (hardware) distribution

• different number of shards per database

• some databases need more or less replica nodes

Page 9: Intelligent Stream Filtering Using MongoDB

MULTIPLE ENVIRONMENTS

• application servers hold 4 mongos,

instead of just 1

• each of the 3 config servers has 4 x

mongod processes

mongodmongodmongodmongod

mongodmongodmongodmongod

mongodmongodmongodmongod

Page 10: Intelligent Stream Filtering Using MongoDB

MONGOD

• run only one mongod process per

replica node

• each shard resides on a MDADM

RAID 10 matrix

• consisting of 16 HDD x 250 GB each

Page 11: Intelligent Stream Filtering Using MongoDB

AMAZON EC2 INSTANCES

• mongod primary• High-Memory Double Extra Large 34.2 GB

• mongod secondary• High-Memory Extra Large 17.1 GB

• config servers• Large Instance (cheapest 64-bit machine)

• expensive for its purpose :(

Page 12: Intelligent Stream Filtering Using MongoDB

CONTENTS

• Who Are We?

• MongoDB On Amazon

• How We Do Stream-Filtering

• First Approach

• Second Approach

• Questions

Page 13: Intelligent Stream Filtering Using MongoDB

THE PROBLEM

Gather mentions from web (Twitter,

Facebook, etc.)

Data Stream =

mentions around

a certain term

• mentions are

annotated (language,

location, sentiment,

etc.)

• data stream is

indexed in MongoDB

Page 14: Intelligent Stream Filtering Using MongoDB

FILTERING

• filter data stream by time (since & until)

• filter by other attributes:

• platform: Twitter, Facebook

• language: English, French

• location: UK, US, Romania

• sentiment

• gender

• etc.

Page 15: Intelligent Stream Filtering Using MongoDB

FILTERING

“MongoDB”

filtered by:

• United States

• gender: female

• sentiment: positive

Page 16: Intelligent Stream Filtering Using MongoDB

CONTENTS

• Who Are We?

• MongoDB On Amazon

• How We Do Stream-Filtering

• First Approach

• Second Approach

• Questions

Page 17: Intelligent Stream Filtering Using MongoDB

FIRST APPROACH

• if no filters are needed, 1 index will suffice:1. stream, time

• 1 filter => 2 indexes1. stream, time

2. stream, platform, time

• sort attribute must be last in index

Page 18: Intelligent Stream Filtering Using MongoDB

FIRST APPROACH

• 2 filters => 4 indexes

1. stream, time

2. stream, platform, time

3. stream, language, time

4. stream, platform, language, time

• ...etc... (F filters => 2F indexes)

Page 19: Intelligent Stream Filtering Using MongoDB

IMPROVEMENTS

• don’t really need (stream, platform, language,

time)

• when filtering for platform & language, use:

• stream, platform, time OR

• stream, language, time

• which one?

• the one with the smallest cardinality

Page 20: Intelligent Stream Filtering Using MongoDB

IMPROVEMENTS

• saves index space

• but increases query scanning time

• finding the right indexes is a trade-off between:

• indexing space

• query scanning time

Page 21: Intelligent Stream Filtering Using MongoDB

CONTENTS

• Who Are We?

• MongoDB On Amazon

• How We Do Stream-Filtering

• First Approach

• Second Approach

• Questions

Page 22: Intelligent Stream Filtering Using MongoDB

IMPROVEMENTS

• Question: when filtering by platform &

language, what index should we use?

• stream, platform, time

• stream, language, time

• Answer: smallest cardinality

• we need to know the size of each attribute:

• platform: twitter - 90%

• language: English - 60%

• location: France - 8%

Page 23: Intelligent Stream Filtering Using MongoDB

ATTRIBUTES

• normalize each attribute

• language: English => 13

• gender: male => 2038, etc.

• numbers use less space & are faster

• each mention now has several attributes:

{ 'platform': 'twitter','language': English', ---> { 'attributes': [13, 213, 2039, 1] }'location': 'UK','gender': 'female' }

Page 24: Intelligent Stream Filtering Using MongoDB

MULTIKEY INDEX

• use a multikey index for attributes:

• stream, attributes, time

• use $all to query for multiple filters

db.find( { 'stream': 'mongo', 'platform': 'twitter', --->'gender': 'male','language': 'romanian'

} )

db.mentions.find( {'stream': 'mongo','attributes': { ‘$all':

[1, 2038, 58]}

} )

Page 25: Intelligent Stream Filtering Using MongoDB

SORT BY FILTER

• $all: only the first item uses the index!

• the rest are scanned through

• ensure the first item has the smallest cardinality

• for the smallest query scanning time

{ location: france} < { gender: male } < { platform: twitter }

Page 26: Intelligent Stream Filtering Using MongoDB

SECOND APPROACH

• now we only need 2 indexes!

• stream, time

• stream, attributes, time

• works for any number of filters

• is far from perfect

• but gets the job done

• with little resources

Page 27: Intelligent Stream Filtering Using MongoDB

MORE IMPROVEMENTS

• don’t store all normalized attributes in index

• skip the very big ones:• platform: twitter - 90%

• language: English - 60%

• 90% selection rate: no index needed

• decreases index size

• no noticeable performance loss

Page 28: Intelligent Stream Filtering Using MongoDB

USE _ID!

• use _id index instead of (stream, time)

• saves memory!

• Problem: _id must be unique

• (stream, time) index is not!

• Question: how to make (stream, time) unique?

• Answer: add some random number

Page 29: Intelligent Stream Filtering Using MongoDB

USE _ID!

• pack stream, time & random into a number

• why: number look-ups are faster

• use all 64 bits available!

Page 30: Intelligent Stream Filtering Using MongoDB

DUPLICATES

• we need to detect duplicates

• modify bit packing

• use mention.url

• instead of random bits

• uniquely identifies a mention

• for fastest index lookup use:• db.find({ _id: docid }).count()

Page 31: Intelligent Stream Filtering Using MongoDB

CONTENTS

• Who Are We?

• MongoDB On Amazon

• How We Do Stream-Filtering

• First Approach

• Second Approach

• Questions

Page 32: Intelligent Stream Filtering Using MongoDB

?