Data structures for cloud tag storage

Post on 25-Jan-2017

121 Views

Category:

Software

2 Downloads

Preview:

Click to see full reader

Transcript

Tagging schema design for high performance

Plan

▪ Tagging basis▪ Database challenges▪ Tagging solutions▪ Pros and cons▪ Q&A session

Tagging terms• Tag is a non-hierarchical keyword or term assigned to a piece of information• Tags are generally chosen informally and personally by the item's creator or

by its viewer• If tags are assigned by the creator and are limited it is taxonomy• If tags are assigned by the viewer and are unlimited it is folksonomy • Started to be widely used from 2003 by Flikr and Delicious web sites• Tags are showed usually inline as well as tag cloud

Tagging challenges+1. used vocabulary reflects the user’s vocabulary directly 2. flexibility - the user can add or remove tags3. multi-dimensional nature - users can assign any number and combination of tags to express a

concept

lead to-4. specialized tags or tags without meaning to others than themselves, misspellings,

singular/plural form, compound words5. tags are often ambiguous, overly personalized, poorly applied tag6. Using synonyms, acronyms and homonyms which aren’t handled well

Database challenges

1. Performance2. Queries awkwardness3. Database size4. Housekeeping

High normalized approach

Denormalized approach

Complex data type approach

Full-text-search oriented solutions

Stackoverflow: <php><mysql><guid><encryption>JSON: {“tags”:[“php”, “apache2”, “openinviter”]}

Full-text-search approaches

FTS inside DB

+FTS model

Relational/denormalized/FTS model

Approach 1 Approach 2

FTS server(Lucene, Sphinx,

Elastic, Solr, Xapian, etc)

Application

server

Application

server

Housekeeping

Denormalized/FTS1. Change all affected tags in all documents if a tag name changedFTS1. FTS index rebuild due fragmentation 2. FTS index refresh if it isn’t refreshed on COMMIT

Test exampleStackOverflow posts via http://data.stackexchange.com/From 31/07/2008 to 21-12-2012Posts: 2 680 474Applied tags: 7 791 527Used unique tags: 30 485Max tags count for a post: 5

Comparison

Initial population time

Relational

Denormalized

Complex data type

Full text search

0 500 1000 1500 2000 2500

Insert time

ModelInsert time, seconds

Relational 1048Denormalized 1205Complex data type 2086Full text search 1950

Comparison

DB sizeModel

Size total, MB

Data size, MB

Index size, MB

Relational 1166 338 828Denormalized 1080 376 704Complex data type 1134 256 878Full text search 1055 416 639

Relational

Denormalized

Complex data type

Full text search

0 200 400 600 800 1000 1200 1400

DB size

Index size, MB Data size, MB Size total, MB

Comparison

Search by document id and all tag retrieval

ModelSpeed with cold cache, seconds

Speed with hot cache, seconds

Relational 0,2 0,003Denormalized 0,07 0,002Complex data type 0,9 0,002Full text search 0,3 0,001

Relational

Denormalized

Complex data type

Full text search

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Speed with cold cache, seconds

Relational

Denormalized

Complex data type

Full text search

0 0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035

Speed with hot cache, seconds

Comparison

Search using 1 tags and all tag retrieval

Model

Speed with cold cache, seconds

Speed with hot cache, seconds

Relational 1 0,005Denormalized 0,7 0,004Complex data type 1,7 0,005Full text search 0,7 0,002

Relational

Denormalized

Complex data type

Full text search

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Speed with cold cache, seconds

Relational

Denormalized

Complex data type

Full text search

0 0.001 0.002 0.003 0.004 0.005 0.006

Speed with hot cache, seconds

ComparisonSearch by AND using 2 tags and all tag retrieval

Model

Speed with cold cache, seconds

Speed with hot cache, seconds

Relational 40 34Denormalized 34 20Complex data type 34 14Full text search 20 2

Relational

Denormalized

Complex data type

Full text search

0 5 10 15 20 25 30 35 40 45

Search speed

Speed with hot cache, seconds Speed with cold cache, seconds

Comparison

Cloud tag populationModel Speed, secondsrelation 20relational simplified 18relational without fk 202denormalized 18Complex data type 21fts 40

relation

relational simplified

relational without fk

denormalized

array

fts

0 50 100 150 200 250

Speed, seconds

Pros & Cons

ModelSpace consumption

Search performance Insert performance

Maintenance

Additional housekeeping

Risk of failure

Search queries development

Relational worst worst highest minimal not required no worst

Denormalized moderate moderate good required required no moderate

Complex data type moderate moderate worst required required no moderate

Full text search optimal optimal moderate required required yes optimal

Conclusion

1. Choose your best model based on:• Performance (search/insert/update)• Space consumption• Engineer experience• Hardware cost• Software cost

2. Each storage model should be checked on your RDBMS - don’t be afraid to try and measure

3. Understanding how complex data types are stored inside is crucial4. Understanding how FTS works inside is crucial5. Investigate your DBMS unique features

There is no silver bullet for tag storage model!

Q&A

Contacts

Feel free to ask any db-related questions: shtock@mail.ru

top related