SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeat Geeknet, page 1 Allura – an Open Source MongoDB Based Document Oriented SourceForge Rick Copeland @rick446 [email protected]
May 12, 2015
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 1
Allura – an Open Source MongoDB Based Document
Oriented SourceForge
Rick Copeland@rick446
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 2
I am not Mark Ramm (sorry)
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 3
Allura (SF.net “beta” devtools)
Rewrite developer tools with new architecture
Wiki, Tracker, Discussions, Git, Hg, SVN, with more to come
Single MongoDB replica set
Release early & often
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 4
Allura ScalingSourceForge.net currently handles ~4M pageviews per day
Allura will eventually handle 10% (with lots of writing)
“Consume” currently handles 3M+ pageviews/day on one shard (read-mostly)
Allura can handle ~48k pageviews / day / shard
Add shards & optimize queries as we migrate projects to sf.net
Most data is project-specific; sharding by project is straightforward
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 5
System Architecture
Web-facing App Server
Task Daemon
SMTPServer
FUSE Filesystem(repository hosting)
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 6
Ming – an “Object-Document
Mapper?” Your data has a schema Your database can define and enforce it
It can live in your application (as with MongoDB)
Nice to have the schema defined in one place in the code
Sometimes you need a “migration” Changing the structure/meaning of fields
Adding indexes, particularly unique indexes
Sometimes lazy, sometimes eager
“Unit of work:” Queuing up all your updates can be handy
Python dicts are nice; objects are nicer
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 7
Ming Concepts Inspired by SQLAlchemy
Group of collection objects with schemas defined
Group of classes to which you map your collections
Use collection-level operations for performance
Use class-level operations for abstraction
Convenience methods for loading/saving objects and ensuring indexes are created
Migrations
Unit of Work – great for web applications
MIM – “Mongo in Memory” nice for unit tests
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 8
Ming Examplefrom ming import schema, Fieldfrom ming.orm import (mapper, Mapper, RelationProperty,
ForeignIdProperty)
WikiDoc = collection(‘wiki_page', session, Field('_id', schema.ObjectId()), Field('title', str, index=True), Field('text', str))CommentDoc = collection(‘comment', session, Field('_id', schema.ObjectId()), Field('page_id', schema.ObjectId(), index=True), Field('text', str))
class WikiPage(object): passclass Comment(object): pass
ormsession.mapper(WikiPage, WikiDoc, properties=dict( comments=RelationProperty('WikiComment')))ormsession.mapper(Comment, CommentDoc, properties=dict( page_id=ForeignIdProperty('WikiPage'), page=RelationProperty('WikiPage')))
Mapper.compile_all()
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 9
Allura Artifacts
Artifacts include tickets, wiki pages, discussions, comments, merge requests, etc.
On artifact change, a session extension:
• Queues a Solr index operation (for full text search support)
• Scans the artifact text for references to other artifacts
• Updates statistics on objects created/modified/deleted
Artifact
VersionedArtifact Snapshot Message
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 10
Allura Threaded DiscussionsMessageDoc = collection( 'message', project_doc_session, Field('_id', str, if_missing=h.gen_message_id), Field('slug', str, if_missing=h.nonce), Field('full_slug', str), Field('parent_id', str),…)
_id – use an email Message-ID compatible key
slug – threaded path of random 4-digit hex numbers prefixed by parent (e.g. dead/beef/f00d dead/beef dead)
full_slug – slug interspersed with ISO-formatted message datetime
Easy queries for hierarchical data
Find all descendants of a message – slug prefix search “dead/.*”
Sort messages by thread, then by date – full_slug sort
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 11
MonQ: Async Queueing in MongoDB
states = ('ready', 'busy', 'error', 'complete')result_types = ('keep', 'forget')
MonQTaskDoc = collection( 'monq_task', main_doc_session, Field('_id', schema.ObjectId()), Field('state', schema.OneOf(*states)), Field('result_type', Schema.OneOf(*result_types)), Field('time_queue', datetime), Field('time_start', datetime), Field('time_stop', datetime), # dotted path to function Field('task_name', str), Field('process', str), # worker process name: “locks” the task Field('context', dict( project_id=schema.ObjectId(), app_config_id=schema.ObjectId(), user_id=schema.ObjectId())), Field('args', list), Field('kwargs', {None:None}), Field('result', None, if_missing=None))
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 12
Repository Cache Objects
On commit to a repo (Hg, SVN, or Git)
• Build commit graph in MongoDB for new commits
• Build auxiliary structures
• tree structure, including all trees in a commit & last commit to modify
• linear commit runs (useful for generating history)
• commit difference summary (must be computed in Hg and Git)
• Note references to other artifacts and commits
Repo browser uses cached structure to serve pages
Commit
Tree Trees CommitRun
LastCommitDiffInfo
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 13
Repository Cache Lessons Learned
Using MongoDB to represent graph structures (commit graph, commit trees) requires careful query planning. Pointer-chasing is no fun!
Sometimes Ming validation and ORM overhead can be prohibitively expensive – time to drop down a layer.
Benchmarking and profiling are your friends, as are queries like {‘_id’: {‘$in’:[…]}} for returning multiple objects
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 14
Authorization: ProjectRole Objects
ProjectRoleDoc = collection( 'project_role', main_doc_session, Field('_id', schema.ObjectId()), Field('user_id', schema.ObjectId(), index=True), Field('project_id', schema.ObjectId(), index=True), Field('name', str), Field('roles', [schema.ObjectId()]), Index('user_id', 'project_id', 'name', unique=True) )
class ProjectRole(object): passmain_orm_session.mapper(ProjectRole, ProjectRoleDoc, properties=dict( user_id=ForeignIdProperty('User'), project_id=ForeignIdProperty('Project'), user=RelationProperty('User'), project=RelationProperty('Project’)))
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 15
Authorization: ProjectRole Objects
Roles can be named roles (“Groups”) or user proxies. Roles inherit all permissions of the roles they can “act as”
User membership in a group is stored on the user proxy object (the list of roles for which the user has permission)
Authorization checks all roles transitively for a user. If any role has the appropriate permission being required, then access is granted.
Hierarchical role structures are supported, but not exposed in the UI.
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatConfidential Geeknet, page 16
Flyway Migrations
Ming supports “lazy migrations” from one schema version to another automatically
Sometimes you want to explicitly version your DB
Flyway allows you to define various versions of your schema with pre- and post-conditions for running an “up” migration and a “down” migration
With multiple tools with interdependencies and a platform under it all, we thought we needed it
We didn’t, but it’s there and it works….
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 17
What We Liked Performance, performance, performance – Easily handle
90% of SF.net traffic from 1 DB server, 4 web servers
Schemaless server allows fast schema evolution in development, making many migrations unnecessary
Replication is easy, making scalability and backups easy Keep a “backup slave” running
Kill backup slave, copy off database, bring back up the slave
Automatic re-sync with master
Query Language You mean I can have performance without map-reduce?
GridFS
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 18
Pitfalls Too-large documents
Store less per document Return only a few fields
Ignoring indexing Watch your server log; bad queries show up there
Too much denormalization Try to use an index if all you need is a backref
Ignoring your data’s schema Using many databases when one will do Using too many queries
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 19
Open Source
Minghttp://sf.net/projects/merciless/
MIT License
Allurahttp://sf.net/p/allura/
Apache License
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 20
Future Work
mongos New Allura Tools Migrating legacy SF.net projects to Allura Stats all in MongoDB rather than Hadoop? Better APIs to access your project data
SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 21
Rick Copeland@rick446