Allura - an Open Source MongoDB Based Document Oriented SourceForge

SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatGeeknet, page 1

Allura – an Open Source MongoDB Based Document

Oriented SourceForge

Rick Copeland@rick446

[email protected]


I am not Mark Ramm (sorry)


Allura (SF.net “beta” devtools)

Rewrite developer tools with new architecture

Wiki, Tracker, Discussions, Git, Hg, SVN, with more to come

Single MongoDB replica set

Release early & often


Allura ScalingSourceForge.net currently handles ~4M pageviews per day

Allura will eventually handle 10% (with lots of writing)

“Consume” currently handles 3M+ pageviews/day on one shard (read-mostly)

Allura can handle ~48k pageviews / day / shard

Add shards & optimize queries as we migrate projects to sf.net

Most data is project-specific; sharding by project is straightforward


System Architecture

Web-facing App Server

Task Daemon

SMTPServer

FUSE Filesystem(repository hosting)


Ming – an “Object-Document

Mapper?” Your data has a schema Your database can define and enforce it

It can live in your application (as with MongoDB)

Nice to have the schema defined in one place in the code

Sometimes you need a “migration” Changing the structure/meaning of fields

Adding indexes, particularly unique indexes

Sometimes lazy, sometimes eager

“Unit of work:” Queuing up all your updates can be handy

Python dicts are nice; objects are nicer


Ming Concepts Inspired by SQLAlchemy

Group of collection objects with schemas defined

Group of classes to which you map your collections

Use collection-level operations for performance

Use class-level operations for abstraction

Convenience methods for loading/saving objects and ensuring indexes are created

Migrations

Unit of Work – great for web applications

MIM – “Mongo in Memory” nice for unit tests


Ming Examplefrom ming import schema, Fieldfrom ming.orm import (mapper, Mapper, RelationProperty,

ForeignIdProperty)

WikiDoc = collection(‘wiki_page', session, Field('_id', schema.ObjectId()), Field('title', str, index=True), Field('text', str))CommentDoc = collection(‘comment', session, Field('_id', schema.ObjectId()), Field('page_id', schema.ObjectId(), index=True), Field('text', str))

class WikiPage(object): passclass Comment(object): pass

ormsession.mapper(WikiPage, WikiDoc, properties=dict( comments=RelationProperty('WikiComment')))ormsession.mapper(Comment, CommentDoc, properties=dict( page_id=ForeignIdProperty('WikiPage'), page=RelationProperty('WikiPage')))

Mapper.compile_all()


Allura Artifacts

Artifacts include tickets, wiki pages, discussions, comments, merge requests, etc.

On artifact change, a session extension:

• Queues a Solr index operation (for full text search support)

• Scans the artifact text for references to other artifacts

• Updates statistics on objects created/modified/deleted

Artifact

VersionedArtifact Snapshot Message


Allura Threaded DiscussionsMessageDoc = collection( 'message', project_doc_session, Field('_id', str, if_missing=h.gen_message_id), Field('slug', str, if_missing=h.nonce), Field('full_slug', str), Field('parent_id', str),…)

_id – use an email Message-ID compatible key

slug – threaded path of random 4-digit hex numbers prefixed by parent (e.g. dead/beef/f00d dead/beef dead)

full_slug – slug interspersed with ISO-formatted message datetime

Easy queries for hierarchical data

Find all descendants of a message – slug prefix search “dead/.*”

Sort messages by thread, then by date – full_slug sort


MonQ: Async Queueing in MongoDB

states = ('ready', 'busy', 'error', 'complete')result_types = ('keep', 'forget')

MonQTaskDoc = collection( 'monq_task', main_doc_session, Field('_id', schema.ObjectId()), Field('state', schema.OneOf(*states)), Field('result_type', Schema.OneOf(*result_types)), Field('time_queue', datetime), Field('time_start', datetime), Field('time_stop', datetime), # dotted path to function Field('task_name', str), Field('process', str), # worker process name: “locks” the task Field('context', dict( project_id=schema.ObjectId(), app_config_id=schema.ObjectId(), user_id=schema.ObjectId())), Field('args', list), Field('kwargs', {None:None}), Field('result', None, if_missing=None))


Repository Cache Objects

On commit to a repo (Hg, SVN, or Git)

• Build commit graph in MongoDB for new commits

• Build auxiliary structures

• tree structure, including all trees in a commit & last commit to modify

• linear commit runs (useful for generating history)

• commit difference summary (must be computed in Hg and Git)

• Note references to other artifacts and commits

Repo browser uses cached structure to serve pages

Commit

Tree Trees CommitRun

LastCommitDiffInfo


Repository Cache Lessons Learned

Using MongoDB to represent graph structures (commit graph, commit trees) requires careful query planning. Pointer-chasing is no fun!

Sometimes Ming validation and ORM overhead can be prohibitively expensive – time to drop down a layer.

Benchmarking and profiling are your friends, as are queries like {‘_id’: {‘$in’:[…]}} for returning multiple objects


Authorization: ProjectRole Objects

ProjectRoleDoc = collection( 'project_role', main_doc_session, Field('_id', schema.ObjectId()), Field('user_id', schema.ObjectId(), index=True), Field('project_id', schema.ObjectId(), index=True), Field('name', str), Field('roles', [schema.ObjectId()]), Index('user_id', 'project_id', 'name', unique=True) )

class ProjectRole(object): passmain_orm_session.mapper(ProjectRole, ProjectRoleDoc, properties=dict( user_id=ForeignIdProperty('User'), project_id=ForeignIdProperty('Project'), user=RelationProperty('User'), project=RelationProperty('Project’)))


Authorization: ProjectRole Objects

Roles can be named roles (“Groups”) or user proxies. Roles inherit all permissions of the roles they can “act as”

User membership in a group is stored on the user proxy object (the list of roles for which the user has permission)

Authorization checks all roles transitively for a user. If any role has the appropriate permission being required, then access is granted.

Hierarchical role structures are supported, but not exposed in the UI.

SourceForge | Slashdot | ThinkGeek | Ohloh | freshmeatConfidential Geeknet, page 16

Flyway Migrations

Ming supports “lazy migrations” from one schema version to another automatically

Sometimes you want to explicitly version your DB

Flyway allows you to define various versions of your schema with pre- and post-conditions for running an “up” migration and a “down” migration

With multiple tools with interdependencies and a platform under it all, we thought we needed it

We didn’t, but it’s there and it works….


What We Liked Performance, performance, performance – Easily handle

90% of SF.net traffic from 1 DB server, 4 web servers

Schemaless server allows fast schema evolution in development, making many migrations unnecessary

Replication is easy, making scalability and backups easy Keep a “backup slave” running

Kill backup slave, copy off database, bring back up the slave

Automatic re-sync with master

Query Language You mean I can have performance without map-reduce?

GridFS


Pitfalls Too-large documents

Store less per document Return only a few fields

Ignoring indexing Watch your server log; bad queries show up there

Too much denormalization Try to use an index if all you need is a backref

Ignoring your data’s schema Using many databases when one will do Using too many queries


Open Source

Minghttp://sf.net/projects/merciless/

MIT License

Allurahttp://sf.net/p/allura/

Apache License


Future Work

mongos New Allura Tools Migrating legacy SF.net projects to Allura Stats all in MongoDB rather than Hadoop? Better APIs to access your project data


Rick Copeland@rick446

[email protected]

Allura - an Open Source MongoDB Based Document Oriented SourceForge

Technology

field project

field slug

field text

field page

field time

field result

field user

field parent