Aleksi KallioCSC – IT Center for Science, Finland
Connecting Chipster genome browser to the cloud
Architecture of Chipster platform
Loosely coupled, independent components Message oriented communications Flexible, scalable, robust In other words, very cloud like
Clients
Authentication service
Management service
Computing services
Brokers
Message broker
File broker
Chipster in the cloud
1) Deploying compute nodes in the cloud• Easy, because architecture already loosely coupled and based
on message passing 2) Running large parallel jobs in the cloud
• Architecture allows this easily• Cloud compatible tools can be integrated quickly
3) Using cloud as a back end for interactive visualisations
• Not maybe so obvious• So let's dig into this further...
Background: Chipster Genome Browser
Interactive Swing-based GUI Shows reads and analysis results in genomic context Interactive zooming from chromosome down to nucleotide level Ensembl annotations for genes and transcripts Integrated with the rest of the Chipster Parallel, distributed to some extent
Basic idea
Preprocess data with Hadoop / MapReduce Generate powers of two summaries for the data, like in
Google Earth• Doubles the data size
Current genome browser samples data to produce summaries
Now summaries can be read directly– Accurate results, significantly less disk seeks
Distribute data to scale into massive datasets• Use messaging to query independent data providers
Aggregate results as/if they appear to the visualiser
Work in progress...
Genome browser up and running
Hadoop based data processing at very early stages
Currently trying to get it scale well
What's the point?
Besides items (e.g., reads), visualiser can receive “superitems” (e.g., summaries of reads)
• Summarises coverage, quality, SNP's etc. of the original reads All kinds of advanced information can be generated in
the preprocessing step– Such as features that combine large number of genomes– Generators should be pluggable
We spend resources on the server side to improve user experience on the client side
• At server side CPU, memory and disk space required• But only for a short time (like in large batch jobs)• Cheap commodity servers can be used• And the experiment has already been expensive
Summary
Use cheap server resources to enable better user experience
Goal: to make data analysis quicker (and more fun) Tackle server side unreliability on the client side Future development
– If this works out, it could be used in other Chipster visualisers also
– Integrating Hbase queries to interactive visualisations– Optimising data summarising for visual truthfulness
For more info: [email protected],