Top Banner
Uvod u bioinformatičku analizu podataka s Galaxy aplikacijom Enis Afgan Institut Ruđer Bošković 30.9.2014.
71

IRB Galaxy CloudMan radionica

Jul 17, 2015

Download

Education

Enis Afgan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: IRB Galaxy CloudMan radionica

Uvod u bioinformatičku analizu podataka s Galaxy

aplikacijom

Enis Afgan Institut Ruđer Bošković

30.9.2014.

Page 2: IRB Galaxy CloudMan radionica

Svi mi •  U 30 sekundi ili manje recite svima

•  Vaše ime •  Vaš zavod / afiliaciju •  Nešto o Vašem znanstvenom radu •  Zašto ste ovdje / što se nadate da ćete naučiti

Page 3: IRB Galaxy CloudMan radionica

Pregled radionice 9:30-10:00 Uvodno predavanje: Galaxy i CloudMan aplikacije

10:00-10:15 Q&A / pauza

10:15-10:30 Pokretanje vlastitog CloudMan klastera

10:30-11:30 Galaxy 101

11:30-11:45 Q&A / pauza

11:45-12:30 Podešavanje Galaxy i CloudMan aplikacija

12:30-12:45 Anketa i AWS credits: 3x $100

Page 4: IRB Galaxy CloudMan radionica
Page 5: IRB Galaxy CloudMan radionica
Page 6: IRB Galaxy CloudMan radionica

Making sense of this data requires

sophisticated analysis environment

with

adequate computational infrastructure

that is

accessible to the researchers

while it ensures

reproducibility of scientific results.

Page 7: IRB Galaxy CloudMan radionica

Galaxy: accessible analysis system

Page 8: IRB Galaxy CloudMan radionica

What is Galaxy?

A data analysis and integration tool

A (free for everyone) web service integrating a wealth of tools, compute resources, terabytes of

reference data and permanent storage

Open source software that makes integrating your own tools and data and customizing for your own

site simple

Page 9: IRB Galaxy CloudMan radionica

Need an analysis? There’s a tool for that.

Page 10: IRB Galaxy CloudMan radionica

Running a tool -  Automatically generated

web UI from a tool wrapper (any tool can be integrated)

-  Integrated with other tools

Page 11: IRB Galaxy CloudMan radionica

Data analysis history

Page 12: IRB Galaxy CloudMan radionica

Galaxy Workflows

Page 13: IRB Galaxy CloudMan radionica

Reproducibility in Genomics 18 Nat. Genetics experiments in microarray gene expression

<50% of reproducible

Problems •  missing data (38%) •  missing software, hardware

details (50%) •  missing methods, processing

details (66%)

Ioannidis, J.P.A. et al. “Repeatability of published microarray gene expression analyses.” Nat Genet 41, 149-155 (2009)

14 re-sequencing experiments in Nat. Genetics, Nature, Science

0% reproducible?

Problems •  missing primary data (50%) •  tools unavailable (50%) •  missing parameter setting, tool

versions (100%)

"Devil in the details," Nature, vol. 470, 305-306 (2011).

Page 14: IRB Galaxy CloudMan radionica

Metadata = Reproducibility

Page 15: IRB Galaxy CloudMan radionica

Automatic metadata

Page 16: IRB Galaxy CloudMan radionica

Data provenance

Page 17: IRB Galaxy CloudMan radionica

User metadata

Page 18: IRB Galaxy CloudMan radionica

Sharing and Publishing

Page 19: IRB Galaxy CloudMan radionica
Page 20: IRB Galaxy CloudMan radionica
Page 21: IRB Galaxy CloudMan radionica

Three ways to use Galaxy

•  Public website

•  Download and Run Locally

•  Run on the Cloud

Page 22: IRB Galaxy CloudMan radionica

http://usegalaxy.org (a.k.a. Main)

•  Public web site

•  Anybody can use it

•  Hundreds of tools

•  Persistent

•  +500 users/month

•  ~200TB of user data

•  ~140,000 analysis jobs / month

http://bit.ly/gxystats

Page 23: IRB Galaxy CloudMan radionica

Public Galaxy Servers https://wiki.galaxyproject.org/PublicGalaxyServers

Interested in:

ChIP-chip and ChIP-seq? ü  Cistrome

Statistical Analysis?

ü  Genomic Hyperbrowser

Sequence and tiling arrays?

ü  Oqtans

Text Mining?

ü  DBCLS Galaxy

Reasoning with ontologies?

ü  GO Galaxy

Internally symmetric protein structures?

ü  SymD

Page 24: IRB Galaxy CloudMan radionica

getgalaxy.org

GConfiguration

Local installation

Page 25: IRB Galaxy CloudMan radionica

Compute clusters

•  A number of connected computers

•  Typically built from commodity components

•  Used to improve performance: throughput or speed (supercomputers)

Page 26: IRB Galaxy CloudMan radionica

ALL GOOD, RIGHT?

Page 27: IRB Galaxy CloudMan radionica

Two challenges still exist

Infrastructure Customization

Page 28: IRB Galaxy CloudMan radionica

cloudman.irb.hr

AWS

OpenStack

Eucalyptus

GConfiguration

Page 29: IRB Galaxy CloudMan radionica

Cloud Computing •  Dynamically scalable shared resources accessed over a network

•  Control infrastructure via API

•  Private, public, or hybrid

•  Virtually unlimited resources: storage, computing, services •  Only pay for what you use

Page 30: IRB Galaxy CloudMan radionica
Page 31: IRB Galaxy CloudMan radionica

What is CloudMan?

CloudMan allows one to create a compute cluster in the cloud, use pre-configured applications, or add

one’s own. And then share it all.

Page 32: IRB Galaxy CloudMan radionica

Deploying a CloudMan Platform

1.  An account on the supported cloud

2.  Start a master instance via a launcher app or the cloud web dashboard

3.  Use the CloudMan web interface on the master instance to manage the platform

Page 33: IRB Galaxy CloudMan radionica

Manage Your Cluster

Page 34: IRB Galaxy CloudMan radionica

Share Your Instance •  Share entire (Galaxy) CloudMan platform

•  Even the customized ones (including data and/or tools)

•  Fully automated solution

•  Publish a self-contained analysis •  In progress or otherwise

Page 35: IRB Galaxy CloudMan radionica

How much does the Cloud cost?

Amazon Web Services •  $0.14 per CPU hour (~$100 per CPU month) •  $0.05 per GB-month (~$50 per TB-month)

Page 36: IRB Galaxy CloudMan radionica

Pregled radionice 9:30-10:00 Uvodno predavanje: Galaxy i CloudMan aplikacije

10:00-10:15 Q&A / pauza

10:15-10:30 Pokretanje vlastitog CloudMan klastera

10:30-11:30 Galaxy 101

11:30-11:45 Q&A / pauza

11:45-12:30 Podešavanje Galaxy i CloudMan aplikacija

12:30-12:45 Anketa i AWS credits: 3x $100

Page 37: IRB Galaxy CloudMan radionica

Rad s vlastitim CloudMan klasterom •  Launch an instance •  Demonstrate the following CloudMan

features and prepare for the data analysis part: •  Manual & Auto-scaling •  Using an S3 bucket as a data source •  Accessing an instance over ssh •  Customizing an instance •  Controlling Galaxy •  Sharing-an-instance

•  Perform data analysis in Galaxy •  Find exons with most SNPs

Inte

rac

tio

n fl

ow

Page 38: IRB Galaxy CloudMan radionica

YOUR TURN

Page 39: IRB Galaxy CloudMan radionica

Launch an instance 1.  Slides @ bit.ly/irb-ws 2.  Load biocloudcentral.org 3.  Enter the access key and secret key

provided at http://bit.ly/ws-creds

4.  Provide your email address 5.  Use your initials as the cluster name 6.  Set any password (and remember it) 7.  Use Large instance type 8.  Start your instance

Wait for the instance to start (~2-3 minutes)

9.  Access Galaxy application For more details, see

http://cloudman.irb.hr

Page 40: IRB Galaxy CloudMan radionica

Pregled radionice 9:30-10:00 Uvodno predavanje: Galaxy i CloudMan aplikacije

10:00-10:15 Q&A / pauza

10:15-10:30 Pokretanje vlastitog CloudMan klastera

10:30-11:30 Galaxy 101

11:30-11:45 Q&A / pauza

11:45-12:30 Podešavanje Galaxy i CloudMan aplikacija

12:30-12:45 Anketa i AWS credits: 3x $100

Page 41: IRB Galaxy CloudMan radionica

Agenda details •  Launch an instance •  Perform data analysis in Galaxy

•  Find exons with most SNPs •  Demonstrate the following CloudMan

features and prepare for the data analysis part: •  Manual & Auto-scaling •  Using an S3 bucket as a data source •  Accessing an instance over ssh •  Customizing an instance •  Controlling Galaxy •  Sharing-an-instance

Inte

rac

tio

n fl

ow

Page 42: IRB Galaxy CloudMan radionica

On human chromosome 22, which coding exons have the most SNPs in them?

Page 43: IRB Galaxy CloudMan radionica

A Rough Plan

• Get some data • Coding exons on chromosome 22 • SNPs on chromosome 22

• Mess with it • Identify which exons have SNPs • Count SNPs per exon

• Visualize our results

Page 44: IRB Galaxy CloudMan radionica

Exons, from UCSC SNPs, from UCSC

Page 45: IRB Galaxy CloudMan radionica

Exons, from UCSC SNPs, from UCSC

Exons, from UCSC

SNPs, from UCSC

Overlap pairings

Page 46: IRB Galaxy CloudMan radionica

Exons, from UCSC SNPs, from UCSC

1 1 2

Exons, from UCSC

SNPs, from UCSC

Overlap pairings

Exon overlap counts

Page 47: IRB Galaxy CloudMan radionica

Exons, from UCSC

1 1 2

Exon overlap counts

Page 48: IRB Galaxy CloudMan radionica

Exons, from UCSC

1 1 2

Exon overlap counts

1 1 2

Join on exon name 0 0 0

Page 49: IRB Galaxy CloudMan radionica

Exons, from UCSC

1 1 2

Exon overlap counts

1 1 2

Join on exon name 0 0 0

1 1 2

Rearrange columns w/ cut

Page 50: IRB Galaxy CloudMan radionica

Data types overview: BED •  Tab-delimited text file that defines a feature track •  Zero-based •  One line per feature •  Each line contains 3-12 columns

Page 51: IRB Galaxy CloudMan radionica

Data types overview: Tabular / Interval

•  Tab-delimited text file •  Interval

•  Each line represents genomic intervals •  Zero-based •  One line per interval •  Each line contains 3-5 columns

Page 52: IRB Galaxy CloudMan radionica

Your turn http://usegalaxy.org/galaxy101

Slides @ http://bit.ly/irb-ws

Page 53: IRB Galaxy CloudMan radionica

Pregled radionice 9:30-10:00 Uvodno predavanje: Galaxy i CloudMan aplikacije

10:00-10:15 Q&A / pauza

10:15-10:30 Pokretanje vlastitog CloudMan klastera

10:30-11:30 Galaxy 101

11:30-11:45 Q&A / pauza

11:45-12:30 Podešavanje Galaxy i CloudMan aplikacija

12:30-12:45 Anketa i AWS credits: 3x $100

Page 54: IRB Galaxy CloudMan radionica

Agenda details •  Launch an instance •  Perform data analysis in Galaxy

•  Find exons with most SNPs •  Demonstrate the following CloudMan

features and prepare for the data analysis part: •  Manual & Auto-scaling •  Using an S3 bucket as a data source •  Accessing an instance over ssh •  Customizing an instance •  Controlling Galaxy •  Sharing-an-instance

Inte

rac

tio

n fl

ow

Page 55: IRB Galaxy CloudMan radionica

Scaling computation

YES

YES

NO

Page 56: IRB Galaxy CloudMan radionica

Manual scaling •  Explicitly add 1 worker node to your cluster

•  Node type corresponds to node processing capacity

•  Research use of Spot instances

Page 57: IRB Galaxy CloudMan radionica

Auto-scaling

Page 58: IRB Galaxy CloudMan radionica

Public / shared data •  Take a look at the 1000 Genomes data

•  Take a look at AWS Public Datasets

•  More examples exist

•  How to use this freely available data and make new discoveries?

Page 59: IRB Galaxy CloudMan radionica

Using an S3 bucket as a data source

Page 60: IRB Galaxy CloudMan radionica

Accessing an instance over ssh

Use the terminal (or install Secure Shell for Chrome)

SSH using user ubuntu and the password you chose when launching an instance:

[local machine]$ ssh ubuntu@<instance IP address>

Page 61: IRB Galaxy CloudMan radionica

Once logged in

•  You have full system access to your instance, including sudo; use it as any other system

•  galaxy user exists on the system and should be used when manipulating Galaxy (sudo su galaxy)

•  Can submit any jobs via the standard qsub command

Page 62: IRB Galaxy CloudMan radionica

Customizing an instance •  Edit Galaxy’s configuration

$ sudo su galaxy

$ cd /mnt/galaxy/galaxy-app

$ nano universe_wsgi.ini

allow_library_path_paste = True

Page 63: IRB Galaxy CloudMan radionica

Controlling Galaxy •  Start/stop Galaxy application

•  Add an admin user

•  Use the email you registered with

Page 64: IRB Galaxy CloudMan radionica

S3 bucket as a data library

•  Within Galaxy, create a Data Library, using S3 bucket path as the data source (/mnt/workshop-data)

•  This will import all the datasets into the Data Library

•  Import that datasets into a history

Page 65: IRB Galaxy CloudMan radionica

Proširivanje palete programa •  Galaxy ToolShed = App Store za Galaxy

•  Need to be an Admin to use

•  Browse the Main ToolShed and install needed tool(s)

Page 66: IRB Galaxy CloudMan radionica

Sharing-an-Instance •  Share the entire CloudMan platform

•  Includes all of user data and even the customizations

•  Publish a self-contained analysis

•  Make a note of the share-string and send it to your neighbor

Page 67: IRB Galaxy CloudMan radionica

Pregled radionice 9:30-10:00 Uvodno predavanje: Galaxy i CloudMan aplikacije

10:00-10:15 Q&A / pauza

10:15-10:30 Pokretanje vlastitog CloudMan klastera

10:30-11:30 Galaxy 101

11:30-11:45 Q&A / pauza

11:45-12:30 Podešavanje Galaxy i CloudMan aplikacija

12:30-12:45 Anketa i AWS credits: 3x $100

Page 68: IRB Galaxy CloudMan radionica

Want more tutorials?

genome.edu.au/wiki/Learn

galaxy-tut.genome.edu.au

•  RNA-seq (basic and advanced)

•  Variant detection (basic and advanced)

•  Genome assembly

•  Quality control for small RNA

•  …

Page 69: IRB Galaxy CloudMan radionica
Page 70: IRB Galaxy CloudMan radionica

Anketa

bit.ly/IRBanketa

Page 71: IRB Galaxy CloudMan radionica

AWS Credits 3x $100

Vrijedi samo za AWS usluge

Hoće li Vam uistinu biti korisno za rad? Iznesite kako u jednoj minuti!

Pisani izvještaj (kratko!) o iskustvu nakon ~3 mjeseca