GalaxyRNA-SeqAnalysis:H.sapiensTutorialResearchInformaticsSolutionsMinnesotaSupercomputingInstituteUniversityofMinnesotaVersion310/25/2016
Introduction
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 2
1 Introduction......................................................................................................................................................31.1 Scopeofthistutorial.................................................................................................................................31.2 Referencematerials..................................................................................................................................31.3 Outlineoftutorial.......................................................................................................................................3
2 StartingGalaxy................................................................................................................................................42.1 AccessingGalaxy.........................................................................................................................................52.2 ImportFastqfilesforonesampleintocurrenthistory..............................................................62.3 ImporttheGTFfilefromtheiGenomesdatalibrary..................................................................62.4 Setfileattributes........................................................................................................................................62.5 RunFastQC....................................................................................................................................................6
3 MappingwithTophat...................................................................................................................................73.1 InitialTophatrun.......................................................................................................................................83.2 Determineinsertsize................................................................................................................................93.3 RerunTophatwithcorrectinsertsize...........................................................................................103.4 Reviewmappingstatistics...................................................................................................................10
4 Workflows.......................................................................................................................................................115 VisualizingalignmentswithIGV............................................................................................................115.1 LoadBAMalignmentfilesandGTFintonewhistory..............................................................125.2 LoadfilesintoIGV...................................................................................................................................125.3 Lookatahousekeepinggene.............................................................................................................135.4 Lookatagenewithdifferentialexpression.................................................................................13
6 Computingdifferentialexpressionwithcuffdiff.............................................................................146.1 Runcuffdiff.................................................................................................................................................156.2 Filtercuffdiffoutput...............................................................................................................................16
7 CuffdiffvisualizationwithCummeRbund..........................................................................................177.1 RunCummeRbundtool.........................................................................................................................187.2 ReviewCummeRbundplots................................................................................................................197.3 AdditionalCummeRbundplots:........................................................................................................207.4 Troubleshooting.......................................................................................................................................20
8 AppendixA:Workflows.............................................................................................................................218.1 Extractworkflowfromcurrenthistory.........................................................................................228.2 Edittheworkflow....................................................................................................................................228.3 Createnewhistory..................................................................................................................................238.4 Runworkflow............................................................................................................................................23
Introduction
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 3
1 Introduction
1.1 ScopeofthistutorialThisisapractical,hands-ontutorialdesignedtogiveparticipantsexperiencewithRNA-SeqdataanalysisusingTophat,Cufflinks,andCummRbundinGalaxy.Theanalysisinthistutorialistypicalofexperimentsineukaryoticspecieswithhigh-qualitygenomesandgenomeannotationavailable.Participantsareexpectedtobefamiliarwithnext-generationsequencedata,basictheoryofRNA-Seq,andGalaxy.ParticipantsdonotneedpreviousexperiencewithTophat,Cufflinks,orCummeRbund.
1.2 ReferencematerialsRNA-SeqLecturePDFsonMSIwebsite:https://www.msi.umn.edu/sites/default/files/RNA-Seq%20Lecture_2016.pdfGalaxy101:NGSdataanalysishands-ontutorial:www.msi.umn.edu/content/bioinformatics-analysisTophatmanual:ccb.jhu.edu/software/tophat/manual.shtmlCufflinksmanual:cole-trapnell-lab.github.io/cufflinks/manual/CummeRbundmanual:compbio.mit.edu/cummeRbund
1.3 Outlineoftutorial1 Introduction2 StartingGalaxy3 MappingwithTophat4 Workflows5 VisualizingalignmentswithIGV6 Computingdifferentialexpressionwithcuffdiff7 CuffdiffvisualizationwithCummeRbund8 AppendixA:Workflows
StartingGalaxy
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 4
2 StartingGalaxy
êTutorialDataset(Sect2.2page6)Thistutorialwillidentifygeneswhoseexpressionlevelsdifferbetweenskeletalmuscletissueandheartmuscletissue.ThesampledatasetusedinthistutorialwascreatedfromtheheartandskeletalmusclesamplesfromtheIlluminaBodymap2.0Project(www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30611).Thesingleheartandskeletalmusclesamplesweresplitintothreesubsamples,andthereadsmappingtoa5MBregionnearthedistalendofchromosome19wereextractedalongwithsomeunmappedreads.Eachfastqfilecontainsabout50,00050base-pairpaired-endreads.NOTE:Thisdatasetwaschosentoallowforfastprocessingandresponsetimesinaclassroomsettingwhere dozensofpeoplewillbesubmittingjobsatoncetotheserver.Itisnotidealduetothesmallsamplesizes(leadingtoatypical-lookinggraphsinsomecasesandpoorstatistics)andlackofrealbiologicalreplicates(resultinginunrealistically-goodsampleseparation). êGTFFiles(Sect2.3page6)AGTFfileidentifiesthegenomiclocationsofgenesandtheirexons.IfaGTFfileforyourorganismisnotlistedsendarequesttoMSI,orfindoneonlineatsitessuchaswww.ensembl.org/info/data/ftp/index.html,genome.ucsc.edu/cgi-bin/hgTables?command=start,orNCBI.TheGTFfilesprovidedintheIlluminaiGenomescollection(ccb.jhu.edu/software/tophat/igenomes.shtml)havebeenspeciallymodifiedformaximumcompatibilitywiththeCufflinksandCuffdiffprograms.êQualityControl(Sect2.5page6)Itisimportanttoalwaysverifytheintegrityofadatasetbeforestartingtoanalyzeit.Quantifyingdatasetqualitymayuncoverproblemsthatmightotherwisegoundetected.Dataqualityproblemssuchassequencingadaptorcontaminationorlowreadqualityrequiretrimmingandfilteringnotcoveredinthistutorial.SeetheGalaxy101tutorialhandoutontheMSIwebsitefordetailedinstructionsonhowtocleanupalowqualitydataset:www.msi.umn.edu/content/bioinformatics-analysisThegraphsgeneratedinthistutorialarenotentirelytypicalduetothesmallsampledatasetsused.SeeexamplesofoutputfromgoodandbadIlluminadatasetsunderthe“ExampleReports”sectiononthiswebsite:www.bioinformatics.babraham.ac.uk/projects/fastqc/.FormoreinformationaboutinterpretingFastQCoutputrefertotheRIStutorial“QCofIlluminaDatausingGalaxy“handout:www.msi.umn.edu/content/bioinformatics-analysis
StartingGalaxy
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 5
2.1 AccessingGalaxya) OpenawebbrowserandnavigatetoMSIGalaxywebsitegalaxy.msi.umn.edub) LoginwithyourMSIusernameandpassword
Toolspane Centerpane Historypane
StartingGalaxy
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 6
2.2 ImportFastqfilesforonesampleintocurrenthistory!TutorialDataseta) Atthetopofthescreenselect“SharedData->DataLibraries”b) Select“RISS-tutorial-Hsapiens”fromthelistofdatalibrariesc) Expandthe“Fastq”folderandchecktheboxesnexttothefirsttwofilesd) Nearthetopofthescreenclickthe“toHistory”button,thenclick“Import”toimporttheselecteddatasetstotheselectedhistory(defaultiscurrentdirectory)
2.3 ImporttheGTFfilefromtheiGenomesdatalibrary!GTFFilesa) Atthetopofthescreenselect“SharedData->DataLibraries”b) Select“iGenomes”fromthelistofdatalibrariesc) Checktheboxnexttothe“hg19_chr19_genes_2012-03-09.gtf”filed) Nearthetopofthepageclickthe“toHistory”button,thenclick“Import”toimporttheselecteddatasetstothecurrenthistory
e) Atthetopofthescreenclick“AnalyzeData”toreturntoyourcurrenthistory
2.4 Setfileattributesa) Inthehistorypaneclickonthepencilicon nexttotheheart-1_R1.fastqfileb) ClicktheDatatypetabc) Enter“fastqsanger”inthe“NewType”box.Alistofavailabledatatypeswillappearasyoutype.
d) Clicksave
2.5 RunFastQC!QualityControla) LoadtheFastQCtoolfromthetoolpane:“NGS:QCandmanipulation->FastQC”b) Settheinputfile:select“heart-1_R1.fastq”fromthedropdownmenuunder“Shortreaddatafromyourcurrenthistory”
c) Click“Execute”d) WhenFastQChasfinishedrunning,clickontheeye ontheFastQCWebpageoutputfiletodisplaythefileinthecenterpane
ForarealdatasetyouwouldneedtorepeatthisstepontheR2fastqfile
SeetheGalaxy101tutorialhandoutfordetailedinstructionsonhowtocleanupalowqualitydataset:www.msi.umn.edu/content/bioinformatics-analysis
ForarealdatasetyouwouldneedtorepeatthisstepontheR2fastqfile
MappingwithTophat
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 7
3 MappingwithTophat
êReferenceGenomes(Sect3.1page8)ItisimportantthatthereferencegenomeyoualignagainstisgeneratedfromthesamereferencegenomeastheGTFyouareusingbecausethechromosomenamesandcoordinatesusedintheGTFfilemustbethesameasthoseusedinthedatabase.IfthereferencegenomeforyourorganismisnotlistedemailarequesttoMSItohaveitadded.
êMeanInnerDistance–PartI(Sect3.1page8)Thisistheexpected(mean)innerdistancebetweenmatepairs.Forexample,theUMGC’sdefaultfragmentselectionsizeis200,so200–(2*readlength)isagoodvaluetouseforthisparameter.Wewilldeterminetheexactfragmentlengthinthenextsection.
êJunctions(Sect3.1page8)Tophatcanattempttoidentifyexon-exonsplicejunctionssolelyusingyourdataset,oryoumaysupplyasetofgenemodelannotationsasaGTForGFFfile.InthistutorialwewillprovideaGTFannotationfilebecausethehumangenomeiswellannotated.êAdvancedTophatParameters(Sect3.1page8)SeetheRNA-SeqLecture2handoutformoredetailonsettingparametersproperlyforotherorganisms:www.msi.umn.edu/content/bioinformatics-analysisêMeanInnerDistance–PartII(Sect3.2page9)ItisimportantthatthemeaninnerdistanceTophatparameterissetcorrectlyinordertogetthebestmappingresults.TheactualaveragefragmentsizeforeachsamplecanbedeterminedbyrunningTophatwithanestimatedinnerdistanceandthencalculatingthetruevaluefromthemappedreads.RerunningTophatwiththetruevaluewillgiveimprovedresults.êInsertSizeHistogram(Sect3.2page9)Theinsertsizehistogramgeneratedfromthissampledatasetisnoisierthanatypicalhistogram,shownhere:
êMappingStatistics(Sect3.4page10)ItisimportanttodeterminehowwelltheRNA-Seqreadsaligntothereferencegenome.Lowmappingratesrequirefurtherinvestigationtodeterminethecause.
MappingwithTophat
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 8
3.1 InitialTophatrun!ReferenceGenomes!MeanInnerDistance–PartI!Junctions!AdvancedTophatParametersa) LoadtheTophattoolfromthetoolpane:“NGS:RNAAnalysis->Tophat”b) Isthislibrarymate-paired->Paired-end(asindividualdatasets)c) RNA-SeqFASTQfile,forwardreads->heart-1_R1.fastqd) RNA-SeqFASTQfile,reversereads->heart-1_R2.fastqe) MeanInnerDistancebetweenMatePairs->100f) Selectareferencegenome->Humanhg19chr19g) TopHatsettingstouse->Fullparameterlisth) Doyouwanttosupplyyourownjunctiondata->Yesi) UseGeneAnnotationModel->Useageneannotationfromhistoryj) Click“Execute”tosubmitthejob
Onlyfilesoftype“fastqsanger”willappearinthedropdownlist.Ifyourfastqfileisn’tshownthefiletypeissetincorrectly.Seestep2.4
Doyouwanttosupplyyourownjunctiondata
Useageneannotationfromhistory
MappingwithTophat
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 9
3.2 Determineinsertsize!MeanInnerDistance–PartII!InsertSizeHistograma) Loadtheinsertsizetool“NGS:Picard->CollectInsertSizeMetrics”b) Usingreferencegenome->hg19-chr19c) ClickExecuted) Clickonthe“eye”iconnexttothefirstofthetwooutputfilesinthehistorypanetoviewtheoutputinthecentralpane
e) Identifythemode(highestfrequency)insertsizefromtheprogramoutput
MappingwithTophat
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 10
3.3 RerunTophatwithcorrectinsertsizea) ClickonthenameofanyoneoftheTophatoutputfilesinthehistorypanetoexpandit,andclickonthecirculararrowicon todisplaytheTophattoolinthecentralpanewiththeparameterspresetfromthelastTophatrun
b) Changethe“MeanInnerDistancebetweenMatePairs”tothecorrectvalue:Picardvalue–(2*readlength)=160–(2*50)=60
c) Click“Execute”tosubmitthejob
3.4 Reviewmappingstatistics!MappingStatisticsa) Clickonthe“eye”iconnexttotheTophat“align_summary”outputfileinthehistorypanetoviewtheoutputinthecentralpane
b) Renamethecurrenthistory:atthetopofthehistorypaneclickon“Unnamedhistory”andrenameit“heart-1”.(NOTE:youmusthit‘Enter’aftertypingthenewname,ratherthanclickingoutsidethebox)
Workflows
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 11
4 Workflows
5 VisualizingalignmentswithIGV
êGalaxyWorkflows(Sect8page21)Allofthestepsthathavebeenperformedontheheart-1sampleneedtoberepeated,inseparatehistories,forthetwootherheartsamplesandthethreeskeletalsamples.Galaxyworkflowsprovideaneasymethodtoautomateananalysispipeline.AppendixAdemonstrateshowtogenerateaworkflowfromyourcurrenthistoryanduseittoanalyzeanothersample.Tosavetimewewillnotworkthroughthissectioninthehands-onworkshop,butthissectionshouldbecompletedifworkingonarealdataset.
êVisualization(Sect5.3page13)Visualizingalignmentsisaquickandeasywaytocheckformajorproblemswiththedata.Youmaywishtoverifythathousekeepinggenesareindeedroughlyevenlycoveredwithreads,ordocumenteddifferentially-expressedgenesindeedhavedifferentialcoveragebetweensamplesofdifferentgroups.êGalaxyVisualizationOptions(Sect5.2page12)Galaxysupportsthreegenomebrowsersforvisualizingdata:TheIntegrativeGenomicsViewer(IGV)istherecommendedgenomebrowserbecauseitisfast,powerful,andeasytouse.TracksterisagenomebrowserbuiltintoGalaxy.AnydatafilethatcanbeviewedinTracksterwill
haveaTrackstericon displayedwith“Download”and“Viewdetails”buttons.TheIntegratedGenomeBrowser(IGB)issimilartoIGV,butmostusersprefertouseIGV.êSampleDataset(Sect5.1page12)InthissectionwestartwithBamalignmentfilesthathavealreadybeengeneratedforallsixheartandskeletalsamples.TheseBamfilesweregeneratedusingtheworkflowpreviouslydescribedinthistutorial.
VisualizingalignmentswithIGV
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 12
5.1 LoadBAMalignmentfilesandGTFintonewhistory!SampleDataseta) Createanewhistorybyclickingonthegearicon atthetopofthehistorywindowandselecting“CreateNew”fromthedrop-downmenu
b) Clickon“SharedData->DataLibraries”atthetopofthewindowc) Clickonthe“RISS-tutorial-Hsapiens”datalibraryd) Expandthe“Bam”folderandchecktheboxnexttoeachbamfilee) Click“toHistory”almostatthetopofthecentertoimporttocurrenthistoryf) Importthehg19_chr19GTFfilebyclickingon“SharedData->DataLibraries”atthetopofthescreenandselecting“hg19_chr19_genes_2012-03-09.gtf”fromthe“iGenomes”datalibrary
g) Returntoyourhistorybyclickingon“AnalyzeData”atthetopofthescreen
5.2 LoadfilesintoIGV!GalaxyVisualizationOptionsa) LaunchIGVbrowseronyourcomputer(todownloadIGV:http://software.broadinstitute.org/software/igv/download).
b) Clickonthe“heart-1_accepted_hits.bam”fileinthehistorypanetoexpanditandclickonthe“local”linknextto“displaywithIGV”.Theheart-1.bamfilewillloadintoIGV.
c) Repeatb)toloadskeletal-1.bamintoIGV.
b
VisualizingalignmentswithIGV
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 13
5.3 Lookatahousekeepinggene!Visualizationa) Verifythat“Humanhg19”isselectedasthereferencegenomefromthedrop-downmenuatthetopleftoftheIGVwindow
b) Enter“ube2s”inthesearchboxtoviewthereadsaligningtotheubiquitin-conjugatingenzymeE2Sgene,whichisexpectedtohavesimilarexpresslevelsinbothtissuetypes
c) Right-clickontheheartcoveragetrackandselect“SetDataRange”d) Setthe“Max”valueto16e) Repeatfortheskeletalcoveragetrack
5.4 Lookatagenewithdifferentialexpressiona) Enter“tnnt1”inthesearchboxtoviewthereadsaligningtotheTroponinT,slowskeletalmusclegene,whichisexpectedtobeexpressedonlyinskeletalmuscle
b) Adjustthescaleofthecoveragetracksasneeded(trymax=1700)
x
Computingdifferentialexpressionwithcuffdiff
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 14
6 Computingdifferentialexpressionwithcuffdiff
êCuffdiffOutput(Sect6.2page16)Cuffdiffproducesmanyoutputfiles.Inthistutorialwelookatthegenedifferentialexpressiontestingfilewhichshowswhichgenesaredifferentiallyexpressed.Theotheroutputfilesalsocontainimportantdata,includingtheresultsofdifferentialexpressiontestingforsplicedtranscripts,primarytranscripts,andcodingsequences.Seethecufflinksmanualfordetailedinformationaboutwhatinformationisineachfile:cole-trapnell-lab.github.io/cufflinks/file_formats/index.html#output-formats-used-in-the-cufflinks-suiteêDifferentialGeneExpression(Sect6.2page16)Thegenedifferentialexpressiontestingoutputfileisatab-delimitedtextfilewithonerowforeachgene.Oursampledatasetonlycoversasmallportionofchr19somostgeneswillhavetoofewalignedreadsforadifferentialexpressiontest.Thesegenesareindicatedwith“NOTEST”or“LOWDATA”incolumn7.êDenovogene/transcriptdiscovery(Sect6.1page15)Theanalysispipelineusedinthistutorialwillquantifytheexpressionofknowngenesinareferenceannotation.Ifyouareinterestedindiscoveringnovelgenesorspliceformsmorestepsneedtobeaddedtothepipeline.RefertotheNatureProtocolspaper“DifferentialgeneandtranscriptexpressionanalysisofRNA-seqexperimentswithTopHatandCufflinks”formoreinformation:www.ncbi.nlm.nih.gov/pubmed/22383036
Computingdifferentialexpressionwithcuffdiff
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 15
6.1 Runcuffdiff!Denovogene/transcriptdiscoverya) LoadtheCuffdifftool:“NGS:RNAAnalysis->Cuffdiff”b) Setparameters:
§ GenerateSQLite->Yes§ 1:ConditionName->Heart§ Replicates->useshifttoselectthethreeheartbamfiles§ 2:ConditionName->Skeletal§ Replicates->useshifttoselectthethreeskeletalbamfiles
c) Click“Execute”tosubmitthejob
Computingdifferentialexpressionwithcuffdiff
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 16
6.2 Filtercuffdiffoutput!CufdiffOutput!DifferentialGeneExpressiona) Loadthetextfiltertool:“FilterandSort->Filter”b) Clickontheoutputfile“genedifferentialexpressiontesting”toexpanditinthehistorypane(thisallowsyoutoseethecolumnnamesandnumbers)
c) SettheCuffdiffoutputfile“genedifferentialexpressiontesting”asthefiletofilterd) Filteroutgeneswithsignificantchangeinexpressionwithalogfold-changeofatleast1byentering“c14==‘yes’andabs(c10)>1”inthe“withfollowingcondition”textbox
e) Click“Execute”tosubmitthejobf) Clickonthe“eye”iconnexttothefilteroutputfilenametoviewtheresultsinthecenterpane
CuffdiffvisualizationwithCummeRbund
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 17
7 CuffdiffvisualizationwithCummeRbund
êCummeRbundCummeRbundisaneasytouseRpackagethattakestheoutputfilesfromacuffdiffrunandcreatesaSQLitedatabaseoftheresults.Thisallowstheusertoexploredataforgenes,transcripts,transcriptionstartsites,andCDSregionsacrossmultiplesamplesorconditions.CummeRbundimplementsnumerousplottingfunctionsforcommonlyusedvisualizations.TheCummeRbundwrapperinGalaxyallowseasyaccesstomuchofCummeRbund’sfunctionality.FormoredetailsaboutavailableplotsrefertotheCummeRbundwebsite:compbio.mit.edu/cummeRbund/êDensityPlotsAKerneldensityplotisinterpretedthesameasahistogram.Thedensityplotshowsthedistributionofgeneexpressionlevelsacrossdifferentsamples.Allsamplesshouldhavereasonablysimilardistributions.Alog10(FPKM)of0=1FPKM,whichisverylowexpression.êMDSPlotsMDSplotsaresimilartoPrincipleComponentAnalysis(PCA)plots.Theyareusefulfordeterminingthemajorsourcesofvariationinthedataset.Ideallysamplesfromthesameexperimentalgroupwillbeclusteredtogetherintheplotindicatingthatexperimentalconditionisthemajorsourceofvariation.Samplesmightalsoclusterbyage,batch,date,technician,orothertechnicalaspectoftheexperiment.êDendogramAdendogramisatreediagramshowinghowsampleclusterbysimilarity.Ideallysamplesfromthesameexperimentalgroupareclusteredtogether.
CuffdiffvisualizationwithCummeRbund
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 18
7.1 RunCummeRbundtool!CummeRbunda) LoadtheCummeRbundtool:NGS:RNAAnalysis->cummeRbundvisualizeCuffdiffoutput
b) Setparameters:§ +InsertPlots(clickthreetimestogeneratethreeplots)§ Plottype:Density§ Plottype:MultiDimensionalScaling(MDS)Plot§ Plottype:Dendrogram
c) Click“Execute”tosubmitthejob
HavepatiencewhensettingtheCummeRbundparameters.Afterchangingeachsettingittakesseveralsecondsforthecenterpanetoreload.Thisiscommonwhenworkingwithlargehistories.
CuffdiffvisualizationwithCummeRbund
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 19
7.2 ReviewCummeRbundplots!Densityplots,MDSplots,andDendrogramsa) Whenthecummerbundjobhasfinishedrefreshthehistorypanebyclickingontherefreshiconatthetopofthehistorypane
b) Clickthe“eye”iconnexttotheeachofthethreecummerbundoutputfilestoviewtheplots
c) Verifythat:• Thesampleshavesimilardensitydistributions• ThesamplesclusterbyexperimentalconditionintheMDSplot• Thesampleclusterbyexperimentalconditioninthedendrogram
CuffdiffvisualizationwithCummeRbund
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 20
7.3 AdditionalCummeRbundplots:a) Volcano,Heatmap,ExpressionPlot,andCluster.
7.4 TroubleshootingIfyouexperienceproblemsusingGalaxysendanemailtohelp@msi.umn.eduwithasubjectbeginning“RIS”andareportoftheproblem.
AppendixA:Workflows
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 21
8 AppendixA:Workflows
êGalaxyWorkflows(Sect8.1page22)Allofthestepsthathavebeenperformedontheheart-1sampleneedtoberepeatedforthetwootherheartsamplesandthethreeskeletalsamples.Galaxyworkflowsprovideaneasymethodtoautomateananalysispipeline.AppendixAdemonstrateshowtogenerateaworkflowfromyourcurrenthistoryanduseittoanalyzeanothersample.Tosavetimewewillnotworkthroughthissectioninthehands-onworkshop.êWorkflowParameters(Sect8.2page22)TheworkflowwesetupinthissectionwillrunFastQC,Tophat,andInsertionsizemetrics.Tophat2willberunjustonceusingtheinnermatedistancecalculatedfromthefirstsample.Samplesthatweresequencedtogetherinthesamebatchoftenhaveverysimilaraverageinsertsizesandthesameinnermatedistancecanbeusedforallsamples.ChecktheInsertionsizemetricsresultsafterrunningtheworkflowtoverifythatisthecase.
AppendixA:Workflows
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 22
8.1 Extractworkflowfromcurrenthistory!GalaxyWorkflowsa) Atthetopofthehistorypaneclickonthesmallgeariconandselect“ExtractWorkflow”fromthepop-upmenu
b) Inthe“Workflowname”boxenter“QCandTophatc) Uncheckthesecond(closesttothebottom)Tophatrund) Click”CreateWorkflow”undertheworkflowname
8.2 Edittheworkflow!Workflowparametersa) Clickon“Workflow”atthetopoftheGalaxywindowb) Clickontheworkflowthatwasjustcreatedandselect“Edit”fromthedrop-downmenu
c) Movetheelementsoftheworkflowaroundtomakeiteasiertoseehowtheyareconnected.
d) ClickonthefirstInputdatasetboxandsettheNamefieldto‘R1’.Repeatforsecondinputdataset(‘R2’).
e) ClickontheTophatboxtodisplaytheTophatoptionsinthe“Details”paneontherightside.
f) Setthe“MeanInnerDistancebetweenMatePairs”to60.g) VerifytheotherTophatparametersaresetcorrectly.h) Saveyourchangesbyselecting“Options->Save”nearthetopofthescreeni) Returntoyourhistorybyclickingon“AnalyzeData”atthetopofthescreen
AppendixA:Workflows
RISMinnesotaSupercomputingInstitute,UniversityofMinnesota 23
8.3 Createnewhistorya) Renamethecurrenthistory:atthetopofthehistorypaneclickon“Unnamedhistory”andrenameit“heart-1”.(NOTE:youmusthit‘Enter’aftertypingthenewname,ratherthanclickingoutsidethebox.)
b) Createanewhistorybyclickingonthegeariconatthetopofthehistorypaneandselecting“CreateNew”fromthepop-upmenu
c) Namethenewhistory“heart-2”d) Importtheheart-2fastqfilesbyclickingon“SharedData->DataLibraries”atthetopofthescreenandselectingthe“heart-2_R1.fastq”and“heart-2_R2.fastq”filesfromthe“RISS-tutorial-Hsapiens”datalibrary
e) Importthehg19_chr19GTFfilebyclickingon“SharedData->DataLibraries”atthetopofthescreenandselecting“hg19_chr19_genes_2012-03-09.gtf”fromthe“iGenomes”datalibrary
f) Returntoyourhistorybyclickingon“AnalyzeData”atthetopofthescreen
8.4 Runworkflowa) Loadaworkflowbyclickingon“Workflow”atthetopofthescreenb) Clickontheworkflowthatwasjustcreatedandselect“Run”fromthedropdownmenuc) Selectthe“heart-2_R1.fastq”fileinthefirstdrop-downmenuandthe“heart-2_R2.fastq”fileintheseconddrop-downmenu
d) VerifytheGTFfileisselectedinthethirddrop-downmenue) Clickon“Runworkflow”tosubmittheFastQC,Tophat,andInsertionsizemetricsjobs.