Anvaya: A Workflow Engine for High Throughput Genomics Dr. Rajendra Joshi Associate Director & HOD BioinformaFcs Group CDAC Pune , India [email protected] 1
Anvaya: A Workflow Engine for High Throughput Genomics
Dr. Rajendra Joshi Associate Director & HOD BioinformaFcs Group C-‐DAC Pune , India [email protected]
1
To exploit the enormous scientific value of this information for understanding biological systems, the information must be integrated, analyzed, graphically displayed and ultimately modeled computationally.
HIGH-‐THROUGHPUT TECHNIQUES ARE REVOLUTIONIZING LIFE SCIENCES
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca tggatttgcc tgttctggat attcatatta atagaatcaa
CURRENT SCENARIO
Figure: Stuart Owen “ Workflows with Taverna“
2
Architecture
3
Key Features of Anvaya
• Rules Engine which adds intelligence to control tools connec>vity
• Provision of addi>onal Custom Tools and Custom Parsers • 13 Pre-‐defined Workflows for frequently used pipelines
in genome annota>on and compara>ve genomics • Easy to use, standalone Anvaya Client which is supported
on Windows as well as Linux
4
Feature : Workflow OperaFons
• Create workflow or pipeline using the available tool list
• Set proper>es of each node and Save workflow
• Run workflow opera>on to execute the pipeline on high end server
• Stop the workflow, is user an>cipates changes etc.
• Resume the workflow from previously executed node
5
Feature : Tool List
The tool list in Anvaya is available func>onality wise or in alphabe>cal order
6
Feature : Rules Engine
• All the tools included in Anvaya have been categorized according to their func>onality and the allowed logical connec>vity between tools has been included as a rules file.
Defines rules for logical connecFon between the exisFng tools
7
Feature : Custom Tools and Parsers
• Custom tools in Anvaya serve as a wrapper around one or more standard tools or are tools with new func>onality not available in standard tools.
Anvaya Custom Tools provide novel funcFonaliFes to carry out exhaus>ve compara>ve analysis
• Parser scripts have been developed in PERL to enhance the logical connec>vity between various tools, which was hitherto not possible and required manual interven>on
8
GUI : Design Canvas
• Drag Tools available in the tool list on the canvas
• Connect them logically to create a workflow pipeline
• Set advanced IO and advanced parameters of each node
• Save the workflow
9
GUI : Status Status available in tabular format on the status tab and also pictorially on the design
canvas
10
GUI : Project Explorer Allows user to view the input-‐output and the intermediate output files of the current
project
11
Client Feature: Scribble Note • Scribble Note allows user to store short notes
regarding associated node or workflow. • These can be minimized or hidden or expanded
back for readability purpose.
12
Client Feature: Sub-‐layer Support • Nodes (Tools) can be logically grouped together to form
sublayer. • The sublayer can be collapsed or expanded as per
readability.
13
Feature : Pre Defined Workflows • Anvaya provides a set of 13 pre-‐defined workflows for
frequently used pipelines in genome annota>on and compara>ve genomics ranging for EST assembly and annota>on to phylogene>c reconstruc>on and microarray analysis.
Ø EST Analysis Ø Genome annota>on Ø Func>onal Annota>on Ø Ortholog Predic>on Ø Predic>on of Mo>fs Ø Remote ortholog predic>on Ø Phylogeny (DNA and Protein sequences) Ø Predic>on of poten>al an>genic sites Ø Primer Predic>on Ø Phylogene>c profiling Ø Promoter iden>fica>on using microarray data
-‐Reference mapping -‐RNAseq DifferenFal expression analysis
14
PDW : EST Analysis
Provides researcher a single pipeline, that can read raw trace files from sequencing machines and provide fully annotated assembled ESTs. *Patil DP et al., BMC Genomics (2009)
Base calling
Vector masking Sequence
trimming Removal of PolyA tail
Trimming of QV
Functional annotation
NCBI submission format
CAP3 pre-processing
CAP3 assembly
Unique transcripts
EST prediction
Domain prediction
Gene ontology
15
PDW : PhylogeneFc Profiling
The workflow aims to infer functional linkages using phylogenetic profiling. The profiles obtained are analyzed for their statistical significance using parameters like mutual information content, Hamming distance and Pearson correlation coefficient.
Similarity search
Conversion to profile matrix format with
norm. E-values
Hamming distance
MI and CC
16
Test Case : Genome annotaFon of 21 mycobacterial genomes
Input Dataset: 102 MB Execution Time: 23 min 17
Anvaya Publications
• Bhak> Limaye, Ruma Banerjee, Avik Daba, Harshal Inamdar, Pankaj Vats, Sonal Dahale, Alok Bhandari, E. P. Ramakrishnan, Rajnikanth Tupakula, Sandeep Malviya, Avinash Bayaskar, Renu Gadhari, Sankalp Jain, Vivek Gavane, Rashmi Mahajan, Sunitha K, AND Rajendra Joshi, " ANVAYA: A Workflows Environment For Automated Genome Analysis “ , Journal of BioinformaFcs and ComputaFonal Biology (2012)
• Ruma Banerjee, Pankaj Vats, Sonal Dahale, Sunitha Manjari Kasibhatla
and Rajendra Joshi, ComparaFve genomics of cell envelope
components in Mycobacteria, PloS One (2011) 18
ICTBioMed Leadership at AcceleraFng Biology 2014: CompuFng Life
19
THANK YOU [email protected]
20