Top Banner
Pipelines
30

Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

Jan 03, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

Pipelines

Page 2: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

Programinput output-Keyboard-File-Pipe

-Screen-File-Pipe

Page 3: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

The “echo” program reads text from the inputand writes this to the output

echoinput output-Keyboard-File-Pipe

-Screen-File-Pipe

Page 4: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

The “cat” program reads text from the inputand writes this to the output

catinput output-Keyboard-File-Pipe

-Screen-File-Pipe

Page 5: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

echo uniprot_sprot_plants.fasta

uniprot_sprot_plants.fasta

Page 6: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

cat uniprot_sprot_plants.fasta

>sp|Q43495|108_SOLLC Protein 108 OS=Solanum lycopersicum PE=2 SV=1MASVKSSSSSSSSSFISLLLLILLVIVLQSQVIECQPQQSCTASLTGLNVCAPFLVPGSPTASTECCNAVQSINHDCMCNTMRIAAQIPAQCNLPPLSCSAN>sp|Q9XHP0|11S2_SESIN 11S globulin seed storage protein 2 OS=Sesamum indicum PE=2 SV=1MVAFKFLLALSLSLLVSAAIAQTREPRLTQGQQCRFQRISGAQPSLRIQSEGGTTELWDERQEQFQCAGIVAMRSTIRPNGLSLPNYHPSPRLVYIERGQGLISIMVPGCAETYQVHRSQRTMERTEASEQQDRGSVRDLHQKVHRLRQGDIVAIPSGAAHWCYNDGSEDLVAVSINDVNHLSNQLDQKFRAFYLAGGVPRSGEQEQQARQTFHNIFRAFDAELLSEAFNVPQETIRRMQSEEEERGLIVMARERMTFVRPDEEEGEQEHRGRQLDNGLEETFCTMKFRTNVESRREADIFSRQAGRVHVVDRNKLPILKYMDLSAEKGNLYSNALVSPDWSMTGHTIVYVTRGDAQVQVVDHNGQALMNDRVNQGEMFVVPQYYTSTARAGNNGFEWVAFKTTGSPMRSPLAGYTSVIRAMPLQVITNSYQISPNQAQALKMNRGSQSFLLSPGGRRS>sp|P19084|11S3_HELAN 11S globulin seed storage protein G3 OS=Helianthus annuus GN=HAG3 PE=3 SV=1MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEALEPIEVIQAEAGVTEIWDAYDQQFQCAWSILFDTGFNLVAFSCLPTSTPLFWPSSREGVILPGCRRTYEYSQEQQFSGEGGRRGGGEGTFRTVIRKLENLKEGDVVAIPTGTAHWLHNDGNTELVVVFLDTQNHENQLDENQRRFFLAGNPQAQAQSQQQQQRQPRQQSPQRQRQRQRQGQGQNAGNIFNGFTPELIAQSFNVDQETAQKLQGQNDQRGHIVNVGQDLQIVRPPQDRRSPRQQQEQATSPRQQQEQQQGRRGGWSNGVEETICSMKFKVNIDNPSQADFVNPQAGSIANLNSFKFPILEHLRLSVERGELRPNAIQSPHWTINAHNLLYVTEGALRVQIVDNQGNSVFDNELREGQVVVIPQNFAVIKRAN

Page 7: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

The “grep” program filters the input for given termsand writes the filtered text to the output

grepinput output-Keyboard-File-Pipe

-Screen-File-Pipe

Page 8: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

grep --help

Usage: grep [OPTION]... PATTERN [FILE] ...Search for PATTERN in each FILE or standard input.Example: grep -i 'hello world' menu.h main.c

Regexp selection and interpretation: -E, --extended-regexp PATTERN is an extended regular expression -F, --fixed-strings PATTERN is a set of newline-separated strings -G, --basic-regexp PATTERN is a basic regular expression -P, --perl-regexp PATTERN is a Perl regular expression -e, --regexp=PATTERN use PATTERN as a regular expression -f, --file=FILE obtain PATTERN from FILE -i, --ignore-case ignore case distinctions -w, --word-regexp force PATTERN to match only whole words -x, --line-regexp force PATTERN to match only whole lines -z, --null-data a data line ends in 0 byte, not newline

Page 9: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

grep sp uniprot_sprot_plants.fasta

>sp|Q43495|108_SOLLC Protein 108 OS=Solanum lycopersicum PE=2 SV=1>sp|Q9XHP0|11S2_SESIN 11S globulin seed storage protein 2 OS=Sesamum indicum PE=2 SV=1>sp|P19084|11S3_HELAN 11S globulin seed storage protein G3 OS=Helianthus annuus GN=HAG3 PE=3 SV=1>sp|P13744|11SB_CUCMA 11S globulin subunit beta OS=Cucurbita maxima PE=1 SV=1>sp|Q05349|12KD_FRAAN Auxin-repressed 12.5 kDa protein OS=Fragaria ananassa PE=2 SV=1>sp|O23878|13S1_FAGES 13S globulin seed storage protein 1 OS=Fagopyrum esculentum GN=FA02 PE=2 SV=1>sp|O23880|13S2_FAGES 13S globulin seed storage protein 2 OS=Fagopyrum esculentum GN=FA18 PE=2 SV=1>sp|Q9XFM4|13S3_FAGES 13S globulin seed storage protein 3 OS=Fagopyrum esculentum GN=FAGAG1 PE=1 SV=1>sp|P83004|13SB_FAGES 13S globulin basic chain OS=Fagopyrum esculentum PE=1 SV=1>sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=2 SV=1>sp|P93207|14310_SOLLC 14-3-3 protein 10 OS=Solanum lycopersicum GN=TFT10 PE=2 SV=2>sp|Q9S9Z8|14311_ARATH 14-3-3-like protein GF14 omicron OS=Arabidopsis thaliana GN=GRF11 PE=2 SV=1>sp|Q9C5W6|14312_ARATH 14-3-3-like protein GF14 iota OS=Arabidopsis thaliana GN=GRF12 PE=2 SV=1>sp|P42643|14331_ARATH 14-3-3-like protein GF14 chi OS=Arabidopsis thaliana GN=GRF1 PE=1 SV=3>sp|P49106|14331_MAIZE 14-3-3-like protein GF14-6 OS=Zea mays GN=GRF1 PE=1 SV=1>sp|Q84J55|14331_ORYSJ 14-3-3-like protein GF14-A OS=Oryza sativa subsp. japonica GN=GF14A PE=2 SV=1>sp|P85938|14331_PSEMZ 14-3-3-like protein 1 (Fragments) OS=Pseudotsuga menziesii PE=1 SV=1>sp|P93206|14331_SOLLC 14-3-3 protein 1 OS=Solanum lycopersicum GN=TFT1 PE=3 SV=2>sp|Q41418|14331_SOLTU 14-3-3-like protein OS=Solanum tuberosum PE=2 SV=1>sp|Q01525|14332_ARATH 14-3-3-like protein GF14 omega OS=Arabidopsis thaliana GN=

Page 10: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

Redirection

By placing a “>” with a file name at the end of the command line the output can be redirected to a file.

Page 11: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

grep sp uniprot_sprot_plants.fasta > out.txt

Page 12: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

The “wc” program counts lines or characters in the inputand writes the count to the output

wcinput output-Keyboard-File-Pipe

-Screen-File-Pipe

Page 13: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

wc -l uniprot_sprot_plants.fasta

250177 uniprot_sprot_plants.fasta

wc -l out.txt

33851 out.txt

Page 14: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

Creating a pipeline

With the “|” character the output of one program can be linked to the input of another program

Page 15: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

pipeline

grepinput outputInput/Output wc

Page 16: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

grep sp uniprot_sprot_plants.fasta| wc –l

33851

Page 17: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

grep sp uniprot_sprot_plants.fasta| grep thaliana

>sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=2 SV=1>sp|Q9S9Z8|14311_ARATH 14-3-3-like protein GF14 omicron OS=Arabidopsis thaliana GN=GRF11 PE=2 SV=1>sp|Q9C5W6|14312_ARATH 14-3-3-like protein GF14 iota OS=Arabidopsis thaliana GN=GRF12 PE=2 SV=1>sp|P42643|14331_ARATH 14-3-3-like protein GF14 chi OS=Arabidopsis thaliana GN=GRF1 PE=1 SV=3>sp|Q01525|14332_ARATH 14-3-3-like protein GF14 omega OS=Arabidopsis thaliana GN=GRF2 PE=1 SV=2>sp|P42644|14333_ARATH 14-3-3-like protein GF14 psi OS=Arabidopsis thaliana GN=GRF3 PE=1 SV=2>sp|P46077|14334_ARATH 14-3-3-like protein GF14 phi OS=Arabidopsis thaliana GN=GRF4 PE=1 SV=2>sp|P42645|14335_ARATH 14-3-3-like protein GF14 upsilon OS=Arabidopsis thaliana GN=GRF5 PE=1 SV=2>sp|P48349|14336_ARATH 14-3-3-like protein GF14 lambda OS=Arabidopsis thaliana GN=GRF6 PE=1 SV=1>sp|Q96300|14337_ARATH 14-3-3-like protein GF14 nu OS=Arabidopsis thaliana GN=GRF7 PE=1 SV=1>sp|P48348|14338_ARATH 14-3-3-like protein GF14 kappa OS=Arabidopsis thaliana GN=GRF8 PE=2 SV=2>sp|Q96299|14339_ARATH 14-3-3-like protein GF14 mu OS=Arabidopsis thaliana GN=GRF9 PE=1 SV=2>sp|Q9LQ10|1A110_ARATH Probable aminotransferase ACS10 OS=Arabidopsis thaliana GN=ACS10 PE=2 SV=1>sp|Q9S9U6|1A111_ARATH 1-aminocyclopropane-1-carboxylate synthase 11 OS=Arabidopsis thaliana GN=ACS11 PE=1 SV=1>sp|Q8GYY0|1A112_ARATH Probable aminotransferase ACS12 OS=Arabidopsis thaliana GN=ACS12 PE=2 SV=2>sp|Q06429|1A11_ARATH 1-aminocyclopropane-1-carboxylate synthase-like protein 1 OS=Arabidopsis thaliana GN=ACS1 PE=1 SV=2>sp|Q06402|1A12_ARATH 1-aminocyclopropane-1-carboxylate synthase 2 OS=Arabidopsis thaliana GN=ACS2 PE=1 SV=1>sp|Q43309|1A14_ARATH 1-aminocyclopropane-1-carboxylate synthase 4 OS=Arabidopsis thaliana GN=ACS4 PE=1 SV=1>sp|Q37001|1A15_ARATH 1-aminocyclopropane-1-carboxylate synthase 5 OS=Arabidopsis thaliana GN=ACS5 PE=1 SV=1>sp|Q9SAR0|1A16_ARATH 1-aminocyclopropane-1-carboxylate synthase 6 OS=Arabidopsis thaliana GN=ACS6 PE=1 SV=2>sp|Q9STR4|1A17_ARATH 1-aminocyclopropane-1-carboxylate synthase 7 OS=Arabidopsis thaliana GN=ACS7 PE=1 SV=1>sp|Q9T065|1A18_ARATH 1-aminocyclopropane-1-carboxylate synthase 8 OS=Arabidopsis thaliana GN=ACS8 PE=1 SV=1>sp|Q9M2Y8|1A19_ARATH 1-aminocyclopropane-1-carboxylate synthase 9 OS=Arabidopsis thaliana GN=ACS9 PE=1 SV=1

Page 18: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

Programstdin stdoutPipe orKeyboard

PipeorScreen

Page 19: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

Special output channel for error messages

Programstdin

stdoutPipe orKeyboard

PipeorScreen

stderr

Page 20: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

grep sp uniprot_sprot_plants.fas > out.txt

grep: uniprot_sprot_plants.fas: No such file or directory

Page 21: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

EMBOSS

"European Molecular Biology Open Software Suite"

http://emboss.sourceforge.net/

Toolbox with bioinformatics applications

Page 22: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

http://emboss.bioinformatics.nl/

Page 23: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

wossname "open reading frame"

Finds programs by keywords in their short descriptionSEARCH FOR 'OPEN READING FRAME'getorf Finds and extracts open reading frames (ORFs)plotorf Plot potential open reading frames in a nucleotide sequence

Page 24: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

wossname documentation

Finds programs by keywords in their short descriptionSEARCH FOR 'DOCUMENTATION'tfm Displays full documentation for an application

Page 25: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

tfm getorf

getorf

Function

Finds and extracts open reading frames (ORFs)

Description

This program finds and outputs the sequences of open reading frames (ORFs) in one or more nucleotide sequences. An ORF may be defined as a region of a specified minimum size between two STOP codons, or between a START and a STOP codon. The ORFs can be output as the nucleotide sequence or as the protein translation. Optionally, the program will output the region around the START codon, the first STOP codon, or the final STOP codon of an ORF. The START and STOP codons are defined in a Genetic Code table; a suitable table can be selected for the organism you are investigating. The output is a sequence file containing predicted open reading frames longer than the minimum size, which defaults to 30 bases (i.e. 10 amino acids).

Page 26: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

Command line options

All EMBOSS programs have a number of command line options. To get started:

–help Get help–stdout Write to standard output–filter Read stdin, write

stdout

Page 27: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

getorf -help

Standard (Mandatory) qualifiers: [-sequence] seqall Nucleotide sequence(s) filename and optional format, or reference (input USA) [-outseq] seqoutall [<sequence>.<format>] Protein sequence set(s) filename and optional format (output USA)

Additional (Optional) qualifiers: -table menu [0] Code to use (Values: 0 (Standard); 1 (Standard (with alternative initiation codons)); 2 (Vertebrate Mitochondrial); 3 (Yeast Mitochondrial); 4 (Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma); 5 (Invertebrate Mitochondrial); 6 (Ciliate Macronuclear and Dasycladacean); 9 (Echinoderm Mitochondrial); 10 (Euplotid Nuclear); 11 (Bacterial); 12 (Alternative Yeast Nuclear); 13 (Ascidian Mitochondrial); 14 (Flatworm Mitochondrial); 15 (Blepharisma Macronuclear); 16 (Chlorophycean Mitochondrial); 21 (Trematode Mitochondrial); 22 (Scenedesmus obliquus); 23 (Thraustochytrium Mitochondrial)) -minsize integer [30] Minimum nucleotide size of ORF to report (Any integer value)

Page 28: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

cat example1.fasta | getorf -filter -find 1

>BTBSCRYR_1 [72 - 110] Bovine mRNA for lens beta-s-crystallin...MTAIATVQISTCT>BTBSCRYR_2 [11 - 544] Bovine mRNA for lens beta-s-crystallin...MSKAGTKITFFEDKNFQGRHYDSDCDCADFHMYLSRCNSIRVEGGTWAVYERPNFAGYMYILPRGEYPEYQHWMGLNDRLSSCRAVHLSSGGQYKLQIFEKGDFNGQMHETTEDCPSIMEQFHMREVHSCKVLEGAWIFYELPNYRGRQYLLDKKEYRKPVDWGAASPAVQSFRRIVE>BTBSCRYR_3 [159 - 590] Bovine mRNA for lens beta-s-crystallin...MKGPILLGTCTSYPGASILSTSTGWASTTASAPAGLFTCLVEASISFRSLRKGILMVRCMRPRKTALPSWSSSTCGRSTPVRCWRAPGSSMSCPTTEAGSTCWTRRSTGSPSTGVQLPQLSSLSAALWSDDTDAAKRWLALSSK>BTBSCRYR_4 [547 - 603] Bovine mRNA for lens beta-s-crystallin...MIQMRPNAGWPCHPNKHYK>BTBSCRYR_5 [618 - 445] (REVERSE SENSE) Bovine mRNA for lens beta-s-crystallin...MPIVLFIMLIWMTRPASVWPHLYHHSTMRRKDWTAGEAAPQSTGFRYSFLSSRYCLPR>BTBSCRYR_6 [381 - 331] (REVERSE SENSE) Bovine mRNA for lens beta-s-crystallin...MWNCSMMEGQSSVVSCI>BTBSCRYR_7 [337 - 197] (REVERSE SENSE) Bovine mRNA for lens beta-s-crystallin...MHLTIKIPFLKDLKLILASTRQVNSPAGAEAVVEAHPVLVLRILAPG>BTBSCRYR_8 [192 - 73] (REVERSE SENSE) Bovine mRNA for lens beta-s-crystallin...MYMYPAKLGLSYTAQVPPSTLMELQRLRYMWKSAQSQSLS

Page 29: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

Exercise

Make a pipeline that reports (only) the size in residues of the longest protein in this file:

uniprot_sprot_plants.fasta

It can be done using these applications as building blocks:sizeseqnthseq pepstatsgrep cut

Page 30: Pipelines. Program input output -Keyboard -File -Pipe -Keyboard -File -Pipe -Screen -File -Pipe -Screen -File -Pipe.

http://main.g2.bx.psu.edu/