Top Banner
Introduction to Perl Matt Hudson
45

Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Dec 25, 2015

Download

Documents

Baldwin Floyd
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Introduction to Perl

Matt Hudson

Page 2: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Review

• blastall: Do a blast search• HMMER

hmmpfam: search against HMM database

hmmsearch: search proteins with HMMhmmbuild: make an HMM from a protein

alignment, as made by clustalw• clustalw: align protein or DNA sequences• fasta34: search a sequence using an older,

slower, but sometimes more flexible algorithm

Page 3: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

grep – my favorite

• Allows you to pick out lines of a text file that match a query, count them, and retrieve lines around the match.

grep ‘Query=’ myblast.txtWhat sequences did I BLAST?

grep –c ‘>’ testprotein.txtHow many sequences are in this file?

grep –A 10 ‘>’ testprotein.txtGive me the first ten lines of each protein

Page 4: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

ftp commands• ftp ftp.ncbi.nih.gov go to the NCBI site• open open a connection• ls same as UNIX• cd same as UNIX• get get me this file• mget get more than one file• put put a file on the server• lcd local cd• ! local shell• close close connection• bye exit the ftp program

Page 5: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

• OK. You are now up and running with UNIX, and can use it to do some fairly sophisticated bioinformatics.

• We’re going to concentrate on Perl scripting from now on.

Page 6: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

UNIX books• You might find that your UNIX skills need some refreshing from time to

time. I recommend having one of these books around in case you need some help using the command line:

• For students who haven’t done much UNIX:Sams Teach Yourself Unix in 24 Hours (4th Edition) (Sams Teach Yourself in 24 Hours) (Paperback)by Dave Taylor

For more advanced UNIX users:UNIX System V: A Practical Guide (3rd Edition) (Paperback)by Mark G. Sobell

• Also, for those of you not so familiar with bioinformatics:Bioinformatics for Dummies (Paperback)by Jean-Michel Claverie, Cedric Notredame, Jean-Michel Claverie, Cedric Notredame

Page 7: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Perl books• For some reason, although there are hundreds of Perl books out there, none of them are

really that good. Here are some that might be useful, but none are completely recommended.

• This one I recommend EXCEPT that it uses tools that come with the book that are non-standard:Beginning Perl for Bioinformatics (Paperback)by James Tisdall

This I have heard good things about but not used much myself:Beginning Perl, Second Edition (Paperback)by James Lee This is a classic but slow going if you know no programming:Learning Perl, Fourth Edition (Paperback)by Randal L. Schwartz, Tom Phoenix, brian d foy

This is better if you have little programming experience, but not a textbook:Perl for Dummies (Fourth Edition) (Paperback)by Paul Hoffman

• Once you get startedProgramming Perl, 3rd edition, by Larry Wall, O’Reilly, 2001

Page 8: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Why use Perl?

• Interpreted language – quick to program

• Easy to learn compared to most languages

• Designed for working with text files

• Free for all operating systems

• Most popular language in bioinformatics – many scripts available you can “borrow”, also ready made modules.

Page 9: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Programming

• In Perl, the program, or script, is just a text file.

• You write it with ANY text editor (we are using WordPad and/or nano).

• Run the program• Look at the output• Correct the errors (debugging)• Edit the script and try again.

Page 10: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

All programming courses traditionally start with a program that prints “Hello, world!”. So in keeping with that tradition:

Note:

No line numbers.

Each command line ends with a semicolon

Remember your program?

#!/usr/bin/perl

print “Hello, world\n”;

Page 11: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Print• All programming languages use “print” to mean “write this to the

console” – i.e. the command line.• Once opon a time, the console was a typewriter. But now “print”

never means print on a printer.• print statements are necessary to keep tabs on what your

program is doing.

• You need to tell Perl to put a carriage return at the end of a printed line– Use \n in a text string to signify a newline.– The \ character is called “backslash”.– It is an “escape” – it changes the meaning of the character

after it. In this case it changes “n” to “newline”. Other examples are \t (tab) or \$ (= print an actual dollar sign, normally a dollar sign has a special meaning).

Page 12: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Program details

• Perl programs on UNIX start with a line like:#!/usr/bin/perl

• Perl ignores anything after a # (this is a command not to Perl, but to the UNIX shell).

• Elsewhere in the program # is used for comments to explain the code.

• Lines that are Perl commands end with a semicolon (;).

Page 13: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Run your Perl program

#cd scratch

#nano helloworld.pl

(paste or type text into editor, save, and exit)

#perl helloworld.pl

Or:

#chmod 755 helloworld.pl

#./helloworld.pl

Page 14: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Pseudocode

• Programmers often find it easier to write out the things the program is doing in “normal” language. We call this pseudocode.

print “Hello, world\n”;

=

Output the text “Hello, world” to the terminal, followed by a newline character.

Page 15: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Strings

• In Perl, strings are very important. They are just a series of any text characters – letters,numbers, ><?>:$%^&*, etc.

• In the statement

print “Hello, world\n”;

---- this is a string----

Page 16: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Numbers, etc• The other common type of data is a number.

• Perl can handle numbers in most common formats, without any complications:

4565.67436.3E-26

• Arithmetic functions:+ (add)- (minus)/ (divide)* (multiply)** (exponentiation)

Page 17: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

A program using numbers

#!/usr/bin/perlprint “2+2\n”;print 3*4 , “\n”;print “8/2=” , 8/2 , “\n”;

Do you get it?Numbers in quotes are part of a string.Numbers outside quotes are numbers, andthe computer does the math before printing.

Page 18: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Pseudocodeprint “2+2\n”;=

Output “2+2”, followed by a newline, to the terminal

print 3*4 , “\n”;=

Evaluate 3 x 4, and print the answer, followed by a newline, to the terminal

Page 19: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Variables

• Up till now, we’ve been telling the computer exactly what to print. But in order for the program to generate what is printed, we need to use variables.

• A variable name starts with “$”

• It can be either a string or a number.

Page 20: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Assigning values

In pretty much all programming languages, = means “assign this value to this variable”.

The “my” command in Perl initializes the variable. This is optional but highly recommended.

So, you assign values to a variable as follows:

my $number = 123;

my $dna_sequence_string = “acgt”;

Page 21: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

A program with variables

#!/usr/bin/perl -w

#this program uses variables containing numbers

my $two = 2;

my $three = $two + 1;

print “\$two * \$three = $two * $three = “,($two * $three);

print "\n";

Page 22: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Pseudocode

my $two = 2;

Assign the value 2 to the variable $two

Page 23: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Interpolation• When you print the variable, Perl gives the contents

rather than the name of the variable.print $number; 9 • If you put a variable inside double quotes, Perl

interpolates the variableprint “The number is $number\n”The number is 9

• If you use single quotes, no interpolation happensprint ‘The number is $number\n’

The number is $number\n

• A more flexible way to do this is to “escape” the $print “The value of \$number is $number\n”;The value of $number is 9

Page 24: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Variables - summary

• A variable name starts with a $• It contains a number or a text string• Use my to define a variable• Use = to assign a value• Use \ to stop the variable being

interpolated• Take care with variable names and with

changing the contents of variables

Page 25: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Standard Input

• To make the program do something, we need to input data.– The angle bracket operator (<>) tells Perl to

expect input, by default from the keyboard.– Usually this is assigned to a variable

print “Please type a number: ”;

my $num = <STDIN>;

print “Your number is $num\n”;

Page 26: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Pseudocodemy $num = <STDIN>;

Stop the program, and wait until the user types input. Once the user hits the “enter” key, take the input (including the newline character) and put it into the variable $num.

Page 27: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

chomp• When data is entered from the keyboard, the program

waits for you to type the carriage return key.• But.. the string which is captured includes a newline

(carriage return) at its end• You can use the chomp function to remove the

newline character:

print “Enter your name: ”;

$name = <STDIN>;

print “Hello $name, happy to meet you!\n”;

chomp $name;

print “Hello $name, happy to meet you!\n”;

Page 28: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

if and True/False

• All programming works on ones and zeros – true and false.

if (1 == 1) {print “one equals one”;}

Perl evaluates the expression (1 == 1 )

Note TWO NOT ONE EQUALS SIGNS!

The if operator causes the command in curlybrackets to be executed ONLY IF the expression is true

Page 29: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

if

• if evaluates some statement in parentheses (must be true or false)

• Note: conditional block is indented, using tabs.

– Perl doesn’t care about indents, but it makes your code more “human readable”

Page 30: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Comparing variables

if ($one == $two) {print “one equals two”;}

Note there are TWO equals signs in this expression. If youremember, = means “assign this variable this value”. So ==actually means “equals”. You can also use

> Greater than< Less than>= Greater than or equal to<= Less than or equal to!= Not equal to

Page 31: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Pseudocodeif ($one == $two) {print “one equals two”;}

If the contents of the variable $one are identical to the contents of the variable $two, print “one equals two”

Page 32: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

What’s a block?

• In the case of an “if” statement:

• If the test is true, execute all the command lines inside the {} brackets. If not, then go on past the closing } to the statements below.

• You can also do stuff in a block over and over again using a loop – more later.

Page 33: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

die, scum

• die kills your script safely and prints a message

• It is often used to prevent you doing something regrettable – e.g. running your script on a file that doesn’t exist, or overwriting an existing file.

Page 34: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Exercising the Perl muscles

• Now let’s write a script to ask the user their age, and then deliver an insult specific to the age bracket:

• Over 25 - old fogey

• Under 15 – callow youth

• 15-25 – (insert your own insult here)

Page 35: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Pseudocodeoutput “Enter your age: ” to the terminal

Stop the program, and wait until the user types input. Once the user hits the “enter” key, take the input (including the newline character) and put it into the variable $age.

Remove newline from $age if presentIf the value in $age is less than 15, output “You are too young for this kind of

work!” followed by a newline, then terminate the program with the text “too young”

If the value in $age is more than 25, output “You’re old enough to know better!” and then

terminate the program with the text “too old”.If the program is still running (i.e. $age is between 15 and 25), then output

“You have much to learn!” followed by a newline.

Page 36: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Conditional Blocks, summary• An if test can be used to control multiple

lines of commands, as in this example *

print “Enter your age: ”;$age = <STDIN>;chomp $age;if ($age < 15) { print “You are too young for this kind of work!\n”; die “too young”;

}if ($age > 25) {

print “You’re old enough to know better!”;die “too old”;

}print “You have much to learn!\n”;

Page 37: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Arrays• An array can store multiple pieces of data. • They are essential for the most useful

functions of Perl. They can store data such as:

– the lines of a text file (e.g. primer sequences)– a list of numbers (e.g. BLAST e values)

• Arrays are designated with the symbol @

my @bases = (“A”, “C”, “G”, “T”);

Page 38: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Converting a variable to an array

split splits a variable into parts and puts them in an array.

my $dnastring = "ACGTGCTA";

my @dnaarray = split //, $dnastring;

@dnaarray is now (A, C, G, T, G, C, T, A)

@dnaarray = split /T/, $dnastring;

@dnaarray is now (ACG, GC, A)

Page 39: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

• join combines the elements of an array into a single scalar variable (a string)

$dnastring = join('', @dnaarray);

Converting an array to a variable

which arrayspacer(empty here)

Page 40: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Loops• A loop repeats a bunch of functions until it is done.

The functions are placed in a BLOCK – some code delimited with curly brackets {}

• Loops are really useful with arrays.

• The “foreach” loop is probably the most useful of all:

foreach my $base (@dnaarray) {

print "$base “;

}

Page 41: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

• String comparison (is the text the same?)

• eq (equal ) • ne (not equal )

There are others but beware of them!

Comparing strings

Page 42: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Getting part of a string

• substr takes characters out of a string

$letter = substr($dnastring, $position, 1)

which string where in the string

how many letters to take

Page 43: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Combining strings

• Strings can be concatenated (joined).

• Use the dot . operator$seq1= “ACTG”;

$seq2= “GGCTA”;

$seq3= $seq1 . $seq2;

print $seq3;ACTGGGCTA

Page 44: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

Making Decisions - review

• The if operator is generally used together with numerical or string comparison operators, inside an (expression).

numerical: ==, !=, >, <, ≥, ≤strings: eq, ne

• You can make decisions on each member of an array using a loop which puts each part of the array through the test, one at a time

Page 45: Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:

More healthy exercise

• Write a program that asks the user for a DNA restriction site, and then tells them whether that particular sequence matches the site for the restriction enzyme EcoRI, or Bam HI, or Hind III.

• Site for EcoR1: GAATTC• Bam H1: GGATCC• Hind III: AAGCTT