Top Banner
Regular expressions • Perl provides a pattern-matching engine • Patterns are called regular expressions • They are extremely powerful – probably Perl's strongest feature, compared to other languages • Often called "regexps" for short
39

Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Regular expressions

• Perl provides a pattern-matching engine

• Patterns are called regular expressions

• They are extremely powerful– probably Perl's strongest feature, compared to

other languages

• Often called "regexps" for short

Page 2: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Motivation: N-glycosylation motif

• Common post-translational modification in ER– Membrane & secreted proteins– Purpose: folding, stability, cell-cell adhesion

• Attachment of a 14-sugar oligosaccharide• Occurs at asparagine residues with the

consensus sequence NX1X2, where

– X1 can be anything(but proline & aspartic acid inhibit)

– X2 is serine or threonine

• Can we detect potential N-glycosylationsites in a protein sequence?

Page 3: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

User Input from Keyboard

• Input a line of input from the user and save it into a variable:

• We can also input a file name from the user that we want to open:

print "Enter your DNA sequence:";$dna = <STDIN>;chomp($dna);

It is often needed to remove the “new line” character from user input

print "Enter data file name:";$data = <STDIN>;chomp($data);open F, $data;

Create a file handle called F for the file name stored in the $data variable

Page 4: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Interactive testing

• This script echoes input from the keyboard

• Sometimes (e.g. in Windows IDEs) the output isn’t printed until the script stops

• This is because of buffering.

• To stop buffering, set to "autoflush":

while (<STDIN>) { print;}

The special filehandle STDIN means"standard input", i.e. the keyboard

$| = 1;while (<STDIN>) { print;}

$| is the autoflush flag

Page 5: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Matching alternative characters

• [ACGT] matches one A, C, G or T:

• In general square brackets denote a set of alternative possibilities

• Use - to match a range of characters: [A-Z]• . matches anything• \s matches spaces or tabs• \S is anything that's not a space or tab• [^X] matches anything but X

while (<STDIN>) { print "Matched: $_" if /[ACGT]/;}

this is not printedThis is printedMatched: This is printed

Italics denoteinput text

Page 6: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Matching alternative strings

• /(this|that)/ matches "this" or "that"

• ...and is equivalent to /th(is|at)/while (<STDIN>) { print "Matched: $_" if /this|that|other/;}

Won't match THISWill match thisMatched: Will match thisWon't match ThE oThERWill match the otherMatched: Will match the other

Remember, regexpsare case-sensitive

Page 7: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Matching multiple characters• x* matches zero or more x's (greedily)• x*? matches zero or more x's (sparingly)• x+ matches one or more x's (greedily)• x{n} matches n x's• x{m,n} matches from m to n x's

Word and string boundaries• ^ matches the start of a string• $ matches the end of a string• \b matches word boundaries

Page 8: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

"Escaping" special characters

• \ is used to "escape" characters that otherwise have meaning in a regexp

• so \[ matches the character "["– if not escaped, "[" signifies the start of a list of

alternative characters, as in [ACGT]

Page 9: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Retrieving what was matched

• If parts of the pattern are enclosed by parentheses, then (following the match) those parts can be retrieved from the scalars $1, $2...

• e.g. /the (\S+) sat on the (\S+) drinking (\S+)/• matches "the cat sat on the mat drinking milk"• with $1="cat", $2="mat", $3="milk"

$| = 1;while (<STDIN>) { if (/(a|the) (\S+)/i) { print "Noun: $2\n"; }}

Pick up the cupNoun: cupSit on a chairNoun: chairPut the milk in the teaNoun: milk

Note: only the first "the"is picked up by this regexp

Page 10: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Variations and modifiers

• //i ignores upper/lower case distinctions:

• //g starts search where last match left off– pos($_) is index of first character after last match

• s/OLD/NEW/ replaces first "OLD" with "NEW"• s/OLD/NEW/g is "global" (i.e. replaces every

occurrence of "OLD" in the string)

pAttERnMatched pAttERn

while (<STDIN>) { print "Matched: $_" if /pattern/i;}

Page 11: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

N-glycosylation site detector

$| = 1;while (<STDIN>) { $_ = uc $_; while (/(N[^PD][ST])/g) { print "Potential N-glycosylation sequence ", $1, " at residue ", pos() - 2, "\n"; }}

Convert to upper case

Regexp uses'g' modifier toget all matchesin sequence

pos() is index of first residueafter match, starting at zero;so, pos()-2 is index of first residueof three-residue match, starting at one.

while (/(N[^PD][ST])/g) { ... }

The main regular expression

Page 12: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

PROSITE and Pfam

PROSITE – a database of regular expressionsfor protein families, domains and motifs

Pfam – a database of Hidden MarkovModels (HMMs) – equivalent toprobabilistic regular expressions

Page 13: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Subroutines

• Often, we can identify self-contained tasks that occur in so many different places we may want to separate their description from the rest of our program.

• Code for such a task is called a subroutine.• Examples of such tasks:

– finding the length of a sequence– reverse complementing a sequence– finding the mean of a list of numbers

NB: Perl providesthe subroutinelength($x) to dothis already

Page 14: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Finding all sequence lengths (2)open FILE, "fly3utr.txt";while (<FILE>) { chomp; if (/>/) { print_name_and_len(); $name = $_; $len = 0; } else { $len += length; }}print_name_and_len();close FILE;

sub print_name_and_len { if (defined ($name)) { print "$name $len\n"; }}

Subroutine definition;code in here is notexecuted unlesssubroutine is called

Subroutine calls

Page 15: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Reverse complement subroutinesub revcomp { my $rev; $rev = reverse ($dna); $rev =~ tr/acgt/tgca/; return $rev;}

$rev = 12345;

$dna = "accggcatg";$rev1 = revcomp();print "Revcomp of $dna is $rev1\n";

$dna = "cggcgt";$rev2 = revcomp();print "Revcomp of $dna is $rev2\n";

print "Value of rev is $rev\n";

Revcomp of accggcatg is catgccggtRevcomp of cggcgt is acgccgValue of rev is 12345

Value of $rev isunchanged bycalls to revcomp

"my" announces that$rev is local to thesubroutine revcomp

"return" announcesthat the return valueof this subroutineis whatever's in $rev

Page 16: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Revcomp with argumentssub revcomp { my ($dna) = @_; my $rev = reverse ($dna); $rev =~ tr/acgt/tgca/; return $rev;}

$dna1 = "accggcatg";$rev1 = revcomp ($dna1);print "Revcomp of $dna1 is $rev1\n";

$dna2 = "cggcgt";$rev2 = revcomp ($dna2);print "Revcomp of $dna2 is $rev2\n";

Revcomp of accggcatg is catgccggtRevcomp of cggcgt is acgccg

The array @_ holdsthe arguments tothe subroutine(in this case, the sequence to be revcomp'd)

Now we don'thave to re-usethe same variablefor the sequenceto be revcomp'd

Page 17: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Mean & standard deviation@xdata = (1, 5, 1, 12, 3, 4, 6);($x_mean, $x_sd) = mean_sd (@xdata);

@ydata = (3.2, 1.4, 2.5, 2.4, 3.6, 9.7);($y_mean, $y_sd) = mean_sd (@ydata);

sub mean_sd { my @data = @_; my $n = @data + 0; my $sum = 0; my $sqSum = 0; foreach $x (@data) { $sum += $x; $sqSum += $x * $x; } my $mean = $sum / $n; my $variance = $sqSum / $n - $mean * $mean; my $sd = sqrt ($variance); return ($mean, $sd);}

Subroutinereturns atwo-elementlist: (mean,sd)

Subroutinetakes a listof $n numericarguments

Square root

Page 18: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Maximum element of an array

• Subroutine to find the largest entry in an array

@num = (1, 5, 1, 12, 3, 4, 6);$max = find_max (@num);print "Numbers: @num\n";print "Maximum: $max\n";

sub find_max { my @data = @_; my $max = pop @data; foreach my $x (@data) { if ($x > $max) { $max = $x; } } return $max;}

Numbers: 1 5 1 12 3 4 6Maximum: 12

Page 19: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Including variables in patterns• Subroutine to find number of instances of

a given binding site in a sequence$dna = "ACGCGTAAGTCGGCACGCGTACGCGT";$mcb = "ACGCGT";print "$dna has ", count_matches ($mcb, $dna), " matches to $mcb\n";

sub count_matches { my ($pattern, $text) = @_; my $n = 0; while ($text =~ /$pattern/g) { ++$n } return $n;}

ACGCGTAAGTCGGCACGCGTACGCGT has 3 matches to ACGCGT

Page 20: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Data structures

• Suppose we have a file containing a table of Drosophila gene names and cellular compartments, one pair on each line:

Cyp12a5 MitochondrionMRG15 NucleusCop Golgibor CytoplasmBx42 Nucleus

Suppose this file is in "genecomp.txt"

Page 21: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Reading a table of data

• We can split eachline into a 2-elementarray using thesplit command.

• This breaks the lineat each space:

• The opposite of split is join, which makes a scalar from an array:

open FILE, "genecomp.txt";while (<FILE>) { ($g, $c) = split; push @gene, $g; push @comp, $c;}close FILE;print "Genes: @gene\n";print "Compartments: @comp\n";

Genes: Cyp12a5 MRG15 Cop bor Bx42Compartments: Mitochondrion Nucleus Golgi Cytoplasm Nucleus

print join (" and ", @gene);

Cyp12a5 and MRG15 and Cop and bor and Bx42

Page 22: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Finding an entry in a table• The following code assumes that we've

already read in the table from the file:

• Example:$ARGV[0] = "Cop"

$geneToFind = shift @ARGV;print "Searching for gene $geneToFind\n";for ($i = 0; $i < @gene; ++$i) { if ($gene[$i] eq $geneToFind) { print "Gene: $gene[$i]\n"; print "Compartment: $comp[$i]\n"; exit; }}print "Couldn't find gene\n";

Searching for gene CopGene: CopCompartment: Golgi

Page 23: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Binary search• The previous algorithm is inefficient. If there are N

entries in the list, then on average we have to search through ½(N+1) entries to find the one we want.

• For the full Drosophila genome, N=12,000. This is painfully slow.

• An alternative is the Binary Search algorithm:

Start with a sorted list.

Compare the middle elementwith the one we want. Pick thehalf of the list that contains ourelement.

Iterate this procedure tolocate the right element.This takes around log2(N) steps.

Page 24: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Associative arrays (hashes)

• Implementing algorithms like binary search is a common task in languages like C.

• Conveniently, Perl provides a type of array called an associative array (also called a hash) that is pre-indexed for quick search.

• An associative array is a set of keyvalue pairs (like our genecompartment table)

$comp{"Cop"} = "Golgi"; Curly braces {} are used toindex an associative array

Page 25: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Reading a table using hashes

open FILE, "genecomp.txt";while (<FILE>) { ($g, $c) = split; $comp{$g} = $c;}$geneToFind = shift @ARGV;print "Gene: $geneToFind\n";print "Compartment: ", $comp{$geneToFind}, "\n";

Gene: CopCompartment: Golgi

...with $ARGV[0] = "Cop" as before:

Page 26: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Reading a FASTA file into a hashsub read_FASTA { my ($filename) = @_; my (%name2seq, $name, $seq); open FILE, $filename; while (<FILE>) { chomp; if (/>/) { s/>//; if (defined $name) { $name2seq{$name} = $seq; } $name = $_; $seq = ""; } else { $seq .= $_; } } $name2seq{$name} = $seq; close FILE; return %name2seq;}

Page 27: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Formatted output of sequencessub print_seq { my ($name, $seq) = @_; print ">$name\n"; my $width = 50; for (my $i = 0; $i < length($seq); $i += $width) { if ($i + $width > length($seq)) { $width = length($seq) - $i; } print substr ($seq, $i, $width), "\n"; }}

The term substr($x,$i,$len) returns the substring of $x starting at position $i with length $len.

For example, substr("Biology",3,3) is "log"

50-column output

Page 28: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

keys and values• keys returns the list of keys in the hash

– e.g. names, in the %name2seq hash

• values returns the list of values– e.g. sequences, in the %name2seq hash%name2seq = read_FASTA ("fly3utr.txt");print "Sequence names: ", join (" ", keys (%name2seq)), "\n";my $len = 0;foreach $seq (values %name2seq) { $len += length ($seq);}print "Total length: $len\n";

Sequence names: CG11488 CG11604 CG11455Total length: 210

Page 29: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Files of sequence names

• Easy way to specify a subset of a given FASTA database

• Each line is the name of a sequence in a given database

• e.g. CG1167CG685CG1041CG1043

Page 30: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Get named sequences• Given a FASTA database and a "file of sequence

names", print every named sequence:

($fasta, $fosn) = @ARGV;%name2seq = read_FASTA ($fasta);open FILE, $fosn;while ($name = <FILE>) { chomp $name; $seq = $name2seq{$name}; if (defined $seq) { print_seq ($name, $seq); } else { warn "Can't find sequence: $name. ", "Known sequences: ", join (" ", keys %name2seq), "\n"; }}close FILE;

Page 31: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Intersection of two sets

• Two files of sequence names:• What is the overlap?

• Find intersection using hashes:

CG1167CG685CG1041CG1043

CG215CG1041CG483CG1167CG1163

open FILE1, "fosn1.txt";while (<FILE1>) { $gotName{$_} = 1; }close FILE1;open FILE2, "fosn2.txt";while (<FILE2>) { print if $gotName{$_};}close FILE2;

fosn1.txt

fosn2.txt

CG1041CG1167

Page 32: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Assigning hashes• A hash can be assigned directly,

as a list of "key=>value" pairs:

%comp = ('Cyp12a5' => 'Mitochondrion', 'MRG15' => 'Nucleus', 'Cop' => 'Golgi', 'bor' => 'Cytoplasm', 'Bx42' => 'Nucleus');print "keys: ", join(";",keys(%comp)), "\n";print "values: ", join(";",values(%comp)), "\n";

keys: bor;Cop;Bx42;Cyp12a5;MRG15values: Cytoplasm;Golgi;Nucleus;Mitochondrion;Nucleus

Page 33: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

The genetic code as a hash%aa = ('ttt'=>'F', 'tct'=>'S', 'tat'=>'Y', 'tgt'=>'C', 'ttc'=>'F', 'tcc'=>'S', 'tac'=>'Y', 'tgc'=>'C', 'tta'=>'L', 'tca'=>'S', 'taa'=>'!', 'tga'=>'!', 'ttg'=>'L', 'tcg'=>'S', 'tag'=>'!', 'tgg'=>'W', 'ctt'=>'L', 'cct'=>'P', 'cat'=>'H', 'cgt'=>'R', 'ctc'=>'L', 'ccc'=>'P', 'cac'=>'H', 'cgc'=>'R', 'cta'=>'L', 'cca'=>'P', 'caa'=>'Q', 'cga'=>'R', 'ctg'=>'L', 'ccg'=>'P', 'cag'=>'Q', 'cgg'=>'R', 'att'=>'I', 'act'=>'T', 'aat'=>'N', 'agt'=>'S', 'atc'=>'I', 'acc'=>'T', 'aac'=>'N', 'agc'=>'S', 'ata'=>'I', 'aca'=>'T', 'aaa'=>'K', 'aga'=>'R', 'atg'=>'M', 'acg'=>'T', 'aag'=>'K', 'agg'=>'R', 'gtt'=>'V', 'gct'=>'A', 'gat'=>'D', 'ggt'=>'G', 'gtc'=>'V', 'gcc'=>'A', 'gac'=>'D', 'ggc'=>'G', 'gta'=>'V', 'gca'=>'A', 'gaa'=>'E', 'gga'=>'G', 'gtg'=>'V', 'gcg'=>'A', 'gag'=>'E', 'ggg'=>'G' );

Page 34: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Translating: DNA to protein$prot = translate ("gatgacgaaagttgt");print $prot;

sub translate { my ($dna) = @_; $dna = lc ($dna); my $len = length ($dna); if ($len % 3 != 0) { die "Length $len is not a multiple of 3"; } my $protein = ""; for (my $i = 0; $i < $len; $i += 3) { my $codon = substr ($dna, $i, 3); if (!defined ($aa{$codon})) { die "Codon $codon is illegal"; } $protein .= $aa{$codon}; } return $protein;} DDESC

Page 35: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Counting residue frequencies

%count = count_residues ("gatgacgaaagttgt");@residues = keys (%count);foreach $residue (@residues) { print "$residue: $count{$residue}\n";}

sub count_residues { my ($seq) = @_; my %freq; $seq = lc ($seq); for (my $i = 0; $i < length($seq); ++$i) { my $residue = substr ($seq, $i, 1); ++$freq{$residue}; } return %freq;}

g: 5a: 5c: 1t: 4

Page 36: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Counting N-mer frequencies

%count = count_nmers ("gatgacgaaagttgt", 2);@nmers = keys (%count);foreach $nmer (@nmers) { print "$nmer: $count{$nmer}\n";}

sub count_nmers { my ($seq, $n) = @_; my %freq; $seq = lc ($seq); for (my $i = 0; $i <= length($seq) - $n; ++$i) { my $nmer = substr ($seq, $i, $n); ++$freq{$nmer}; } return %freq;}

cg: 1tt: 1ga: 3tg: 2gt: 2aa: 2ac: 1at: 1ag: 1

Page 37: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

N-mer frequencies for a whole filemy %name2seq = read_FASTA ("fly3utr.txt");while (($name, $seq) = each %name2seq) { %count = count_nmers ($seq, 2, %count);}@nmers = keys (%count);foreach $nmer (@nmers) { print "$nmer: $count{$nmer}\n";}

sub count_nmers { my ($seq, $n, %freq) = @_; $seq = lc ($seq); for (my $i = 0; $i <= length($seq) - $n; ++$i) { my $nmer = substr ($seq, $i, $n); ++$freq{$nmer}; } return %freq;}

ct: 5tc: 9tt: 26cg: 4ga: 11tg: 12gc: 2gt: 17aa: 39ac: 10gg: 4at: 17ca: 11ag: 15ta: 20cc: 2

The each command is a shorthand for loopingthrough each (key,value) pair in an array

Note how we keep passing %freq back into the count_nmers subroutine, to get cumulative counts

Page 38: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Files and filehandles

• Opening a file:• Closing a file:• Reading a line:• Reading an array:• Printing a line:• Read-only:• Write-only:• Test if file exists:

open XYZ, $filename;

close XYZ;

This XYZ is the filehandle

$data = <XYZ>;

@data = <XYZ>;

print XYZ $data;

open XYZ, "<$filename";

open XYZ, ">$filename";

if (-e $filename) { print "$filename exists!\n";}

Page 39: Regular expressions Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful –probably Perl's strongest.

Files and filehandles

• Opening a file:• Closing a file:• Reading a line:• Reading an array:• Printing a line:• Read-only:• Write-only:• Test if file exists:

open XYZ, $filename;

close XYZ;

This XYZ is the filehandle

$data = <XYZ>;

@data = <XYZ>;

print XYZ $data;

open XYZ, "<$filename";

open XYZ, ">$filename";

if (-e $filename) { print "$filename exists!\n";}