Parsing a File with Perl Regexp, substr and oneliners Paolo Marcatili - Programmazione 09-10
Parsing a File with Perl
Regexp, substr and oneliners
Paolo Marcatili - Programmazione 09-10
2
Agenda
Today we will see how to> Extract information from a file> Substr and regexp
We already know how to use:> Scalar variables $ and arrays @> If, for, while, open, print, close…
Paolo Marcatili - Programmazione 09-10
Task Today
Paolo Marcatili - Programmazione 09-10
4
Protein Structures
1st task:> Open a PDB file> Operate a symmetry transformation> Extract data from file header
Paolo Marcatili - Programmazione 09-10
5
Zinc Finger
2nd task:> Open a fasta file> Find all occurencies of Zinc Fingers
(homework?)
Paolo Marcatili - Programmazione 09-10
Parsing
Paolo Marcatili - Programmazione 09-10
7
Rationale
Biological data -> human readable files
If you can read it, Perl can read it as well*BUT*It can be tricky
Paolo Marcatili - Programmazione 09-10
8
Parsing flow-chart
Open the fileFor each line{
look for “grammar”and store data
}Close fileUse data
Paolo Marcatili - Programmazione 09-10
Substr
Paolo Marcatili - Programmazione 09-10
10
Substr
substr($data, start, length)returns a substring from the expression supplied as first
argument.
11
Substr
substr($data, start, length)
^ ^ ^
your string | | start from 0 |
you can omit this(you will extract up to the end of string)
12
Substr
substr($data, start, length)Examples:
my $data=“il mattino ha l’oro in bocca”;print substr($data,0) . “\n”; #prints all stringprint substr($data,3,5) . “\n”; #prints mattiprint substr($data,25 ) . “\n”; #prints boccaprint substr($data,-5 ) . “\n”; #prints bocca
Pdb rotation
Paolo Marcatili - Programmazione 09-10
14
PDB
ATOM 4 O ASP L 1 43.716 -12.235 68.502 1.00 70.05 OATOM 5 N ILE L 2 44.679 -10.569 69.673 1.00 48.19 N…
COLUMNS DATA TYPE FIELD DEFINITION------------------------------------------------------------------------------------- 1 - 6 Record name "ATOM " 7 - 11 Integer serial Atom serial number.13 - 16 Atom name Atom name.17 Character altLoc Alternate location indicator.18 - 20 Residue name resName Residue name.22 Character chainID Chain identifier.23 - 26 Integer resSeq Residue sequence number.27 AChar iCode Code for insertion of residues.31 - 38 Real(8.3) x Orthogonal coordinates for X in Angstroms39 - 46 Real(8.3) y Orthogonal coordinates for Y in Angstroms47 - 54 Real(8.3) z Orthogonal coordinates for Z in Angstroms55 - 80 Bla Bla Bla (not useful for our purposes)
15
Rotation
X->ZY->X ===> rotation of 120° around u=(1,1,1)Z->Y
X
Y
16
Rotation
#! /usr/bin/perl -w
use strict;open(IG, "<IG.pdb") || die "cannot open IG.pdb:$!"; open(IGR, ">IG_rotated.pdb") || die "cannot open IG_rotated.pdb:$!"; while (my $line=<IG>){ if (substr($line,0,4) eq "ATOM"){ my $X= substr($line,30,8); my $Y= substr($line,38,8); my $Z= substr($line,46,8); print IGR substr($line,0,30).$Z.$X.$Y.substr($line,54); } else{ print IGR $line; }}close IG;close IGR;
RegExp
Paolo Marcatili - Programmazione 09-10
18
Regular Expressions
PDB have a “fixed” structures.
What if we want to do something like“check for a valid email address”…
19
Regular Expressions
PDB have a “fixed” structures.
What if we want to do something like“check for a valid email address”…1. There must be some letters or numbers2. There must be a @3. Other letters4. [email protected] is good
[email protected] is not good
20
Regular Expressions
$line =~ m/^[a-z |1-9| \.| _]+@[^\.]+\.[a-z]{2,}$/
WHAAAT???
This means:Check if $line has some chars at the beginning, then @, thensome non-points, then a point, then at least two letters
….Ok, let’s start from something simpler :)
21
Regular Expressions
$line =~ m/^[a-z |1-9| \.| _]+@[^\.]+\.[a-z]{2,}$/
WHAAAT???
This means:Check if $line has some chars at the beginning, then @, thensome non-points, then a point, then at least two letters
….Ok, let’s start from something simpler :)
22
Regular Expressions
$line =~ m/^ATOM/Line starts with ATOM
$line =~ m/^ATOM\s+/Line starts with ATOM, then there are some spaces
$line =~ m/^ATOM\s+[\-|0-9]+/Line starts with ATOM, then there are some spaces, then there are some
digits or -$line =~ m/^ATOM\s+\-?[0-9]+/Line starts with ATOM, then there are some spaces, then there can be a
minus, then some digits
23
Regular Expressions
24
PDB Header
We want to find %id for L and H chain
25
PDB Header
We want to find %id for L and H chain
$pidL= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_L:([\.|0-9])/);$pidH= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_H:([\.|0-9])/);
ONELINER!!
cat IG.pdb | perl -ne ‘print “$1\n” if ($_=~m/^REMARK SUMMARY-ID_GLOB_([LH]:[\.|0-9]+)/);’
Zinc Finger
Paolo Marcatili - Programmazione 09-10
27
Zinc Finger
A zinc finger is a large superfamily of proteindomains that can bind to DNA.
A zinc finger consists of two antiparallel βstrands, and an α helix.
The zinc ion is crucial for the stability of thisdomain type - in the absence of the metalion the domain unfolds as it is too small tohave a hydrophobic core.
The consensus sequence of a single finger is:
C-X{2-4}-C-X{3}-[LIVMFYWC]-X{8}-H-X{3}-H
28
Homework
Find all occurencies of ZF motif inzincfinger.fasta
Put them in file ZF_motif.fasta
e.g.weofjpihouwefghoicacvgnfglapglhtylhyuiui
29
Homework
Find all occurencies of ZF motif inzincfinger.fasta
Put them in file ZF_motif.fasta
e.g.Weofjpihouwefghoicacvgnfglapglifhtylhyuiui
cacvgnfglapglifhtylh