Top Banner
Bioinformatics master course, ‘11/’12 Paolo Marcatili Parsing a File with Perl Regexp, substr and oneliners
29

Regexp master 2011

May 09, 2015

Download

Technology

Paolo Marcatili
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili

Parsing a File with Perl

Regexp, substr and oneliners

Page 2: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 2

Agenda

Today we will see how to• Extract information from a file• Substr and regexp

We already know how to use:• Scalar variables $ and arrays @• If, for, while, open, print, close…

Page 3: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili

Task Today

Page 4: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 4

Protein Structures

1st task: • Open a PDB file• Operate a symmetry transformation • Extract data from file header

Page 5: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 5

Zinc Finger

2nd task: • Open a fasta file• Find all occurencies of Zinc Fingers

(homework?)

Page 6: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili

Parsing

Page 7: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 7

Rationale

Biological data -> human readable files

If you can read it, Perl can read it as well

*BUT*It can be tricky

Page 8: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 8

Parsing flow-chart

Open the fileFor each line{

look for “grammar”and store data

}Close fileUse data

Page 9: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili

Substr

Page 10: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 10

Substr

substr($data, start, length)returns a substring from the expression supplied as

first argument.

Page 11: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 11

Substr

substr($data, start, length)

^ ^ ^

your string | | start from 0 |

you can omit this(you will extract up to the end of

string)

Page 12: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 12

Substr

substr($data, start, length)Examples:

my $data=“il mattino ha l’oro in bocca”;print substr($data,0) . “\n”; #prints all stringprint substr($data,3,5) . “\n”; #prints mattiprint substr($data,25) . “\n”; #prints boccaprint substr($data,-5) . “\n”; #prints bocca

Page 13: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili

Pdb rotation

Page 14: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 14

PDB

ATOM 4 O ASP L 1 43.716 -12.235 68.502 1.00 70.05 O ATOM 5 N ILE L 2 44.679 -10.569 69.673 1.00 48.19 N …

COLUMNS DATA TYPE FIELD DEFINITION------------------------------------------------------------------------------------- 1 - 6 Record name "ATOM " 7 - 11 Integer serial Atom serial number.13 - 16 Atom name Atom name.17 Character altLoc Alternate location indicator.18 - 20 Residue name resName Residue name.22 Character chainID Chain identifier.23 - 26 Integer resSeq Residue sequence number.27 AChar iCode Code for insertion of residues.31 - 38 Real(8.3) x Orthogonal coordinates for X in Angstroms39 - 46 Real(8.3) y Orthogonal coordinates for Y in Angstroms47 - 54 Real(8.3) z Orthogonal coordinates for Z in Angstroms55 - 80 Bla Bla Bla (not useful for our purposes)

Page 15: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 15

simmetryX->ZY->XZ->Y

X

Y

Page 16: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 16

Rotation#! /usr/bin/perl -w

use strict;open(IG, "<IG.pdb") || die "cannot open IG.pdb:$!"; open(IGR, ">IG_rotated.pdb") || die "cannot open IG_rotated.pdb:$!"; while (my $line=<IG>){ if (substr($line,0,4) eq "ATOM"){ my $X= substr($line,30,8); my $Y= substr($line,38,8); my $Z= substr($line,46,8); print IGR substr($line,0,30).$Z.$X.$Y.substr($line,54); } else{ print IGR $line; }}close IG;close IGR;

Page 17: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili

RegExp

Page 18: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 18

Regular Expressions

PDB have a “fixed” structures.

What if we want to do something like“check for a valid email address”…

Page 19: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 19

Regular Expressions

PDB have a “fixed” structures.

What if we want to do something like“check for a valid email address”…1. There must be some letters or numbers2. There must be a @3. Other letters4. [email protected] is good

[email protected] is not good

Page 20: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 20

Regular Expressions$line =~ m/^[a-z |1-9| \.| _]+@[^\.]+\.[a-z]{2,}$/

WHAAAT???

This means:Check if $line has some chars at the beginning, then @, then some non-points, then a point, then at least two letters

….Ok, let’s start from something simpler :)

Page 21: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 21

Regular Expressions$line =~ m/^[a-z |1-9| \.| _]+@[^\.]+\.[a-z]{2,}$/

WHAAAT???

This means:Check if $line has some chars at the beginning, then @, then some non-points, then a point, then at least two letters

….Ok, let’s start from something simpler :)

Page 22: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 22

Regular Expressions$line =~ m/^ATOM/Line starts with ATOM

$line =~ m/^ATOM\s+/Line starts with ATOM, then there are some spaces

$line =~ m/^ATOM\s+[\-|0-9]+/Line starts with ATOM, then there are some spaces, then there are

some digits or -$line =~ m/^ATOM\s+\-?[0-9]+/Line starts with ATOM, then there are some spaces, then there can be

a minus, then some digits

Page 23: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 23

Regular Expressions

Page 24: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 24

PDB Header

We want to find %id for L and H chain

Page 25: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 25

PDB Header

We want to find %id for L and H chain

$pidL= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_L:([\.|0-9])/);$pidH= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_H:([\.|0-9])/);

ONELINER!!

cat IG.pdb | perl -ne ‘print “$1\n” if ($_=~m/^REMARK SUMMARY-ID_GLOB_([LH]:[\.|0-9]+)/);’

Page 26: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili

Zinc Finger

Page 27: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 27

Zinc Finger

A zinc finger is a large superfamily of protein domains that can bind to DNA.

A zinc finger consists of two antiparallel β strands, and an α helix.

The zinc ion is crucial for the stability of this domain type - in the absence of the metal ion the domain unfolds as it is too small to have a hydrophobic core.

The consensus sequence of a single finger is:

C-X{2-4}-C-X{3}-[LIVMFYWC]-X{8}-H-X{3}-H

Page 28: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 28

Homework

Find all occurencies of ZF motif in zincfinger.fasta

Put them in file ZF_motif.fasta

e.g.weofjpihouwefghoicalcvgnfglapglifhtylhyuiui

Page 29: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 29

Homework

Find all occurencies of ZF motif in zincfinger.fasta

Put them in file ZF_motif.fasta

e.g.weofjpihouwefghoicalcvgnfglapglifhtylhyuiui

calcvgnfglapglifhtylh