Top Banner
Physical Mapping II + Perl CIS 667 March 2, 2004
25

Physical Mapping II + Perl CIS 667 March 2, 2004.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Physical Mapping II + Perl CIS 667 March 2, 2004.

Physical Mapping II + Perl

CIS 667 March 2, 2004

Page 2: Physical Mapping II + Perl CIS 667 March 2, 2004.

Restriction Site Models

• Let each fragment in the Double Digest Problem be represented by its length No measurement errors All fragments present

• Digesting the target DNA by the first enzyme gives the multiset A = {a1, a2, …, an}

• The second enzyme gives B = {b1, b2, …, bn}• Digestion with both gives O = {o1, o2, …, on}

Page 3: Physical Mapping II + Perl CIS 667 March 2, 2004.

Restriction Site Models

• We want to find a permutation A of the elements of A and B of the elements of B Plot lengths A from on a line in the order of A

Plot lengths B from on a line in the order of B

on top of previous plot Several new subintervals may be produced

We need a one-to-one correspondence between each resulting subinterval and each element of O

Page 4: Physical Mapping II + Perl CIS 667 March 2, 2004.

Restriction Site Models

• This problem is NP-complete It is a generalization of the set partition

problem The number of solutions is exponential

• Partial Digest problem has not been proven to be NP-complete The number of solutions is much smaller

than for DDP

Page 5: Physical Mapping II + Perl CIS 667 March 2, 2004.

Interval Graph Models

• We model hybridization mapping using interval graphs Much simpler than the real problem, but

still NP-complete Uses graphs

Vertices represent clones Edges represent overlap information

between clones

Page 6: Physical Mapping II + Perl CIS 667 March 2, 2004.

First Interval Graph Model

• Uses two graphs Gr = (V, Er)

(i, j) Er means we know clones i, j overlap

Gt = (V, Et) Et represents known and unknown overlap

information If we know for sure that two clones don’t

overlap, the corresponding edge is left out of the graph Gt

Page 7: Physical Mapping II + Perl CIS 667 March 2, 2004.

First Interval Graph Model

• Does there exist a graph Gs = (V, Es) such that Er Es Et such that Gs is an interval graph? An interval graph G = (V, E) is an

undirected graph obtained from a collection C of intervals on the real line To each interval in C there corresponds a

vertex in G There is an edge between u and v only if

their intervals have a non-empty intersection

Page 8: Physical Mapping II + Perl CIS 667 March 2, 2004.

First Interval Graph Model

a

bc

d e

a

b

c

d

e

Page 9: Physical Mapping II + Perl CIS 667 March 2, 2004.

Non-Interval Graphs

a

b

c

d

e

a

b

c

d

e

Page 10: Physical Mapping II + Perl CIS 667 March 2, 2004.

Second Interval Graph Model

• Don’t assume that known overlap information is reliable Construct a graph G = (V, E) using that

information Does there exist a graph G’ = (V, E’)

such that E’ E, G’ is an interval graph and |E’| is maximum? We have discarded some false positives The solution is the interpretation that

contains the minimum number of false positives

Page 11: Physical Mapping II + Perl CIS 667 March 2, 2004.

Third Interval Graph Model

• Use overlap information along with information about each clone Different clones come from different

copies of the same molecule Label each clone with the identification

of the molecule copy it came from Assume we had k copies of the target

DNA and different restriction enzymes were used to break up each copy

Page 12: Physical Mapping II + Perl CIS 667 March 2, 2004.

Third Interval Graph Model

• Build a graph G = (V, E) with known overlap information between clones Use k colors to color the vertices No edges between vertices of the same color

since they come from the same clone and hence cannot overlap We say that such a graph has a valid coloring Does there exist graph G’ = (V’, E) such that , G’ is an

interval graph, and the coloring of G is valid for G’? I.e., Can we add edges to G transforming it into an

interval graph without violating the coloring?

Page 13: Physical Mapping II + Perl CIS 667 March 2, 2004.

Consecutive Ones Property

• We can apply the previous models in any situation where we can obtain some type of fingerprint for each fragment Now we use as a clone fingerprint the set of

probes that hybridize to it Assumptions

Reverse complement of each probe’s sequence occurs only once in the target DNA (“probes are unique”

There are no errors All “clones X probes” hybridization experiments have

been done

Page 14: Physical Mapping II + Perl CIS 667 March 2, 2004.

Consecutive Ones Property

• If we have n clones and m probes we will build an n m binary matrix M, where each entry Mij tells us whether probe j hybridized to clone i or not Then obtaining a physical map from the matrix

becomes the problem of finding a permutation of the columns (probes) such that all 1s in each row (clone) are consecutive Such a matrix is said to have the consecutive 1s

property for rows (C1P)

Page 15: Physical Mapping II + Perl CIS 667 March 2, 2004.

Consecutive Ones Property

• There exist polynomial algorithms for the C1P property Works only for data with no errors Realistic algorithms should try to find matrixes

which approximate the C1P property, while minimizing the number of errors which must have been present to lead to such a solution Allow 2 or 3 runs of 1s in a row Minimize the number of runs of 1s in the matrix

• Problem is now NP-hard

Page 16: Physical Mapping II + Perl CIS 667 March 2, 2004.

Now we will look at some Perl in

preparation for assignment 1

Page 17: Physical Mapping II + Perl CIS 667 March 2, 2004.

Perl substitution operator

• Example of Perl substitution operator

$RNA =~ s/T/U/g;

variable binding operator

substitute operator

PATTERN regular expressionTo be replaced by REPLACEMENT

delimiter

REPLACEMENTtext to replace PATTERN

Pattern modifier: g meansglobally, throughout thestring. Others:i case insensitivem multilines single line

Page 18: Physical Mapping II + Perl CIS 667 March 2, 2004.

Example 1

• Let’s use the substitution operator to calculate the reverse complement of a strand of DNA

Page 19: Physical Mapping II + Perl CIS 667 March 2, 2004.

Example 2

• One common task in bioinformatics is to look for motifs, short segments of DNA or protein of interest For example, regulatory elements of DNA

• Let’s see a program to Read in protein sequence data from a file Put all the sequence data into one string for

easy searching Look for motifs the user types in at the

keyboard

Page 20: Physical Mapping II + Perl CIS 667 March 2, 2004.

Turning arrays into Scalars

• We often find sequence data broken into short segments of 80 or so characters This is inconvenient for the Perl program

Have to deal with motifs on more than one line

Collapse an array into a scalar with join $protein = join( ‘’, @protein)

Page 21: Physical Mapping II + Perl CIS 667 March 2, 2004.

Regular expressions

• Regular expressions are ways of matching one or more strings using special wildcard-like operators $protein =~ s/\s//g

\s matches whitespace Can also be written [ \t\n\f\r]

if ($motif =~ /^\s*$/ ) { ^ - beginning of line; $ - end of line * repeated zero or more times

Page 22: Physical Mapping II + Perl CIS 667 March 2, 2004.

Hashes

• There are three main data types in Perl: scalar variables, arrays and hashes (also called associative arrays) A hash provides a fast lookup of the

value associated with a key Initialized like this:%classification = (

‘dog’ => ‘mammal’,‘robin’ => ‘bird’‘asp’ => ‘reptile’

);

Page 23: Physical Mapping II + Perl CIS 667 March 2, 2004.

Example 3

• Let’s look at the use of a hash by a subroutine to translate a codon to an amino acid using hash lookup codon2aa

Page 24: Physical Mapping II + Perl CIS 667 March 2, 2004.

Example 3

• The arguments to the subroutine are in the @_ array

• Declare a local variable as a my variable

• my($dna) = @_;

Page 25: Physical Mapping II + Perl CIS 667 March 2, 2004.

Example 4

• We can use that subroutine to translate DNA into protein

• Note the use of a module (library)• Note the use of .= to concatenate