Restriction MappingRestriction Mapping
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Molecular ScissorsMolecular Scissors(restriction enzymes)(restriction enzymes)
Molecular Cell Biology, 4th edition
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
HindII (first restriction enzyme): discovered accidentally in 1970 while studying how bacterium Haemophilus influenzae takes up DNA from the virus. Recognizes and cuts DNA at sequence GAATTC
Discovering Restriction EnzymesDiscovering Restriction Enzymes
Werner Arber Daniel Nathans Hamilton Smith
Werner Arber – discovered restriction enzymesDaniel Nathans - pioneered the application of restriction for the construction of genetic mapsHamilton Smith - showed that restriction enzyme cuts DNA in the middle of a specific sequence
My father has discovered a servant who serves as a pair of scissors. If a foreign king invades a bacterium, this servant can cut him in small fragments, but he does not do any harm to his own king. Clever people use the servant with the scissors to find out the secrets of the kings. For this reason my father received the Nobel Prize for the discovery of the servant with the scissors".
Daniel Nathans’ daughter (from Nobel lecture)
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Recognition Sites of Restriction EnzymesRecognition Sites of Restriction Enzymes
Molecular Cell Biology, 4th edition
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Uses of Restriction EnzymesUses of Restriction Enzymes
Recombinant DNA technologyRecombinant DNA technologyCloningCloningcDNA/genomic library constructioncDNA/genomic library constructionDNA mappingDNA mapping
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Restriction MapsRestriction Maps• A map showing positions of restriction sites in a DNA sequence• If DNA sequence is known then construction of restriction map is a trivial exercise• In early days of molecular biology DNA sequences were often unknown• Biologists had to solve the problem of constructing restriction maps without knowing DNA sequences
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Full Restriction DigestFull Restriction Digest
• Cutting DNA at each restriction site creates
multiple restriction fragments:
• Is it possible to reconstruct the order of the fragments from the sizes of the fragments {3,5,5,9} ?
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Full Restriction Digest: Multiple SolutionsFull Restriction Digest: Multiple Solutions
• Alternative ordering of restriction fragments:
vs
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Measuring Length of Restriction FragmentsMeasuring Length of Restriction Fragments
Restriction enzymes break DNA into restriction Restriction enzymes break DNA into restriction fragments. fragments.
Gel electrophoresisGel electrophoresis is a process for separating DNA is a process for separating DNA by size and measuring sizes of restriction fragments by size and measuring sizes of restriction fragments
Can separate DNA fragments that differ in length in Can separate DNA fragments that differ in length in only 1 nucleotide for fragments up to 500 only 1 nucleotide for fragments up to 500 nucleotides longnucleotides long
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Gel ElectrophoresisGel Electrophoresis
DNA fragments are injected into a gel DNA fragments are injected into a gel positioned in an electric fieldpositioned in an electric field
DNA are negatively charged near neutral DNA are negatively charged near neutral pHpH The ribose phosphate backbone of each The ribose phosphate backbone of each
nucleotide is acidic; DNA has an overall nucleotide is acidic; DNA has an overall negative chargenegative charge
DNA molecules move towards the positive DNA molecules move towards the positive electrodeelectrode
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Gel ElectrophoresisGel Electrophoresis (cont’d) (cont’d)
DNA fragments of different lengths are DNA fragments of different lengths are separated according to sizeseparated according to size Smaller molecules move through the gel Smaller molecules move through the gel
matrix more readily than larger moleculesmatrix more readily than larger molecules
The gel matrix restricts random diffusion The gel matrix restricts random diffusion so molecules of different lengths separate so molecules of different lengths separate into different bandsinto different bands
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Gel Electrophoresis: ExampleGel Electrophoresis: Example
Direction
of DNA
movement
Smaller fragments
travel farther
Molecular Cell Biology, 4th edition
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Distance traveled is (roughly) inversely proportional to the logarithm of molecule size
Different sized molecules form distinct bands
Detecting DNA: Detecting DNA: AutoradiographyAutoradiography
One way to visualize separated DNA One way to visualize separated DNA bands on a gel is bands on a gel is autoradiographyautoradiography::
The DNA is radioactively labeledThe DNA is radioactively labeled
The gel is laid against a sheet of The gel is laid against a sheet of photographic film in the dark, exposing photographic film in the dark, exposing the film at the positions where the DNA is the film at the positions where the DNA is present.present.
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Detecting DNA: FluorescenceDetecting DNA: Fluorescence
Another way to visualize DNA bands in Another way to visualize DNA bands in gel is gel is fluorescencefluorescence::
The gel is incubated with a solution The gel is incubated with a solution containing the fluorescent dye ethidiumcontaining the fluorescent dye ethidium
Ethidium binds to the DNAEthidium binds to the DNA
The DNA lights up when the gel is The DNA lights up when the gel is exposed to ultraviolet light.exposed to ultraviolet light.
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Partial Restriction DigestPartial Restriction Digest
The sample of DNA is exposed to the restriction The sample of DNA is exposed to the restriction enzyme for only a limited amount of time to enzyme for only a limited amount of time to prevent it from being cut at all restriction sitesprevent it from being cut at all restriction sites
This experiment generates the set of all This experiment generates the set of all possible restriction fragments between every possible restriction fragments between every two (not necessarily consecutive) cutstwo (not necessarily consecutive) cuts
This set of fragment sizes is used to determine This set of fragment sizes is used to determine the positions of the restriction sites in the DNA the positions of the restriction sites in the DNA sequencesequence
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Partial Digest ExamplePartial Digest Example Partial Digest results in the following 10 Partial Digest results in the following 10
restriction fragments:restriction fragments:
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
L = {3, 5, 5, 8, 9, 14, 14, 17, 19, 22}
X = {0, 5, 14, 19, 22}
Partial Digest Problem:Partial Digest Problem:
GoalGoal:: Given all pairwise distances Given all pairwise distances between points on a line, reconstruct between points on a line, reconstruct the positions of those pointsthe positions of those points
InputInput: The multiset of pairwise : The multiset of pairwise distances distances LL, containing C(n,2) , containing C(n,2) integersintegers
OutputOutput: A set : A set XX, of , of nn integers, such integers, such that that ΔΔXX = = LL
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Note:
It is not always possible to uniquely reconstruct a set X based
only on ΔX.
For example, the sets X = {0, 2, 5} and (X + 10) = {10, 12, 15}
both produce ΔX={2, 3, 5} as their partial digest set.
The sets {0,1,2,5,7,9,12} and {0,1,5,7,8,10,12} present a less
trivial example of non-uniqueness. They both digest into:
{1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 5, 6, 7, 7, 7, 8, 9, 10, 11, 12}
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
0 1 2 5 7 9 12
0
1 2 5 7 9 12
1
1 4 6 8 11
2
3 5 7 10
5
2 4 7
7
2 5
9
3
12
0 1 5 7 8 10 12
0
1 5 7 8 10 12
1
4 6 7 9 11
5
2 3 5 7
7
1 3 5
8
2 4
10
2
12
Homometric Sets
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Two sets A and B are homometric if A = B
A = {0,1,2,5,7,9,12} B = {0,1,5,7,8,10,12}
Partial Digest: Brute ForcePartial Digest: Brute Force(exhaustive search)(exhaustive search)
1.1. Find the restriction fragment of maximum length Find the restriction fragment of maximum length MM. . MM is the length of the DNA sequence. is the length of the DNA sequence.
2.2. For every possible set For every possible set
XX={={0, 0, xx22, … ,, … ,xxnn-1-1, , M}M}
compute corresponding compute corresponding ΔΔXX (i.e., pairwise distances) (i.e., pairwise distances)
3.3. If If ΔΔXX is equal to the experimental partial digest is equal to the experimental partial digest LL, , then then X X is the correct restriction mapis the correct restriction map
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Partial Digest: Brute ForcePartial Digest: Brute ForceTo do this, we will need to know To do this, we will need to know nn. Note that . Note that C(n,2) is n!/[(n-2)!2!] = n(n-1)/2C(n,2) is n!/[(n-2)!2!] = n(n-1)/2
But |L| = C(n,2) = n(n-1)/2, so nBut |L| = C(n,2) = n(n-1)/2, so n22 – n – 2|L| = 0 – n – 2|L| = 0
For For L = {3, 5, 5, 8, 9, 14, 14, 17, 19, 22} (i.e., our L = {3, 5, 5, 8, 9, 14, 14, 17, 19, 22} (i.e., our previous example), |L| = 10 and n = 5. (Recall that previous example), |L| = 10 and n = 5. (Recall that X = {0, 5, 14, 19, 22} in that example.)X = {0, 5, 14, 19, 22} in that example.)
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
BruteForcePDP(BruteForcePDP(L, nL, n):):
MM ← ← maximum element in maximum element in LLfor every set of for every set of nn – 2 integers 0 < – 2 integers 0 < xx22 < … < … xxnn-1-1 < < MM
XX ← ← {0, {0, xx22, …, , …, xxnn-1-1, , MM}}
form form ΔΔX X from from XXif if ΔΔX X == L L
return return XXoutput “no solution”output “no solution”
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
AnotherBruteForcePDP(AnotherBruteForcePDP(L, nL, n))
MM ←← maximum element in maximum element in LL
for every set of for every set of nn – 2 integers 0 < – 2 integers 0 < xx22 < … < … xxnn-1-1 < < M M fromfrom L L
XX ← ← { 0, { 0, xx22, …, , …, xxnn-1-1, , M M }}
form form ΔΔX X from from XX
if if ΔΔX X == L L
return return XX
output “no solution”output “no solution”
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
AnotherBruteForcePDP(AnotherBruteForcePDP(L, nL, n))
MM ←← maximum element in maximum element in LL
for every set of for every set of nn – 2 integers 0 < – 2 integers 0 < xx22 < … < … xxnn-1-1 < < M M fromfrom L L
XX ← ← { 0, { 0, xx22, …, , …, xxnn-1-1, , M M }}
form form ΔΔX X from from XX
if if ΔΔX X == L L
return return XX
output “no solution”output “no solution”
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Example:
L = {3, 5, 5, 8, 9, 14, 14, 17, 19, 22}
n=5
Form all possible variations of
X = {0, a, b, c, M} until finding one for
which ΔX= L (where a, b, and c are
values < M from L)
Answer: X = {0, 5, 14, 19, 22}
BruteForcePDP(BruteForcePDP(L, nL, n):):
MM ← ← maximum element in maximum element in LLfor every set of for every set of nn – 2 integers 0 < – 2 integers 0 < xx22 < … < … xxnn-1-1 < < MM
XX ← ← {0, {0, xx22, …, , …, xxnn-1-1, , MM}}
form form ΔΔX X from from XXif if ΔΔX X == L L
return return XXoutput “no solution”output “no solution”
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Efficiency:
1.There are C(M-1,n-2) sets of integers
having values in the range (0,M)
2.Creating X, forming ΔX from X, and
comparing ΔX to L each requires a constant
number of operations
3.So, efficiency is O(C(M-1,n-2)) O(Mn-2)
AnotherBruteForcePDP(AnotherBruteForcePDP(L, nL, n))
MM ←← maximum element in maximum element in LL
for every set of for every set of nn – 2 integers 0 < – 2 integers 0 < xx22 < … < … xxnn-1-1 < < M M fromfrom L L
XX ← ← { 0, { 0, xx22, …, , …, xxnn-1-1, , M M }}
form form ΔΔX X from from XX
if if ΔΔX X == L L
return return XX
output “no solution”output “no solution”
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Efficiency:
1.There are C(|L|,n-2) sets of integers in L
having values in the range [0,M]. Note that |
L| = n(n-1)/2.
2.As before, the other processes each take
a constant number of operations
3.So, efficiency is O(C(|L|,n-2)) O(n2n-4)
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Compare AnotherBruteForcePDP with BruteForcePDP
More efficient, but still slow
Consider L = {2, 998, 1000} (n = 3, M = 1000), BruteForcePDP will be extremely slow, but AnotherBruteForcePDP will be quite fast
Fewer sets are examined, but runtime is still exponential: O(n2n-4)
PartialDigest(L)width ← Maximum element in LDELETE(width, L)X ← {0, width}PLACE(L, X)
PLACE(L, X)if L is empty
output Xreturn
y ← maximum element in Lif Δ(y, X ) L
Add y to X and remove lengths Δ(y, X) from LPLACE(L,X )Remove y from X and add lengths Δ(y, X) to L
if Δ(width-y, X ) LAdd width-y to X and remove lengths Δ(width-y, X) from LPLACE(L,X )Remove width-y from X and add lengths Δ(width-y, X) to L
return
A Better Algorithm…Notes:1.DELETE(y, L) removes the value y from L.2.Δ(y, X) denotes the multiset of distances between a point y and all points in a set X.3.After each recursive call in PLACE, X and L are restored to their condition before the call in case another branch in the search tree must be explored4.The algorithm lists all sets X with ΔX = L.
Consider an example whereL = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10}…
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Example
PartialDigest(L)width ← Maximum element in LDELETE(width, L)X ← {0, width}PLACE(L, X)
PLACE(L, X)if L is empty
output Xreturn
y ← maximum element in Lif Δ(y, X ) L
Add y to X and remove lengths Δ(y, X) from LPLACE(L,X )Remove y from X and add lengths Δ(y, X) to L
if Δ(width-y, X ) LAdd width-y to X and remove lengths Δ(width-y, X) from LPLACE(L,X )Remove width-y from X and add lengths Δ(width-y, X) to L
return
A Better Algorithm…Efficiency:For the ideal case, only one recursive call is made in PLACE each time PLACE is called. The amount of work done for the call is O(n) the first time, O(n-1) the second time, etc., and this continues for n times, so the total work is n+(n-1)+(n-2)+…+1 = n(n+1)/2 or O(n2).
For pathological cases where both recursive calls are made in PLACE (i.e., if both alternatives are viable) each time PLACE is called, the complexity is O(2n) where n is |X|.
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Notes for Brute Force ApproachesBruteForcePDPM ← maximum element in Lfor every set of n – 2 integers 0 < x2 < … xn-1 < M X ← {0, x2, …, xn-1, M} form ΔX from X if ΔX = L return Xoutput “no solution”
AnotherBruteForcePDPM ← maximum element in Lfor every set of n – 2 integers 0 < x2 < … xn-1 < M from L X ← {0, x2, …, xn-1, M} form ΔX from X if ΔX = L return Xoutput “no solution”
compare
We would like to use the same algorithm to solve both. In order to do this, let’s put the values we will be using to create X into an array called workingArray. For BruteForcePDP we will be dealing with #s 1, 2, 3, 4, …, M-1. Let’s call this set of values allValues. For AnotherBruteForcePDP we will be dealing with #s from L except M and any duplicates. Let’s call this set of values reducedL. So, in order to use the same code for both BruteForcePDP and AnotherBruteForcePDP, all we need to do is put allValues or reducedL into workingArray, respectively, then use workingArray:
GenericBruteForcePDPif algorithm = BruteForcePDP workingArray ← allValueselse workingArray ← reducedLM ← maximum element in Lfor every set of n – 2 integers in workingArray X ← { 0, x2, …, xn-1, M } form ΔX from X if ΔX = L return Xoutput “no solution”
Notes for Brute Force Approaches
Our next problem is to generate every possible set of n-2 integers from values in workingArray. One way is to envision this as a tree search problem where the leaf nodes represent the possible arrangements of the values in workingArray. For example, consider L = {2, 2, 5, 7, 9, 10} and n = 4. In this case, workingArray contains the values {2, 5, 7, 9}.
A simple (but naïve) tree would look like this: But this tree eliminates redundancy :
Of course, we really just want the leaf nodes. To produce them, we can simply perform a depth first search, adding to the set of values at each of the n-2 positions (from left to right) as we go deeper into the tree until all positions have been filled. Our choice of value at any time will be made from the unused values in workingArray. When we use a value, we must remove it from workingArray so that it cannot be used at the next level. However, in order for this to work, we must restore workingArray to its previous state when we backtrack to a node. The easiest way to do this is via recursion, in which case we only need to make a copy of workingArray before the next recursive call, remove the appropriate value from the copy, and then pass the copy. That way, when returning from the recursive call, workingArray will already be as it was before the recursion.
(continued)
Combinations:C(x,y) = x!/[(x-y)!y!] C(4,2) = 4!/[(4-2)!2!] = 6
Permutations:P(x,y) = x!/(x-y)! P(4,2) = 4!/(4-2)! = 12
Notes for Brute Force Approaches(continued)
Let setOfIntegers be the collection of n-2 integers that we must generate (i.e., a candidate map). Recall that this will begin with no values. Here is pseudocode for a depth-first traversal of the search tree:
depthFirst (setOfIntegers, workingArray)if setOfIntegers is complete (i.e., has no unfilled positions) if ΔX = L show setOfIntegers returnfor each position in workingArray V ← value at current position in workingArray (i.e., next unused value) workingArrayCopy ← workingArray remove V from workingArrayCopy setOfIntegersCopy ← setOfIntegers next available position in setOfIntegersCopy ← V depthFirst (setOfIntegersCopy, workingArrayCopy)return
Note: Before each recursive call we are reducing the contents of workingArray and increasing the number of values in setOfIntegers (i.e., the candidate map). We make copies of these arrays and pass them so that upon return both workingArray and setOfIntegers are as they were before being modified for the recursive call.