Top Banner
Where are we going? Remember the extended analogy? Given binary code, what does the program do? How does it work? At the end of the semester, I am going to show you how biologist solved that problem Binary Code DNA Code How the program works How life works But, we are approaching it from the Bottom-up
43

Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Where are we going?

Remember the extended analogy?– Given binary code, what does the program do?– How does it work?

At the end of the semester, I am going to show you how biologist solved that problem

– Binary Code DNA Code– How the program works How life works

But, we are approaching it from the Bottom-up

Page 2: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Bottom-up Design

Top Down– See the big picture first– Break it into part– Analyze each part– Continue breaking down sub-part into solvable tasks

Bottom Up– Identify easily solvable task– Use them to solve larger problem– Use the solution to larger and larger problems to solve the

BIG problem and see the big picture

Page 3: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Bottom-up Design

Top Down– Rethinking the design of existing ideas/inventions– Managing projects that are underway– Works really good in the Utopian world

Bottom Up– Designing totally new ideas– Putting together projects from scratch– Works really good in the real world

Page 4: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Bottom-up Design

Top Down– Let build an airplane– Lets build a steering mechanism– Lets build a lift mechanism– Lets build a propulsion mechanism

Bottom UP– This shape produces lift– A spinning propeller creates propulsion in the air– Canvas with a wood frame is light enough– Perhaps we can build an stable controllable airplane

Page 5: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Bottom-up Design

Before we can analyze the big picture We have to

– Look at some of the initial smaller problems– See how they were solved– See how they led to new discoveries

Page 6: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Remember

Don’t forget to – pick a paper and – Email me

See the schedule to see what’s taken– http://www.cs.siena.edu/~ebreimer/csis-400-f03/schedule.html

Page 7: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Recap

3 different types of comparisons

1. Whole genome comparison

2. Gene search

3. Motif discovery (shared pattern discovery)

Page 8: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Agenda

Overview of Shared Pattern Discovery Edit Distance

– How do you compute it– Why its not good enough

Alignment– Why its better– How to compute it

Page 9: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Shared Pattern Discovery

I have 100 rats that all have green eyes I have 1000 rats that all have blue eyes What exactly do the 100 rats have in

common that give them green eyes?

Page 10: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Shared Pattern Discovery

A technique called multiple alignment can be used to measure the strength a genomic pattern found in a set of sequences (a group of rats)

– You can identify a subset (rats that have green eyes) and– You can find a sub-region of DNA (a pattern) that the

subset shares – But that isn’t shared by any other subset (rats that have

blue eyes)

Initially, this is how genes were pin-pointed

Page 11: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Shared Pattern Discovery

To understand multiple alignment One needs to understand pair-wise alignment Multiple alignment emerged from the successful

application of pair-wise alignment Pair-wise alignment emerged from improvements

to traditional string matching algorithms All of this emerged from a need to compare genetic

sequences

Page 12: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Exact string matching

Target: CGTACGAC Pattern: ACGTACGTACGT Problem: Target can not be found in the

pattern even though its really close

Page 13: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Edit Distance

How many edits are needed to exactly match the target with part of the pattern

Target: CGTACGAC Pattern: ACGTACGTACGT Just one

Page 14: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Edit Distance

How many edits are needed to exactly match the target with the WHOLE Pattern

Target: CGTACGAC Pattern: ACGTACGTACGT Four

Page 15: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Edit Distance – Dynamic Programming

A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

1 2 1 2 3 4 5 6 7

2 3 2 1 2 3 4 5 6

3 2 3 2 1 2 3 4 5

4 3 2 3 2 1 2 3 4

5 4 3 4 3 2 1 2 3

6 5 4 5 4 3 2 3 4

7 6 5 6 5 5 3 2 3

Optimal edit distance forTG and TCG

Optimal edit distance for TG and TCGA

Optimal edit distance for TGA and TCG

Final Answer

Optimal edit distance for TGA and TCGA

Page 16: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Edit Distance

int matrix[n+1][m+1];

for (x = 0; x <= n; x++) matrix[x][0] = x;

for (y = 1; y <= m; y++) matrix [0][y] = y;

for (x = 1; x <= n; x++)

for (y = 1; y <= m; y++)

if (seq1[x] == seq2[y])

matrix[x][y] = matrix[x-1][y-1];

else

matrix[x][y] = max(matrix[x][y-1] + 1,

matrix[x-1][y] + 1);

return matrix[n][m];

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

1 2 1 2 3 4 5 6 7

2 3 2 1 2 3 4 5 6

3 2 3 2 1 2 3 4 5

4 3 2 3 2 1 2 3 4

5 4 3 4 3 2 1 2 3

6 5 4 5 4 3 2 3 4

7 6 5 6 5 5 3 2 3

Page 17: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Edit Distance

int matrix[n+1][m+1];

for (x = 0; x <= n; x++) matrix[x][0] = x;

for (y = 0; y <= m; y++) matrix [0][y] = y;

for (x = 1; x <= n; x++)

for (y = 1; y <= m; y++)

if (seq1[x] == seq2[y])

matrix[x][y] = matrix[x-1][y-1];

else

matrix[x][y] = max(matrix[x][y-1] + 1,

matrix[x-1][y] + 1);

return matrix[n][m];

How many times is this comparison performed?

How many times is this assignment performed?

How many times is this assignment performed?

How many times is this assignment performed?

Page 18: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Edit Distance – Dynamic Programming

A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

1 2 1 2 3 4 5 6 7

2 3 2 1 2 3 4 5 6

3 2 3 2 1 2 3 4 5

4 3 2 3 2 1 2 3 4

5 4 3 4 3 2 1 2 3

6 5 4 5 4 3 2 3 4

7 6 5 6 5 5 3 2 3

To derive the value 7,we need to know that we

can match two T’sn=8

In the worst case,this may take n comparisons

To derive the value 6,we need to know that we can match two C’s after

matching two T’s

To derive this value 5,we need to know that

we can match two G’s after already matching two C’s and previously matching two T’s

Page 19: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Edit Distance – Dynamic Programming

A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

1 2 1 2 3 4 5 6 7

2 3 2 1 2 3 4 5 6

3 2 3 2 1 2 3 4 5

4 3 2 3 2 1 2 3 4

5 4 3 4 3 2 1 2 3

6 5 4 5 4 3 2 3 4

7 6 5 6 5 5 3 2 3

Given our previous matches,there is no way we can match two A’sThus, the edit distance is increased

Luckily, we can match these two C’sBut now we’ve matched the last symbol

We can’t do any more matching (period!)

Page 20: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Lesson to learn

There is no way to compute the optimal (minimum) edit distance without considering all possible matching combinations.

The only way to do that is to consider all possible sub-problems.

This is the reason the entire table must be considered.

If you can compute the optimal (minimum) edit distance using less than O(nm) computations.

Then you will be renown!

Page 21: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Why Edit Distances Stinks for Genetic Data?

DNA evolves in strange ways …TAGATCCCAGATCAGTATTCAAGTTATAC….

…GATCTCCCAGATAGAAGCAGTATTCAGTCA…

… CCTATCAGCAGGATCAAGTATGTCATACTAC…

The edit distance between rat and virus is smaller thanrat and fruit bat.

This is a gene in the rat genome

This is the same gene in the fruit bat

This is a totally unrelatedregion of the AIDS virus

Page 22: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Alignment

We need a more robust way to measure similarity

Alignment meets several requirements1. It rewards matches2. It penalizes mismatches3. It allows for different strategies for

penalizing gaps4. It helps visualize similarity.

Page 23: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Alignment

Example

1. G A A T T C A G T T A (sequence #1)

2. G G A T C G A (sequence #2)

One possible alignment:

G A A T T C A G T T A

G G A _ T C _ G _ _ A

Mismatch Gap Gap Gap (size 2)

Page 24: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Alignment

A simple scoring scheme is used where– Si,j is the score at position i,j – Si,j = 1 if the residue at position i of sequence #1

matches the residue at position j of sequence #2 (match score); otherwise

– Si,j = 0 (mismatch score)

w is the gap penalty which we will discuss later

Page 25: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Alignment

Three steps in the dynamic programming algorithm for alignment

1. Initialization

2. Matrix fill (scoring)

3. Traceback (alignment)

Page 26: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Alignment

Initialization Step The first step create a matrix with

– M + 1 columns and – N + 1 rows – where M and N correspond to the size of the sequences to

be aligned.

The first row and first column of the matrix can be initially filled with 0.

Page 27: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Alignment

Page 28: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Alignment

Matrix Fill Step– For each position, Mi,j is defined to be the maximum

score at position i,j; i.e. – Mi,j = MAX[

Mi-1, j-1 + Si,j (match/mismatch), Mi,j-1 + w (gap in sequence #1),      Mi-1,j + w (gap in sequence #2)

]

Page 29: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Example

The score at position 1,1 can be calculated.

The first residue in both sequences is a G

Thus, S1,1 = 1

Thus, M1,1 =

MAX[M0,0 + 1, M1,0 + 0, M0,1 + 0] = MAX[1, 0, 0] = 1.

Page 30: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Example

Page 31: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Example

Page 32: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Example

Page 33: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Example

Page 34: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Edit Dist. vs. Alignment Scoring

Note that the metric used in alignment is different that that of edit distance

– Smaller edit distance more similar– Higher alignment score more similar

Also: Edit distance refers specifically to edits

– delete or insert a symbol– discrete value– not flexible

Page 35: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Alignment Scoring

Mi,j = MAX[ Mi-1, j-1 + Si,j (match/mismatch score), Mi,j-1 + w (gap in sequence #1),      Mi-1,j + w (gap in sequence #2)

]

Si,j A C G T

A 1.1 0.0 0.3 0.5

C 1.3 0.1 0.0

G 1.0 0.0

T 1.2

Page 36: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Alignment Scoring

Mi,j = MAX[Mi-1, j-1 + Si,j (match/mismatch score),Mi,j-1 + w (gap in sequence #1),      Mi-1,j + w (gap in sequence #2)

]w = -1

One possible alignment:

G A A T T C A G T T A

G G A _ T C _ G _ _ A

Gap -1 Gap -1 Gap -2

Page 37: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Alignment Scoring

Summary: We have a way of rewarding different types of

matches and mismatches We have a separate way of penalizing gaps We could choose not to penalize gaps

– if we had a clue that they weren’t harmful

Page 38: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Recall

DNA evolves in strange ways …TAGATCCCAGATCAGTATTCAAGTTATAC….

…GATCTCCCAGATAGAAGCAGTATTCAGTCA…

… CCTATCAGCAGGATCAAGTATGTCATACTAC…

The edit distance between rat and virus is smaller thanrat and fruit bat.

This is a gene in the rat genome

This is the same gene in the fruit bat

This is a totally unrelatedregion of the AIDS virus

Page 39: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Tracing back the alignment

(Seq #1) A

|    

(Seq #2) A

Page 40: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Tracing back the alignment

(Seq #1) A

|    

(Seq #2) A

Page 41: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Tracing back the alignment

(Seq #1) TA

|    

(Seq #2)  A

Page 42: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Tracing back the alignment

(Seq #1) TTA

|    

(Seq #2)  A

Page 43: Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Tracing back the alignment

(Seq #1) GAATTCAGTTA

| | || | |    

(Seq #2) GGA_TC_G__A