Top Banner
Bioinformatics GBIO0002 -1 Biological Sequences _________________________________________________________________________________________________ ___________________ Kirill Bessonov Biological Sequences Analysis GBIO00002-1 Presented by Kirill Bessonov Oct 27, 2015
74

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Jan 17, 2016

Download

Documents

Bethanie Carr
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 1

Biological Sequences Analysis

GBIO00002-1

Presented by

Kirill Bessonov

Oct 27, 2015

Page 2: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 2

Talk Structure

1. Introduction 2. Global Alignment: Needleman-Wunsch3. Local Alignment: Smith-Waterman4. Practical

– Retrieval of sequences using R– Alignment of sequences using R– Finding ORF with R– Comparing two species based on genes

Page 3: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 3

Gene structure and expression• Gene regions

– 5’ untranslated region (5’UTR)• Directly upstream of start codon (AUG)• Regulates translation

– 3’ untranslated region (3’UTR)• Right after the stop codon• Influences translation efficiency [protein]

– Open Reading Frame (ORF) • Protein coding region (with introns / exons)

mRNA

Page 4: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 4

Gene expression sequences

• Central dogma– DNA RNA protein

• Each codon codes for one amino acid (a.a.)– residue = amino acid

• mRNA polymerase II– Reads from 5’ to 3’ direction– 3 nucleotides code for 1 a.a.

• In the DNA context– Start codon: ATG– Stop codon: TAA, TGA, TAG

Page 5: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 5

mRNA Protein alphabet

• Codon table: 3 nucleotides code for 1 a.a.

Page 6: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 6

Biological sequence

• Single continuous molecule– DNA ACGGCT– RNA ACGGCU– Protein TA or ThrAla

Page 7: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 7

Biological problem

• Given the DNA sequence AATCGGATGCGCGTAGGATCGGTAGGGTAGGCTTTAAGATCATGCTATTTTCGAGATTCGATTCTAGCTA• Answer

– Is it likely to be a gene?– What is its possible expression level?– What is the possible structure of the protein product?– Can we get the protein (i.e. express protein)?– Can we figure out the key residues of the protein?– Can we determine the organism from which this sequence

came?

Page 8: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 8

Biological alphabet

– In the case of DNA• A, C, T, G

– In the case of RNA• A,C, U, G

– In the case of protein• 20 amino acids• Complete list is found here

Page 9: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 9

Words

• Short strings of letters from an alphabet• A word of length k is called a k-word or k-tuple• Examples:

– 1-tuple: individual nucleotide– 2-tuple: dinucleotide– 3-tuple: codon

Page 10: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 10

2-words: dinucleotides

• Composed of 2 nucleotides– Given DNA alphabet {A,T,C,G}

• How many possible dinucleoties?• Total of 16: AA, AC,AG,AT … TG,TT

• CpG islands are regions of DNA– Frequent repetition of CpG dinucleotides– Rich in ‘G’ and ‘C’– CpG islands appear in some 70% of promoters of

human genes

Page 11: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 11

CpG islands

• CpG sites could be methylated• Location and methylation state

– Impact gene expression• If in promoter region and methylated

– May inhibit expression

– Present at the gene ‘start’ region• How many CpG sites in this DNA sequence?

Page 12: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 12

3-words: codons

• Important in case of DNA sequences• Linked to expression

– DNA RNA protein

Page 13: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 13

Patterns• Recognizing motifs, sites, signals, domains

– functionally important regions– a conserved motif - consensus sequence– Often words (in bold) are used interchangeably

• Gene starts with with an “ATG” codon– Identify # of potential gene start sites

• 4 sites

AATCGGATGCGCGTAGGATCGGTAGGGTAGGCTTTAAGATCATGCTATTTTCGAGATTCGATTCTAGCTAGGTTTAGCTTAGCTTAGTGCCAGAAATCGGATGCGCGTAGGATCGGTAGGGTAGGCTTTAAGATCATGCTATTTTCGAGATTCGATTCTAGCTAGGTTTTTAGTGCCAGAAATCGTTAGTGCCAGAAATCGATT

Page 14: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 14

Bases distribution• The distribution of bases within a DNA

– is not ordinarily uniform• i.e. not P(A) =P(C) = P(G) = P(T) = 0.25

• There may be an excess of G over C on the leading strands

– This can be described by the “GC skew”,

Page 15: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 15

GC skew in a sequence

• GC skew is defined as • (#G - #C) / (#G + #C)• Calculated in windows of length l• Theoretical minimal is 0

– #G and #C are equal

Page 16: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 16

Human genome characteristics

Fact: the amount of adenine is always the same as the amount of thymine, and the amount of guanine equals the amount of cytosine (A:T=1 and G:C = 1) - in the whole genome

Page 17: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 17

Alignments

Page 18: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 18

Biological context

• Proteins may be multifunctional– Sequence determines protein function – Assumptions

• Pairs of proteins with similar sequence also share similar biological function(s)

Page 19: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 19

Comparing sequences• are important for a number of reasons.

– used to establish evolutionary relationships among organisms

– identifi cation of functionally conserved sequences (e.g., DNA sequences controlling gene expression)• ‘TATAAT’ box transcription initiation

– develop models for human diseases • identify corresponding genes in model organisms (e.g.

yeast, mouse), which can be geneti cally manipulated– E.g. gene knock outs / silencing

Page 20: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 20

Deletions/insertions/substitution of nucleotides

• Mutations are of 3 main types– Deletions – Insertions– Substitution– Cause a shift in the mRNA reading frame

Page 21: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 21

Comparing two sequences• There are two ways of pairwise comparison

– Global using Needleman-Wunsch algorithm (NW)– Local using Smith-Waterman algorithm (SW)

• Global alignment (NW)• Alignment of the “whole” sequence

• Local alignment (SW)• tries to align portions (e.g. motifs) • more flexible

– Considers sequences “parts”

• works well on – highly divergent sequences

entire sequence

perfect match

unaligned sequence

aligned portion

Page 22: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 22

Global alignment

Page 23: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 23

Global alignment (NW)

• Sequences are aligned end-to-end along their entire length • Many possible alignments are produced

– The alignment with the highest score is chosen• Naïve algorithm is very inefficient (Oexp)

– To align sequence of length 15, need to consider• (insertion, deletion, gap)15 = 315 = 1,4*107

– Impractical for sequences of length >20 nt• Used to analyze homology/similarity of

– genes and proteins– between species

Page 24: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 24

Methodology of global alignment (1 of 4)

• Define scoring scheme for each event– mismatch between ai and bj

• if

– gap (insertion or deletion)

– match between ai and bj • if

• Provide no restrictions on minimal score• Start completing the alignment MxN matrix

s1: ..AATA.. s2: ..AACA..

s1: ..AAT-A.. s2: ..AACA..

s1: ..AATA.. s2: ..AATA..

Page 25: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 25

Methodology of global alignment (2 of 4)• The matrix should have extra column and row

– M+1 columns , where M is the length sequence M– N+1 rows, where N is the length of sequence N

1. Initialize the matrix– introduce gap penalty at every initial position

along rows and columns– Scores at each cell are cumulative

W H A T 0 -2 -4 -6 -8

W -2 H -4 Y -6

-2 -2 -2 -2-2

-2

-2

Page 26: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 26

Methodology of global alignment (3 of 4)

2. Alignment possibilities Gap (horiz/vert) Match (W-W diag.)

3. Select the maximum score– Best alignment

W H A T 0 -2 -4 -6 -8

W -2 2 0 -2 -4 H -4 0 4 2 0 Y -6 -2 2 3 1

W H 0 -2 -4

W -2 -4

W H 0 -2 -4

W -2 +2-2

-2+2-OR-

Page 27: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 27

Methodology of global alignment (4 of 4)4. Select the most very bottom right cell 5. Consider different path(s) going to very top left cell

– How the next cell value was generated? From where?

WHAT WHATWHY- WH-Y

Overall score = 1 Overall score = 1

6. Select the best alignment(s)

W H A T 0 -2 -4 -6 -8

W -2 2 0 -2 -4 H -4 0 4 2 0 Y -6 -2 2 3 1

W H A T 0 -2 -4 -6 -8

W -2 2 0 -2 -4 H -4 0 4 2 0 Y -6 -2 2 3 1

Page 28: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 28

Local alignment

Page 29: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 29

Local alignment (SW)

• Sequences are aligned to find regions where the best alignment occurs (i.e. highest score)

• Assumes a local context (aligning parts of seq.)• Ideal for finding short motifs, DNA binding sites

– helix-loop-helix (bHLH) - motif– TATAAT box (a famous promoter region) – DNA binding site

• Works well on highly divergent sequences

Page 30: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 30

Methodology of local alignment (1 of 4)

• The scoring system is similar with one exception– The minimum possible score in the matrix is zero– There are no negative scores in the matrix

• Let’s define the scoring system as in globalmismatch between seq. ai and bj gap (insertion or deletion)

if

match between ai and bj if

Page 31: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 31

Methodology of local alignment (2 of 4)

• Construct the MxN alignment matrix with M+1 columns and N+1 rows

• Initialize the matrix by introducing gap penalty at 1st row and 1st column

W H A T

0 0 0 0 0

W 0

H 0

Y 0

s(a,b) ≥ 0(min value is zero)

Page 32: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 32

Methodology of local alignment (3 of 4)

• For each subsequent cell consider alignments– Vertical s(I, - )– Horizontal s(-,J)– Diagonal s(I,J)

• For each cell select the highest score– If score is negative assign zero

W H A T 0 0 0 0 0

W 0 2 0 0 0H 0 0 4 2 0Y 0 0 2 3 1

Page 33: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 33

Methodology of local alignment (4 of 4)• Select the initial cell with the highest score(s)• Consider different path(s) leading to score of zero

– Trace-back the cell values – Look how the values were originated (i.e. path)

WHWH

• Mathematically– where S(I, J) is the score for sub-sequences I and J

W H A T 0 0 0 0 0

W 0 2 0 0 0H 0 0 4 2 0Y 0 0 2 3 1

total score of 4B

A

J

I

Page 34: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 34

Local alignment illustration (1 of 2)

• Determine the best local alignment and the maximum alignment score for

• Sequence A: ACCTAAGG• Sequence B: GGCTCAATCA• Scoring conditions:

– if , – if and

Page 35: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 35

Local alignment illustration (2 of 2)

G G C T C A A T C A

A

C

C

T

A

A

G

G

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0

C 0

C 0

T 0

A 0

A 0

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2

C 0

C 0

T 0

A 0

A 0

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 0

C 0

C 0

T 0

A 0

A 0

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 0

C 0

C 0

T 0

A 0

A 0

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2 2 0 0 2

C 0

C 0

T 0

A 0

A 0

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2 2 0 0 2

C 0 0 0 2 0 2 0 1 1 2 0

C 0

T 0

A 0

A 0

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2 2 0 0 2

C 0 0 0 2 0 2 0 1 1 2 0

C 0 0 0 2 1 2 1 0 0 3 1

T 0

A 0

A 0

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2 2 0 0 2

C 0 0 0 2 0 2 0 1 1 2 0

C 0 0 0 2 1 2 1 0 0 2 1

T 0 0 0 0 4 2 1 0 2 0 1

A 0

A 0

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2 2 0 0 2

C 0 0 0 2 0 2 0 1 1 2 0

C 0 0 0 2 1 2 1 0 0 2 1

T 0 0 0 0 4 2 1 0 2 0 1

A 0 0 0 0 2 3 4 3 1 1 2

A 0

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2 2 0 0 2

C 0 0 0 2 0 2 0 1 1 2 0

C 0 0 0 2 1 2 1 0 0 2 1

T 0 0 0 0 4 2 1 0 2 0 1

A 0 0 0 0 2 3 4 3 1 1 2

A 0 0 0 0 0 1 5 6 4 2 3

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2 2 0 0 2

C 0 0 0 2 0 2 0 1 1 2 0

C 0 0 0 2 1 2 1 0 0 2 1

T 0 0 0 0 4 2 1 0 2 0 1

A 0 0 0 0 2 3 4 3 1 1 2

A 0 0 0 0 0 1 5 6 4 2 3

G 0 2 2 0 0 0 3 4 5 3 1

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2 2 0 0 2

C 0 0 0 2 0 2 0 1 1 2 0

C 0 0 0 2 1 2 1 0 0 3 1

T 0 0 0 0 4 2 1 0 2 1 2

A 0 0 0 0 2 3 4 3 1 1 3

A 0 0 0 0 0 1 5 6 4 2 3

G 0 2 2 0 0 0 3 4 5 3 1

G 0 2 4 2 0 0 1 2 3 4 2

Page 36: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 36

Local alignment illustration (3 of 3)

CTCAA GGCTCAATCACT-AA ACCT-AAGG

Best score: 6

G G C T C A A T C A 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2 2 0 0 2C 0 0 0 2 0 2 0 1 1 2 0C 0 0 0 2 1 2 1 0 0 3 1T 0 0 0 0 4 2 1 0 2 1 1A 0 0 0 0 2 3 4 3 1 1 3A 0 0 0 0 0 1 5 6 4 2 3G 0 2 2 0 0 0 3 4 5 3 1G 0 2 4 2 0 0 1 2 3 4 2

in the whole seq. context (globally)locally

Page 37: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 37

Aligning proteinsGlobally and Locally

Page 38: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 38

Biological context

• Find common functional units– Structural motifs

• Helix-loop-helix

• Zinc finger• …

• Phylogeny– Distance between species

Page 39: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 39

Protein Alignment• Protein local and global alignment

– follows the same rules as we saw with DNA/RNA • Differences (∆)

– alphabet of proteins is 22 residues (aa) long – scoring/substitution matrices used (BLOSUM)

• protein proprieties are taken into account– residues that are totally different due to charge such as polar

Lysine and apolar Glycine are given a low score

Page 40: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 40

Substitution matrices

• Protein sequences are more complex– matrices = collection of scoring rules

• Matrices over events such as– mismatch and perfect match

• Need to define gap penalty separately• E.g. BLOcks SUbstitution Matrix (BLOSUM)

Page 41: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 41

BLOSUM-x matrices

• Constructed from aligned sequences with specific x% similarity– matrix built using sequences with no more then

50% similarity is called BLOSUM-50

• For highly mutating / dissimilar sequences use– BLOSUM-45 and lower

• For highly conserved / similar sequences use– BLOSUM -62 and higher

Page 42: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 42

BLOSUM 62

• What diagonal represents? • What is the score for substitution ED (acid a.a.)? • More drastic substitution KI (basic to non-polar)?

perfect match between a.a.

Score = 2

Score = -3

Page 43: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 43

Practical problem:Align following sequences both globally and locally using BLOSUM 62 matrix with gap penalty of -8

Sequence A: AAEEKKLAAASequence B: AARRIA

Page 44: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 44

Aligning globally using BLOSUM 62

AAEEKKLAAAAA--RRIA--

Score: -14Other alignment options? Yes

A A E E K K L A A A 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80

A -8 4 -4 -12 -20 -28 -36 -44 -52 -60 -68A -16 -4 8 0 -8 -16 -24 -32 -40 -48 -56R -24 -12 0 8 0 -6 -14 -22 -30 -38 -46R -32 -20 -8 0 8 2 -4 -12 -20 -28 -36I -40 -28 -16 -8 0 5 -1 -2 -10 -18 -26

A -48 -36 -24 -16 -8 -1 4 -2 2 -6 -14

Page 45: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 45

Aligning locally using BLOSUM 62

KKLARRIA

Score: 10

A A E E K K L A A A

0 0 0 0 0 0 0 0 0 0 0

A 0 4 4 0 0 0 0 0 4 4 4

A 0 4 8 3 0 0 0 0 4 8 8

R 0 0 3 8 3 2 2 0 0 3 7

R 0 0 0 3 8 5 4 0 0 0 2

I 0 0 0 0 0 5 2 6 0 0 0

A 0 4 4 0 0 0 4 1 10 4 4

Page 46: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 46

Practical 1 of 4 :

Sequence Retrieval and Analysisvia R

Page 47: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 47

Protein database

• UniProt database (http://www.uniprot.org/) has high quality protein data manually curated

• It is manually curated• Each protein is assigned UniProt ID

Page 48: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 48

Retrieving data from

• In search field one can enter either use UniProt ID or common protein name – example: myelin basic protein

• We will use retrieve data for P02686

Uniprot ID

Page 49: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 49

FASTA format

• FASTA format is widely used and has the following parameters– Sequence name start with > sign– The fist line corresponds to protein name

Actual protein sequence starts from 2nd line

Page 50: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 50

Retrieving protein data with R

• Can “talk” programmatically to UniProt database using R and seqinR library– seqinR library is suitable for

• “Biological Sequences Retrieval and Analysis”• Detailed manual could be found here

– Install this library in your R environmentinstall.packages("seqinr")library("seqinr")

– Choose database to retrieve data from choosebank("swissprot")

– Download data object for target protein (P02686) MBP_HUMAN = query("MBP_HUMAN", "AC=P02686")

– See sequence of the object MBP_HUMAN MBP_HUMAN_seq = getSequence(MBP_HUMAN); MBP_HUMAN_seq

Page 51: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 51

Dot Plot (comparison of 2 sequences) (1of2)

• Each sequence plotted on vertical or horizontal dimension– If two a.a. from two

sequences at given positions are identical the dot is plotted

– matching sequence segments appear as diagonal lines (that could be parallel to the absolute diagonal line if insertion or gap is present)

Page 52: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 52

Dot Plot (comparison of 2 sequences) (2of2)

• Visualize dot plotdotPlot(MBP_HUMAN_seq[[1]], MBP_MOUSE_seq[[1]],xlab="MBP - Human", ylab = "MBP - Mouse")

- Is there similarity between human and mouse form of MBP protein?- Where is the difference in the sequence between the two isoforms?

• Let’s compare two protein sequences– Human MBP (Uniprot ID: P02686)– Mouse MBP (Uniprot ID: P04370)

• Download 2nd mouse sequenceMBP_MOUSE = query("MBP_MOUSE", "AC=P04370");MBP_MOUSE_seq = getSequence(MBP_MOUSE);

INSERTION in MBP-Human or GAP in MBP-Mouse

Shift in diagonal line (identical regions)

Breaks in diagonal line = regions of dissimilarity

Page 53: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 53

Practical 2 of 4:Pairwise global and local alignmentsvia R and Biostrings

Page 54: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 54

Installing Biostrings library

• Install library from Bioconductorsource("http://bioconductor.org/biocLite.R")biocLite("Biostrings")library(Biostrings)

• Define substitution martix (e.g. for DNA)DNA_subst_matrix = nucleotideSubstitutionMatrix(match = 2,

mismatch = -1, baseOnly = TRUE)

• The scoring rules– Match: = 2 if – Mismatch : = -1 if – Gap: = -2 or = -2

DNA_subst_matrix

Page 55: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 55

Global alignment using R and Biostrings

• Create two sting vectors (i.e. sequences)seqA = "GATTA"seqB = "GTTA"

• Use pairwiseAlignment() and the defined rulesglobalAlignAB = pairwiseAlignment(seqA, seqB, substitutionMatrix = DNA_subst_matrix, gapOpening = -2,

scoreOnly = FALSE, type="global")

• Visualize best paths (i.e. alignments) globalAlignAB

Global PairwiseAlignedFixedSubject (1 of 1)pattern: [1] GATTA subject: [1] G-TTA score: 2

Page 56: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 56

Local alignment using R and Biostrings

• Input two sequencesseqA = "AGGATTTTAAAA"seqB = "TTTT"

• The scoring rules will be the same as we used for global alignmentlocalAlignAB = pairwiseAlignment(seqA, seqB, substitutionMatrix = DNA_subst_matrix, gapOpening = -2,

scoreOnly = FALSE, type="local")

• Visualize alignmentglobalAlignABLocal PairwiseAlignedFixedSubject (1 of 1)pattern: [5] TTTT subject: [1] TTTT score: 8

Page 57: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 57

Aligning protein sequences

• Protein sequences alignments are very similar except the substitution matrix is specified

data(BLOSUM62)BLOSUM62

• Will align sequencesseqA = "PAWHEAE"seqB = "HEAGAWGHEE"

• Execute the global alignmentglobalAlignAB <- pairwiseAlignment(seqA, seqB, substitutionMatrix = "BLOSUM62", gapOpening = -2,

gapExtension = -8, scoreOnly = FALSE)

Page 58: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 58

Practical 3 of 4:

DNA sequence statistics and Seqinr

Page 59: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 59

Retrieving genome sequence data

Can retrieve sequence data from NCBI 1. Manually via webGUI2. Programmatically via R

• DEN-1 Dengue virus genome sequence, which has NCBI accession NC_001477

• Gain in speed compared to manual retrieval• More complex queries

Page 60: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 60

Manually

• NCBI Sequence Database via its website www.ncbi.nlm.nih.gov

• Dengue DEN-1 DNA sequence is a viral DNA sequence

• NCBI accession is NC_001477

Page 61: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 61

NCBI database

Page 62: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 62

Retrieving FASTA sequence• To retrieve the DNA sequence • as a FASTA format sequence file

– click on “Send” at the top right • choose “File” in the pop-up menu

– and then choose FASTA from the “Format” » click on “Create file”.

Page 63: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 63

Retrieving genome sequence data using SeqinR• Can retrieve sequences much faster programmatically getncbiseq <- function(accession) { require("seqinr"); # this function requires the SeqinR R package # first find which ACNUC database the accession is stored in: dbs <- c("genbank","refseq","refseqViruses","bacterial"); numdbs <- length(dbs); for (i in 1:numdbs) { db <- dbs[i]; choosebank(db); # check if the sequence is in ACNUC database 'db': resquery <- try(query(".tmpquery", paste("AC=", accession)), silent = TRUE); if (!(inherits(resquery, "try-error"))) { queryname <- "query2"; thequery <- paste("AC=",accession,sep=""); query2 <- query(`queryname`,`thequery`); # see if a sequence was retrieved: seq <- getSequence(query2$req[[1]]); closebank(); return(seq); } closebank(); } print(paste("ERROR: accession",accession,"was not found")); }

dengueseq <- getncbiseq("NC_001477");dengueseq[1:50];length(dengueseq);

Page 64: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 64

Base composition of a DNA sequence

• Count frequencies of the 4 nucleotides

table(dengueseq);dengueseq; a c g t 3426 2240 2770 2299

• This means that the DEN-1 Dengue virus genome sequence has – 3426 As, 2240 Cs, 2770 Gs and 2299 Ts

Page 65: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 65

GC Content of DNA

• the fraction of the sequence that consists of Gs and Cs, ie. the %(G+C).– the percentage of the bases in the genome that are – Gs or Cs

GC(dengueseq)[1] 0.4666977

(2240+2770)*100/(3426+2240+2770+2299) [1] 46.66977

Page 66: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 66

Di-nucleotides• it is also interesting to know

– the frequency of longer DNA “words”– dinucleotides

• ie. “AA”, “AG”, “AC”, “AT”, “CA”, “CG”, “CC”, “CT”, “GA”, “GG”, “GC”, “GT”, “TA”, “TG”, “TC”, and “TT”

count(dengueseq, 2) aa ac ag at 1108 720 890 708

ca cc cg ct 901 523 261 555

ga gc gg gt 976 500 787 507

ta tc tg tt440 497 832 529

Page 67: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 67

Practical 4 of 4:Using BLAST for sequence identification

Page 68: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 68

BLAST

• Basic Local Alignment Search Tool• Many different types• http://blast.ncbi.nlm.nih.gov/Blast.cgi

Page 69: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 69

Types

• blastn– nucleotide query vs nucleotide database

• blastp– protein query vs protein DB

• blastx– translated in 6 frames nucleotide query vs protein DB

Page 70: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 70

Sequence identity• Want to know which genes are coded by the genomic sequence

• >human_genomic_seq TGGACTCTGCTTCCCAGACAGTACCCCTGACAGTGACAGAACTGCCACTCTCCCCACCTG ACCCTGTTAGGAAGGTACAACCTATGAAGAAAAAGCCAGAATACAGGGGACATGTGAGCC ACAGACAACACAAGTGTGCACAACACCTCTGAGCTGAGCTTTTCTTGATTCAAGGGCTAG TGAGAACGCCCCGCCAGAGATTTACCTCTGGTCTTCTGAGGTTGAGGGCTCGTTCTCTCT TCCTGAATGTAAAGGTCAAGATGCTGGGCCTCAGTTTCCTCTTACATACTCACCAAAAGG CTCTCCTGATCAGAGAAGCAGGATGCTGCACTTGTCCTCCTGTCGATGCTCTTGGCTATG ACAAAATCTGAGCTTACCTTCTCTTGCCCACCTCTAAACCCCATAAGGGCTTCGTTCTGT GTCTCTTGAGAATGTCCCTATCTCCAACTCTGTCATACGGGGGAGAGCGAGTGGGAAGGA TCCAGGGCAGGGCTCAGACCCCGGCGCATGGACCTAGTCGGGGGCGCTGGCTCAGCCCCGCCCCGCGCGCCCCCGTCGCAGCCGACGCGCGCTCCCGGGAGGCGGCGGCAGAGGCAGCATCCACAGCATCAGCAGCCTCAGCTTCATCCCCGGGCGGTCTCCGGCGGGGAAGGCCGGTGGGACAAACGGACAGAAGGCAAAGTGCCCGCAATGGAGGGAGCATCCTTTGGCGCGGGCCGTGCGGGAGCTGCCTTTGATCCCGTGAGCTTTGCGCGGCGGCCCCAGACCCTGTTGCGGGTCGTGTCCTGG

Page 71: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 71

BLAST GUI

Page 72: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 72

Results

Top hit• Mus musculus targeted KO-first, conditional ready, lacZ-tagged mutant

allele Tbl3:tm1a(EUCOMM)Hmgu; transgenic

Page 73: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 73

Page 74: Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences

____________________________________________________________________________________________________________________Kirill Bessonov slide 74

Resources

• Online Tutorial on Sequence Alignment– http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/s

rc/chapter4.html

• Graphical alignment of proteins– http://www.itu.dk/~sestoft/bsa/graphalign.html

• Pairwise alignment of DNA and proteins using your rules:– http://www.bioinformatics.org/sms2/pairwise_align_dna.html

• Documentation on libraries– Biostings: http://www.bioconductor.org/packages/2.10/bioc/manuals/Biostrings/man/Biostrings.pdf

– SeqinR: http://seqinr.r-forge.r-project.org/seqinr_2_0-7.pdf