Top Banner
Introduction to Biological sequences Sushmita Roy www.biostat.wisc.edu/bmi576/ [email protected] September 4, 2014 BMI/CS 576
53

Introduction to Biological sequences Sushmita Roy [email protected] September 4, 2014 BMI/CS 576.

Dec 23, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Introduction to Biological sequences

Sushmita Roy

www.biostat.wisc.edu/bmi576/[email protected]

September 4, 2014

BMI/CS 576

Page 2: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Goals for today

• A few key concepts in molecular biology– Nucleic acids– Genes– Proteins– The Central Dogma

• Connection between DNA, RNA and proteins

• Problems in sequence similarity– Sequence alignment– Sequence search

Page 3: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

A Living Cell

• The fundamental unit of life• There are unicellular (one cell) and multi-cellular

organisms• A cell has different cellular components• We will be concerned with

– Nucleus– Ribosomes– Cytoplasm

• prokaryotes (single-celled organisms lacking nucleus)• eukaryotes (organisms with nucleus)

Page 4: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

An animal cell

http://www.genome.gov/Glossary/index.cfm?id=25

Page 5: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Deoxyribonucleic acid (DNA)

image from the DOE Human Genome Programhttp://www.ornl.gov/hgmis

Page 6: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

DNA is a double helical molecule

• In 1953, James Watson and Francis Crick discovered DNA molecule has two strands arranged in a double helix

• This was possible through the Xray diffraction data from Maurice Wilkins and Rosalind Franklin

http://www.chemheritage.org/discover/online-resources/chemistry-in-history/themes/biomolecules/dna/watson-crick-wilkins-franklin.aspx

Watson and Crick

Maurice Wilkins

Rosalind Frankin

Page 7: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Nucleotides

• DNA is composed of small chemical units called nucleotides

• Nucleotide– Nitrogen containing base– 5 carbon sugar: deoxyribose– Phosphate group– Phosphate-hydroxy bonds connect the

nucleotides• Four nucleotides make DNA– adenine (A), cytosine (C), guanine (G) and thymine (T)– Each nucleotide differs in the base

Phosphate Base

Sugar

Hydroxy

Page 8: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Bases in the nucleotides

• Purines (Two rings)

• Pyrimidines (one ring)Adenine (A) Guanine (G)

Cytosine (C)Thymine (T)

Page 9: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Nucleotides are linked to form one strand of DNA

Base

Sugar 1’

2’3’

4’

5’CH2OP

O

O

O

-

-

Base

Sugar 1’

2’3’

4’

5’CH2OP

O

O

O

-

-

Page 10: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

5’ and 3’ of a DNA molecule• Each strand is made up of linkages

between 5’ position (Phosphate) on one nucleotide to the 3’ position of the following nucleotide

• At one end, there is a free phosphate group: 5’ end

• At the other end, there is a free OH group: 3’ end

• Therefore we can talk about directionality– the 5’ and the 3’ ends of a DNA strand

• The two strands are held-together through base pairing

Page 11: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

5’ and 3’ of a DNA molecule contd..

• DNA sequence is read from 5’ to 3’• The two stands run anti-parallel to each other– One is the complement of the other

• For example, if the AAG is the sequence on one strand the sequence on the other strand is CTT– Not TTC

Page 12: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Watson-Crick Base pairing

A always bonds to T C always bonds to G

• This base-pairing is also called “complementary base-paring”• Each strand has a base sequence that is complementary to the

sequence on the other strand.• If you know the sequence on one strand, you know the sequence on

the other strand

Page 13: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

DNA stores the blue print of an organism

• The heredity molecule• Has the information needed to make an organism• Double strandedness of the DNA molecule provides

stability, prevents errors in copying– one strand has all the information

• DNA replication is the process by this information is copied through generations of daughter cells

Page 14: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

DNA replication• Helicase, an enzyme, separates the double-helix• DNA polymerase makes a copy of each strand using

free nucleotides• Each strand of DNA serves as a template

C A T T G C C C A G T 5’ 3’

G T A A C G G G T C A

Strand A

Strand B

Parent DNA double helix

Adapted from “Understanding Bioinformatics”

5’3’

C A T T G C C C A G T 5’ 3’

G T A A C G G G T C A 5’3’

C A T T G C C C A G T 5’ 3’

G T A A C G G G T C A 5’3’

Template strand A

New strand B

New strand A

Template strand B

Page 15: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

https://www.youtube.com/watch?v=zdDkiRw1PdUhttps://www.youtube.com/watch?v=27TxKoFU2Nw

Videos on DNA replication

Page 16: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Chromosomes

• All the DNA of an organism is divided up into individual chromosomes

• Each chromosome is really a DNA molecule

• Different organisms have different numbers of chromosomes

Image from www.genome.gov

Page 17: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Different organisms have different numbers of chromosomes

Organism # of chromosomes

Yeast 32

Human 46

Fly 8

Mouse 40

Arabidopsis 10

Worm 12

Page 18: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Genes

• Genes are the units of heredity• A gene is a sequence of bases

which specifies a protein or RNA molecule

• The human genome has ~ 25,000 protein-coding genes (still being revised)

• One gene can have many functions

• One function can require many genes

…GTATGTCTAAGCCTGAATTCAGTCTGCTTTAAACGGCTTC…

Page 19: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Genomes

• Refers to the complete complement of DNA for a given species

• The human genome consists of 2X23 chromosomes

• Every cell (except egg and sperm cells and mature red blood cells) contains the complete genome of an organism

Page 20: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Some Greatest Hits

Genome Where Year

H. Influenza (bacteria) TIGR 1995

E. Coli (K12) Wisconsin 1997

S. cerevisiae (yeast) International collab 1997

C. elegans (worm) Washington U./Sanger 1998

D. melanogaster (fruit fly) Multiple groups 2000

E. Coli 0157:H7 (pathogen) Wisconsin 2000

H. sapiens (humans) International Collab./Celera

2001

M. musculus (mouse) International Collab. 2002

R. norvegicus (rat) International Collab. 2004

Page 21: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Some Genome Sizes

Genome # base pairs

HIV 9750

E. coli 4.6 billion

S. cerevisiae 12 million

C. elegans 97 million

D. melanogaster 137 million

H. sapiens 3.1 billion

Page 22: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

The central dogma of Molecular biology

DNA

RNA

Proteins

Transcription

Translation

Page 23: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

RNA: Ribonucleic acid

• RNA – Made up of repeating nucleotides– The sugar is ribose– U is used in place of T

• A strand of RNA can be thought of as a string composed of the four letters: A, C, G, U

• RNA is single stranded– More flexible than DNA– Can double back and form loops– Such structures can be more stable

Page 24: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Transcription• In eukaryotes: happens inside the nucleus• RNA polymerase (RNA Pol) is an enzyme that builds an

RNA strand from a gene• RNA Pol is recruited at specific parts of the genome in a

condition-specific way. • Transcription factor proteins are assigned the job of RNA

Pol recruitment.• RNA that is transcribed from a protein coding region is

called messenger RNA (mRNA)

Page 25: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Transcription

The RNA string produced is identical to the non-template strand except T is replaced by U.

Page 26: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

The central dogma of Molecular biology

DNA

RNA

Proteins

Transcription

Translation

Page 27: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Translation

• Process of turning mRNA into proteins.

• Happens outside of the nucleus inside the cytoplasm in ribosomes

• ribosomes are the machines that synthesize proteins from mRNA

Page 28: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Proteins

• Proteins are polymers too• The repeating units are amino acids• There are 20 different amino acids known• DNA codes for protein

– How many nucleotides are needed to specify 20 amino acids?

Page 29: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Amino Acids

Page 30: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Codons

• Each triplet of bases is called a codon• How many codons are possible?• There are three special codons

– One Start codon: AUG: start of translation– Three Stop codons: End of translation

• All others code for a particular amino acid

Page 31: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

The Genetic Code: Specifies how mRNA is translated into protein

Genetic code is degenerate

Page 32: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Codons and Reading Frames

CUC AGC GUU ACC AU5’ 3’

Leu Ser Val Thr

CSer Ala Leu Pro

UCA GCG UUA CCA U

CU CAG CGU UAC CAU

Gln Arg Tyr His

Page 33: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Proteins are the workhorses of the cell

• structural support• transport of substances• coordination of an organism’s activities• response of cell to chemical stimuli• protection against disease• Catalyzing chemical reactions

Page 34: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Proteins are complex molecules

• Primary amino acid sequence

• Secondary structure

• Tertiary structure

• Quarternary structure

• These structures are formed through different levels of protein folding and packaging

Page 35: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Some well-known proteins

Hemoglobin: carries oxygen Insulin: metabolism of sugarActin: maintenance of cell structure

http://en.wikipedia.org/wiki/Hemoglobinhttp://en.wikipedia.org/wiki/Insulinhttp://en.wikipedia.org/wiki/Actin

Page 36: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Hemoglobin protein HBA1

>gi|224589807:226679-227520 Homo sapiens chromosome 16, GRCh37.p9 Primary Assembly

1 CCCACAGACT CAGAGAGAAC CCACCATGGT GCTGTCTCCT GACGACAAGA CCAACGTCAA

61 GGCCGCCTGG GGTAAGGTCG GCGCGCACGC TGGCGAGTAT GGTGCGGAGG CCCTGGAGAG

121 GATGTTCCTG TCCTTCCCCA CCACCAAGAC CTACTTCCCG CACTTCGACC TGAGCCACGG

181 CTCTGCCCAG GTTAAGGGCC ACGGCAAGAA GGTGGCCGAC GCGCTGACCA ACGCCGTGGC

241 GCACGTGGAC GACATGCCCA ACGCGCTGTC CGCCCTGAGC GACCTGCACG CGCACAAGCT

301 TCGGGTGGAC CCGGTCAACT TCAAGCTCCT AAGCCACTGC CTGCTGGTGA CCCTGGCCGC

361 CCACCTCCCC GCCGAGTTCA CCCCTGCGGT GCACGCCTCC CTGGACAAGT TCCTGGCTTC

421 TGTGAGCACC GTGCTGACCT CCAAATACCG TTAAGCTGGA GCCTCGGTGG CCATGCTTCT

481 TGCCCCTTTG G

DNA sequence (491 bp)

>sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens GN=HBA1 PE=1 SV=2MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

Amino acid sequence (142 aa)

Page 37: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

RNA genes

• Not all genes encode proteins• For some genes the end product is RNA– ribosomal RNA (rRNA), which includes major

constituents of ribosomes– transfer RNAs (tRNAs), which carry amino acids to

ribosomes– micro RNAs (miRNAs), which play an important

regulatory role in various plants and animals– linc RNAs (long non-coding RNAs), play important

regulatory roles

Page 38: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

RECAP

• Key components of a eukaryotic cell– Nucleus, Cytoplasm, Ribosome

• What is DNA and RNA?– A large molecule called a polymer– Made up of repeated units

• Nucleotides– DNA: ATGC– RNA: AUGC

• What is a protein– Also a polymer, but the units are amino acids

• The Central Dogma: DNA->RNA->protein• Important processes

– DNA replication, Transcription, Translation

• Some resources– http://www.genome.gov/Glossary/index.cfm

Page 39: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

http://www.youtube.com/watch?v=41_Ne5mS2ls

A video on transcription and translation

Page 40: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Things we did not talk about

• DNA packaging• Alternative splicing• Polyadenylation• Post translational modifications

Page 41: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

A few important biological data/knowledge bases

• 2014 Nucleic acids Research Database reports 1,552 databases• National Center of Biotechnology (NCBI)

– http://www.ncbi.nlm.nih.gov– GenBank: Database of sequences– Refseq: Reference sequences

• Ensemble – http://useast.ensembl.org/info/about/index.html

• UniProt: Protein sequence and protein function• Protein Databank: Protein structure• Pathway databases

– Gene Ontology– KEGG

• Interaction databases– BioGRID– STRING

See also http://nar.oxfordjournals.org/content/42/D1/D1.full#T1

Page 42: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Number of genomes in RefSeq

Source: http://www.ncbi.nlm.nih.gov/refseq/statistics/

Page 43: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Sequence similarity

• Sequence similarity is central to addressing many questions in biology– Are two sequences related?

• Similarity in sequence can imply similarity in function.– Assign function to uncharacterized sequences based on

characterized sequences

• Sequence from different species can be compared to estimate the evolutionary relationships between species – We will come back to this in Phylogenetic trees.

Page 44: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Overview of sequence similarity problems

• Assessing similarity between a small number of DNA or protein sequences– Pairwise sequence alignment– Multiple sequence alignment

• Searching databases for a query sequence– Heuristic search using BLAST

Page 45: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

What is sequence alignment

The task of locating equivalent regions of two or more sequences to assess their overall similarity

Page 46: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

A very simple alignment of two sequences

T H I S S E Q U E N C E

T H A T S E Q U E N C E

Aligned/matched positions

Page 47: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

How to align these two sequences?

T H I S S E Q U E N C E

T H A T I S A S E Q U E N C E

The problem arises when the sequences to be compared are of unequal length

Page 48: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

How do sequences change?

• Sequences change through mutations

substitutions: ACGA AGGA

insertions: ACGA ACGGA

deletions: ACGA AGA

Page 49: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Need to incorporate gaps while aligning sequences

_ _ _T H I S S E Q U E N C E

T H A T I S A S E Q U E N C E

T H I S _ _ _ S E Q U E N C E

T H A T I S A S E Q U E N C E

Alignment 1: 3 gaps, 8 matches Alignment 2: 3 gaps, 9 matches

Page 50: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Issues in sequence alignment

• What type of alignment?– Align the entire sequence or part of it?– Two sequences or multiple sequences?

• How to find the alignment?– Search algorithms for alignment

• How to score an alignment?– the sequences we’re comparing typically differ in length– some characters (nucleotide or aminoacid) are more substitutable

than others

• How to tell if the alignment is biologically meaningful?– Assessing how likely the alignment could have happened by

random chance

Page 51: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Algorithms for alignment

• Pairwise alignment algorithms based on Dynamic programming– Global alignment– Local alignment

• Multiple sequence alignment– Progressive/Guide-tree based approaches– Iterative alignments

• BLAST– Searching a query sequence in a database of sequences

with efficient pre-processing

Page 52: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Scoring alignments

• Percent identity

• Substitution matrices of amino acids– Genuine matches may not be identical– PAM, BLOSUM50 matrices

• Gap penalty functions

Page 53: Introduction to Biological sequences Sushmita Roy  sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576.

Reading assignment for Sep 9th

• Chapter 2, Sections 2.1-2.3, from Textbook: Biological Sequence Analysis