Top Banner
Genomics and Gene Recognition CIS 667 April 27, 2004
44
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Genomics and Gene Recognition CIS 667 April 27, 2004.

Genomics and Gene Recognition

CIS 667 April 27, 2004

Page 2: Genomics and Gene Recognition CIS 667 April 27, 2004.

Genomics and Gene Recognition

• How do we recognize the genes given the raw sequence data?

• Two different cases: Prokaryotes: relatively easy Eukaryotes: relatively difficult

Much “junk DNA” to search through

• Signals determine the beginnings and ends of genes Need to find the signals

Page 3: Genomics and Gene Recognition CIS 667 April 27, 2004.

Prokaryotic Genomes

• Genomic information of prokaryotes dedicated mainly to basic tasks Make and replicate DNA Make new proteins Obtain and store energy

• Over 60 prokaryotic genomes have been completely sequenced since mid-1990s

Page 4: Genomics and Gene Recognition CIS 667 April 27, 2004.

Prokaryotic Genomes

• Recall - prokaryotes have a single circular chromosome

• Also - no cell nucleus, therefore no splicing out of introns

• Therefore, prokaryotic gene structure is quite simple

Transcriptionalstart site

Promoterregion

Operatorsequence

Open Reading Frame

Transcriptionalstop site

Translationalstart site (AUG)

Translationalstop site

Page 5: Genomics and Gene Recognition CIS 667 April 27, 2004.

Promoter Elements

• Gene expression begins with transcription RNA copy of a gene made by an RNA

polymerase Prokaryotic RNA polymerases are

assemblies of several different proteins ’ protein binds to DNA template protein links nucleotides protein holds subunits together protein recognizes specific nucleotide

sequences of promoters

Page 6: Genomics and Gene Recognition CIS 667 April 27, 2004.

Promoter Elements

• ’, and often very similar from one bacterial species to another

• can vary (less well conserved) Several variants often found in a cell The ability to use several different

factors allows a cell to turn on or off expression of whole sets of genes For example, 32 turns on gene expressions

for genes associated with heat shock while does the same for nitrogen stress and genes that always need to be expressed are transcribed by polymerases with

Page 7: Genomics and Gene Recognition CIS 667 April 27, 2004.

Promoter Elements

• Each factor recognizes a particular sequence of nucleotides upstream from the gene looks for -35 sequence TTGACA and -

10 sequence TATAAT Other factors look for other -35 and -

10 sequences The match need not always be exact The better the match, the more likely

transcription will be initiated

Page 8: Genomics and Gene Recognition CIS 667 April 27, 2004.

Promoter Elements

• Protein products from some genes are always used in tandem with those from some other genes These related genes may share a single

promoter in prokaryotic genomes and be arranged in an operon

When one gene is transcribed, so are all of the others - one polycistronic RNA molecule is produced

The lactose operon contains three genes involved in metabolism of the sugar lactose in bacterial cells

Page 9: Genomics and Gene Recognition CIS 667 April 27, 2004.

Operon

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 10: Genomics and Gene Recognition CIS 667 April 27, 2004.

Operon

• The protein encoded by the regulatory gene (pLacI) can bind to lactose or to the operator sequence of the operon So when lactose is abundant, less likely

to bind to operator sequence When it does, it blocks transcription, thus

acting as a negative regulator Even without negative regulation, we have

low levels of operon expression due to poor match of consensus sequence for the factor• A positive regulator (CRP) promotes expression

Page 11: Genomics and Gene Recognition CIS 667 April 27, 2004.

Operon

QuickTime™ and aAnimation decompressor

are needed to see this picture.

Page 12: Genomics and Gene Recognition CIS 667 April 27, 2004.

Lac Operon

QuickTime™ and a decompressor

are needed to see this picture.

Page 13: Genomics and Gene Recognition CIS 667 April 27, 2004.

Open Reading Frames

• Recall - 3 of the 64 codons are stop codons (UAA, UAG, UGA) - they cause translation to stop

• Most prokaryotic proteins are longer than 60 amino acids Since on average we expect to find a stop

codon once in every 21 (3/64) codons, the presence of a run of 30 or more codons with no stop codons (an Open Reading Frame - ORF) is good evidence that we are looking at the coding sequence of a prokaryotic gene

Page 14: Genomics and Gene Recognition CIS 667 April 27, 2004.

Open Reading Frames

• AUG is a start codon Defines where translation begins If no likely promoter sequences are

found upstream of a start codon at the start of an ORF before the end of the preceding ORF, assume the two genes are part of an operon whose promoter sequence is further upstream

Page 15: Genomics and Gene Recognition CIS 667 April 27, 2004.

Termination Sequence

• Most prokaryotic operons contain specific signals for the termination of transcription called intrinsic terminators Must have a sequence of nucleotides that

includes an inverted repeat followed by A run of roughly six uracils The inverted repeat allows the RNA to form a

loop structure that greatly slows down RNA synthesis Together with the chemical properties of uracil, this is

enough to end transcription

Page 16: Genomics and Gene Recognition CIS 667 April 27, 2004.

Termination Sequence

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 17: Genomics and Gene Recognition CIS 667 April 27, 2004.

GC Content in Prokaryotic Genomes

• For every G within a double-stranded DNA genome there must be a C - likewise an A for every T Only constraint on fraction of nucleotides that

are G/C as opposed to A/T is that the two must add to 100%

Can use genomic GC content to identify bacterial species (ranges from 25% to 75%)

Can also use GC content to identify genes that have been obtained from other bacteria by horizontal gene transfer

Page 18: Genomics and Gene Recognition CIS 667 April 27, 2004.

Prokaryotic Gene Densities

• Gene density within prokaryotic genomes is very high Between 85% and 88% of the

nucleotides are typically associated with coding regions of genes

Just as large portions of chromosomes can be acquired, they can also be deleted Portions left are those which code for

essential genes

Page 19: Genomics and Gene Recognition CIS 667 April 27, 2004.

Gene Recognition in Prokaryotes

• Long ORFs (60 or more codons)• Matches to simple promoter

sequences• Recognizable transcriptional

termination signal (inverted repeats followed by run or uracils)

• Comparison with nucleotide (or amino acid) sequences of known protein coding regions from other organisms

Page 20: Genomics and Gene Recognition CIS 667 April 27, 2004.

Eukaryotic Genomes

• Much more complex Internal membrane-bound

compartments allows wide variety of chemical environments in each cell

Multicellular organisms Each cell type has distinct gene expression

Size of genome may be larger Allows for “junk DNA”

• Gene expression more complex and flexible than in prokaryotes

Page 21: Genomics and Gene Recognition CIS 667 April 27, 2004.

Eukaryotic Gene Structure

Page 22: Genomics and Gene Recognition CIS 667 April 27, 2004.

Promoter Elements

• Each different cell type requires different gene expression Therefore eukaryotes have elaborate

mechanisms for starting transcription Prokaryotes have a single RNA

polymerase - eukaryotes have three RNA polymerase I - Ribosomal RNAs RNA polymerase II - Protein-coding genes RNA polymerase III - tRNAs, other small RNAs

Page 23: Genomics and Gene Recognition CIS 667 April 27, 2004.

Promoter Elements

• Most RNA polymerase II promoters contain a set of sequences known as a basal promoter where an initiation complex is assembled and transcription begins

• Also have several upstream promoter elements (typically at least 5) to which other proteins bind Without the proteins binding

upstream, initiation complex assembly is difficult

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 24: Genomics and Gene Recognition CIS 667 April 27, 2004.

Promoter Elements

• RNA polymerase II does not directly recognize the basal sequences of promoters Basal transcription factors

including a TATA-binding protein (TBP) and at least 12 TBP-associated factors bind to the promoter in a specific order, facilitating binding of RNA polymerase TATA-box 5’-TATAWAW-3’ (W

is A or T) at -25 relative to transcriptional start site

Initiator sequence 5’-YYCARR-3’ (Y is C or T and R is G or A) at transcriptional start site

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 25: Genomics and Gene Recognition CIS 667 April 27, 2004.

TranscriptionQuickTime™ and a decompressor

are needed to see this picture.

Page 26: Genomics and Gene Recognition CIS 667 April 27, 2004.

Regulatory Protein Binding Sites

• Transcription initiation in eukaryotes relies heavily on positive regulation Constitutive factors work on many genes

and don’t respond to external signals Regulatory factors have limited number

of genes and respond to external signals Response factors (e.g. heat shock) Cell-specific factors (e.g. pituitary cells only) Developmental factors (e.g. early embryo

organization)

Page 27: Genomics and Gene Recognition CIS 667 April 27, 2004.

Open Reading Frames

• Before translation, a heterogeneous RNA (hnRNA) is transformed into mRNA by being Capped

5’ end chemically altered

Spliced Various splicings can occur

Polyadenylated Long stretch of A’s added at 3’ end

Page 28: Genomics and Gene Recognition CIS 667 April 27, 2004.

Introns and Exons

• The introns are spliced out of the hnRNA Protein-coding genes conform to the GU-

AG rule These are the nucleotides at the 5’ and 3’

end of the intron Other nucleotides are examined as well

• Most of these are inside the intron• These signals constrain introns to be at least 60 bp

long - but there is no upper limit

Page 29: Genomics and Gene Recognition CIS 667 April 27, 2004.

Alternative Splicing

• About 20% of human genes give rise to more than one type of mRNA sequence due to alternative splicing

• Splice junctions can be masked, causing an exon to be spliced out

• The following slide shows how alternative splicing based on different splicing factors (proteins) can stop a useful protein from being produced

Page 30: Genomics and Gene Recognition CIS 667 April 27, 2004.

Alternative Splicing

Page 31: Genomics and Gene Recognition CIS 667 April 27, 2004.

GC Content

• Overall GC content between different genomes does not vary as much in eukaryotes as in prokaryotes However variations in GC content within

a genome can help us to recognize genes Of all of the pairs of nucleotides,

statistically, CG is found only at 20% of its expected value No other pair is under or over represented

Page 32: Genomics and Gene Recognition CIS 667 April 27, 2004.

GC Content

• The expected levels of are found, however, in stretches of 1 -2 kbp at the end of the 5’ ends of many human genes These are called CpG islands and are

associated with methylation Can cause make it easy for CG to mutate to

TG or CA High levels of methylation imply low levels of

acetylation of histones (a protein which, when acetylated makes transcription of DNA possible)

Page 33: Genomics and Gene Recognition CIS 667 April 27, 2004.

Isochores

• Vertebrates and plants display a level of organization called isochores that is intermediate between that of genes and chromosomes The GC content of an isochore is relatively

uniform throughout There are five classes of isochores depending

on the level of GC content Those with high GC content also have high gene

density The types of genes found in different classes differs as

well

Page 34: Genomics and Gene Recognition CIS 667 April 27, 2004.

Codon Usage Bias

• Another hint for gene hunting can be derived from the fact that every organism prefers some equivalent triplet codon to code for proteins

• Real exons generally reflect the bias while randomly chosen strings of triplets do not

Page 35: Genomics and Gene Recognition CIS 667 April 27, 2004.

Gene Recognition

• In summary, useful DNA sequence features for gene hunting include Known promoter elements (I.e. TATA

boxes) CpG islands Splicing signals associated with introns ORFs with characteristic codon utilization Similarity to the sequences of ESTs or

genes from other organisms.

Page 36: Genomics and Gene Recognition CIS 667 April 27, 2004.

Gene Expression

• Expression varies greatly however• Tools for determining gene

expression levels include cDNAs and ESTs Complementary DNAs are synthesized

from mRNAs and can be used to provide expressed sequence tags useful for contig assembly or gene recognition

Page 37: Genomics and Gene Recognition CIS 667 April 27, 2004.

cDNA

Page 38: Genomics and Gene Recognition CIS 667 April 27, 2004.

Microarrays

• Gene expression patterns can be studied using microarrays Small silica (glass) chips covered with

thousands of short sequences of nucleotides of known sequence

The microarray can then be used to compare the expression of all of the genes in the genome simultaneously

A gene is represented by a set of 16 probes

Page 39: Genomics and Gene Recognition CIS 667 April 27, 2004.

Microarrays

• The probes representing genes are arranged in a grid on the chip

• Fluorescently labeled cDNA from the tissue/organism we want to test is washed over the chip from the tissue/organism we want to test

• If a gene is expressed, it will bind to the genes tags

• We can detect this through pattern recognition

Page 40: Genomics and Gene Recognition CIS 667 April 27, 2004.

Microarrays

Make cDNAfrom cells after treatmentwith a drug

Make cDNAfrom cells before treatmentwith a drug

Page 41: Genomics and Gene Recognition CIS 667 April 27, 2004.

Microarrays

Page 42: Genomics and Gene Recognition CIS 667 April 27, 2004.

Transposition

• Transposons result from insertion of duplicate sequence from another part of the genome aided by a transposase enzyme If inserted in “junk DNA”, not harmful More common are retrotransposons

which are by retroviruses (encapsulated RNA and reverse transcriptase which use a host to duplicate) like HIV

Page 43: Genomics and Gene Recognition CIS 667 April 27, 2004.

Retrovirus Replication

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 44: Genomics and Gene Recognition CIS 667 April 27, 2004.

Virus Replication

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.