Genes, Microarrays and Motifs Lecture 8 CSC 2417/BCB 410 Michael Brudno Many slides from various sources, including T. Hughes (U. of T.), S. Batzolgou.

Genes, Genes, MicroarraysMicroarrays and Motifsand Motifs

Lecture 8Lecture 8

CSC 2417/BCB 410 CSC 2417/BCB 410

Michael BrudnoMichael BrudnoMany slides from various Many slides from various sources, including T. Hughes sources, including T. Hughes (U. of T.), S. Batzolgou (U. of T.), S. Batzolgou (Stanford), Sanja Rogic (Stanford), Sanja Rogic (UBC), Manolis Kellis (MIT)(UBC), Manolis Kellis (MIT)

OutlineOutline

Intro to genes and motifsIntro to genes and motifs Identifying Gene StructuresIdentifying Gene Structures MicroarraysMicroarrays Identifying Regulatory ElementsIdentifying Regulatory Elements

Genes and MotifsGenes and Motifs

Cells respond to environmentCells respond to environmentCell responds toenvironment—various external messages

Genome is fixed – Cells are Genome is fixed – Cells are dynamicdynamic

A genome is staticA genome is static

Every cell in our body has a copy of same genomeEvery cell in our body has a copy of same genome

A cell is dynamicA cell is dynamic

Responds to external conditionsResponds to external conditions Most cells follow a Most cells follow a cell cyclecell cycle of division of division

Cells differentiate during developmentCells differentiate during development

Gene expression varies according to:Gene expression varies according to:

Cell typeCell type Cell cycleCell cycle External conditionsExternal conditions LocationLocation

slide credits: M. Kellis

Where gene regulation takes Where gene regulation takes placeplace

Opening of chromatinOpening of chromatin

TranscriptionTranscription

TranslationTranslation

Protein stabilityProtein stability

Protein modificationsProtein modifications

Transcriptional RegulationTranscriptional Regulation

EfficientEfficient place to regulate: place to regulate:

No energy wasted making intermediate productsNo energy wasted making intermediate products

However, However, slowestslowest response time response time

After a receptor notices a change:After a receptor notices a change:

1.1. Cascade message to nucleusCascade message to nucleus

2.2. Open chromatin & bind transcription factorsOpen chromatin & bind transcription factors

3.3. Recruit RNA polymerase and transcribeRecruit RNA polymerase and transcribe

4.4. Splice mRNA and send to cytoplasmSplice mRNA and send to cytoplasm

5.5. Translate into proteinTranslate into protein

Transcription Factors Binding Transcription Factors Binding to DNAto DNA

Transcription regulation:Transcription regulation:

Transcription factors bind DNATranscription factors bind DNA

Binding recognizes DNA Binding recognizes DNA substrings:substrings:

Regulatory motifsRegulatory motifs

Promoter and EnhancersPromoter and Enhancers

PromoterPromoter necessary to start transcription necessary to start transcription

EnhancersEnhancers can affect transcription from afar can affect transcription from afar

Transcription Factor(Protein)

DNA

Gene Regulation with TFsGene Regulation with TFs

Regulatory Element Gene

RNA polymerase

Gene

RNA polymerase


Regulatory Element

DNA


DNA

New protein



Regulatory Element Gene

RNA polymerase

TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT

TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT

Promoter motifs

3’ UTR motifs

Exons

Introns

Example: A Human heat Example: A Human heat shock proteinshock protein

TATA box: TATA box: positioning transcription startpositioning transcription start

TATA, CCAAT: TATA, CCAAT: constitutive transcriptionconstitutive transcription

GRE: GRE: glucocorticoid responseglucocorticoid response

MRE:MRE: metal responsemetal response

HSE:HSE: heat shock elementheat shock element

TATASP1CCAAT AP2HSEAP2CCAATSP1

promoter of heat shock hsp70

0--158

GENE

SplicingSplicing

frgjjfrgjjthissentencethissentencehjfmkhjfmkcontainsjunkcontainsjunkelmelm

thissentencecontainsjunkthissentencecontainsjunk

Gene structureGene structure

exon1 exon2 exon3intron1 intron2

transcription

translation

splicing

exon = protein-codingintron = non-coding

Codon:A triplet of nucleotides that is converted to one amino acid

Alternative splicingAlternative splicing

Isoform 1

Isoform 2

Isoform 3

exon 1 exon 2 exon 3 exon 4 exon 5

Predicting Gene Predicting Gene StructureStructure

Gene Finding: Different Gene Finding: Different ApproachesApproaches

Similarity-based methods (extrinsic)Similarity-based methods (extrinsic) - use similarity to - use similarity to

annotated sequencesannotated sequences::

proteinsproteins cDNAscDNAs ESTsESTs

Comparative genomicsComparative genomics - Aligning genomic sequences from - Aligning genomic sequences from different speciesdifferent species

Ab initioAb initio gene-finding (intrinsic) gene-finding (intrinsic)

Integrated approachesIntegrated approaches

Similarity-based methodsSimilarity-based methods Based on sequence conservation due to functional Based on sequence conservation due to functional

constraintsconstraints

Use local alignment tools (Smith-Waterman algo, Use local alignment tools (Smith-Waterman algo, BLAST, FASTA) to search protein, cDNA, and EST BLAST, FASTA) to search protein, cDNA, and EST databasesdatabases

Will not identify genes that code for proteins not Will not identify genes that code for proteins not already in databasesalready in databases

Limits of the regions of similarity not well definedLimits of the regions of similarity not well defined

Comparative GenomicsComparative Genomics

Based on the assumption that coding Based on the assumption that coding sequences are more conserved than non-sequences are more conserved than non-codingcoding

Two approaches:Two approaches: intra-genomic (gene families)intra-genomic (gene families) inter-genomic (cross-species)inter-genomic (cross-species)

Alignment of homologous regionsAlignment of homologous regions

Difficult to define limits of higher similarityDifficult to define limits of higher similarity

Difficult to find optimal evolutionary distanceDifficult to find optimal evolutionary distance

Using Comparative Using Comparative Information Information

Hox cluster is an example where everything is conservedHox cluster is an example where everything is conserved

Patterns of ConservationPatterns of Conservation

30% 1.3%

0.14%

58%14%

10.2%

Genes Intergenic

Mutations Gaps Frameshifts

Separation

2-fold10-fold75-fold

Summary for Extrinsic Summary for Extrinsic ApproachesApproaches

Strengths:Strengths:

Rely on accumulated pre-existing Rely on accumulated pre-existing biological data, thus should produce biological data, thus should produce biologically relevant predictionsbiologically relevant predictions

Weaknesses:Weaknesses:

Limited to pre-existing biological dataLimited to pre-existing biological data Errors in databasesErrors in databases Difficult to find limits of similarityDifficult to find limits of similarity

Ab initioAb initio Gene Finding, Gene Finding, Part 1Part 1

Input:Input: A DNA string over the alphabet A DNA string over the alphabet {A,C,G,T}{A,C,G,T}

Output:Output: An annotation of the string An annotation of the string showing for every nucleotide showing for every nucleotide whether it is coding or notwhether it is coding or not

AAAGCATGCATTTAACGAGTGCATCAGGACTCCATACGTAATGAAAGCATGCATTTAACGAGTGCATCAGGACTCCATACGTAATGCCGCCG

AAAGC ATG CAT TTA ACG A GT GCATC AG GA CTC CAT ACG TAA TGCCG

Gene finder

Ab initioAb initio Gene Finding, Gene Finding, Part 2Part 2

Using only sequence informationUsing only sequence information

Identifying only coding exons of Identifying only coding exons of protein-coding genes (transcription protein-coding genes (transcription start site, 5start site, 5’’ and 3 and 3’’ UTRs are UTRs are ignored)ignored)

Integrates coding statistics with Integrates coding statistics with signal detectionsignal detection

Coding Statistics, Part 1Coding Statistics, Part 1

Unequal usage of codons in the coding Unequal usage of codons in the coding regions is a universal feature of the genomesregions is a universal feature of the genomes uneven usage of amino acids in existing proteinsuneven usage of amino acids in existing proteins uneven usage of synonymous codonsuneven usage of synonymous codons

We can use this feature to differentiate We can use this feature to differentiate between coding and non-coding regions of the between coding and non-coding regions of the genomegenome

Coding statistics - a function that for a given Coding statistics - a function that for a given DNA sequence computes a likelihood that the DNA sequence computes a likelihood that the sequence is coding for a proteinsequence is coding for a protein

Coding Statistics, Part 2Coding Statistics, Part 2

Many different onesMany different ones

codon usagecodon usage hexamer usagehexamer usage GC contentGC content compositional bias between codon compositional bias between codon

positionspositions nucleotide periodicitynucleotide periodicity ……

An Example of Coding Statistics, An Example of Coding Statistics,

Part 1 Part 1

Signal Sensors, Part 1Signal Sensors, Part 1 Signal – a string of DNA recognized by the cellular Signal – a string of DNA recognized by the cellular

machinerymachinery

Signal Sensors, Part 2Signal Sensors, Part 2

Various pattern recognition method are Various pattern recognition method are used for identification of these signals:used for identification of these signals:

consensus sequencesconsensus sequences weight matricesweight matrices weight arraysweight arrays decision treesdecision trees Hidden Markov Models (HMMs)Hidden Markov Models (HMMs) neural networksneural networks ……

A T G

T G A T A A T A G

G T A G

(start codons)(start codons) (start codons)(start codons)

(donor splice sites)(donor splice sites)(donor splice sites)(donor splice sites) (acceptor splice sites)(acceptor splice sites)(acceptor splice sites)(acceptor splice sites)

(stop codons)(stop codons) (stop codons)(stop codons)

Stochastic Nature of Signal Motifs

Stochastic Nature of Signal Motifs

Gene Prediction Gene Prediction SummarySummary

Expressed Sequence (cDNA) or protein Expressed Sequence (cDNA) or protein sequence available?sequence available? Yes Yes Spliced alignment Spliced alignment

BLAT, Exonerate, est_genome, spidey, GMAP, BLAT, Exonerate, est_genome, spidey, GMAP, GenewiseGenewise

No No Integrated gene prediction Integrated gene prediction Informant genome(s) available?Informant genome(s) available?

Yes Yes Dual or n-genome Dual or n-genome de novode novo predictors: predictors: SGP2, Twinscan, NSCAN, SGP2, Twinscan, NSCAN, (Genomescan – same or cross genome protein (Genomescan – same or cross genome protein

blastx)blastx) No No ab initioab initio predictors predictors

geneid, genscan, augustus, fgenesh, geneid, genscan, augustus, fgenesh, genemark, etc.genemark, etc.

Many newer gene predictors can run in Many newer gene predictors can run in multiple modes depending on the multiple modes depending on the evidence available.evidence available.

MicroarraysMicroarrays

MicroarrayMicroarray

Measure the level of mRNA messages Measure the level of mRNA messages in a cellin a cell

DN

A 1

DN

A 3

DN

A 5

DN

A 6

DN

A 4

DN

A 2

cDNA 4

cDNA 6

Hybridize Gen

e 1

Gen

e 3

Gen

e 5

Gen

e 6

Gen

e 4

Gen

e 2

MeasureRNA 4

RNA 6

RT


controltreatment

(drug, mutation)

updownunchangednot present

x y z

xx

x

xx

yy

yy

zz z

cDNA pools

Typical use of cDNA Microarrays:“Internal” normalization using two colors

Alternative Splicing Alternative Splicing MicroarrayMicroarray

Measure the Measure the expression of the expression of the various probesvarious probes

Infer the Infer the expression of the expression of the different splice different splice forms from the forms from the ratio of the ratio of the inclusion and inclusion and exclusion isoformexclusion isoform

Picture taken from:

J. Calarco et al. Genes and Dev. 21, 2963-2975

“cDNA microarrays” are essentially dot-blots on glass slides

http://arrayit.com/Products/Printing/Stealth/stealth.html

• This slide was made with 16 pins• 4.5 mm pin spacing matches 384-well plates (16 x 24)• Done with robotics• Slides usually coated with poly-lysine• Spots are usually 100-150 microns• Spot spacing is usually 200-300 microns.• Slides are 25 x 75 mm• Easy to deposit 20K spots/slide

0.45 mm

Microarray expression profiling by 2-color assay (“cDNA arrays”)

Array: PCR products6250 yeast ORFs

hybridized cDNAs:green = controlred = experiment

*Schena et al., 1995

Looking at data from a single experiment

3-AT vs.No drug

wild-type vs.wild-type

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Log10

(Intensity)

Log 1

0(Exp

ress

ion

Rat

io)

Slides: 11120c01 -11121c01

P-value < 0.01

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

P-value < 0.01

Log10

(Intensity)

Log 1

0(Exp

ress

ion

Rat

io)

Slides: 11857c01 -11858c01

log10(average intensity)

-2 -1 0 1 2

log 1

0(r

atio

)lo

g 10(r

atio

)

2

1

0

-1

-2

-2 -1 0 1 2

2

1

0

-1

-2

Clustering AlgorithmsClustering Algorithms

b

ed

f

a

c

h

ga b d e f g hc

• K-meansb

ed

f

a

c

h

gc1

c2

c3a b g hcd e f

• Hierarchical


Hierarchical clusteringHierarchical clustering

Bottom-up algorithm:Bottom-up algorithm: Initialization: each point in a Initialization: each point in a

separate clusterseparate cluster At each step:At each step:

Choose the pair of Choose the pair of closest closest clustersclusters

MergeMerge The exact behavior of the The exact behavior of the

algorithm depends on how we algorithm depends on how we define the define the distance CD(X,Y)distance CD(X,Y) between clusters X and Ybetween clusters X and Y

Avoids the problem of Avoids the problem of specifying the number of specifying the number of clustersclusters

b

ed

f

a

c

h

g


Distance between Distance between clustersclusters

CD(X,Y)=minCD(X,Y)=minx x X, y X, y Y Y D(x,y)D(x,y)

Single-link methodSingle-link method

CD(X,Y)=maxCD(X,Y)=maxx x X, y X, y Y Y D(x,y)D(x,y)

Complete-link methodComplete-link method CD(X,Y)=avgCD(X,Y)=avgx x X, y X, y Y Y D(x,y)D(x,y)

Average-link methodAverage-link method CD(X,Y)=D( avg(X) , CD(X,Y)=D( avg(X) ,

avg(Y) )avg(Y) )

Centroid methodCentroid method

ed

f

h

g

ed

f

h

g

ed

f

h

g

ed

f

h

g


-10 -5 -2 1 2 5 10

fold repression fold induction

transcript response index

exp

erim

ent

ind

ex

RHO O/XPKC O/X

ste mutants

treatment withalpha-factor

Data from Roberts et al., Science (2000)

Hierarchical Clustering Hierarchical Clustering ResultResult

K-Means Clustering K-Means Clustering AlgorithmAlgorithm

Each cluster Each cluster XXii has a center has a center ccii

Define the clustering cost criterion Define the clustering cost criterion COST(XCOST(X11,…X,…Xkk) = ∑) = ∑XiXi ∑ ∑x x XiXi |x – c |x – cii||22

Algorithm tries to find clusters XAlgorithm tries to find clusters X11…X…Xkk and centers cand centers c11…c…ckk that minimize COST that minimize COST

K-means algorithm:K-means algorithm: Initialize centers Initialize centers Repeat:Repeat:

Compute best clusters for given centersCompute best clusters for given centers → → Attach each point to the closest Attach each point to the closest

centercenter Compute best centers for given clustersCompute best centers for given clusters → → Choose the centroid of points in Choose the centroid of points in

clustercluster Until the changes in COST are Until the changes in COST are ““smallsmall””

b

ed

f

a

c

h

g

c1

c2

c3


K-Means AlgorithmK-Means Algorithm

Randomly Randomly Initialize Initialize ClustersClusters


Assign data Assign data points to points to nearest nearest clustersclusters


Recalculate Recalculate ClustersClusters


Recalculate Recalculate ClustersClusters


RepeatRepeat


RepeatRepeat


Repeat … Repeat … until until convergenceconvergence

Time: O(KNM) per iteration

N: #genesM: #conditions

K = 10 #1 #2 #3

K-Means Result One example: K-means (must choose K)

See: Sherlock G. Analysis of large-scale gene expression data.Curr Opin Immunol. 2000 Apr;12(2):201-5.

GO-Biological Process GO-Biological Process categoriescategories

Broad

Mid-level

Narrow eye pigment metabolism

eye morphogenesis

pigment metabolism

striated muscle contraction

ATP biosynthesis

vision

CNS development

insulin secretion

Very Broadmetabolism

163

137

21

36

25

33

34

1548

# annotated genes(mouse)

development 2341

GO-Biological Process GO-Biological Process hierarchyhierarchy

eye pigment metabolism

eye morphogenesis

pigment metabolism

CNS development

metabolism

development

Other types of categorical Other types of categorical annotations:annotations:

KEGG, EC numbers (describe biochemical “pathways”)

MIPS, YPD (yeast databases – older than GO)

Results of individual studies (localization, 2-hybrid screens, protein complexes, etc.

Sequence motifs, structural domains (pfam, SMART)

Evaluating clusters – Hypergeometric Evaluating clusters – Hypergeometric DistributionDistribution

• N genes, p labeled ++, (N-p) ––• Cluster: k genes, m labeled ++• P-value of single cluster

containing k genes of which at least r are ++

Prob a random set of k genes

has m ++ and k-m –– genes

P-value that at least r

genes are ++ in the cluster


Analyzing clusters:

amino acid biosynthesis (p<10-

14)**amino acid metabolism (p<10-

14)**

methionine metabolism (p=1.07×10-7)

**When testing clusters against many different types of categorical annotations, should consider correcting for multiple-testing, and also consider

that categories are often not independent

Cluster labelamino acid metabolismarginine biosynthesisarginine catabolismaromatic AA metabolismasparagine biosynthesisbranched chain AA synthlysine biosynthesismethionine biosynthesissulfur AA tnsprt, metabadenine biosynthesisaldehyde metabolismbiotin biosynthesiscitrate metabolismergosterol biosynthesisfatty acid biosynthesisgluconeogenesisNAD biosynthesisone-carbon metabolismpyridoxine metabolismthiamin biosynthesis 1thiamin biosynthesis 2hexose transportsodium ion transportpolyamine transportnucleocytoplasmic transportribosome/RNA biogenesisribosomal proteinstranslational elongationprotein foldingsecretionprotein glycosylationvesicle-mediated transportproteasomevacuole fusionmitoribosome/respirationMitochond. electron trans.iron transport/TCA cycleChromatin/transcriptionhistonesMCM2/3/6/CDC47DNA replicationmitotic cell cycleCLB1/CLB6/BBP1cytokinesisdevelopmentpheromone responseconjugationsporulation/meiosisresponse to oxidative stressstress/heat shock

Sample genesTRP4, HIS3ARG1, ARG3CAR1, CAR2ARO9, ARO10ASN1, ASN2ILV1,2,3,6LYS2, LYS9MET3,16,28MUP1, MHT1ADE1,4,8AAD4,14,16BIO3,4CIT1,2ERG1,5,11FAS1,FAS2PGK1, TDH1,2,3BNA4,6GCV1,2,3SNO1, SNZ1THI5,12THI2,20HXT4,GSY1ENA1,2,5TPO2,3KAP123,NUP100MAK16,CBF5RPS1A,RPL28TEF1,2SSA1,HSP60VTH1,KRE11ALG6,CAX4VPS5,IMH1RPN6,RPT5VTC1,3,4,PHO84MRPL1,MRPS5ATP1,COX4FRE1,FET3SNF2,CHD1,DOT6HTA1,HHF1MCM2,3,6RFA1,POL12SPC110,CIN8CLB1,6CTS1,EGT2PAM1,GIC2FUS3,FAR1CIK1,KAR3SPO11,SPO19GDH3,HYR1 HSP104,SSA4

Candidate regulatorGCN4ARG80/81ARG80/81/UME6/RPD3ARO80GCN4/HAP1/HAP2LEU3, GCN4LYS14CBF1, MET28, MET32MET31,MET32BAS1, BAS2, GCN4

RTG3ECM22/UPC2INO4GCR1

THI2/THI3THI2/THI3GCR1NRG1,MIG1HAA1RRPE-binding factorPAC/RRPE-binding factors

HAC1,ROX1RLM1XBP1

RPN4PHO4

HAP2/3/4/5MAC1/RCS1/AFT1/PDR1/3

HIR1,HIR2ECBMCBHCM1FKH1ACE2,SWI4

MATALPHA2,STE12KAR4NDT80ROX1,MSN2,MSN4MSN2,MSN4

249

gen

es1,

226

gen

esNon-overlapping yeast gene expression

clusters424 experiments

Chua et al., 2004

Gene Function Prediction

Microarray expression data

Co-regulated groups of genes

Functional categories

Predict functions of new genes

cis, trans regulators

Motifs: Splice Motifs: Splice Sites & Binding Sites & Binding

SitesSites

Identifying MotifsIdentifying Motifs

Genes are turned on or off by regulatory Genes are turned on or off by regulatory proteinsproteins

These proteins bind to upstream regulatory These proteins bind to upstream regulatory regions of genes to either attract or block an regions of genes to either attract or block an RNA polymeraseRNA polymerase

Regulatory protein (TF) binds to a short DNA Regulatory protein (TF) binds to a short DNA sequence called a motif (TFBS)sequence called a motif (TFBS)

So finding the same motif in multiple genesSo finding the same motif in multiple genes’’ regulatory regions suggests a regulatory regulatory regions suggests a regulatory relationship amongst those genesrelationship amongst those genes

Identifying Motifs: Identifying Motifs: ComplicationsComplications

We do not know the motif sequenceWe do not know the motif sequence

We do not know where it is located relative We do not know where it is located relative to the genes start to the genes start

Motifs can differ slightly from one gene to Motifs can differ slightly from one gene to the nextthe next

How to discern it from How to discern it from ““randomrandom”” motifs? motifs?

Regulatory Motif Regulatory Motif DiscoveryDiscovery

DNA

Group of co-regulated genesCommon subsequence

Find motifs within groups of corregulated genes


Random SampleRandom Sample

atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgatgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtacaacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca

tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagagctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatagtcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag

gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaagtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcatcggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtaaacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaagttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagcactggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca

Implanting Motif Implanting Motif AAAAAAAGGGGGGGAAAAAAAGGGGGGG

atgaccgggatactgatatgaccgggatactgatAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGaa

tgagtatccctgggatgactttgagtatccctgggatgacttAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatggctgagaattggatgAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaattcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGcttatagcttatag

gtcaatcatgttcttgtgaatggatttgtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtcggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcatcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaacttgagttAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtactggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGaccgaaagggaagaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGaa

Where is the Implanted Where is the Implanted Motif? Motif?

atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgatgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaagggggggaacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga

tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagagctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatagtcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag

gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaagtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcatcggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtaaacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaagttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaagggggggactggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga

Implanting Motif AAAAAAGGGGGGG Implanting Motif AAAAAAGGGGGGG

with Four Mutationswith Four Mutations

atgaccgggatactgatatgaccgggatactgatAAggAAAAggAAAGGAAAGGttttGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataccAAAAttAAAAAAAAccGGGGccGGGGGGaa

tgagtatccctgggatgactttgagtatccctgggatgacttAAAAAAAAttAAAAttGGGGaaGGttGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatggctgagaattggatgccAAAAAAAGGGAAAAAAAGGGattattGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaattcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAttAAAAttAAAGGAAAGGaaaaGGGGGGcttatagcttatag

gtcaatcatgttcttgtgaatggatttgtcaatcatgttcttgtgaatggatttAAAAccAAAAttAAGGGAAGGGctctGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtcggttttggcccttgttagaggcccccgtAAttAAAAAAccAAGGAAGGaaGGGGGGcccaattatgagagagctaatctatcgcgtgcgtgttcatcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaacttgagttAAAAAAAAAAAAttAGGGAGGGaaGGccccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtactggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAActctAAAAAGGAAAAAGGaaGGccGGGGaccgaaagggaagaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAActctAAAAAGGAAAAAGGaaGGccGGGGaa

Where is the Motif??? Where is the Motif???

atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgatgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcgggaacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga

tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagagctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatagtcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag

gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaagtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcatcggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtaaacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaagttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcggactggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga

Why Finding (15,4) Motif is Why Finding (15,4) Motif is Difficult?Difficult?

atgaccgggatactgatatgaccgggatactgatAAggAAAAggAAAGGAAAGGttttGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataccAAAAttAAAAAAAAccGGGGccGGGGGGaa

tgagtatccctgggatgactttgagtatccctgggatgacttAAAAAAAAttAAAAttGGGGaaGGttGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatggctgagaattggatgccAAAAAAAGGGAAAAAAAGGGattattGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaattcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAttAAAAttAAAGGAAAGGaaaaGGGGGGcttatagcttatag

gtcaatcatgttcttgtgaatggatttgtcaatcatgttcttgtgaatggatttAAAAccAAAAttAAGGGAAGGGctctGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtcggttttggcccttgttagaggcccccgtAAttAAAAAAccAAGGAAGGaaGGGGGGcccaattatgagagagctaatctatcgcgtgcgtgttcatcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaacttgagttAAAAAAAAAAAAttAGGGAGGGaaGGccccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtactggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAActctAAAAAGGAAAAAGGaaGGccGGGGaccgaaagggaagaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAActctAAAAAGGAAAAAGGaaGGccGGGGaa

AgAAgAAAGGttGGG

cAAtAAAAcGGcGGG

..|..|||.|..|||

Challenge ProblemChallenge Problem

Find a motif in a sample of Find a motif in a sample of

- 20 - 20 ““randomrandom”” sequences (e.g. 600 nt long) sequences (e.g. 600 nt long)

- each sequence containing an implanted - each sequence containing an implanted

pattern of length 15, pattern of length 15,

- each pattern appearing with 4 mismatches - each pattern appearing with 4 mismatches

as (15,4)-motif.as (15,4)-motif.

Pevzner, et al

Discrete FormulationsDiscrete Formulations

Given sequences S = {xGiven sequences S = {x11, …, x, …, xnn}}

A motif W is a consensus string wA motif W is a consensus string w11…w…wKK

FindFind motif W motif W** with with ““bestbest”” match to x match to x11, …, x, …, xnn

Definition of Definition of ““bestbest””::

d(W, xd(W, xii) = min hamming dist. between W and ) = min hamming dist. between W and any word in xany word in xii

d(W, S) = d(W, S) = ii d(W, x d(W, xii))

Exhaustive SearchesExhaustive Searches

1. Pattern-driven algorithm1. Pattern-driven algorithm::

For W = AA…A to TT…T For W = AA…A to TT…T (4 (4KK possibilities) possibilities)

Find d( W, S )Find d( W, S )

Report W* = argmin( d(W, S) )Report W* = argmin( d(W, S) )

Running time: O( K N 4Running time: O( K N 4KK ) )

(where N = (where N = ii |x |xii|)|)

Advantage:Advantage: Finds provably Finds provably ““bestbest”” motif W motif W

Disadvantage:Disadvantage: TimeTime

Exhaustive SearchesExhaustive Searches2. Sample-driven algorithm2. Sample-driven algorithm::

For W = any K-long word occurring in some xFor W = any K-long word occurring in some x ii

Find d( W, S )Find d( W, S )

ReportReport W* = argmin( d( W, S ) ) W* = argmin( d( W, S ) )or, or, ReportReport a local improvement of W a local improvement of W**

Running time: O( K NRunning time: O( K N22 ) )

Advantage:Advantage: TimeTime

Disadvantage:Disadvantage: If the true motif is weak and does not occur in If the true motif is weak and does not occur in datadata

then a random motif may score better than any then a random motif may score better than any instance of true motif instance of true motif

Consensus splice sitesConsensus splice sites

Donor: 7.9 bitsAcceptor: 9.4 bits

Example of Consensus Example of Consensus SequenceSequence

obtained by choosing the most frequent base at each position of obtained by choosing the most frequent base at each position of the multiple alignment of subsequences of interestthe multiple alignment of subsequences of interest

TACGATTACGATTATAATTATAATTATAATTATAATGATACTGATACTTATGATTATGATTATGTTTATGTT

consensus sequenceconsensus sequence

consensus (IUPAC)consensus (IUPAC)

Leads to loss of information and can produce Leads to loss of information and can produce many false positive or false negative predictionsmany false positive or false negative predictions

TATAAT

TATRNT

MELONMANGOHONEYSWEETCOOKY

MONEY

Sequence LogosSequence Logos

TGGGGGATGGGGGA

TGAGAGATGAGAGA

TGGGGGATGGGGGA

TGAGAGATGAGAGA

TGAGGGATGAGGGA

Characteristics of Characteristics of Regulatory MotifsRegulatory Motifs

TinyTiny

Highly VariableHighly Variable

~Constant Size~Constant Size Because a constant-size Because a constant-size

transcription factor bindstranscription factor binds

Often repeatedOften repeated

Low-complexity-ishLow-complexity-ish

Weight Matrices & Sequence Logos

1 2 3 4 5 6 7 8 9 10 11 12 13 141 G A C C A A A T A A G G C A2 G A C C A A A T A A G G C A3 T G A C T A T A A A A G G A4 T G A C T A T A A A A G G A5 T G C C A A A A G T G G T C6 C A A C T A T C T T G G G C7 C A A C T A T C T T G G G C8 C T C C T T A C A T G G G C

Set of signal sequences:

A 0 4 4 0 3 7 4 3 5 4 2 0 0 4C 3 0 4 8 0 0 0 3 0 0 0 0 0 4G 2 3 0 0 0 0 0 0 1 0 6 8 5 0T 3 1 0 0 5 1 4 2 2 4 0 0 1 0

Position Frequency Matrix - PFM

A -1.93 .79 .79 -1.93 .45 1.50 .79 .45 1.07 .79 .0 -1.93 -1.93 .79C .45 -1.93 .79 1.68 -1.93 -1.93 -1.93 .45 -1.93 -1.93 -1.93 -1.93 .0 .79G .0 .45 -1.93 -1.93 -1.93 -1.93 -1.93 -1.93 .66 -1.93 1.3 1.68 1.07 -1.93T .15 .66 -1.93 -1.93 1.07 .66 .79 .0 .79 -1.93 -1.93 -1.93 .66 -1.93 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Position Weight Matrix - PWM

T T G C A T A A G T A G T C.45 -.66 .79 1.66 .45 -.66 .79 .45 -.66 .79 .0 1.68 -.66 .79

Score for New Sequence

Sequence Logo & Information content

Motifs in Biological Sequences

(R,l)

1

K

A=(a1,..,aK) – positions of the windows

Priors A has uniform prior

j has Dirichlet(N0) prior – base frequency in genome. N0 is pseudocounts

0.0 1.0

=(1,A,…,w,T) probability of different bases in the window

0=(A,..,T) – background frequencies of nucleotides.

Natural Extensions to Basic ModelCorrelated in Nucleotide Occurrence in Motif: Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics, 6, 909-916.

Regulatory Modules:De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Nat’l Acad Sci USA, 102, 7079-84

M1

M2

M3

Stop

Start

Gene AGene B

Insertion-Deletion

BALSA: Bayesian algorithm for local sequence alignment Nucl. Acids Res., 30 1268-77.

1

K

w1

w2

w3

w4

Genes, Microarrays and Motifs Lecture 8 CSC 2417/BCB 410 Michael Brudno Many slides from various sources, including T. Hughes (U. of T.), S. Batzolgou.

Documents

protein slide

dna transcription regulation

dna transcription factors

transcription enhancers

kellis slide

environment cell

cell cycle of division

dna binding