Top Banner
Advanced Algorithms Advanced Algorithms and Models for and Models for Computational Biology Computational Biology -- a machine learning approach -- a machine learning approach Introduction to cell Introduction to cell biology, genomics, biology, genomics, development, and development, and probability probability Eric Xing Eric Xing Lecture 2, January 23, 2006 Reading: Chap. 1, DTM book
47

Advanced Algorithms and Models for Computational Biology -- a machine learning approach

Jan 13, 2016

Download

Documents

Jesús

Advanced Algorithms and Models for Computational Biology -- a machine learning approach. Introduction to cell biology, genomics, development, and probability Eric Xing Lecture 2, January 23, 2006. Reading: Chap. 1, DTM book. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Advanced Algorithms Advanced Algorithms and Models for and Models for

Computational BiologyComputational Biology-- a machine learning approach-- a machine learning approach

Introduction to cell biology, Introduction to cell biology, genomics, development, and genomics, development, and

probabilityprobability

Eric XingEric Xing

Lecture 2, January 23, 2006

Reading: Chap. 1, DTM book

Page 2: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Introduction to cell biology, Introduction to cell biology, functional genomics, functional genomics,

development, etc.development, etc.

Page 3: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Model Organisms

Page 4: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Bacterial Phage: T4

Page 5: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Bacteria: E. Coli

Page 6: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

The Budding Yeast:Saccharomyces cerevisiae

Page 7: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

The Fission Yeast:Schizosaccharomyces pombe

Page 8: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

FEATURES OF THE NEMATODE Caenorhabditis elegans

• SMALL: ~ 250 µm • TRANSPARENT • 959 CELLS • 300 NEURONS

• SHORT GENERATION TIME • SIMPLE GROWTH MEDIUM • SELF- FERTILIZING HERMAPHRODITE • RAPID ISOLATION AND CLONING OF MULTIPLE TYPES OF MUTANT ORGANISMS

The Nematode: Caenorhabditis elegans

Page 9: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

The Fruit Fly: Drosophila Melanogaster

Page 10: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

The Mouse

transgenic for human growth hormone

Page 11: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Prokaryotic and Eukaryotic Cells

Page 12: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

A Close Look of a Eukaryotic Cell

The structure:

The information flow:

Page 13: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Cell Cycle

Page 14: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Signal Transduction

A variety of plasma membrane receptor proteins bind extracellular signaling molecules and transmit signals across the membrane to the cell interior

Page 15: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Signal Transduction Pathway

Page 16: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Functional Genomics and X-omics

Page 17: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

A Multi-resolution View of the Chromosome

Page 18: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Organism

PROKARYOTIC Mycoplasma genitalum (Bacterium) Helicobacter pylori (Bacterium) Haemophilus influenza (Bacterium) EUKARYOTIC Saccharomyces cerevisiae (yeast) Drosophila melanogaster (insect) Caenorhabditis elegans (worm) Homo sapiens (human) Arabidopsis thaliana (plant)

Number of base pairs (millions)

0.58

1.67

1.83

12

165

97

2900

125

Number of encoded proteins

470

1590

1743

5885

13,601

19,099

30,000 TO 40,000

25,498

Number of chromosomes

1

1

1

17

4

6

23

10

DNA Content of Representative Types of Cells

Page 19: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Functional Genomics

The various genome projects have yielded the complete DNA sequences of many organisms. E.g. human, mouse, yeast, fruitfly, etc. Human: 3 billion base-pairs, 30-40 thousand genes.

Challenge: go from sequence to function, i.e., define the role of each gene and understand how the genome

functions as a whole.

Page 20: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

motif

Regulatory Machinery of Gene Expression

Page 21: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Free DNA probe

*

*Protein-DNA complex

Advantage: sensitive Disadvantage: requires stable complex; little “structural” information about which protein is binding

Classical Analysis of Transcription Regulation Interactions

“Gel shift”: electorphoretic mobility shift assay (“EMSA”) for DNA-binding proteins

Page 22: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Modern Analysis of Transcription Regulation Interactions

Genome-wide Location Analysis (ChIP-chip)

Advantage: High throughput Disadvantage: Inaccurate

Page 23: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Gene Regulatory Network

Page 24: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Gene Expression networks

Regulatory networks

Protein-protein Interaction networks

Metabolic networks

Biological Networks and Systems Biology

Systems Biology:

understanding cellular event under a system-level context

Genome + proteome + lipome + …

Page 25: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Gene Regulatory Functions in Development

Page 26: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Temporal-spatial Gene Regulationand Regulatory Artifacts

A normal fly Hopeful monster?

Page 27: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Microarray or Whole-body ISH?

Page 28: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Gene Regulation and Carcinogenesis

PCNA (not cycle specific)

G0 or G1 M G2

S

G1

E

AB

+

PCNA

Gadd45DNA repair

Rb

E2F

Rb P

Cyclin

CdkPhosphorylation of

+ -

Apoptosis

FasTNF

TGF-...

p53

Pro

mo

tes

oncogeneticstimuli

(ie. Ras)

extracellularstimuli(TGF-)In

hibi

tsac

tiva

tes

acti

vate

s

p16

p15

p53

p14

tran

scrip

tiona

l ac

tivat

ion

p21

acti

vate

s

cell damagetime required for DNA repair severe DNA damage

Cancer !Cancer !

Page 29: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Normal BCH

CIS

DYS

SCC

The Pathogenesis of Cancer

Page 30: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Genetic Engineering: Manipulating the Genome

Restriction Enzymes, naturally occurring in bacteria, that cut DNA at very specific places.

Page 31: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Recombinant DNA

Page 32: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Transformation

Page 33: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Formation of Cell Colony

Page 34: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

How was Dolly cloned?

Dolly is claimed to be an exact genetic replica of another sheep.

Is it exactly "exact"?

Page 35: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Definitions

Recombinant DNA: Two or more segments of DNA that have been combined by humans into a sequence that does not exist in nature.

Cloning: Making an exact genetic copy. A clone is one of the exact genetic copies.

Cloning vector: Self-replicating agents that serve as vehicles to transfer and replicate genetic material.

Page 36: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Software and Databases

NCBI/NLM Databases Genbank, PubMed, PDB DNA Protein Protein 3D Literature

Page 37: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Introduction to ProbabilityIntroduction to Probability

xx

ff((xx))

xx

ff((xx))

Page 38: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Basic Probability Theory Concepts

A sample space S is the set of all possible outcomes of a conceptual or physical, repeatable experiment. (S can be finite or infinite.) E.g., S may be the set of all possible nucleotides of a DNA site:

A random variable is a function that associates a unique numerical value (a token) with every outcome of an experiment. (The value of the r.v. will vary from trial to trial as the experiment is repeated) E.g., seeing an "A" at a site X=1, o/w X=0. This describes the true or false outcome a random event. Can we describe richer outcomes in the same way? (i.e., X=1, 2, 3, 4, for being A,

C, G, T) --- think about what would happen if we take expectation of X.

Unit-Base Random vector Xi=[XiA, XiT, XiG, XiC]T, Xi=[0,0,1,0]T seeing a "G" at site i

GC,T,A,S

SS X()

Page 39: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Basic Prob. Theory Concepts, ctd

(In the discrete case), a probability distribution P on S (and hence on the domain of X ) is an assignment of a non-negative real number P(s) to each sS (or each valid value of x) such that sSP(s)=1. (0P(s) 1) intuitively, P(s) corresponds to the frequency (or the likelihood) of getting s in the

experiments, if repeated many times

call s= P(s) the parameters in a discrete probability distribution

A probability distribution on a sample space is sometimes called a probability model, in particular if several different distributions are under consideration write models as M1, M2, probabilities as P(X|M1), P(X|M2)

e.g., M1 may be the appropriate prob. dist. if X is from "splice site", M2 is for the "background".

M is usually a two-tuple of {dist. family, dist. parameters}

Page 40: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Bernoulli distribution: Ber(p)

Multinomial distribution: Mult(1,)

Multinomial (indicator) variable:

Multinomial distribution: Mult(n,)

Count variable:

. , w.p.

and ],,[

where , ∑

T]G,C,[A,∈

T]G,C,[A,∈

11

110

jjjj

jjj

T

G

C

A

X

XX

X

X

X

X

X

x

k

xk

xT

xG

xC

xAj

j

kTGCA

jXPjxp

∏}nucleotide observed index the where,{))(( 1

Discrete Distributions

1

01

xp

xpxP

for

for )( xx ppxP 11 )()(

jj

K

nx

x

x

X where , 1

x

K

xK

xx

K xxxn

xxxn

xp K !!!

!

!!!

!)(

2121

21

21

Page 41: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Basic Prob. Theory Concepts, ctd A continuous random variable X can assume any value in an interval on the

real line or in a region in a high dimensional space X usually corresponds to a real-valued measurements of some property, e.g., length, position, … It is not possible to talk about the probability of the random variable assuming a particular value ---

P(x) = 0

Instead, we talk about the probability of the random variable assuming a value within a given interval, or half interval

The probability of the random variable assuming a value within some given interval from x1 to x2 is defined to be the area under the graph of the probability density function between x1 and x2.

Probability mass: note that

Cumulative distribution function (CDF):

Probability density function (PDF):

, , 21 xxXP xXPxXP ,

xdxxpxXPxP ')'()(

, )( , 2

121

x

xdxxpxxXP

xPdxd

xp )(

. 1 )(

dxxp

Page 42: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Uniform Probability Density Function

Normal Probability Density Function

The distribution is symmetric, and is often illustrated

as a bell-shaped curve. Two parameters, (mean) and (standard deviation), determine the location and shape of

the distribution. The highest point on the normal curve is at the mean, which is also the median and mode. The mean can be any numerical value: negative, zero, or positive.

Exponential Probability Distribution

Continuous Distributions

elsewhere

for )/()(

0

1

bxaabxp

22 2

2

1

/)()( xexp

xx

ff((xx))

xx

ff((xx))

,)( :density /

xexp

1 /o)( :CDF xexxP 10 xx

f(x)f(x)

.1.1

.3.3

.4.4

.2.2

1 2 3 4 5 6 7 8 9 101 2 3 4 5 6 7 8 9 10

P(x <2) = area = .4866P(x <2) = area = .4866

Time Between Successive Arrivals (mins.)

xx

f(x)f(x)

.1.1

.3.3

.4.4

.2.2

1 2 3 4 5 6 7 8 9 101 2 3 4 5 6 7 8 9 10

P(x <2) = area = .4866P(x <2) = area = .4866

Time Between Successive Arrivals (mins.)

Page 43: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Expectation: the center of mass, mean value, first moment):

Sample mean:

Variance: the spreadness:

Sample variance

continuous )(

discrete (

)(

)

dxxxp

xpx

XESi

ii

continuous )()]([

discrete )()]([

)(dxxpXEx

xpXEx

XVarSx

ii

2

2

Statistical Characterizations

Page 44: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Basic Prob. Theory Concepts, ctd

Joint probability: For events E (i.e. X=x) and H (say, Y=y), the probability of both events are true:

P(E and H) := P(x,y)

Conditional probability The probability of E is true given outcome of H

P(E and H) := P(x |y)

Marginal probability The probability of E is true regardless of the outcome of H

P(E) := P(x)=xP(x,y)

Putting everything together:

P(x |y) = P(x,y)/P(y)

Page 45: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

Independence and Conditional Independence

Recall that for events E (i.e. X=x) and H (say, Y=y), the conditional probability of E given H, written as P(E|H), is

P(E and H)/P(H)

(= the probability of both E and H are true, given H is true)

E and H are (statistically) independent if

P(E) = P(E|H)

(i.e., prob. E is true doesn't depend on whether H is true); or equivalently

P(E and H)=P(E)P(H).

E and F are conditionally independent given H if

P(E|H,F) = P(E|H)

or equivalently

P(E,F|H) = P(E|H)P(F|H)

Page 46: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

),,,,|(),,,|(),,|(),|()|()(=

),,,,,(

543216432153214213121

654321

XXXXXXPXXXXXPXXXXPXXXPXXPXP

XXXXXXP

∏ )(=)()()()()()(=

),,,,,(

654321

654321

iiXPXPXPXPXPXPXP

XXXXXXP

X1

X2

X3

X4 X5

X6

p(X6| X2, X5)

p(X1)

p(X5| X4)p(X4| X1)

p(X2| X1)

p(X3| X2)

P(X1, X2, X3, X4, X5, X6) = P(X1) P(X2| X1) P(X3| X2) P(X4| X1) P(X5| X4) P(X6| X2, X5)

Representing multivariate dist.

Joint probability dist. on multiple variables:

If Xi's are independent: (P(Xi|·)= P(Xi))

If Xi's are conditionally independent, the joint can be factored to simpler products, e.g.,

The Graphical Model representation

Page 47: Advanced Algorithms  and Models for  Computational Biology -- a machine learning approach

The Bayesian Theory

The Bayesian Theory: (e.g., for date D and model M)

P(M|D) = P(D|M)P(M)/P(D)

the posterior equals to the likelihood times the prior, up to a constant.

This allows us to capture uncertainty about the model in a principled way