LectureW2_Intro_Array

Friday (1/15) computer lab session:Location: 3073 (3rd floor), Department of

Computational Biology, BST3, 3501 Fifth Avenue.

Time: 9:30-10:45AMPlay with R (tutorial) at home before the lab session.

Agenda• Introduction to microarray

– Motivation & previous techniques• Concept of biological pathway• Northern blot, RT-PCR and real time RT-PCR

– Affymetrix microarray experiment– cDNA microarray experiment– Comparison of the two– Codelink, Illumina & Agilent– MAQC (Microarray Quality Control) Project

• Introduction to next generation sequencing (RNA-seq, ChIP-seq etc)

Review

The central dogma of molecular biology:

mRNA(messenger)

rRNA(ribosomal)

tRNA(transfer)

Protein

Ribosome

transcription

ProteinRNADNA ntranslatioiontranscript

transcription transcription

translationMicroarray is a technology to globaly (simultaneously detecting thousands of genes) detect mRNA expression level.

Why detect expression level of protein or

Cell cycle

Cancer cells are malignant cells who don’t die but reproduce rapidly instead.

Important to repair problematic mutations during cell division.

Example 1: p53 Pathway(an important tumor suppressor)

Cancer cells are malignant cells who don’t die but reproduce rapidly instead.

(DNA damaged)

http://breast-cancer-research.com/content/pdf/bcr426.pdf

Example 2: KRas Pathway(an oncogene)

( upregulation; downregulation)

normalcell

-- P53 properly suppress cell replication.-- Ras genes properly activate cell replication.

cancerouscell

-- P53 doesn’t suppress cell replication.-- Ras genes overly expressed. Cells are overly replicated.

From http://www.icnet.uk/axp/mphh/biomed/lemoine.html

Prediction of a disease:

If mechanism known, detecting expression level can help identifying cancer patients (e.g. unusual p53 or Kras expression activity).

Exploratory:

In general, microarray can help identify candidate genes that contribute to tumor progression and propose hypothesis of the underlying genetic network.

Why detect expression level of protein or mRNA?

http://www.escience.ws/b572/L13/north.html

Northern Blot (an old technique for measuring mRNA expression)

mRNA extracted and purified.

mRNA loaded for electrophoresis.

Lane 1: size standards.Lane 2: RNA to be tested.

The gel is charged and RNA “swim” through gel according to weight.

mRNA are transferred from the gel to a membrane.

A labelled probe specific for the RNA fragment is incubated with the blot. So the RNA of interest can be detected.

See next page for the details of this step.

http://www.escience.ws/b572/L13/northupclose.html

Norther Blot closeup(color staining)

In this simplified cartoon, two mRNAs are bound on the membrane.

The complement DNAs of A are prepared with label and are hybridized to all the mRNA on the membrane.

The labeled complement DNA will bind to A but not B.

After washing and detecting, abundance of the target mRNA can be seen.

See animation of RT-PCR:http://www.bio.davidson.edu/courses/Immunology/Flash/RT_PCR.html

RT-PCR (reverse transcription-polymerase chain reaction)

http://www.ambion.com/techlib/basics/rtpcr/

real-time RT-PCR

1. RNA is reverse transcribed to DNA.2. PCR procedures can be used amplify DNA at exponential

rate.3. Gel quantification for the amplified product.

---- an semi-quantitative method. Smaller amount of sample needed.

1. The PCR amplification can be monitored by fluorescence in “real time”.

2. The fluorescence values recorded in each cycle represent the amount of amplified product.

---- a quantitative method. The current most advanced and accurate analysis for mRNA abundance. Usually used to validate microarray result.

Often used to validate microarray

Limitation of the old techniques

1. Labor intensive

2. Can only detect up to dozens of genes. (gene-by-gene analysis)

3. Need to know the target sequences. For RT-PCR, at least need to know the primer to start the PCR.

Various microarrays

A new view on genomic level

Affymetrix GeneChip

from Affymetrix Inc.

Overview of the Affymetrix GeneChip technology

From experiments to analysis

Details of labeling and hybridizationRNA

polymeraseDNA DNARNA

tase transcripreverse

TACGTATTGCAAAA TTTTGCAATACGTA

TACGTATTGCAAAA

(at C and T)

• Only Pyrimidines (C and T) have biotin labeled. This is where the color intensities come from.

• The fragmentation makes the biotin-labeled cRNA shorter and helps efficiency of hybridization.

• Sequence info of the target mRNA should be known so the complementary sequence can be prepared on the array.

25-mer unique oligo

mismatch in the middle nuclieotide

multiple probes (11~16) for each gene

from Affymetrix Inc.

Array Design

from Affymetrix Inc.Needs at most 425=100 masking and coupling.

Technology adapted from semiconductor industry.(photolithography and combinatorial chemistry)

Array Manufacturing

HG-U95 HG-U133 Set HG-U133 Plus 2.0 Array

sequence source

Build 95

UniGene database

(Oct, 2, 1999??)

Build 133

UniGene database

(April, 20, 2001)

Build 133

UniGene database

(April, 20, 2001)

Probe uniqueness

21/25 bases Two 8-mers including at least one 12-mer

Two 8-mers including at least

one 12-mer

# of probes ~16 11 11

# of arrays 5 2 1

# of transcripts

~54000 genesHG-U95Av2: ~12000

HG-U95B-E: ~44000 EST

~33,000 genes ~38500 genes

Feature size 20 µm 18 µm 11 µm

Chip Advances

Few years ago, U95 set had 5 arrays. Normally only U95Av2 is used.

Improved probe selection algorithm to avoid non-specific binding. Decreased # of probes in each probe set (20 => 11)

Smaller probe size20 µm => 11 µm

More genes on each array and less cost(Only one array for HG-U133 Plus )

Chip Advances

Background adjustment Normalization Summarization

Give an expression measure for each probe set on each array

The result will greatly affect subsequent analysis (e.g. clustering and classification). If not modeled properly,

=> “Garbage in, garbage out”

Array Probe Level Analysis

NormalizationBackground adjustment Summarization

Details will be discussed in the next lecture.

Spotted cDNA microarray

From experiments to analysis

1. 48 grids in a 12x4 pattern.

2. Each grid has 12x16 features (spots).

3. Total 9216 features (spots).

4. Each pin prints 3 grids.

Probe (array) printing

Probe design and printing

From Y. Chen et al. 1997

The experiment

From: http://www.techfak.uni-bielefeld.de/ags/ai/projects/microarray/

An image example

Image analysis is more difficult than Affy array. The probes are spotted by robot instead of synthesized and the exact physical location is not known.

cDNA GeneChip

Probe preparation

Probes are cDNA fragments, usually amplified by PCR and spotted by robot.

Probes are short oligos synthesized using a photolithographic approach.

colors Two-color

(measures relative intensity)

One-color

(measures absolute intensity)

Gene representation

One probe per gene 11-16 probe pairs per gene

Probe length Long, varying lengths

(hundreds to 1K bp)

25-mers

Density Maximum of ~15000 probes. 38500 genes * 11 probes = 423500 probes

Comparison of cDNA array and GeneChip

Affymetrix GeneChipOne color design

cDNA microarrayTwo color design

Why the difference?

Affymetrix GeneChipPhotolithography

(The amount of oligos on a probe is well controlled)

cDNA microarrayRobotic spotting

(The amount of cDNA spotted on a probe may vary greatly)

Advantage and disadvantage of cDNA array and GeneChip

cDNA microarray Affymetrix GeneChip

The data can be noisy and with variable quality

Specific and sensitive. Result very reproducible.

Cross(non-specific) hybridization can often happen.

Hybridization more specific.

May need a RNA amplification procedure.

Can use small amount of RNA.

More difficulty in image analysis. Image analysis and intensity extraction is easier.

Need to search the database for gene annotation.

More widely used. Better quality of gene annotation.

Cheap. (both initial cost and per slide cost)

Expensive (~$400 per array+labeling and hybridization)

Can be custom made for special species.

Only several popular species are available

Do not need to know the exact DNA sequence.

Need the DNA sequence for probe selection.

Other platforms of microarray

• GE Codelink (out of market now)

• Illumina

• Agilent

Codelink

Fig. End-point attachment orients the DNA while the polymeric coating holds it away from the surface of the slide, making the DNA readily available for hybridization.

Codelink’s

Gel-matrix

cDNA GeneChip Codelink Agilent

Probe preparation

Probes are cDNA fragments, usually amplified by PCR and spotted by robot.

Probes are short oligos synthesized using a photolithographic approach.

3-D aqueous gel matrix

Probes are printed by Inkjet technology from HP

colors Two-color

(measures relative intensity)

One-color

(measures absolute intensity)

One-color One- or two-color

Gene representation

One probe per gene 11-16 probe pairs per gene

One probe per gene

Probe length

Long, varying lengths

(hundreds to 1K bp)

25-mers 30-mers 60-mers

Density Maximum of ~15000 probes.

38500 genes * 11 probes = 423500

~57000 ~22000 probes

Manufacturer

Stanford and many labs.

Affymetrix company

GE company Agilent company

Comparisons

Mechanisms in microarrayImportant mechanisms that make microarray work:

1. Reverse transcription: mRNA => cDNA. This is usually also the step to label dyes.

(Protein can not be reverse translated to mRNA or to another form. So difficult to label dyes.)

2. Double strand binding of complimentary DNA sequences.

(Protein does not enjoy such a good property; there are 20 amino acids without complementary binding)

Microarray Quality Control (MAQC) Project

a series of papers published in Nature Biotechnology (Sep 2006)

Previous paper in NAR 2003

• Evaluation of gene expression measurements from commercial microarray platforms. Tan et al. Nucleic Acids Research. 2003. 31:5676-5684.

• Poor consistency made it a concern for precise science and routine clinical use.

• Three commercial platforms were compared.• Inconsistent result found across platforms

Experiment Design

• 7 microarray platforms; each platform implemented in 3 test sites; 4 pools of RNA each with 5 replicates were performed. (3*4*5=60 arrays for each platform)

• The 4 pools of RNA are: A. 100%UHRR; B. 100%HBRR; C. 75%UHRR + 25%HBRR; D. 25%UHRR + 75%HBRR.UHRR: Universal Human Reference RNA from StratageneHBRR: Human Brain Reference RNA from Ambion

• 3 RT-PCR based alternative gene expression platforms are also tested: TaqMan, StaRT-PCR and QuantiGene Assays.

Experiment Design

• NCI has only 2 test site. AGL has only 2 samples. Some problematic arrays are removed.

• AGL is not included in this paper. A total of 386 arrays are analyzed.

Difficulties in comparing multiple platforms

• Each platform has different probe design• Sensitivity and specificity of the probes. (some

variability of cross-platform may be due to this annotation problem)

• Database (NCBI RefSeq) often change, making it difficult to match.

• Probes may bind to multiple alternative spliced transcripts, which may have different functions and expression patterns.

Kuo(2006): probe matching within one exon for Gas1

Gene matching across different platforms is not easy.Essentially each platform detects different targets.

Match genes across platforms• All probes mapped to RefSeq and AceView database.• Each platform assayed 15,429-16,990 Entrez genes.• 23,971 in 24,157 RefSeq NM accessions assayed in

at least on platform. Among them, 15,615 accessions (which correspond to 12,091 Entrez genes) were assayed in all platforms.

• When multiple probes match to one RefSeq, only the probe closest to the 3’ end is used.

• Finally each platform has 12,091 probes matching to a common set of 12,091 RefSeq from 12,091 different genes.

Number of detected genes called by manufactures’

softwareCV of 5 technical replicates

Blue: CV of 5 technical replicatesRed: CV of all 15 replicates (5 technical replicates X 3 test sites)

Blue dot: percentage of genes concordantly called detected in each test site.Blue bar: percentage of genes concordantly called detected in all three test site.

Conclusions• Microarray provides an opportunity to measure

thousands of genes simultaneously and make the global monitoring of cellular activities possible.

• The method produces more noisy data and the choice of an adequate design and analysis is the key.

• RT-PCR for validation of small number of genes.• Data obtained from different platforms and

centers are consistent. Ready for routine clinical use.

Limitation• The method measures mRNA instead of

proteins. The actual protein abundance and post-translation modification can not be detected.

• The method usually does not measure spatial or temporal dynamics of the cellular activity.

• The method is suitable for global monitoring and should be used to generate further hypothesis or should combine with other carefully designed experiments.

Introduction to next generation sequencing

Introduction

• What is next generation sequencing?– Short reads (35~70 bps)– Higher throughput– Faster– Cheaper

Introduction

• Comparing to traditional sequencing– Traditional Sequencing

• No reference sequence available (ab initio)• Longer reads and additional linkage information

required to assemble the entire sequence

– Next Generation Sequencing• Reference sequence available (Sequenced by

traditional sequencing)• No need of assembly, just map the short reads

back to the reference sequence.

Technology

Major Applications

• ChIP-Seq (Chromosome Immunoprecipitation)– A substitute for ChIP-chip– To find the binding sequence of proteins (TFBS)

• RNA-Seq– A substitute for Microarray– To measure the amount of RNA expressed

RNA-Seq

• Comparing to microarray– Microarray

• Closed technology: Prior knowledge required• Affected by pseudo-genes (homologous of real genes)• Cheap and mature

– RNA-Seq• Open technology: No prior knowledge required• Not affected by pseudo-genes because exact sequence

is measured• Other information could be yielded (SNP, Alternative

splicing)• Still more expensive than microarray

See also the following introduction slides:

http://biocluster.ucr.edu/~tgirke/HTML_Presentations/Manuals/HT-Seq/HT-Seq.pdf

LectureW2_Intro_Array

mrna expression mrna

target mrna

mrna expression level

mrna abundance

realtime rtpcr rna

cell replication

ras genes

genes hgu95av2

Documents