An ABC of Proteomics - LTHAn ABC of Proteomics The aim of the lecture is to introduce you to the basic methods used in modern proteomics research. Afterwards you should be able to

An ABC of Proteomics

The aim of the lecture is to introduce you to the basic methods used in modern proteomics research. Afterwards you should be able to understand current literature and papers that refer to the use of these techniques. It is a short overview, for a more deeper introduction, please look at: Principles of Proteomics, RM Tyman. ISBN 978-1859962732. BIOS scientific publications. (2004) and Principles and Pratice of Biological Mass Spectrometry. C. Dass. ISBN 978-0-471-33053-0. (2006) Wiley Interscience. If your are seriously considering using proteomics techniques in the lab then the following (expensive) texts are highly recommended: Proteins and Proteomics, a laboratory manual. Richard Simpson. ISBN 0-87969-554-4. Cold Spring Harbour Press (2002) and Purifying Proteins for Proteomics, a laboratory manual. Richard Simpson. ISBN 0-87969-696-6. Cold Spring Haroubr Press (2004)

An OMICS Glossary

•Genomics

- The study of the set of genes contained in the chromosomes

•Transcriptomics- The study of the set of mRNA molecules being expressed in a specific cell at a

given time under specified conditions

•Proteomics

- The study of the set of proteins being expressed in a specific cell at a given time under specified conditions and the state of modification

•Metabolomics

- The study of the set of small molecules in a specific cell at a given time under specified conditions

These are the basic definitions that cover modern day approaches to biology. From clinical applications to basic instrument development, these subjects now serve as the basic cornerstones upon which systems biology is built.

Genomics

The Genome is the set of genes contained in the chromosomes

−The human genome is contained in 23 pairs of chromosomes (1-22 + X/Y)

−Each chromosome is one long molecule of DNA

−Every cell has the same genome (except in cancer)

The genome is relatively stable, although it is huge in size, 2,000 Megabases, it is now relatively simple to analyse (though expensive). For example, look at the paper describing the first diploid sequence, the personal genome of Craig Venter (Levy et al. The Diploid Genome Sequence of an Individual Human. PLoS Biol. 2007 Sep 4;5(10):e254 )

The Genome

Analogy: Complete works of an author

in partially understood language

Two approaches

Page by page

All at once

Genomics began with the goal of sequencing entire genomes. To accomplish this task, two different sequencing approaches were developed. These methods can be thought of in the following way: Imagine that you have the complete works of an author, written in a language that you studied in school, but never became fluent in. Moreover, the books are in such bad shape that if you open them, they disintegrate. You have two alternatives. You can remove one page at a time, preserve it and decipher it. Or you can open all the books at once and then pick up the fragments of paper and use the words on them to figure out how they fit together.

Page-by-page sequencing strategy

Sequence = determining the letters of each word on each piece of paper

Assembly = fitting the words back together in the correct order

The page-by-page approach to sequencing the human genome was used by the public genome-sequencing consortium. This group first figured out how all the pages fit together and then deciphered all the words on each page. Finally, it assembled the pages back together to produce the whole genome. The advantage of this approach is that it is very precise. The disadvantage is that it takes a long time.

Technical foundations of genomics

Molecular biology: recombinant-DNA technology

DNA sequencing

Library construction

PCR amplification

Hybridisation techniques

Log

MW

Distance

. ...

Almost all of the underlying techniques of genomics originated with molecular biology, or recombinant-DNA technology. In particular, almost all DNA sequencing is still performed using the approach pioneered by Sanger, for which he won his second Nobel Prize. Also essential to high-throughput sequencing is the ability to generate libraries of genomic clones and then cut portions of these clones and introduce them into other vectors. These techniques were developed in the late 1970s by a number of scientists, including Maniatis and Cohen. The use of the polymerase chain reaction (PCR) to amplify DNA, developed in the 1980s, is another technique at the core of genomics approaches. Finally, the use of hybridization of one nucleic acid to another in order to detect and quantitate DNA and RNA was pioneered by Southern and Alewine in the late 1970s. This method remains the basis for genomics techniques such as microarrays.

Genomics relies on high-throughput

Automated sequencers

Robotics

Microarray spotters

Colony pickers

High-throughput genetics

What genomics added to these recombinant-DNA techniques was automation. The innovation that made the greatest impact on genomic sequencing was the use of fluorescent dyes and capillaries in an automated sequencing system. Pictured in the slide is Applied Biosystem’s ABI 3700, which has been the most widely used instrument for large-scale sequencing. It has 96 capillaries that are fed by robotic loading from two 384-well microtiter plates. It makes a sequence run every two to three hours and can read, on average, 600–700 bases per run. Celera, the company that produced a rough draft of the human genome in three years, used 200 of these machines running 24/7 to do so. Similarly, automation was applied to the processes of spotting DNA onto slides to make microarrays and of identifying and isolating bacterial colonies to grow up DNA for sequencing. While initially applied to improving genomics techniques, high-throughput approaches are now permeating much of biology. An example of such an application is the use of robots to automate genetic screens for new mutants.

All-at-once sequencing strategy

Find small pieces of paper

Decipher the words on each fragment

Look for overlaps to assemble

The biotechnology company Celera used the other method, called “whole genome shotgun sequencing,” in its competing effort to sequence the human genome. This method is equivalent to figuring out what’s written on all the fragments of paper from all of the volumes and then figuring out how they piece together. To do this procedure effectively requires starting with several copies of each volume so that overlaps among the fragments can be found. The number of original copies is referred to as “coverage.” To produce a high-quality sequence by this method usually requires eight- to tenfold coverage. The disadvantage of this method is that you rarely get the whole sequence to line up. The advantage is that the portion of the sequence that does line up is acquired much more rapidly than via the page-by-page method.

Shotgun genomics

Generates 100 MB per day, whole genome 2 is GB

This instrument was released onto the market in August 2006. The principle is to chop DNA into small strands and to encapsulate one strand with a bead in an oil droplet in an emulsion. The single DNA molecule is amplified and the resulting identical DNA sequences attached to the bead. By adding one base at a time and measuring the fluorescent output by using the pyrophosphate pp ion released upon base addition to drive a fluorescent reaction, small ca 100bp sequences can be read. Only one bead in each hole, however there are 100,000s of holes thus in a single instrument run of 4 hours, some 25 million bases of sequence can be read. The genome of mycoplasma genitalium was sequenced in one run at over 99.4% accuracy. Reference for method: Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005 Sep 15;437(7057):376-80

What is Proteomics?

Expression Proteomics- Define all gene products present in a cell and their modifications

Cell-Map Proteomics- Define the spatial & temporal positions of all proteins and interactions

Functional Proteomics- Define the biological function of all proteins within their networks and complexes

Structural Proteomics- Determine the structure of all proteins, alone and in complexes

Population Proteomics- Large scale version of expression proteomics for disease studies

There has been a rush to define everything using an omics suffix, most of which is unnecessary and will disappear with time as the focus goes back to understanding the principles of biology that underline most events studied today.

Proteomics ABC

23,000 genes in the Genome butca. 1,000,000 proteins caused by Exon splicing, 300+ Post-translational modifications

Dynamic RangeCell 106, Plasma 1012

The Dynamic ProteomeTemporal (milliseconds, month)

Spatial (cell, organelle),

Developmental (100+ cell types in the body, years)

All proteins exist in dynamic complexes

This determines their function and is highly dynamic

The point here is the genome deals with 42 molecules per cell. mRNA is found at between 10-1000 copies per cell. Both can be amplified using PCR. Proteins however cannot be amplified and are found a concentration of between 1-1,000,000 copies per cell or 1-1,000,000,000,000 copies per litre in the blood.

The Two Philosophies

‘Traditional’ ProteomicsProtein separation, digestion and MS

Usually two-dimensional electrophoresis, spot picking and digestion

Allows pre-enrichment (organelles), depletion (albumin removal)

Advantages: Maintains isoforms, highly parallel, cheap

Disadvantages: Lab-to-lab reproducibility, depth of coverage, non-automatable

‘Shotgun’ ProteomicsDigest down to peptides, multidimensional separation and MS

Usually ion-exchange separation followed by reverse phase HPLC-MS

Allows modification specific targetting (isolation of phosphopeptides etc)

Advantages: depth of coverage, automatable

Disadvantages: uncertainty in identification, slow, expensive

This refers back to genomics, a one-by-one approach to sequencing, or a shot-gun approach.

Page-by-page

Sequence = determining the letters of each word on each piece of paper

Assembly = fitting the words back together in the correct order

The page-by-page approach to sequencing the human genome was used by the public genome-sequencing consortium. This group first figured out how all the pages fit together and then deciphered all the words on each page. Finally, it assembled the pages back together to produce the whole genome. The advantage of this approach is that it is very precise. The disadvantage is that it takes a long time.

Protein Separation2D Electrophoresis

Proteins are very heterogenous chemically and there is no one separation method like with DNA (electrophoresis can separate DNA fragments which are over 1,000 bases long but differ in length by one base): Proteins are very hard to separate though the highest resolution method is to separate by charge first (1st dimension) and then by size.

The Current High-Tech Analyser

The left hand caption shows the protein mixture or cell extract being loaded on the first dimension focussing instrument. The right hand photo shows the proteins in the gel strip after removal from the focussing device, being loaded onto the second dimension gel for separation by size.

Two-Dimensional Gel Based Proteomics

Disease Tissue

123

4

5 68

7

39 9 1011

12

1314

15

16171819

202122

23242526

272928

30

3138

3233 40

34 3736

35

Healthy Tissue

The images show the protein patterns of extracts from normal and diseased tissue. The pattern alone can show which state the tissue is in and the spots that are changing are thus the differences between the two. Only those spots which change are cut out for further analysis.

Protein Fingerprinting

Enzyme digest

Peptide masses

Proteins can be identified in simple mixtures by digesting them with an enzyme and then measuring the masses of the peptides formed. The set of masses is called the peptide fingerprint. A database is made containing all the proteins in the species genome and the masses of all the peptides from each protein produced by a certain enzyme are calculated. Thus each protein has a theoretical peptide fingerprint. The experimental fingerprint is then compared to all the theoretical fingerprints and the best match is calculated. This should be the correct identity of the unknown protein.

proteases

• There are two types of Proteases– Endoproteases e.g. trypsin (C-terminal to KR but not if followed by P), AspN (N-

terminal to D), LysC (C-terminal to K), V8 (C-terminal to D and E depending onpH) are specific.There are less specific Endoproteases like Chymotrypsin (C-terminal to WFYLI etc), pepsin, elastase etc

– Exoproteases e.g. Carboxpeptidases C, P and Y and aminopeptidase B

Endoproteinase Red-Case

Specific cuts, n=0 miscuts

Specific cuts, n=1 miscuts

Fingerprints are generated by using specific proteases. These are ones that cut after known amino-acids and hence one can predict theoretically which peptides will be formed.

Making a database

>gi|532319|pir|TVFV2E|TVFV2E envelope proteinSIPETQKGVIFYESHGKLEHKDIPVPKPKANELLINVKYSGVCHTDLHAWHGDWPLPVKLPLVGGHEGAGVVVGMGENVKGWKIGDYAGIKWLNGSCMACEYCELGNESNCPHADLSGYTHDGSFQQYATADAVQAAHIPQGTDLAQVAPILCAGITVYKALKSANLMAGHWVAISGAAGGLGSLAVQYAKAMGYRVLGIDGGEGKEELFRSIGGEVFIDFTKEKDIVGAVLKATDGGAHGVINVSVSEAAIEASTRYVRANGTTVLVGMPAGAKCCSDVFNQVVKSISIVGSYVGNRADTREALDFFARGLVKSPIKVVGLSTLPEIYEKMEKGQIVGRYVVDTS

K <EK> DK <ALK> SK <GWK> IK <MEK> GR <GLVK> SR <YVR> AK <SPIK> VR <ADTR> EK <LEHK> DK <AMGYR> VK <GQIVGR> YK <EELFR> S<SIPETQK> GR <YVVDTSK> K <DIVGAVLK> AK <IGDYAGIK> WK <DIPVPKPK> AR <VLGIDGGEGK> ER <EALDFFAR> GK <ANELLINVK> YK <GVIFYESHGK> LK <CCSDVFNQVVK> SK <SISIVGSYVGNR> AR <SIGGEVFIDFTK> ER <ANGTTVLVGMPAGAK> CK <VVGLSTLPEIYEK> MK <LPLVGGHEGAGVVVGMGENVK> GK <ATDGGAHGVINVSVSEAAIEASTR> YK <YSGVCHTDLHAWHGDWPLPVK> L

275.3330.4389.4406.5415.5436.5443.5461.4525.6596.7628.7692.7801.8810.9813.9835.9893.1944.1968.11013.21136.21241.41251.41312.41386.61447.62019.32312.42418.7

A specific enzyme, here trypsin, is used to cut the protein into peptides. Trypsin cuts after arginine (R) and lysine (K) and the masses of the peptides can then be calculated.

Ovarian Unsupervised Analysis

- based on outcome (5 year follow-up)

Even without knowing the identities of the proteins on a gel, the patterns can be used to distinguish different states. Here the patterns distinguish between benign and malignant ovarian cancer.

Gel Reproducibility

BRCA is tissue pre-emptively removed from a patient but clearly shows it was malignant, in the left and right nodes

Unsupervised Pearson Clustering of dye swaps

Here normal tissue was taken from a patient who had hereditary breast cancer and opted to have her ovaries removed. The protein pattern showed that the tissue, although not a tumour, was already in a precancerous state.

The ‘Core Cancer Proteome’

Swiss Prot/TrEMBL

Name Mascot Score(s)

Peptides matched by Mascot

Peptides matched by Sequest

Benign vs. Malignant ROC area

Benign vs. Malignant ROC p-value

Benign vs. Malignant Fold change

Benign vs. Malignant trend

P55072 Transitional endoplasmic reticulum ATPase 102 9 1 0.89 3.80E-05 1.7 M>BP36957 Dihydrolipoyllysine-residue succinyltransferase component of 2-oxoglutarate dehydrogenase103 3 2 0.88 1.60E-04 2.1 M>BP07858 Cathepsin B 118 4 2 0.82 1.20E-03 2.7 M>BP61978 Heterogeneous nuclear ribonucleoprotein K 123 5 4 0.95 2.10E-05 2.7 M>BP45973 Chromobox protein homolog 5 123 3 3 0.9 3.00E-02 3.2 M>BP30101 Protein disulfide-isomerase A3 134 6 3 0.92 7.50E-05 2 M>BP60709 Actin, cytoplasmic 1 141 5 0 0.93 5.30E-04 3.3 M>BP40121 Macrophage capping protein 143 4 2 0.86 1.70E-04 2.4 M>BP00441 Superoxide dismutase 1, soluble 170 2 2 0.89 4.60E-05 1.8 M>BP02545 Lamin A/C 172 6 3 0.92 7.50E-05 2 M>BP10809 60 kDa heat shock protein 178 9 6 0.95 2.10E-05 2.7 M>BP52565 Rho GDP-dissociation inhibitor 1 187 6 0 0.9 9.30E-05 2.1 M>BP27797 Calreticulin 196 7 1 0.88 6.40E-05 2.5 M>BP11021 78 kDa glucose-regulated protein 208 6 2 0.88 6.70E-05 2 M>BP07951 Tropomyosin beta chain 220 5 6 0.89 3.80E-05 2.4 M>BP14625 Endoplasmin 241 10 3 0.9 1.50E-05 2.1 M>BP02768 Serum albumin 246 12 5 0.86 2.00E-04 2.2 B>MQ15181 Inorganic pyrophosphatase 255 10 4 0.86 2.00E-04 3 M>BP32119 Peroxiredoxin 2 266 7 3 0.91 1.20E-05 2.6 M>BP07237 Protein disulfide-isomerase 271 10 4 0.88 3.00E-04 2.1 M>BP60709 Actin, cytoplasmic 1 291 5 1 0.82 6.80E-03 2.5 M>BP61978 Heterogeneous nuclear ribonucleoprotein K 341 7 7 0.83 9.80E-04 3.2 M>BP31947 14-3-3 protein sigma 393 8 3 0.91 5.90E-05 2.3 M>BP67936 Tropomyosin alpha 4 chain 431 12 8 0.89 3.80E-05 2.4 M>B

All these Proteins Appear in Ovarian, Breast, Prostate, Glioma……

The Hallmarks of Cancer……?

Obviously one can identify the protein spots in the pattern that help distinguish between benign and malignant. However for many different cancer types, the same proteins are involved. This is quite logical since the tumours have the same problems, avoiding detection by the immune system, growing fast, needed blood vessels to supply oxygen to the interior of the tumour etc.

Why bother with Proteins?

DNA Microarray is faster and more accurate??

primary GBM, Grade II, Grade III, normal brain

mR

NA

Prot

ein

Both DNA and Protein expression patterns can determine the cancer state. However, very often the genes used to determine the state are different.

Correlating Proteins and mRNA

Most, ca 58% show a positive correlation

Dow

n-U

p-R

egul

atio

n

Correlating Proteins and mRNA

Some, ca 42% show negative or no correlation

-Cor

rN

o C

orr

Why does the level of protein expression very often not agree with the level of mRNA expression. One example is time dependance. If the mRNA is unstable it may disappear quickly leaving a high amount of protein. Alternatively the mRNA level may be high but the protein not present if a protease has been activated that removes the protein. Also proteins may be excreted from the cell.

Whichever way, You Must Validate

Tissue microarray of anti-Integrin alpha 5 Antibodies

Strong staining in Ovarian Tumours

No staining in normal Ovarian Tissues

Whatever experiment you do, you must confirm your findings using a different technique. Very often antibodies can be used to confirm the presence/abscence of a protein and often to show where the protein is in a cell.

Whole Body Summary for Integrin

For the example of a cancer biomarker, antibodies can be used to screen the whole body to try and confirm that the marker is unique to the tissue of origin.

All-at-once sequencing strategy

Find small pieces of paper

Decipher the words on each fragment

Look for overlaps to assemble

The biotechnology company Celera used the other method, called “whole genome shotgun sequencing,” in its competing effort to sequence the human genome. This method is equivalent to figuring out what’s written on all the fragments of paper from all of the volumes and then figuring out how they piece together. To do this procedure effectively requires starting with several copies of each volume so that overlaps among the fragments can be found. The number of original copies is referred to as “coverage.” To produce a high-quality sequence by this method usually requires eight- to tenfold coverage. The disadvantage of this method is that you rarely get the whole sequence to line up. The advantage is that the portion of the sequence that does line up is acquired much more rapidly than via the page-by-page method.

2D-HPLC in depth and breadth coverage

Digesting a whole cell extract generates around 100,000 different peptides. These are usually separated in multiple dimensions. Here they are separated using strong cation exchange chromatography (according to charge) and each of the fractions is then separated by reverse-phase chromatography (according to charge). Usually the peptides are detected by a mass spectrometer which is coupled directly to the end of the reverse-phase column.

Non-gel Based Proteomics

IIICorrelative sequence database searching

Theoretical AcquiredProtein identification

200 400 600 800 1000 1200m/z

200 400 600 800 1000 1200m/z

Q2Collision Cell

Q3

II200 400 600 80010001200m/z

Tandem mass spectrumQ1

*

IPeptides

1D, 2D, 3D peptide separation

12 14 16Time (min)

*

Protein mixture

MS scanning modes

The mass spectrometer used for detection operates in two modes. In the first scan, the masses of the intact peptides coming from the HPLC are measured. Then in the second scan, one peptide is selected, broken into pieces by smashing it into a gas (CID) and the fragments are measured (MS/MS mode). Then the mass spectrometer goes back to measure intact masses again (MS mode)

Peptide Fingerprinting

Peptide disintegration

Fragment ions

Peptides are identified in a similar way to how proteins are identified. Maybe 10 peptides are entering the mass spectrometer. The MS picks automatically the most intense, isolates it (throwing away the other 9 peptides) and then smashes it into pieces. The mass of the peptide is used to search the database to find all peptides with the same mass. The fragmentation spectra of all these peptides are then calculated and compared to the experimental fragments observed. The best matching peptide sequence is then selected.

(Reference)

Stable Heavy

• Ratio of h/l signals indicates ratio of analytes

Sample 1 Sample 2

IncorporateStable Light

Isotope

Incorporate

Isotope

Analyze by Mass Spectrometer

Combine Samples

Quantitation by Isotope dilution

• h/l analytes are chemically identical ⇒ identical specific signal in MS

The most common way to compare samples is to label one same with a heavy atom. For example cells grown in normal medium have oxygen in the form of O-16. However the cells can also be grown in O-18 water. Alternatively amino acids with C-13 can be fed to the cells. The proteins from each cell are mixed together and then separated by 2D-HPLC. The peptides from each cell elute at the same time, but one peptide is heavier that the other.

Protein Labelling in Culture

Cell culture SILAC (for MS)

–Control with Arg C-12 and/or Lys -12

–Experimental with Arg C-13 and/or Lys -13

–Extract and combine proteins then digest

Mass spectrometry

Inte

nsity

m/z

Metabolic stableisotope labeling

Digest

Mix protein extractsDigest together

Culture A Culture B

AA

ABB

B

Here cells are grown in culture with labelled amino acids. The peptides appear as pairs, one light from culture A and one heavy from culture B.

Peptide Digest Labelling

Peptide labelling strategies

Chemical labelling of protein, mix, digest

Digest, chemical label, mix

Enzymatic digestion in H2O18 or H2O16, mix

Isotope taggingby chemical reaction

Digest

Label

Mass spectrometry

Inte

nsity

Alternatively, chemical labels can be attached to the peptides before or after digestion but before mixing.

usefulness

This is an excellent example of modern mass spectrometry coupled with intelligent biology. The proteins in the signalling pathway of the DAF-2 receptor are identified.

Experiment

Feed C.elegans on N15 labelled bacteria

Use data-independent MS/MS

Find DAF-2 targets changed by mutation

Confirm by SRM

The whole worm is fed nitrogen in the form of ammonium salts, either as the light (N-14) or as heavy isotope (N-15). The a worm with a mutation in the DAF receptor was generated which causes the worm to live longer. The proteins that were up or down regulated when comparing the mutant to the wild-type were identified by MS.

only orthogonal data

The levels of the proteins were analysed after temperature shifting. The proteins that were identified as changing in expression level were confirmed by a special mass spectrometric technique called SRM which allows only those peptides coming from the proteins identified to be detected. The proteins were also confirmed by western blotting using antibodies

Confirm targets by rnai

To really confirm that the proteins identified were involved in the DAF-2 receptor pathway, microRNAs called inhibitor RNA were added which destroy the mRNA for that protein knocking it out temporarily. The effect of this on the biological function was then assayed (the formation of a hibernation life stage called a dauer).

Interested?

A pdf version of the talk will be available on

- www.immun.lth.se

There are also what is proteomics pages etc

An ABC of Proteomics - LTHAn ABC of Proteomics The aim of the lecture is to introduce you to the basic methods used in modern proteomics research. Afterwards you should be able to

Documents