Top Banner
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park
51

Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

Dec 23, 2015

Download

Documents

Lindsay Moody
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

Proteomic Characterization

of Alternative Splicing and

Coding Polymorphism

Proteomic Characterization

of Alternative Splicing and

Coding PolymorphismNathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park

Page 2: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

2

Mass Spectrometry for Proteomics

• Measure mass of many (bio)molecules simultaneously• High bandwidth

• Mass is an intrinsic property of all (bio)molecules• No prior knowledge required

Page 3: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

3

Mass Spectrometry for Proteomics

• Measure mass of many molecules simultaneously• ...but not too many, abundance bias

• Mass is an intrinsic property of all (bio)molecules• ...but need a reference to compare to

Page 4: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

4

High Bandwidth

100

0250 500 750 1000

m/z

% I

nte

nsit

y

Page 5: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

5

Mass is fundamental!

Page 6: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

6

Mass Spectrometry for Proteomics

• Mass spectrometry has been around since the turn of the century...• ...why is MS based Proteomics so new?

• Ionization methods• MALDI, Electrospray

• Protein chemistry & automation• Chromatography, Gels, Computers

• Protein sequence databases• A reference for comparison

Page 7: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

7

Sample Preparation for Peptide Identification

Enzymatic Digestand

Fractionation

Page 8: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

8

Single Stage MS

MS

m/z

Page 9: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

9

Tandem Mass Spectrometry(MS/MS)

Precursor selection

m/z

m/z

Page 10: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

10

Tandem Mass Spectrometry(MS/MS)

Precursor selection + collision induced dissociation

(CID)

MS/MS

m/z

m/z

Page 11: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

11

Peptide Identification

• For each (likely) peptide sequence1. Compute fragment masses2. Compare with spectrum3. Retain those that match well

• Peptide sequences from protein sequence databases• Swiss-Prot, IPI, NCBI’s nr, ...

• Automated, high-throughput peptide identification in complex mixtures

Page 12: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

12

Why don’t we see more novel peptides?

• Tandem mass spectrometry doesn’t discriminate against novel peptides...

...but protein sequence databases do!

• Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!

Page 13: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

13

What goes missing?

• Known coding SNPs

• Novel coding mutations

• Alternative splicing isoforms

• Alternative translation start-sites

• Microexons

• Alternative translation frames

Page 14: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

14

Why should we care?

• Alternative splicing is the norm!• Only 20-25K human genes• Each gene makes many proteins

• Proteins have clinical implications• Biomarker discovery

• Evidence for SNPs and alternative splicing stops with transcription• Genomic assays, ESTs, mRNA sequence.• Little hard evidence for translation start site

Page 15: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

15

Novel Splice Isoform

• Human Jurkat leukemia cell-line• Lipid-raft extraction protocol, targeting T cells• von Haller, et al. MCP 2003.

• LIME1 gene:• LCK interacting transmembrane adaptor 1

• LCK gene:• Leukocyte-specific protein tyrosine kinase• Proto-oncogene• Chromosomal aberration involving LCK in leukemias.

• Multiple significant peptide identifications

Page 18: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

18

Novel Frame

Page 19: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

19

Novel Frame

Page 20: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

20

Novel Mutation

• HUPO Plasma Proteome Project• Pooled samples from 10 male & 10 female

healthy Chinese subjects• Plasma/EDTA sample protocol• Li, et al. Proteomics 2005. (Lab 29)

• TTR gene• Transthyretin (pre-albumin) • Defects in TTR are a cause of amyloidosis.• Familial amyloidotic polyneuropathy

• late-onset, dominant inheritance

Page 23: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

23

Searching ESTs

• Proposed long ago:• Yates, Eng, and McCormack; Anal Chem, ’95.

• Now:• Protein sequences are sufficient for protein identification• Computationally expensive/infeasible• Difficult to interpret

• Make EST searching feasible for routine searching to discover novel peptides.

Page 24: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

24

Searching Expressed Sequence Tags (ESTs)

Pros• No introns!• Primary splicing

evidence for annotation pipelines

• Evidence for dbSNP• Often derived from

clinical cancer samples

Cons• No frame• Large (8Gb)• “Untrusted” by

annotation pipelines• Highly redundant• Nucleotide error

rate ~ 1%

Page 25: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

25

Compressed EST Peptide Sequence Database

• For all ESTs mapped to a UniGene gene:• Six-frame translation• Eliminate ORFs < 30 amino-acids• Eliminate amino-acid 30-mers observed once• Compress to C2 FASTA database

• Complete, Correct for amino-acid 30-mers

• Gene-centric peptide sequence database:• Size: < 3% of naïve enumeration, 20774 FASTA entries• Running time: ~ 1% of naïve enumeration search• E-values: ~ 2% of naïve enumeration search results

Page 26: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

26

Compressed EST Peptide Sequence Database

• For all ESTs mapped to a UniGene gene:• Six-frame translation• Eliminate ORFs < 30 amino-acids• Eliminate amino-acid 30-mers observed once• Compress to C2 FASTA database

• Complete, Correct for amino-acid 30-mers

• Gene-centric peptide sequence database:• Size: < 3% of naïve enumeration, 20774 FASTA entries• Running time: ~ 1% of naïve enumeration search• E-values: ~ 2% of naïve enumeration search results

Page 27: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

27

SBH-graph

ACDEFGI, ACDEFACG, DEFGEFGI

Page 28: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

28

Compressed SBH-graph

ACDEFGI, ACDEFACG, DEFGEFGI

Page 29: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

29

Sequence Databases & CSBH-graphs

• Original sequences correspond to paths

ACDEFGI, ACDEFACG, DEFGEFGI

Page 30: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

30

Sequence Databases & CSBH-graphs

• All k-mers represented by an edge have the same count

2 2

1

2

1

Page 31: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

31

cSBH-graphs

• Quickly determine those that occur twice

2 2

1

2

Page 32: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

32

Correct, Complete, Compact (C3) Enumeration

• Set of paths that use each edge exactly once

ACDEFGEFGI, DEFACG

Page 33: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

33

Correct, Complete (C2) Enumeration

• Set of paths that use each edge at least once

ACDEFGEFGI, DEFACG

Page 34: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

34

Patching the CSBH-graph

• Use artificial edges to fix unbalanced nodes

Page 35: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

35

Compressed EST Database

• Gene centric compressed EST peptide sequence database• 20,774 sequence entries• ~8Gb vs 223 Mb• ~35 fold compression

• 22 hours becomes 15 minutes• E-values improve by similar factor!

• Makes routine EST searching feasible• Search ESTs instead of IPI?

Page 36: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

36

“Novel Peptide” Computational Infrastructure

• Binaries (C++)• cSBH-graph construction

• Condor grid-enabled• Eulerian path k-mer enumeration

• Suitable for large graphs

• Data-model for peptide identification• Spectra (>5 million)• Peptide identifications

• Mascot, SEQUEST, X!Tandem, NIST • Genomic context of peptides

Page 37: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

37

“Novel Peptide” Computational Infrastructure

• Condor grid-enabled MS/MS search• Mascot, X!Tandem, (Inspect, OMSSA)

• TurboGears python web-stack• SQLObject Object-Relational-Manager• MVC web-application framework• Suitable for AJAX & web-services too

• Integration with UCSC genome browser• caBIG compatible web-services

• Java applet for viewing spectra

Page 38: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

38

Peptide Identification Navigator

Page 39: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

39

Peptide Identification Navigator

Page 40: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

40

Spectrum Viewer

Page 41: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

41

Spectrum Viewer

Page 42: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

42

Back to the lab...

• Current LC/MS/MS workflows identify a few peptides per protein• ...not sufficient for protein isoforms

• Need to raise the sequence coverage to (say) 80%• ...protein separation prior to LC/MS/MS

analysis• Potential for database of splice sites of

(functional) proteins!

Page 43: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

43

Microorganism Identification by MALDI Mass Spectrometry

• Direct observation of microorganism biomarkers in the field.

• Peaks represent masses of abundant proteins.

• Statistical models assess identification significance.

B.anthracisspores

MALDI Mass Spectrometry

Page 44: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

44

Key Principles

• Protein mass from protein sequence• No introns, few PTMs

• Specificity of single mass is very weak• Statistical significance from many peaks

• Not all proteins are equally likely to be observed• Ribosomal proteins, SASPs

Page 45: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

45

Rapid Microorganism Identification Database (www.RMIDb.org)

• Protein Sequences• 8.1M (2.9M)

• Species• ~ 18K

• Genbank,• Microbial, Virus, Plasmid

• RefSeq• CMR,• Swiss-Prot• TrEMBL

Page 46: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

46

Rapid Microorganism Identification Database (www.RMIDb.org)

Page 47: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

47

Informatics Issues

• Need good species / strain annotation• B.anthracis vs B.thuringiensis 

• Need correct protein sequence• B.anthracis Sterne α/β SASP• RefSeq/Gb: MVMARN... (7442 Da)• CMR: MARN... (7211 Da)

• Need chemistry based protein classification

Page 48: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

48

Conclusions

• Proteomics can inform genome annotation• Eukaryotic and prokaryotic • Functional vs silencing variants

• Peptides identify more than just proteins• Untapped source of disease biomarkers

• Compressed peptide sequence databases make routine EST searching feasible

Page 49: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

49

Future Research Directions

• Identification of protein isoforms:• Optimize proteomics workflow for isoform

detection• Identify splice variants in cancer cell-lines

(MCF-7) and clinical brain tumor samples• Aggressive peptide sequence enumeration• dbPep for genomic annotation• Open, flexible informatics infrastructure for

peptide identification

Page 50: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

50

Future Research Directions

• Proteomics for Microorganism Identification• Specificity of tandem mass spectra• Revamp RMIDb prototype• Incorporate spectral matching

• Primer design• k-mer sets as FASTA sequence databases• Uniqueness oracle for exact and inexact match• Integration with Primer3• Tiling, multiplexing, pooling, & tag arrays

Page 51: Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

51

Acknowledgements

• Chau-Wen Tseng, Xue Wu• UMCP Computer Science

• Catherine Fenselau, Steve Swatkoski• UMCP Biochemistry

• Calibrant Biosystems

• PeptideAtlas, HUPO PPP, X!Tandem

• Funding: National Cancer Institute