Top Banner
Statistical Consideration for Identification and Quantification in Top- Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery Omics with Top Down Proteomics
19

Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

Dec 26, 2015

Download

Documents

Antony Harris
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

Statistical Consideration for Identification and Quantification

in Top-Down Proteomics

Richard LeDuc National Center for Genome Analysis Support

Discovery Omics withTop Down Proteomics

Page 2: Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

Acknowledgements

• Leonid Zamdborg• Shannee Babai• Bryon Early• Ian Spauling• Kevin Glowacz• Eric Bluhm• Vinayak Viswanathan• Yong-Bin Kim• Ryan Fellers• Tom Januszyk• Brian Cis• Chris Strouse• Seyoung Sohn• Greg Taylor• Joe Sola• Lee Bynum• Andrew Birck

All the other numerous members of the KRG who have contributed insights over the years.

Drs. Neil Kelleher, Paul Thomas, and Andy Forbes, and ProSight Development Team (past and present)

Yury Bukhman, James McCurdy, Adam Halstead, Irene Ong (Area 3), Mary Lipton (PNNL), Kathryn Richmond (Enabling Technologies) and others

Proteomics CoreReid TownsendPetra GilmoreCheryl LichtiJames MaloneAlan Davis

Michael Gross (NCRR Mass Spec)Henry Rohrs (NCRR Mass Spec)Ron Bose (Oncology)Mike Boyne (FDA)Jeffry Hiken (Genetics)

Le-Shin Wu, Carrie Ganote,Tom Doak,Bill Barnett, and a cast of thousands

Limbrick LaboratoryDavid LimbrickDiego Morales

Holtzman LaboratoryDavid HoltzmanRick PerrinJacqueline PaytonChengjie Xiong (Biostatistics)

National Center for Genome Analysis Support

Washington Univ. School of Medicine

The Kelleher Research Group

Page 3: Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

Differential Omics Studies1. RNA-seq, Bottom-up

proteomics, metabolomics

2. Looking for a list of discovered entities that have different expression levels between treatments

3. Very popular for target discovery

4. Frequently done on organisms before a genome is completed

Page 4: Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

‘P score’ = Pf,n =(xf)n x e-xf

n!

f is the # of input fragment ions,

n is the # of matches,

Ma is the Mass Accuracy

2211.111

1 aMx

1

0

1n

ifn

i

nifncrude ppp

F. Meng, B. Cargile, L. Miller, J. Johnson, and N. Kelleher, Nat. Biotechnol., 2001, 19, 952-957.

“Kelleher P-Score” Example

Page 5: Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

Modeling the Scrambled P-Scores

Motivation Goodness of Fit

9,839 MS/MS Queries (MS1 and MS2 data)

Page 6: Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

Better is better, butthe easy ones are easy

Page 7: Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

Computers Ask the Darndest Questions

Page 8: Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

Top Down Proteomics!• Three pillars of proteomics:

• Identification• Characterization• Quantification.

• Top down proteomic studies are underway.

• These are large and complex studies(At several institutions, a typical production bottom-up study would have 200+ LC runs)

Page 9: Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

Top Down Proteomics BiometricsSources of

1. Intensity calculation

2. LC alignment

3. Mass Spec Physics

4. SeparationDifferent fractions etc.

5. Protein IsolationChIP, RBC ghosts etc

6. Tissue variation

7. Individual variation

8. Population variation

9. Random and systemic errors

Page 10: Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

Experimental Design

• Ronald A. Fisher (1926) : "The Arrangement of Field Experiments“

• All measurements have errors• All biological systems have

individual variation

• The goal of experimental design is to design the experiment so that the variation can be partitioned

• Typically testing variation between groups against the variation within

Healthy Group

Sub 1 Sub 2

R1 R2 R3 R4 R5 R7R6 R8

Diseased Group

Sub 1 Sub 2

R1 R2 R3 R4 R5 R7R6 R8

Page 11: Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

Control Samples PNH Samples

1

0

-1

-2

-3

3

2

1 642 8 10 1211 133 5 7 9

RAP1A

Coomassie

Catalase

Peroxiredoxin

250150100755037

2025

2025

10075

50

CON_

1CO

N_6

PNH_

9PN

H_10

Typical Results: Human RBC Ghosts

Control Samples

PNH Samples

RAP1A

Page 12: Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

Populations of Experiments• Instead of doing 1

experiment, you are doing an unknown number of experiments

• Number of experiments determined by how many unique entities are observed consistently over the entire set of observations

Control Samples

PNH Samples

Page 13: Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

Typical Results: Breast Cancer Model

Page 14: Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

Sources of Variation: The Model

ijklijkijiijkl erdaI )()(Wherei=1 or 2 and represents the two preparations,j = 1 to 3 for each digestion within a given

preparation,k = 1 to 3 for each injection (or run) within each

digestionl = 1 to the number of peptides for the given protein.

Under this model, let

residuals theis e

npreperatio i thefrom digestion random j thefrom run random k for theeffect theis r

npreperatio i thefrom digestion random j for theeffect theis d

npreparatio random i for theeffect theis a

ijkl

thththk(ij)

ththj(i)

thi

Page 15: Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

Variance Component Estimates

Page 16: Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

Power Calculations

Hum

an

Sub

ject

s

Power Curves for High Subject Low Residual Variation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.5 1 1.5 2 2.5Effect Size (in STD)

Pow

er

n=20

n=5

Inbreed Mice

Page 17: Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

Systems Analysis• What to do with the laundry lists

of significant genes?• Gene Ontology Analysis• Gene Set Enrichment Analysis

• Often paired with RNA or metabolomic data.

• Creates a third level of analysis

Page 18: Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

To Review

• Everything is in place for top-down proteomic studies.

• In any discovery omic study, extreme care must be taken – lots of pilot work to understand the behavior of your analytic system

• Technology and mathematical formalism does not trump biology. (Bad experimental design results in bad experiments)

Page 19: Statistical Consideration for Identification and Quantification in Top-Down Proteomics Richard LeDuc National Center for Genome Analysis Support Discovery.

• Funded by National Science Foundation1. Large memory clusters for assembly

2. Bioinformatics consulting for biologists

3. Optimized software for better efficiency

• Partner Institutions:• Extreme Science and Engineering Discovery Environment (XSEDE)• Texas Advanced Computing Center (TACC) at the University of Texas at

Austin• San Diego Supercomputer Center (SDSC) at the University of California, San

Diego.• Pittsburgh Supercomputing Center (PSC)

• Open for business at: http://ncgas.org

Sh

amel

ess

NC

GA

S P

lug

Questions?