By Xianfeng (Jeff) Chen Computational and Systems Biologist May 7, 2009 Bioinformatics Cyber-infrastructure for Genomics and Proteomics in Systems Biology.

Post on 26-Dec-2015

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

By Xianfeng (Jeff) Chen

Computational and Systems Biologist

May 7, 2009

Bioinformatics Cyber-infrastructure for Genomics and Proteomics in Systems Biology

Agenda Today

(1) Cyber-infrastructure and systems biology.

(2) High performance computing and software for peptide/protein identification and quantification, data mining/target discovery, on mass spectrometry generated proteomics data. (3) Relational database management system, genome annotation methodology, systems biology data integration, biology knowledge generation and augmentation.

Section One: Cyber-infrastructure and Systems Biology

Reductionist approach,one gene, one protein

Systems approach,multiple genes, network

analysis

Cutting edge science and technology

Status of Technologies in Systems Biology

Cyber-infrastructure for Systems Biology Cyber-infrastructure for Systems Biology

• “…. build new types of scientific and engineering knowledge environments and organizations to pursue research in new ways and with increased efficacy.

• …..new NSF funding of $1 billion per year is needed to achieve critical mass …….

2008Awarded $50 millions

http://www.communitytechnology.org/nsf_ci_report/

2004Awarded to $100 millions

2004Awarded $85 millions

Supporting Cyber- infrastructure and Systems Biology Workflow

Historic strong area

Supporting

(DOE - Genomics: GTL Roadmap, p.52)

Cyber-knowledge System to Enable Genomics-based Predicative Medicine

System Integration at Systems Biology CenterSystem Integration at Systems Biology Center

Core Laboratory Facility:Data Generation

Core Computational Facility:Data Processing, Storage,

and Dissemination

Cyber-infrastructure, Data Management, Data Analysis Pipeline, and Data Display

(1) LIMS for raw data & protocol(2) Preprocessed data management(3) High throughput computing(4) Data validation and integration(5) Knowledge representation

Data Mining and Knowledge Discovery

PC Single CPU Computing Unix Multiple CPUs Computing Cluster Computing

Cyber-infrastructure Component (1) : High Performance Computing

Step 1 Step 2Start point

Most labs 5-10 biological labs in US 2-4 biological labs

For large sets of data analysis

--- Migration of Bio-Computing Capability

Cyber-infrastructure Component (2) : Integrated Knowledgebase System

--- Case Study of National Biodefense Proteomics Data Center

Public File Server

Private File ServerOracle Relational Database

Database query,

Data upload over

http

Batch Processing

(1) Data uploading;

(2) Data validation;

(3) Data analysis;

(4) Data processing

Perl,

Java

Web services

Data exchange using XML based

SOAP

---- System Integration Case 1: UVa Proteomics Data Center---- System Integration Case 1: UVa Proteomics Data Center

High Performance

and ThroughputComputing

Data ManagementData Management

Section Two: High Performance Computing and Proteomics

Protein Database Search EnginesMascot Matrix Science

Sequest / Bioworks Scripps/ThermoX! Tandem the GPMSpectrum Mill Agilent Technologies

OMSSA NCBIPEAKS Bioinformatics Solutions Inc. Phenyx GeneBio

Statistical Validation and QuantitationPeptideProphet Institute for Systems Biology ProteinProphet Institute for Systems Biology ASAPRatio, XPRESS, Libra Institute for Systems Biology Scaffold Batch System Proteome Software, Inc.SIEVE ThermoCensus Scripps Research Institute

Open Data StandardsFuGE and XAR FHCRC, ICBC, ITMAT, & ManchesterMIAPE HUPO PSI and Collaborators mzXML, pepXML, protXML Institute for Systems Biology MS1, MS2, SQT Scripps Research Institute

Computational Proteomics Software and Algorithms

Many more ……..…

System Integration Case 2: National Biodefense Proteomics Data Center

http://www.proteomicsresource.org

Awarded $14 millions

(1) University of Michigan Microarray and mass spectrometry

(2) Caprion Pharmaceuticals Mass spectrometry

(3) Harvard Proteomics Institute Genomics and protein expression array

(4) Albert Einsten College of Medicine Mass spectrometry

(5) PNNL Mass spectrometry

(6) Scripps NMR structural, X-ray crystal diffraction data, and Mass spectrometry

(7) Myriad Genetics Yeast two-hybrid system

Proteomics Research Centers (PRC) and Their Major Data Types

PRC Organizations Major Data Types

Proteomics Data Flow

PRCS

VBI

Public

Data Sources

2D GELS

Protein Array

LC

Immunoaffinity purification

Y2H

MS

MS/MS

NMR

X-Ray Cryoelectron Microscopy

X-Ray Defraction

etc…

Data Types

QA

&

QC

Quality Assurance

& Quality Control

Converting to Standard Format

Standard

Format

Standard Format for Each Data Type

QA

&

QC

Quality Assurance

& Quality Control

Data Modeling / Decomposition

Relational Database

MIAME and MIAPE-like Standards/SOP for Data Submission

Proteomics Database Architecture

Search By Experiment/Sample

Databases in Proteomics Data Center

• Annotation improvement and interaction network analysis

(1) Non-homologous based methods -------------- Phylogenetic profiling,

Rosetta stone pattern,

Operon analysis,

Co-expression profiling,

Gene neighboring etc.

(2) Comparative genomics with reference genomes --- E. coli, yeast, Arabidopsis,

etc. model organisms.

• Identifying anchor points for data integration

(1) Known metabolic pathway;

(2) Known signal transduction pathway;

(3) Known gene regulation machinery;

(4) Known protein-protein interaction map.

Strategies for Annotating Raw Data into Meaningful Knowledge

BMC Bioinformatics 2006, 7 (Suppl 4):S18

Qualitative Data Integration and Knowledge Augmentation Based on Networks Biology

Quantitative Proteome Profiling

--- The field is 2-3 years old

Thermo SIEVE Scatter Plot of 14 UVa Raw Files for Validation of Data Quality and Absolute Quantification.

Scaffold Capability of Proteome Spectra Counts of Semi-quantification.

Search Engine Comparison at UVa Proteomics Data Center (1)

Few common annotations

Low annotation rates

Peptide/Protein Identifications with Various Protein Database Search Engines (2)

X!Tandem missed OMSSA missed

Sequest over-predicted

UVaPDC, MS/MS Search Engine Comparison (3)

Spectra counts

Common annotations

Statistics on confident values

Statistics and Summarization Capability of Scaffold

--- The best feather of the software

Data Mining on Data Processed via Computational Approach

Knowledge-based Discovery

Identified

Identified

Rate limited step

Knowledge Inference

Knowledge Inference

Inference on Gene Network in Systems Biology

(1) Y2H, (2) MS pull down assay, (3) Co-expression assay.

Where are the significant regulatory steps impacting pathway expression ?

Target/lead protein

Raf

MAPK

EDH1

EPS8L1* or

EPS8L2*GDP

GTP

NRas*EPS15

Mucin-4*

Gα* GγGTP

P

EGFRAdenylate

Cyclase

ATPcAMP

Cell ProliferationMP Formation

P

Gα*

Urinary Biomarker Identification ---EGFR Pathway Related Bladder Cancer

----- Small scale analysis

* Differentially expressed

Patient with Bladder Cancer

Healthy Individual

Urine Urine

Urine Microparticles

LC-MS/MS

SEQUEST

Spectral Count Analysis

Western Blotting

EPS8L2

Exosomes

Ectosomes

Patten Matching on Gene Signatures at Various Biological States

--- Large-scale analysis

*** query signatures are compared to reference gene/protein expression signatures for known perturbations or disease phenotypes. (many to many association analysis)

Section Three : Knowledge Base Establishment

Database Case 1 Soybean Upstream Regulatory Elements for Ongoing Regulatory Motif Annotation

115

89

Nominated Transcription Factor Involved in Stress Response

Group IX

Red Dot = Soybean ERF genes

Implicated in regulating wounding and jasmonate responses

Soybean Promoter :

GmERFs, Gmubis, Gmcons, GmWRKYs

more and more and more……..

10 promoters per month

Promoter

Ongoing Effort on Transcription Factor Binding Motifs

---- Identify genetic circuits of cell wall, starch, and lipid biosynthesis and degradation

Elucidation of Conserved Co-expression Networks via Data Integration with Expression Profiling Data

(1) BMC Bioinformatics. 2007, 8:129.(2) BMC Bioinformatics. 2008, 9:53.

Database Case 2 CGKB and TOBFAC Knowledge Bases

Genome Annotation Strategy (1) : Homology-based Annotation

263,425 total cowpea gene space sequence (GSS).

High level coding region detection !

BMC Genomics. 2008, 9:103.

Genome Annotation Strategy (2) : Metabolic Pathway Integration

BMC Bioinformatics. 2007, 8:129.

Genome Annotation Strategy (3) : GO Integration with Distribution of Function Assignments

BMC Genomics. 2008, 9:103.

Genome Annotation Strategy (4): Comparative Genomics at Genome-scale

BMC Genomics. 2008, 9:103.

---- Example of medicago vs cowpea

Genome Annotation Strategy (5): Comparison at Gene Family Level

(1) BMC Genomics. 2008, 9:103.(2) Plant Physiology. 2008, 147:280-295.

--- WRKY and CONSTANS (CO) and CO-like Gene Families of Cowpea Transcription Factors

Genome Annotation Strategies: (6) Repeat, (7) Domain, (8) Gene Model

BMC Bioinformatics. 2007, 8:129.

Repeat

Domain

Gene Model

Genome Annotation Strategy (9) : Comparative Genomics on Network for Conserved Protein Complexes

Comparative genome analysis

Conserved networks

Published Protein-Protein (PPI) Interactions in Organisms

Example of Yeast PPI

Genome Annotation Strategy (10): Functional Validation of Genes of Interest Through Reverse Genetics Program

My name

2008

Acknowledgement

top related