Top Banner
2003/12/5 PPLAB 1 Prediction of Human Prediction of Human Protein Function Protein Function According to According to Gene Ontology Gene Ontology Categories Categories L. J. Jensen, R. Gupta, H. –H. Stærf L. J. Jensen, R. Gupta, H. –H. Stærf eldt and S. Brunak eldt and S. Brunak Bioinformatics, Vol. 19, No. 5, p.p. Bioinformatics, Vol. 19, No. 5, p.p. 635-642, 2003 635-642, 2003
21

2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

Dec 30, 2015

Download

Documents

Susanna Weaver
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 1

Prediction of Human Protein Prediction of Human Protein Function According to Function According to

Gene Ontology Gene Ontology CategoriesCategories

L. J. Jensen, R. Gupta, H. –H. Stærfeldt and S. BrunakL. J. Jensen, R. Gupta, H. –H. Stærfeldt and S. Brunak

Bioinformatics, Vol. 19, No. 5, p.p. 635-642, 2003Bioinformatics, Vol. 19, No. 5, p.p. 635-642, 2003

Page 2: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 2

OutlineOutline

IntroductionIntroduction System and MethodsSystem and Methods DiscussionDiscussion ConclusionConclusion

Page 3: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 3

IntroductionIntroduction

For most of the whole genome sequencing pFor most of the whole genome sequencing projects, the function of a large fraction of prrojects, the function of a large fraction of proteins remain unknownoteins remain unknown

This paper expands the ProtFun prediction This paper expands the ProtFun prediction method predicted the cellular role categoriemethod predicted the cellular role categories as well as enzymatic function to also coves as well as enzymatic function to also cover a number of GO categoriesr a number of GO categories– More specific description of the functionMore specific description of the function

Page 4: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 4

System and MethodsSystem and Methods

The Neural Networks ApproachThe Neural Networks Approach– Data setData set– Data set partitioningData set partitioning– Choosing the classes to predictChoosing the classes to predict– Sequence derived protein featuresSequence derived protein features– Neural network training and feature selectionNeural network training and feature selection– Making predictions with the neural networksMaking predictions with the neural networks

Page 5: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 5

The Neural NetworksThe Neural Networks

Page 6: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 6

Data SetData Set

Generation of a Labeled Data SetGeneration of a Labeled Data Set– Making use of the InterPro database in which pMaking use of the InterPro database in which p

rotein families have been assigned with GO nurotein families have been assigned with GO numbersmbers

– Linking this with a list InterPro domain matcheLinking this with a list InterPro domain matches to SWISS-PROT and TrEMBLs to SWISS-PROT and TrEMBL

– A set of 21,401 human sequences with annotateA set of 21,401 human sequences with annotated GO numbersd GO numbers

Page 7: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 7

Data Set PartitioningData Set Partitioning

Cross Validation - HeuristicCross Validation - Heuristic– Divide the data set into five sets of equal size Divide the data set into five sets of equal size

with minimal sequence similarity overlap with minimal sequence similarity overlap between the setsbetween the sets

Unfortunately, it turned out to either be Unfortunately, it turned out to either be impossible to split the set into five unrelated impossible to split the set into five unrelated subsets or the heuristic at least failed to find subsets or the heuristic at least failed to find a sufficiently good solutiona sufficiently good solution

Page 8: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 8

Data Set Partitioning Data Set Partitioning (cont.)(cont.)

Reducing each of the five subsets to 2500 Reducing each of the five subsets to 2500 sequencessequences– By removing the sequences with the highest By removing the sequences with the highest

connectivityconnectivity A five fold cross validation set of 12,500 A five fold cross validation set of 12,500

sequences with no significant similarity sequences with no significant similarity between sequences in the different subsetsbetween sequences in the different subsets

Page 9: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 9

Choosing the Classes to PredictChoosing the Classes to Predict

GO Categories as of June 10GO Categories as of June 10 thth 2001 2001– 1532 of 7949 different classes were represented 1532 of 7949 different classes were represented

in the data set described abovein the data set described above– Leaving 347 categories which were annotated tLeaving 347 categories which were annotated t

o at least 20 different InterPro familieso at least 20 different InterPro families

Page 10: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 10

Sequence Derived Protein Sequence Derived Protein FeaturesFeatures

Features UsedFeatures Used– Aliphatic Aliphatic (( 脂肪族化合物脂肪族化合物 )) index index

– Extinction Extinction (( 吸光度吸光度 )) coefficient coefficient– Hydrophobicity Hydrophobicity (( 厭水厭水 ))

– Instability Instability (( 不穩定度不穩定度 )) index index

– Number of atomsNumber of atoms

– Number of negative residuesNumber of negative residues

– Number of positive residuesNumber of positive residues

– Isoelectric Isoelectric (( 等電點等電點 )) point point

– Secondary structureSecondary structure

– Transmembrane Transmembrane (( 橫跨膜的橫跨膜的 )) helices helices

– Low complexity regionsLow complexity regions

– Propeptides Propeptides

– Signal peptidesSignal peptides

– Protein targetingProtein targeting

– Protein sortingProtein sorting– N-glycosylation N-glycosylation (( 糖基化的糖基化的 ))

– S/T-phosphorylation S/T-phosphorylation (( 磷酸化磷酸化 ))

Page 11: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 11

TrainingTraining

For each GO class, standard feed-forward For each GO class, standard feed-forward neural networks with a single layer of neural networks with a single layer of hidden neurons were used for predicting hidden neurons were used for predicting which example belong to a given classwhich example belong to a given class

For each feature combination the input For each feature combination the input vector for the neural networks consists of a vector for the neural networks consists of a concatenation of the respective feature concatenation of the respective feature vectors while the target output is a single vectors while the target output is a single value (1 or 0)value (1 or 0)

Page 12: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 12

Training Training (cont.)(cont.)

Only 26 GO classes remained after Only 26 GO classes remained after reducing those not strongly correlated to reducing those not strongly correlated to any of the predicted featuresany of the predicted features

For each, the optimal feature combination For each, the optimal feature combination was searched for using a greedy search was searched for using a greedy search heuristicheuristic

The final set of predictors consists of cross The final set of predictors consists of cross validation ensembles of five neural validation ensembles of five neural networks for each of 14 GO classes networks for each of 14 GO classes

Page 13: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 13

Making PredictionsMaking Predictions

Use the neural networks to predict the Use the neural networks to predict the function of novel sequencesfunction of novel sequences– Sequence derived features and encodedSequence derived features and encoded– For each GO class, the average output is For each GO class, the average output is

calculated and converted to a probability using calculated and converted to a probability using a calibration curvea calibration curve

Page 14: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 14

Discussion Discussion

Why so relatively few GO classes can be prWhy so relatively few GO classes can be predicted?edicted?– Lack of data: for 90% of the GO classes we canLack of data: for 90% of the GO classes we can

not assign a single positive example among hunot assign a single positive example among human SWISS-PROT and TrEMBL entriesman SWISS-PROT and TrEMBL entries

– Many of the categories are reduced for some reMany of the categories are reduced for some reasonsasons

Page 15: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 15

Page 16: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 16

Page 17: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 17

Discussion Discussion (cont.)(cont.)

The method appears to be better at The method appears to be better at predicting biological process than predicting biological process than molecular functionmolecular function

Novel putative receptorsNovel putative receptors Chromosomal clustering of protein with Chromosomal clustering of protein with

similar functionsimilar function

Page 18: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 18

Conclusions Conclusions

We have succeeded in making a sequence We have succeeded in making a sequence based function prediction method for a based function prediction method for a subset of the GOsubset of the GO

The method is well suited for computational The method is well suited for computational screening of the human genome for novel screening of the human genome for novel drug targetsdrug targets

Page 19: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 19

Thank you!Thank you!

Yi-Yao Huang(Yi-Yao Huang( 黃奕堯黃奕堯 ))

[email protected]@par.cse.nsysu.edu.tw

Page 20: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 20

What Is GO (Gene Ontology) ?What Is GO (Gene Ontology) ?

The Gene Ontology (GO) project is a collaborativThe Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptie effort to address the need for consistent descriptions of gene products in different databasesons of gene products in different databases

Three structured, controlled vocabularies (ontologiThree structured, controlled vocabularies (ontologies) that describe gene products in terms of their ases) that describe gene products in terms of their associated biological processes, cellular components sociated biological processes, cellular components and molecular functions in a species-independent and molecular functions in a species-independent mannermanner

DAG (directed acyclic gragh)DAG (directed acyclic gragh)

Page 21: 2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt.

2003/12/5 PPLAB 21

名詞解釋名詞解釋

Protein sorting and protein targetingProtein sorting and protein targeting– Proteins are sorted (delivered to their destination within Proteins are sorted (delivered to their destination within

the cell) in accordance with sorting signalsthe cell) in accordance with sorting signals

Propeptides, signal peptides and glycosylationPropeptides, signal peptides and glycosylation– 欲使轉殖的基因產物在特定的胞器中表現或停留,欲使轉殖的基因產物在特定的胞器中表現或停留,

通常需要在啟動子的後面加上一段可將基因帶到特通常需要在啟動子的後面加上一段可將基因帶到特定位置或胞器的 定位置或胞器的 signal peptide signal peptide 或 或 propeptidepropeptide 。。 SigSignal peptide nal peptide 通常將基因帶到 通常將基因帶到 ER (ER ( 內質網內質網 )) ,受到 ,受到 gglycosylationlycosylation ,最後走向 ,最後走向 secretory pathwaysecretory pathway ,而分泌,而分泌到細胞外。到細胞外。