Top Banner
Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES: ANNOTATION AND FAMILY CLASSIFICATION
55

Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

Feb 25, 2016

Download

Documents

iria

FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES: ANNOTATION AND FAMILY CLASSIFICATION. Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center. Problem:. Most new protein sequences come from genome sequencing projects Many have unknown functions - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

Anastasia NikolskayaPIR (Protein Information Resource),

Georgetown University Medical Center

FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES:ANNOTATION AND FAMILY CLASSIFICATION

Page 2: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

2

Most new protein sequences come from genome sequencing projects Many have unknown functions Large-scale functional annotation of these sequences based simply on

BLAST best hit has pitfalls; results are far from perfect

Problem: Overview

Highly curated and annotated protein classification system Automatic annotation of sequences based on protein families

Solution for Large-scale Annotation:

Whole-protein family classification based on evolution Highly annotated, optimized for annotation propagation Functional predictions for uncharacterized proteins Used to facilitate and standardize annotations in UniProt

PIRSF Protein Classification System

Functional Analysis of Protein Sequences: Homology-based (sequence analysis, structure analysis) Non-homology (genome context, phylogenetic distribution)

Page 3: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

3

Proteomics and Bioinformatics

BioinformaticsComputational analysis and integration of these data

Making predictions (function etc), reconstructing pathways

Data: Gene expression profiling Genome-wide analysis of gene expression

Data: Protein-protein interaction Data: Structural genomics 3D structures of all protein

families Data: Genome projects (Sequencing) ….

Page 4: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

4

What’s In It For Me? When an experiment yields a sequence (or a set of

sequences), we need to find out as much as we can about this protein and its possible function from available data

Especially important for poorly characterized or uncharacterized (“hypothetical”) proteins

More challenging for large sets of sequences generated by large-scale proteomics experiments

The quality of this assessment is often critical for interpreting experimental results and making hypothesis for future experiments

Sequence function

Page 5: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

5

Genomic DNA Sequence

5' UTRPromoter Exon1 Intron Exon2 Intron Exon3 3' UTR

AG

GT

AG

Gene Recognition

Exon2Exon1 Exon3

CACA

CAAT

TATA

Protein Sequence

ATG

AATAAA

Structure Determination

Protein Structure

Function Analysis

Gene NetworkMetabolic Pathway

Protein FamilyMolecular Evolution

Family Classification

GT

Gene Gene

DNASequence

Gene

Protein

Sequence

Function

Work with Protein, not DNA Sequence

Page 6: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

6

The Changing Face of Protein Science

20th centuryFew well-studied

proteins

Mostly globular with enzymatic activity

Biased protein set

21st centuryMany “hypothetical”

proteins (Most new proteins come from genome sequencing projects, many have unknown functions)

Various, often with no enzymatic activity

Natural protein setCredit: Dr. M. Galperin, NCBI

Page 7: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

7

Knowing the Complete Genome Sequence

All encoded proteins can be predicted and identified The missing functions can be identified and analyzed Peculiarities and novelties in each organism can be

studied Predictions can be made and verified

Advantages:

Challenge: Accurate assignment of known or predicted functions

(functional annotation)

Page 8: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

8

Escherichia coli Methanococcus jannaschii

Yeast Human

E. coli M. jannaschii S. cerevisiae H. sapiens Characterized experimentally 2046 97 3307 10189 Characterized by similarity 1083 1025 1055 10901 Unknown, conserved 285 211 1007 2723 Unknown, no similarity 874 411 966 7965 from Koonin and Galperin, 2003, with modifications

Page 9: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

9

Experimentally characterized Find up-to-date information, accurate interpretation

Characterized by similarity (“knowns”) = closely related to experimentally characterized Avoid propagation of errors

Function can be predicted (no close sequence similarity, may be distant similarity to characterized proteins) Extract maximum possible information, avoid errors and

overpredictions Most value-added (fill the gaps in metabolic pathways, etc)

“Unknowns” (conserved or unique) Rank by importance

Functional Annotationfor Different Groups of Proteins

Page 10: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

10

Protein Sequence

Function

Automatic assignment based on sequence similarity (best BLAST hit):gene name, protein name, function

Large-scale functional annotation of sequences based simply on BLAST best hit has pitfalls; results are far from perfect

To avoid mistakes, need human intervention (manual annotation)

How are Protein Sequences Annotated?“regular approach”

Quality vs Quantity

Page 11: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

11

Experimentally characterized Find up-to-date information, accurate interpretation

Characterized by similarity (“knowns”) = closely related to experimentally characterized Avoid propagation of errors

Function can be predicted (no close sequence similarity, may be distant similarity to characterized proteins) Extract maximum possible information, avoid errors and

overpredictions Most value-added (fill the gaps in metabolic pathways, etc)

“Unknowns” (conserved or unique) Rank by importance

Functional Annotationfor Different Groups of Proteins

Page 12: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

12

Misinterpreted experimental results (e.g. suppressors, cofactors) Biologically senseless annotations Arabidopsis: separation anxiety protein-like Helicobacter: brute force protein Methanococcus: centromere-binding protein Plasmodium: frameshift “Goofy” mistakes of sequence comparison (e.g. abc1/ABC) Multi-domain organization of proteins Low sequence complexity (coiled-coil, transmembrane, non-

globular regions) Enzyme evolution: - Divergence in sequence and function (minor mutation in active site)- Non-orthologous gene displacement: Convergent evolution

Problems in Functional Assignments for “Knowns”

Page 13: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

13

Problems in Functional Assignments for “Knowns”: multi-domain organization of proteins

ACT domain

Chorismate mutase domain ACT domain

New sequence

Chorismate mutase

BLAST

In BLAST output, top hits are to chorismate mutases ->The name “chorismate mutase” is automatically assigned to new sequence. ERROR ! (protein gets erroneous name, EC number, assigned to erroneous pathway, etc)

Page 14: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

14

Previous low quality annotations lead to propagation of mistakes

Problems in Functional Assignments for “Knowns”

Page 15: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

15

Experimentally characterized Find up-to-date information, accurate interpretation

Characterized by similarity (“knowns”) = closely related to experimentally characterized Avoid propagation of errors

Function can be predicted (no close sequence similarity, may be distant similarity to characterized proteins) Extract maximum possible information, avoid errors and

overpredictions Most value-added (fill the gaps in metabolic pathways, etc)

“Unknowns” (conserved or unique) Rank by importance

Functional Annotationfor Different Groups of Proteins

Page 16: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

16

in non-obvious cases: Sophisticated database searches (PSI-BLAST, HMM) Detailed manual analysis of sequence similarities Structure-guided alignments and structure analysis

Often, only general function can be predicted: Enzyme activity can be predicted, the substrate remains unknown

(ATPases, GTPases, oxidoreductases, methyltransferases, acetyltransferases)

Helix-turn-helix motif proteins (predicted transcriptional regulators)

Membrane transporters

Functional Prediction:I. Sequence and Structure Analysis

(homology-based methods)

Page 17: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

17

Proteins (domains) with different 3D folds are not homologous (unrelated by origin). Proteins with similar 3D folds are usually (but not always) homologous

Those amino acids that are conserved in divergent proteins within a (super)family are likely to be functionally important (catalytic or binding sites, ect).

Reaction chemistry often remains conserved even when sequence diverges almost beyond recognition

Using Sequence Analysis:Hints

Page 18: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

18

Prediction of 3D fold (if distant homologs have known

structures!) and of general biochemical function is much easier than prediction of exact biological function

Sequence analysis complements structural comparisons and can greatly benefit from them

Comparative analysis allows us to find subtle sequence similarities in proteins that would not have been noticed otherwise

Using Sequence Analysis:Hints

Credit: Dr. M. Galperin, NCBI

Page 19: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

19

Functional Prediction:Role of Structural Genomics

Protein Structure Initiative: Determine 3D Structures of All Proteins Family Classification: Organize protein sequences into families, collect families

without known structures Target Selection: Select family representatives as targets Structure Determination: X-Ray crystallography or NMR spectroscopy Homology Modeling: Build models for other proteins by homology Attempt functional prediction based on structure

Page 20: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

20

Structural Genomics: Structure-Based Functional Predictions

Methanococcus jannaschii MJ0577 (Hypothetical Protein)

Contains bound ATP => ATPase or ATP-Mediated Molecular Switch

Confirmed by biochemical experiments

Page 21: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

21

Crystal Structure is Not a Function!

Credit: Dr. M. Galperin, NCBI

Page 22: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

22

Phylogenetic distribution (comparative genomics) Wide - most likely essential Narrow - probably clade-specific Patchy - most intriguing

Domain association – “Rosetta Stone” Genome context (gene neighborhood, operon

organization)

Functional Prediction:II. Computational Analysis Beyond Homology

Clues: specific to niche, pathway type

Page 23: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

23

Using Genome Context for Functional Prediction

Embden-Meyerhof and Gluconeogenesis pathway: 6-phosphofructokinase (EC 2.7.1.11)

SEED analysis tool (by FIG)

Page 24: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

24

Functional Prediction: Problem Areas

Identification of protein-coding regions Delineation of potential function(s) for distant

paralogs Identification of domains in the absence of

close homologs Analysis of proteins with low sequence

complexity

Page 25: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

25

Case Study: Prediction Verified: GGDEF domain

Proteins containing this domain: Caulobacter crescentus PleD controls swarmer cell - stalk cell transition (Hecht and Newton, 1995). In Rhizobium leguminosarum, Acetobacter xylinum, required for cellulose biosynthesis (regulation)

Predicted to be involved in signal transduction because it is found in fusions with other signaling domains (receiver, etc)

In Acetobacter xylinum, cyclic di-GMP is a specific nucleotide regulator of cellulose synthase (signalling molecule). Multidomain protein with GGDEF domain was shown to have diguanylate cyclase activity (Tal et al., 1998)

Detailed sequence analysis tentatively predicts GGDEF to be a diguanylate cyclase domain (Pei and Grishin, 2001)

Complementation experiments prove diguanylate cyclase activity of GGDEF (Ausmees et al., 2001)

Page 26: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

26

Most new protein sequences come from genome sequencing projects Many have unknown functions Large-scale functional annotation of these sequences based simply on

BLAST best hit has pitfalls; results are far from perfect Manual annotation of individual proteins is not efficient

Problem:The Need for Classification

Highly curated and annotated protein classification system Automatic annotation of sequences based on protein families

Solution:

Automatic annotation of sequences based on protein families Systematic correction of annotation errors Protein name standardization Functional predictions for uncharacterized proteins

Facilitates:

This all works only if the system is optimized for annotation

Page 27: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

27

Levels of Protein ClassificationLevel Example Similarity Evolution

Class / Structural elements No relationships

Fold TIM-Barrel Topology of backbone Possible monophyly

Domain Superfamily

Aldolase Recognizable sequence similarity (motifs); basic biochemistry

Monophyletic origin

Family Class I Aldolase High sequence similarity (alignments); biochemical properties

Evolution by ancient duplications

Orthologous group

2-keto-3-deoxy-6-phosphogluconate aldolase

Orthology for a given set of species; biochemical activity; biological function

Traceable to a single gene in LCA

Lineage-specific expansion(LSE)

PA3131 and PA3181

Paralogy within a lineage Recent duplication

Page 28: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

28

Protein Evolution

With enough similarity, one can trace back to a

common origin

Sequence changes

What about these?

Domain shufflingDomain: Evolutionary/Functional/Structural Unit

Page 29: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

29

PDT?

CM/PDH?

Consequences of Domain Shuffling

PIRSF001500CM (AroQ type) PDT ACT

PIRSF001501

CM (AroQ type)

PIRSF006786 PDH

PIRSF001499

PIRSF005547PDH ACT

PDT ACT PIRSF001424

CM = chorismate mutasePDH = prephenate dehydrogenase PDT = prephenate dehydrataseACT = regulatory domain

PDH?

CM/PDT?

CM?PDHCM (AroQ type)

Page 30: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

30

Peptidase M22Acylphosphatase ZnF YrdCZnF- - - -

Whole Protein = Sum of its Parts?

On the basis of domain composition alone, biological function was predicted to be: ● RNA-binding translation factor ● maturation protease

PIRSF006256

Actual function: ● [NiFe]-hydrogenase maturation factor, carbamoyltransferase

Whole protein functional annotation is best done using annotated whole-protein families

Page 31: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

31

Practical classification of proteins:setting realistic goals

We strive to reconstruct the natural classification of proteins to the fullest possible extent

BUTDomain shuffling rapidly degrades the continuity in the protein structure (faster than sequence divergence degrades similarity)

THUSThe further we extend the classification, the finer

is the domain structure we need to considerSO

We need to compromise between the depth of analysis and protein integrity

OR … Credit: Dr. Y. Wolf, NCBI

Page 32: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

32

Domain Classification

Allows a hierarchy that can trace evolution to the deepest possible level, the last point of traceable homology and common origin

Can usually annotate only general biochemical function

Whole-protein Classification

Cannot build a hierarchy deep along the evolutionary tree because of domain shuffling

Can usually annotate specific biological function (preferred to annotate individual proteins)

Can map domains onto proteinsCan classify proteins even when domains are not defined

Complementary Approaches

Page 33: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

33

Levels of Protein ClassificationLevel Example Similarity Evolution

Class / Structural elements No relationships

Fold TIM-Barrel Topology of backbone Possible monophyly

Domain Superfamily

Aldolase Recognizable sequence similarity (motifs); basic biochemistry

Monophyletic origin

Family Class I Aldolase High sequence similarity (alignments); biochemical properties

Evolution by ancient duplications

Orthologous group

2-keto-3-deoxy-6-phosphogluconate aldolase

Orthology for a given set of species; biochemical activity; biological function

Traceable to a single gene in LCA

Lineage-specific expansion(LSE)

PA3131 and PA3181

Paralogy within a lineage Recent duplication

Page 34: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

34

Whole protein classification PIRSF

Domain classification Pfam

SMART CDD

Mixed

•TIGRFAMS•COGs

Based on structural fold

•SCOP

Protein Classification Databases

InterPro: integrates various types of classification databases

Page 35: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

35

Integrated resource for protein families, domains and sites. Combines a number of databases: PROSITE, PRINTS, Pfam, SMART, ProDom, TIGRFAMs, PIRSF

SF001500Bifunctional chorismate mutase/ prephenate dehydratase

InterPro

CM PDT ACT

Page 36: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

36

The Ideal System… Comprehensive: each sequence is classified either as a member of

a family or as an “orphan” sequence

Hierarchical: families are united into superfamilies on the basis of distant homology, and divided into subfamilies on the basis of close homology

Allows for simultaneous use of the whole protein and domain information (domains mapped onto proteins)

Allows for automatic classification/annotation of new sequences when these sequences are classifiable into the existing families

Expertly curated membership, family name, function, background, etc.

Evidence attribution (experimental vs predicted)

Page 37: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

37

PIRSF Classification System PIRSF:

Reflects evolutionary relationships of full-length proteins A network structure from superfamilies to subfamilies

Definitions: Homeomorphic Family: Basic Unit Homologous: Common ancestry, inferred by sequence similarity Homeomorphic: Full-length similarity & common domain architecture Hierarchy: Flexible number of levels with varying degrees of sequence

conservation Network Structure: allows multiple parents

Advantages: Annotate both general biochemical and specific biological functions Accurate propagation of annotation and development of standardized

protein nomenclature and ontology

http://pir.georgetown.edu/

Page 38: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

38

PIRSF001499: Bifunctional CM/PDH (T-protein)

PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)

PIRSF001499: Bifunctional CM/PDH (T-protein)

PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type

PIRSF500001: IGFBP-1

…PIRSF500006: IGFBP-6

PIRSF Homeomorphic Subfamily

• 0 or more levels• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP

PIRSF003033: Ku70 autoantigen

PIRSF016570: Ku80 autoantigen

PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily• 0 or more levels

• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain

PIRSF001499: Bifunctional CM/PDH (T-protein)

PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)

PIRSF001499: Bifunctional CM/PDH (T-protein)

PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type

PIRSF500001: IGFBP-1

…PIRSF500006: IGFBP-6

PIRSF Homeomorphic Subfamily

• 0 or more levels• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP

PIRSF003033: Ku70 autoantigen

PIRSF016570: Ku80 autoantigen

PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily• 0 or more levels

• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain

PIRSF Classification SystemA protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.

Page 39: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

39

Creation and Curation of PIRSFsUniProtKB proteins

Preliminary Homeomorphic Families

Orphans

Curated Homeomorphic Families

Final Homeomorphic Families

Add/remove members

Name, refs, description

Automatic clustering

Computer-assisted Manual Curation

Automatic Procedure Unassigned proteins

Autom

atic placement

Create hierarchies (superfamilies/subfamilies)

Map domains on Families

Merge/split clusters

New proteins

Protein name rule/site rule

Computer-Generated (Uncurated) Clusters (35,000 PIRSFs)

Preliminary Curation (4,400 PIRSFs) Membership Signature

Domains

Full Curation (3,200 PIRSFs) Family Name,

Description, Bibliography

PIRSF Name Rules

Build and test HMMs

Page 40: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

40

Taxonomic distribution of PIRSF can be used to infer evolutionary history of the proteins in the PIRSF

PIRSF Family Report: Curated Protein Family Information

Phylogenetic tree and alignment view allows further sequence analysis

Page 41: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

41

PIRSF Hierarchy and Network: DAG Viewer

Page 42: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

42

PIRSF Family Report (II)

Integrated value added information from other databases

Mapping to other protein classification databases

Page 43: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

43

PIRSF Protein Classification: Platform for Protein Analysis and

Annotation

Improves automatic annotation quality Serves as a protein analysis platform for broad range of

users

Matching a protein sequence to a curated protein family rather than searching against a protein database

Provides value-added information by expert curators, e.g., annotation of uncharacterized hypothetical proteins (functional predictions)

Page 44: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

44

Name Rules

Hierarchy

PIRSF Classification Name

Site Rules

Family-Driven Protein AnnotationObjective: Optimize for protein annotation

PIRSF Classification Name Reflects the function when possible Indicates the maximum specificity that still describes the entire group Standardized format Name tags: validated, tentative, predicted, functionally heterogeneous

Hierarchy Subfamilies increase specificity (kinase -> sugar kinase -> hexokinase)

Name Rules Define conditions under which names propagate to individual proteins Enable further specificity based on taxonomy or motifs Names adhere to Swiss-Prot conventions (though we may make suggestions

for improvement)

Site Rules Define conditions under which features propagate to individual proteins

Page 45: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

45

PIR Name Rules

Monitor such variables to ensure accurate propagation

Account for functional variations within one PIRSF, including: Lack of active site residues necessary for enzymatic activity Certain activities relevant only to one part of the taxonomic tree Evolutionarily-related proteins whose biochemical activities are known to

differ

Propagate other properties that describe function:EC, GO terms, misnomer info, pathway

Name Rule types: “Zero” Rule

Default rule (only condition is membership in the appropriate family) Information is suitable for every member

“Higher-Order” Rule Has requirements in addition to membership Can have multiple rules that may or may not have mutually exclusive conditions

Page 46: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

46

Example Name Rules

Rule ID Rule Conditions Propagated Information

PIRNR000881-1 PIRSF000881 member and vertebrates

Name: S-acyl fatty acid synthase thioesteraseEC: oleoyl-[acyl-carrier-protein] hydrolase (EC 3.1.2.14)

PIRNR000881-2 PIRSF000881 member and not vertebrates

Name: Type II thioesteraseEC: thiolester hydrolases (EC 3.1.2.-)

PIRNR025624-0 PIRSF025624 member Name: ACT domain proteinMisnomer: chorismate mutase

Note the lack of a zero rule for PIRSF000881

Page 47: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

47

Affiliation of Sequence: Homeomorphic Family or Subfamily (whichever PIRSF is the lowest possible node)

No Yes Assign name from Name Rule 1 (or 2 etc)

Protein fits criteria for any higher-order rule?

No Yes

Nothing to propagate

Assign name from Name Rule 0PIRSF has zero rule?

Yes No Nothing to propagate

Name Rule Propagation Pipeline

Name rule exists?

Page 48: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

48

Name Rule in Action at UniProt

Current:• Automatic annotations (AA) are in a separate field• AA only visible from www.ebi.uniprot.org

Future:• Automatic name annotations will become DE line if DE line will improve as a result• AA will be visible from all consortium-hosted web sites

Page 49: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

49

PIR Site Rules Position-Specific Site Features:

active sites binding sites modified amino acids

Current requirements: at least one PDB structure experimental data on functional sites: CATRES database (Thornton)

Rule Definition: Select template structure Align PIRSF seed members with structural template Edit alignment to retain conserved regions covering all site residues Build Site HMM from concatenated conserved regions

Page 50: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

50

Match Rule Conditions Only propagate site annotation if all rule

conditions are met: Membership Check (PIRSF HMM threshold)

Ensures that the annotation is appropriate Conserved Region Check (site HMM threshold) Residue Check (all position-specific residues in

HMMAlign)

Page 51: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

51

Rule-based Annotation of Protein Entries

?

Functional variations within one PIRSF (family or subfamily): binding sites with different specificity

Monitor such variables for accurate propagationSite Rules Feed Name Rules

Functional Site rule: tags

active site, binding, other residue-specific information

Functional Annotation rule: gives name, EC, other activity-specific information

Page 52: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

52

Impact of Protein Bioinformatics and Genomics

Single protein level Discovery of new enzymes and superfamilies Prediction of active sites and 3D structures

Pathway level Identification of “missing” enzymes Prediction of alternative enzyme forms Identification of potential drug targets

Cellular metabolism level Multisubunit protein systems Membrane energy transducers Cellular signaling systems

Page 53: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

53

Most new protein sequences come from genome sequencing projects Many have unknown functions Large-scale functional annotation of these sequences based simply on

BLAST best hit has pitfalls; results are far from perfect

Problem: Overview

Highly curated and annotated protein classification system Automatic annotation of sequences based on protein families

Solution for Large-scale Annotation:

Functional Analysis of Protein Sequences: Homology-based (sequence analysis, structure analysis) Non-homology (genome context, phylogenetic distribution)

Automatic annotation of sequences based on protein families Systematic correction of annotation errors Name standardization in UniProt Functional predictions for uncharacterized proteins

Facilitates:

Page 54: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

54

What to do with a new porotein sequence Basic:- Domain analysis (SMART = most sensitive; PFAM, CDD- BLAST- Curated protein family databases (PIRSF, InterPro, COGs)- Literature (PubMed) from links from individual entries on the

BLAST output (look for SwissProt entries first)

If not sufficient:- PSI-BLAST- Refined PubMed search using gene/protein names, synomims, function and other terms you found

Advanced: - Multiple sequence alignments (manual) - Structure-guided alignments and structure analysis- Phylogenetic tree reconstruction

Page 55: Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

55

PIR Team Dr. Cathy Wu, Director

Protein Classification teamDr. Winona Barker Dr. Lai-Su Yeh Dr. Anastasia NikolskayaDr. Darren Natale Dr. Zhangzhi Hu Dr. Raja Mazumder Dr. CR Vinayaka Dr. Sona Vasudevan Dr. Cecilia Arighi

Informatics teamDr. Hongzhan Huang Dr. Peter McGarvey Baris Suzek, M.S. Dr. Leslie Arminski Dr. Hsing-Kuo Hua Yongxing Chen, M.S.Jing Zhang, M.S. Dr. Xin Yuan

StudentsChristina Fang Natalia Petrova