Update Susan Bridges, Fiona McCarthy, Shane Burgess NRI 2006-04846
Jan 18, 2016
Update
Susan Bridges, Fiona McCarthy, Shane Burgess
NRI 2006-04846
1.Some of what we’ve been doing :Confirmation of predicted/hypothetical proteins in chicken
2. Something of more interest to almost everyone in here for analyzing your data.
Educate researchers who need to use GO.
University of Delaware, 12-13 November, 2007.
…… currently working with researchers from the Universities of Delaware and Maryland to provide GO annotations necessary to facilitate publication of array data.
First residential workshop at MSU in May 20-22 2008.
Avian Genome Conference 18-20 May, 2008GO Annotation Jamboree 21-22 May, 2008
“Hypothetical” and “predicted” proteins
Naive and activated purified CD4+ T cells; transformed CD4+ T cells; spleen; brain tissues; bursal B and stromal cells; muscle; and serum.
Database of all predicted proteins, from chicken build 2.1, using DFF-2D LC MS2 and our computational pipeline.
Experimentally-confirmed 7,809 chicken predicted proteins: 52% were expressed in more than one tissue.
6,027 (77%) of these proteins mapped to human and mouse orthologs and we assigned standardized nomenclature to 5,326 (64%).
8,213 GO associations to 21% of the identified chicken proteins using the ISS evidence code to transfer function between human-chicken and human-mouse orthologs
increased the current chicken GO annotations by 8% and doubled the number of chicken manually-curated annotations.
In PRIDE and NCBI databases and being used at NCBI to promote XP (computational model) to NP (confirmed product) accessions i.e. the words “hypothetical” and “predicted” are removed.
We also add experimentally-derived cell component GO annotations.
48%(3,779)
1%(61)4%
(313)7%
(561)
26%(2,020)
14%(1,073)
0%(0)
0%(2)
In one tissue In two tissues In three tissues In four tissuesIn five tissues In six tissues In seven tissues In all eight tissues
Tissue distribution of expressed ‘predicted’ proteins
0
1000
2000
3000
4000
5000
6000
Spleen
UA
01
Strom
a
Tcell
s B-cells
Serum
Muscle
Brain
Tissue type
Nu
mb
er o
f p
rote
ins
Tissue specific proteins
Proteins identified inother tissues
chicken: human/mouse orthologs (1:1)
236
Mouse orthologsHuman orthologs
5,685 106
No human or mouse orthologs
1,784
Cumulative external visits to AgBase
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
05 05 05 05 05 05 06 06 06 06 06 06 06 06 06 06 06 06 07 07 07 07 07 07 07 07 07 07 07J Au Se Oc No De Ja Fe MaAp MaJu J Au Se Oc No De Ja Fe MaAp Ma Ju J Au Se Oc No De
07
Summary of GO annotations for last 12 months
11,716 GO annotations for chicken & cow:• 214 cow gene products GO annotated
(1,521 GO annotations)• 1,762 chicken gene products GO
annotated (10,194 GO annotations)• in addition, orthology with human and
mouse genes used to GO annotate 7,809 computationally ‘predicted’ chicken proteins (8,213 GO annotations)
Annotation metrics
Database distribution of AgBase GO Annotations
AgBase Community file
GO Consortium file
Chicken Dec '07Cow Dec '07
GO Annotation of Arrays
Functional annotation usingGene Ontology
Nomenclature(species’ genome nomenclature committees)
Other annotations
using other bio-ontologies e.g.
AnatomyOntology
Structural Annotationincluding Sequence Ontology
Genomic Annotation
Quality improvement of annotationsPre-annotation Re-annotation
GO annotation of arrays.
Array IDs
‘known’ genes frompublic databases
‘predicted’ genesfrom genome sequencing
Are strict mammalian orthologs available ?
GO annotation of literature
Is functional literature available ?
Gene product IDs
Electronic GO annotation using InterPro data (IEA)
GO annotation from orthologs (ISO)
Collate GO annotations
Submit to EBI-GOA, GOC
YES
YES NO
NO
structural mapping
link to array IDs(updateable)
AgBase: annotating arrays
1. Del-Mar 14K Chicken Integrated Systems microarray (GPL1731).• 14,053 chicken genes represented
• 9,587 contigs GO annotated
(CC:3,514; MF:6,640; BP:4,623)
• 3,101 singletons GO annotated
(CC:487; MF: 881; BP:646)
• many singletons map to chicken ESTs with no associated GO
metabolic process
transport
cell communication
development
immune response
cell death
cell differentiation
response to stress
sensory perception
cell motility
regulation of biological process
cellular organization and biogenesis
behavior
response to chemical stimulus
process unknown
Figure 1A: Biological Process associated with Del-Mar 14K array
Relative amount of GO BP associated with Del-Mar 14K array compared to total chicken GO.
-6.0
-4.0
-2.0
0.0
2.0
4.0
6.0
de
velo
pm
en
t
imm
un
e r
esp
on
se
cell
de
ath
resp
on
se t
o s
tre
ss
pro
cess
un
kno
wn
cell
mo
tility
cell
diff
ere
ntia
tion
be
ha
vio
r
tra
nsp
ort
reg
ula
tion
of
bio
log
ica
l pro
cess
sen
sory
pe
rce
ptio
n
resp
on
se t
o c
he
mic
al s
timu
lus
secr
etio
n
cellu
lar
org
an
iza
tion
an
d b
iog
en
esi
s
resp
on
se t
o s
timu
lus
me
tab
olic
pro
cess
cell
com
mu
nic
atio
n
Arr
ay
GO
/to
tal c
hic
ken
GO
GO Biological Processes
AgBase: annotating arrays
2. TAMU Agilent 44K chicken array
• approx 44,000 chicken genes represented
• added GO annotation for 8,731 chicken gene products
• many of the array IDs with no associated GO annotation map to chicken EST sequences
AgBase: annotating arrays
3. FHCRC Chicken 13K v2.0 (GPL1836)• 13,007 chicken genes represented• 2,491 array IDs mapped to chicken gene products & GO annotated• 628 mapped to chicken gene products with no GO• approx 2,000 array IDs mapped to human or mouse gene products with GO annotation
GO Annotation Quality Score: “GAQ”
GAQ : no. annotations; DAG depth; GO evidence code
• calculate overall GAQ score for any dataset (eg. array)• calculate GAQ for subsets (eg. biological processes studied
using arrays)
“Gene Ontology”“Biological Process”
IEA inferred from electronic annotation ISS inferred from sequence similarity IMP inferred from mutant phenotype IGI inferred from genetic interaction IPI inferred from physical interaction IDA inferred from direct assay IEP inferred from expression pattern TAS traceable author statement NAS non-traceable author statement ND no biological data available RCA inferred from reviewed computational analysis IC inferred by curator
Evidence Code
Your Favorite Gene
Low GAQ score
Your NEW Favorite gene
High GAQ score
Quantification of re-annotation
Metrics
Granularity Specificity
# previous annotations # chicken annotations
# re-annotations # human/mouse annotations
Quality
Gene Annotation Quality (GAQ) score
0
5001000
15002000
25003000
35004000
4500
Whole Array Chicken Human/Mouse
Annotation type
Nu
mb
er
of a
nn
ota
tion
s
Pre-annotation
Re-annotation
• 13% of previous annotations to other species were corrected to chicken specific annotations
300% increase
50% increase700% increase
GRANULARITY SPECIFICITY
Bart van den Berg, CVM MSU/ Sue Lamont and Huaijun Zhu
2.8579,599207,869Total GAQ score
4.84,240886Total # proteins (Breadth)
2.8108,53739,355Confidence score total
2.7231,18487,250Depth
Fold differenceRe-annotationPre-annotation
GAQ score summary
Quality improvement of annotationsPre-annotation Re-annotation
GO biological process annotations
-4.88
-3.61
-1.80
-0.75-0.04
0.18 0.33 0.461.04 1.06 1.26 1.64
5.12
-6
-4
-2
0
2
4
6
cell co
mm
unica
tion
meta
bolic p
roce
ss
cata
bolic p
roce
ss
transp
ort
regula
tion o
f bio
logica
l pro
cess
Macro
mole
cule
m
eta
bolic p
roce
ss
bio
logica
l_pro
cess
cell m
otility
resp
onse
to stim
ulu
s
Nucle
obase
, nucle
osid
e, n
ucle
otid
e a
nd n
ucle
ic acid
meta
bolic p
roce
ss
cell d
iffere
ntia
tion
cell d
eath
multice
llula
r org
an
ismal
develo
pm
ent
GO Term
Rela
tive
diff
ere
nce
microarray GO / total chicken GO
Modeling using the GO
Functional Understanding
ImpliedDerivedPhysiology (= Cellular Component + Biological
Process + Molecular Function)
Network ModelingGene Ontology
(interactions)
Hypothesis-driven GO-based data interrogation
Buza, J. J. and S.C. Burgess. Modeling the proteome of a Marek's disease transformed cell line: a natural animal model for CD30 over-expressing lymphomas. Proteomics, 2007. 7:1316-26.
Avian Genome Conference 18-20 May, 2008GO Annotation Jamboree 21-22 May, 2008