Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001
Dec 19, 2015
Cancer Genome Anatomy Project (CGAP) Informatics
Carl F. Schaefer
December 7, 2001
Agenda
• Overview• New Features• Future plans• Behind the scenes
url: http://cgap.nci.nih.gov
CGAP Informatics
• Main CGAP– Susan Greenhut (OCG)
– Denise Hise (NCICB)
– Carl Schaefer (NCICB)
– Kotien Wu (NCICB)
• GAI– Bob Clifford (LPG)
– Michael Edmonson (LPG)
– Ying Hu (LPG)
– Cu Nguyen (LPG)
Cancer Genome Anatomy Project
Gene expression
Polymorphism
Chromosomal aberrations
Topography (tissue)
Morphology (histology)
Find correlations between ...
?
• Earlier informatics segments:– Tumor Gene Index
– Cancer Chromosome Aberration Project
– Gene Annotation Initiative
History (Simplified)
N C B I
N C I/LP G
Prototype
01/00 09/00
Live, para lle l
02/01
Live
05/01
N C I/N C IC B
Live
NCI CGAP Site: Original Goals
• Organize by biology rather than by funding– Genes, tissues, chromosomes, reagents, …
• Add bio-functional component (e.g. pathways)• Make the site
– Consistent: search forms; lists; info pages
– Coherent: tied together with internal links
New Features
• Expression: DGED; SAGE data• Function: GO and pathways• Structure: protein motifs• Chromosome aberrations in cancer
Measuring Gene Expression
• Sequencing– ESTs (100-600 bp end or full sequence of single clone)
– SAGE (10 bp “tag” excised with restriction enzymes)
• Hybridization– Spotted cDNA arrays
• Longer probes, fewer features per slide
– “Gene chip” (e.g. Affymetrix)• Multiple shorter probes, more features per chip
Digital Gene Expression Display
• Evolved from a concrete request: help find vaccine targets
• Similar to NCBI’s DDD but more flexible interface
• Queries both EST and SAGE data– But better not to mix
DGED-1
DGED-2
DGED-3
Other Expression Stuff
• SAGE data– SAGE libraries accessible in library browser and DGED
– Caveat: tag-to-gene mapping is ambiguous both ways• Stay tuned for improvement here
• Virtual Northern– For each gene, contrast cancer vs. normal, in ESTs and SAGE, for
each of 50 tissues
– Ratio: tags for G in given tissue, histology divided by total tags in given tissue, histology
– Convert to decile
Virtual Northern
Functional Information
• Ontologies– Set membership
– E.g., TP53 is in “DNA binding”, “DNA repair”, “transcription factor”, …
• Pathways– Set membership, e.g. TP53 is in “ATM Signaling
Pathway”, “p53 Signaling Pathway” …
– But also relations among members of a pathway, e.g. “is catalyst for”, “activates”, …
Gene Ontology
• Three top-level categories– Biological process
– Molecular function
– Cellular component
• Given gene may appear in multiple sets• Mouse by JAX; human by Proteome• An evolving vocabulary
– The sorry fate of “tumor suppressor” and “oncogenesis”
GO Browser
BioCarta Pathways
• 95 pathway diagrams• Artistic rendering …• but dumb
– Relations (e.g. “is catalyst for”) are drawn, but are not data
AKT Signaling Pathway
KEGG Pathways
• Mainly metabolic pathways; some regulatory• Genes represented by EC numbers
– Many can be hyperlinked to CGAP gene info pages
– Some refs to non-human organisms
• Compounds appear under various names; each has a unique KEGG compound number
• Database contains representation of reactions (unlike BioCarta)
D-Glutamine and D-Glutamate Metabolism
L-Glutamate Compound Info Page
Summary of Functional Information for CASPASE 7
Structure: Protein Motifs
• GAI using HMMER to locate Pfam motifs on RefSeq (NM_ …) and MGC (BC…) transcripts
• Similarity among transcripts:– Raw sequence– Single motif occurrences– Multiple motif occurrences
• E-value: fit of motif to transcript• P-value: relative probability that two transcripts
are closely related
Structure: Protein Motifs (Example: ICE_p10, ICE_p20, and CARD among
the CASPASes)
Mitelman Database of Chromosomal Aberrations in Cancer
• Data culled from literature -- 39,000 cases• Case records:
– Clinical/demographic– Topography/morpology– Karyotype– Reference
• Recurrent subset• Separate dataset of associations, often to specific
genes
Future Plans
• Function: smarter pathways• Expression:
– New SAGE data and display
– Microarray data (NCI 60 cell lines) (see CMAP presentation)
• Structure: gene query by motif• Operations on lists of genes
– Adding columns of information to gene lists
Genes in AKT Signaling Pathway
Clone List
Genes in GO Apoptosis
Pathways, Ontology, Tissues
Behind the Scenes
• The build process (not a pretty sight)• Software architecture
Data Sources/SizesSource Dataset MB
UniGene Hs.data 236Hs.seq.all 1965Hs.lib.info 1Mm.data 151Mm.seq.all 1022Mm.lib.info 1
LocusLink LL_tmpl 81HomoloGene hmlg.ftp 18SAGE tag_lib_freq 67
tag_cluster_map 26Research Genetics Hs verified clones 10
Mm verified clones 2Felix Mitelman cytogenetic data 27NCBI custom library.report 6
hierarchy.txt 1BAC clones 1
DTP microarray ids 1Gene Ontology GO terms 1BioCarta Pathways 8Total 3625
Hs.dataHs.seq.all Hs.lib.infoLL_tm pl
LL_stripped
hm lg.ftpResGen svc lib.rep hier.txt
unilib.info
BLAST libs
SAGE tag-gene m ap
SAGE tag-lib-freq
Hs.seq.tm p
Hs.accs
hs_cluster.datall_libraries.dat library_keyword.dat
gene_alias.dat
gene_keyword.dat
hs_gxs.dat
hs_svc.dat
hs_gene_tissue.daths_clust2est.dat
hs_clust2sage.dat
hs_vn.dat
hs_vn_lib.dat
hs_decile.dat
hs_ug_clones.dat
Raw Input
Oracle input
go_genes.dat
go_nam es.dat
GO hier.
hs_gl.dat
Build Process -- Goals
• Automated• Current (with respect to external data sources)• Internally consistent (i.e. new UniGene cluster
numbers throughout)• Efficient (only recompute when necessary)
Makefile Example
$(HS_GENE_TISSUE_DAT): $(TISSUE_SELECTION_DAT) $(HS_GXS_DAT)
$(GENE_TISSUE_CMD) \
Hs \
$(ALL_LIBRARIES_DAT) \
$(TISSUE_SELECTION_DAT) \
$(LIBRARY_KEYWORD_DAT) \
$(HS_GXS_DAT) \
$(HS_GENE_TISSUE_DAT)
$(DATA_DIR)/load_hs_gene_tissue.mak: $(HS_GENE_TISSUE_DAT)
echo "drop index Hs_Gene_Tissue1;" | sqlplus $(DB_USER)
echo "drop index Hs_Gene_Tissue2;" | sqlplus $(DB_USER)
sqlldr userid=$(DB_USER) control=$(LOAD_DIR)/Hs_Gene_Tissue.ctl \>$(LOAD_DIR)/load_hs_gene_tissue.log 2>&1
echo "create index Hs_Gene_Tissue1 on Hs_Gene_Tissue(tissue_code);" | sqlplus $(DB_USER)
echo "create index Hs_Gene_Tissue2 on Hs_Gene_Tissue(cluster_number);" | sqlplus $(DB_USER)
echo "analyze table Hs_Gene_Tissue compute statistics for table;" | sqlplus $(DB_USER)
echo "analyze table Hs_Gene_Tissue compute statistics for all indexes;" | sqlplus $(DB_USER)
touch $(DATA_DIR)/load_hs_gene_tissue.mak
CGAP Site Architecture (Overview)
S taticC ontent
W eb S erver(Zope)
C ontentM anagem ent
S ystem(Zope)
Python
P erl O racle
O racleTab les
socket S Q L
Fla t F ilesin itia liza tion data
Ÿ G eneS erverŸ L ibS erverŸ G LS erverŸ G XS S erverŸ C ytS earchS erverŸ B lastQ ueryS erver
Distributed Processing
GENE_HOST = VenusGENE_PORT = 9000BLAST_HOST = MarsBLAST_PORT = 9001GLS_HOST = P lutoGLS_PORT = 9002LIB_HOST = P lutoLIB_PORT = 9003
GENE_PORT = 9000
Venus
Zope
CGAPConfig
GeneServer
BLAST_PORT = 9001
CGAPConfig
BlastServer
LIB_PORT = 9003GLS_PORT = 9002
CGAPConfig
GLSServerMars
LibServer
Pluto
Application Support
Security Screen IP Screen request
Request status O K N O D AT A BAD R EQ U EST SER VER FAIL
Server loop W ait for request R eplicate process ing thread Set tim er R eset server
# Define Services:
sub F indG ene { ... } sub G etC lones { ...}
# "Publish" services:
SetSafe ("F indG ene", "G etC lones", ...); SetForkable ("F indG ene", "G etC lones", ...);
# Start server loop :
S tartServer(G EN E_PO R T );
Application
Support