Top Banner
Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001
39

Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Cancer Genome Anatomy Project (CGAP) Informatics

Carl F. Schaefer

December 7, 2001

Page 2: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Agenda

• Overview• New Features• Future plans• Behind the scenes

url: http://cgap.nci.nih.gov

Page 3: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

CGAP Informatics

• Main CGAP– Susan Greenhut (OCG)

– Denise Hise (NCICB)

– Carl Schaefer (NCICB)

– Kotien Wu (NCICB)

• GAI– Bob Clifford (LPG)

– Michael Edmonson (LPG)

– Ying Hu (LPG)

– Cu Nguyen (LPG)

Page 4: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Cancer Genome Anatomy Project

Gene expression

Polymorphism

Chromosomal aberrations

Topography (tissue)

Morphology (histology)

Find correlations between ...

?

• Earlier informatics segments:– Tumor Gene Index

– Cancer Chromosome Aberration Project

– Gene Annotation Initiative

Page 5: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

History (Simplified)

N C B I

N C I/LP G

Prototype

01/00 09/00

Live, para lle l

02/01

Live

05/01

N C I/N C IC B

Live

Page 6: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

NCI CGAP Site: Original Goals

• Organize by biology rather than by funding– Genes, tissues, chromosomes, reagents, …

• Add bio-functional component (e.g. pathways)• Make the site

– Consistent: search forms; lists; info pages

– Coherent: tied together with internal links

Page 7: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

New Features

• Expression: DGED; SAGE data• Function: GO and pathways• Structure: protein motifs• Chromosome aberrations in cancer

Page 8: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Measuring Gene Expression

• Sequencing– ESTs (100-600 bp end or full sequence of single clone)

– SAGE (10 bp “tag” excised with restriction enzymes)

• Hybridization– Spotted cDNA arrays

• Longer probes, fewer features per slide

– “Gene chip” (e.g. Affymetrix)• Multiple shorter probes, more features per chip

Page 9: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Digital Gene Expression Display

• Evolved from a concrete request: help find vaccine targets

• Similar to NCBI’s DDD but more flexible interface

• Queries both EST and SAGE data– But better not to mix

Page 10: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

DGED-1

Page 11: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

DGED-2

Page 12: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

DGED-3

Page 13: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Other Expression Stuff

• SAGE data– SAGE libraries accessible in library browser and DGED

– Caveat: tag-to-gene mapping is ambiguous both ways• Stay tuned for improvement here

• Virtual Northern– For each gene, contrast cancer vs. normal, in ESTs and SAGE, for

each of 50 tissues

– Ratio: tags for G in given tissue, histology divided by total tags in given tissue, histology

– Convert to decile

Page 14: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Virtual Northern

Page 15: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Functional Information

• Ontologies– Set membership

– E.g., TP53 is in “DNA binding”, “DNA repair”, “transcription factor”, …

• Pathways– Set membership, e.g. TP53 is in “ATM Signaling

Pathway”, “p53 Signaling Pathway” …

– But also relations among members of a pathway, e.g. “is catalyst for”, “activates”, …

Page 16: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Gene Ontology

• Three top-level categories– Biological process

– Molecular function

– Cellular component

• Given gene may appear in multiple sets• Mouse by JAX; human by Proteome• An evolving vocabulary

– The sorry fate of “tumor suppressor” and “oncogenesis”

Page 17: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

GO Browser

Page 18: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

BioCarta Pathways

• 95 pathway diagrams• Artistic rendering …• but dumb

– Relations (e.g. “is catalyst for”) are drawn, but are not data

Page 19: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

AKT Signaling Pathway

Page 20: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

KEGG Pathways

• Mainly metabolic pathways; some regulatory• Genes represented by EC numbers

– Many can be hyperlinked to CGAP gene info pages

– Some refs to non-human organisms

• Compounds appear under various names; each has a unique KEGG compound number

• Database contains representation of reactions (unlike BioCarta)

Page 21: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

D-Glutamine and D-Glutamate Metabolism

Page 22: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

L-Glutamate Compound Info Page

Page 23: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Summary of Functional Information for CASPASE 7

Page 24: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Structure: Protein Motifs

• GAI using HMMER to locate Pfam motifs on RefSeq (NM_ …) and MGC (BC…) transcripts

• Similarity among transcripts:– Raw sequence– Single motif occurrences– Multiple motif occurrences

• E-value: fit of motif to transcript• P-value: relative probability that two transcripts

are closely related

Page 25: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Structure: Protein Motifs (Example: ICE_p10, ICE_p20, and CARD among

the CASPASes)

Page 26: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Mitelman Database of Chromosomal Aberrations in Cancer

• Data culled from literature -- 39,000 cases• Case records:

– Clinical/demographic– Topography/morpology– Karyotype– Reference

• Recurrent subset• Separate dataset of associations, often to specific

genes

Page 27: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Future Plans

• Function: smarter pathways• Expression:

– New SAGE data and display

– Microarray data (NCI 60 cell lines) (see CMAP presentation)

• Structure: gene query by motif• Operations on lists of genes

– Adding columns of information to gene lists

Page 28: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Genes in AKT Signaling Pathway

Page 29: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Clone List

Page 30: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Genes in GO Apoptosis

Page 31: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Pathways, Ontology, Tissues

Page 32: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Behind the Scenes

• The build process (not a pretty sight)• Software architecture

Page 33: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Data Sources/SizesSource Dataset MB

UniGene Hs.data 236Hs.seq.all 1965Hs.lib.info 1Mm.data 151Mm.seq.all 1022Mm.lib.info 1

LocusLink LL_tmpl 81HomoloGene hmlg.ftp 18SAGE tag_lib_freq 67

tag_cluster_map 26Research Genetics Hs verified clones 10

Mm verified clones 2Felix Mitelman cytogenetic data 27NCBI custom library.report 6

hierarchy.txt 1BAC clones 1

DTP microarray ids 1Gene Ontology GO terms 1BioCarta Pathways 8Total 3625

Page 34: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Hs.dataHs.seq.all Hs.lib.infoLL_tm pl

LL_stripped

hm lg.ftpResGen svc lib.rep hier.txt

unilib.info

BLAST libs

SAGE tag-gene m ap

SAGE tag-lib-freq

Hs.seq.tm p

Hs.accs

hs_cluster.datall_libraries.dat library_keyword.dat

gene_alias.dat

gene_keyword.dat

hs_gxs.dat

hs_svc.dat

hs_gene_tissue.daths_clust2est.dat

hs_clust2sage.dat

hs_vn.dat

hs_vn_lib.dat

hs_decile.dat

hs_ug_clones.dat

Raw Input

Oracle input

go_genes.dat

go_nam es.dat

GO hier.

hs_gl.dat

Page 35: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Build Process -- Goals

• Automated• Current (with respect to external data sources)• Internally consistent (i.e. new UniGene cluster

numbers throughout)• Efficient (only recompute when necessary)

Page 36: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Makefile Example

$(HS_GENE_TISSUE_DAT): $(TISSUE_SELECTION_DAT) $(HS_GXS_DAT)

$(GENE_TISSUE_CMD) \

Hs \

$(ALL_LIBRARIES_DAT) \

$(TISSUE_SELECTION_DAT) \

$(LIBRARY_KEYWORD_DAT) \

$(HS_GXS_DAT) \

$(HS_GENE_TISSUE_DAT)

$(DATA_DIR)/load_hs_gene_tissue.mak: $(HS_GENE_TISSUE_DAT)

echo "drop index Hs_Gene_Tissue1;" | sqlplus $(DB_USER)

echo "drop index Hs_Gene_Tissue2;" | sqlplus $(DB_USER)

sqlldr userid=$(DB_USER) control=$(LOAD_DIR)/Hs_Gene_Tissue.ctl \>$(LOAD_DIR)/load_hs_gene_tissue.log 2>&1

echo "create index Hs_Gene_Tissue1 on Hs_Gene_Tissue(tissue_code);" | sqlplus $(DB_USER)

echo "create index Hs_Gene_Tissue2 on Hs_Gene_Tissue(cluster_number);" | sqlplus $(DB_USER)

echo "analyze table Hs_Gene_Tissue compute statistics for table;" | sqlplus $(DB_USER)

echo "analyze table Hs_Gene_Tissue compute statistics for all indexes;" | sqlplus $(DB_USER)

touch $(DATA_DIR)/load_hs_gene_tissue.mak

Page 37: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

CGAP Site Architecture (Overview)

S taticC ontent

W eb S erver(Zope)

C ontentM anagem ent

S ystem(Zope)

Python

P erl O racle

O racleTab les

socket S Q L

Fla t F ilesin itia liza tion data

Ÿ G eneS erverŸ L ibS erverŸ G LS erverŸ G XS S erverŸ C ytS earchS erverŸ B lastQ ueryS erver

Page 38: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Distributed Processing

GENE_HOST = VenusGENE_PORT = 9000BLAST_HOST = MarsBLAST_PORT = 9001GLS_HOST = P lutoGLS_PORT = 9002LIB_HOST = P lutoLIB_PORT = 9003

GENE_PORT = 9000

Venus

Zope

CGAPConfig

GeneServer

BLAST_PORT = 9001

CGAPConfig

BlastServer

LIB_PORT = 9003GLS_PORT = 9002

CGAPConfig

GLSServerMars

LibServer

Pluto

Page 39: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001.

Application Support

Security Screen IP Screen request

Request status O K N O D AT A BAD R EQ U EST SER VER FAIL

Server loop W ait for request R eplicate process ing thread Set tim er R eset server

# Define Services:

sub F indG ene { ... } sub G etC lones { ...}

# "Publish" services:

SetSafe ("F indG ene", "G etC lones", ...); SetForkable ("F indG ene", "G etC lones", ...);

# Start server loop :

S tartServer(G EN E_PO R T );

Application

Support