Top Banner
UCSC Genome Tools and Databases QuickTime™ and aTIFF (Uncomp QuickTime™ and aTIFF (U Jim Kent - Genome Bioinformatics Group University of California Santa Cruz
49

UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

UCSC Genome Tools and Databases

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Jim Kent - Genome Bioinformatics GroupUniversity of California Santa Cruz

Page 2: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 3: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Behind the Genome Browser• ‘Genome’ database, one for each assembly of

each genome.– hg17 (human genome assembly 17)– mm6 (mus musculus 6)– canFam1 (canis familiaris 1)

• hg17 has 1616 tables, but not really– Some tables split across chromosomes for speed– 228 logical tables– Only ~30 different types of tables

Page 4: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 5: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 6: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 7: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 8: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 9: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Selected fields from related tables results: Ensemble Gene (ensGene) and Superfamily Description (sfDescription).

Page 10: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 11: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 12: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 13: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 14: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 15: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 16: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 17: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 18: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Custom Track Output

• Useful for visualizing results of queries in genome browser

• The way to produce more complex queries.

Page 19: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 20: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 21: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 22: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 23: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 24: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

681/3329 (20%) of Ensemble not known also not conserved1728/33,666 (5%) of Ensembl in general not conserved

Page 25: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Meta-data behind Table Browser

• The trackDb table describes each track.

• Table and field descriptions in AutoSql .as files, which also generate SQL code and C code to load/save from database and tab-separated files.

• Descriptions of how tables are connected in all.joiner file, which along with joinerCheck program checks database integrity.

Page 26: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

.as Files - table and field docstable cpgIsland"Describes the CpG Islands" ( string chrom; "Human chromosome or FPC contig" uint chromStart; "Start position in chromosome" uint chromEnd; "End position in chromosome" string name; "CpG Island" uint length; "Island Length" uint cpgNum; "Number of CpGs in island" uint gcNum; "Number of C and G in island" float perCpg; "Percentage of island that is CpG" float perGc; "Percentage of island that is C or G" )

autoSql generates code from these. They also help document.

Page 27: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

all.joiner - basic example

• The central concept is an identifier that appears in fields in multiple table, sometimes even multiple databases.

• $gbd is a variable that contains a comma-separated list of databases.

• An identifier record ends with a blank line.

identifier softberryGeneName"Link together Fshgene++ gene structure, peptide, and homolog" $gbd.softberryGene.name $gbd.softberryPep.name $gbd.softberryHom.name

Page 28: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

# Genbank/trEMBL Accessions and meaningful subsets thereofidentifier genbankAccession external=genbank"Generic Genbank Accession. More specific Genbank accessions follow" $gbd.seq.acc

identifier bacEndAccession typeOf=genbankAccession"Genbank accession of a BAC end read." $gbd.all_bacends.qName dupeOk $gbd.bacEndPairs.lfNames comma $hg.fishClones.beNames comma minCheck=0.70

typeOf - allows joins between parent and child, but not between siblings. dupeOk - allows more than one row with same identifier in primary tablecomma - indicates field is comma separated list of identifiersminCheck - indicates only a portion identifiers in field is in the primary table

Page 29: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

identifier hugoName external=HUGO fuzzy"International Human Gene Identifier" $hg.refLink.name $hg.atlasOncoGene.locusSymbol $hg.kgAlias.alias $hg.kgXref.geneSymbol $hg.refFlat.geneName $hg.jaxOrtholog.humanSymbol hg13,hg15.geneBands.name

“Biological” names for human genes are so messy, no validation is done (note ‘fuzzy’ keyword).

Page 30: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Other Databases• Genome databases - one for each assembly of each

organism: hg17, mm6, canFam1, etc.• hgCentral - home to dbDb and user settings info.

One database shared by all web servers.• hgFixed - mostly microarray data. • uniProt - Relationalized SwissProt/trEMBL

database.• go - Gene ontology terms and term/gene

associations.• genePix - gene image database

Page 31: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Gene Pix

• Image browser for in-situ and other gene- oriented pictures

• Hopefully in the long run will have a million images covering almost all vertebrate genes.

• (Needs new name, Gene Pix is a microarray analysis program. VisiGene?)

Page 32: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 33: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Data Sets• Paul Gray - ~1000 mouse transcription factor

genes - whole embryo & sections. These are in the database now.

• Other potential sources:– German AxelDB frog in situs– Japanese NIBB frog in situs (have nice browser)– Genepaint.org - mouse stuff– EMAGE and Jackson Lab mouse images

• From development and other journals, copyright issues.

– Nathaniel Heintz BAC expression constructs– Eddy Rubin lab mouse embryos– UCSF cell-localization stuff?

Page 34: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Types of images• Whole animal vs.

sectioned tissues, vs. single cell.

• Single vs. multiple probes within same image.

• Single image vs. image series (movies even).

• RNA, Antibody, Fusion protein.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Mitotic cell 3 stains

Page 35: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Gene Pix Programs• genePixLoad - loads SQL database from a well

defined format involving a .ra file and a tab separated file. See genePixLoad.doc

• loadMahoney - converts Paul Gray (Mahoney center) spreadsheet and image directory into genePixLoad format

• Hg/lib/genePix.c - interface with SQL database.• hgGenePix - cgi script to display images• knownToGenePix - makes table in mm5 (or other)

genome database to connect known genes to genePix Ids.

Page 36: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Gene Pix Database

• Just a single database for all assemblies of all organisms.

• A knownToGenePix table in the assembly database.

Page 37: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

GenePix tables

• fileLocation - directory• bodyPart - whole, brain etc. • sliceType - transverse, sagital • treatment - tech details • contributor - who done it• Journal - scientific journal• submissionSet - info about a

whole set of images from one author

• sectionSet - links together separate sections of same specimen.

• Gene - gene info

• geneSynonym

• Antibody - info on an antibody

• probeType - antibody, RNA, fusion protein

• Probe - links gene, primers, sequence Ab.

• probeColor - color probe is

• imageFile - file containing image

• Image - a single image.

• imageProbe links image and probe

Page 38: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Some Anatomy Required

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.QuickTime™ and a

TIFF (LZW) decompressorare needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 39: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Especially with slices

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 40: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Edinburgh mouse atlas

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 41: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Theiler Stages

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 42: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Later Stages

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 43: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

NIBB Japanese Frog Site

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 44: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Earlier Stages

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 45: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 46: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Page 47: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Who you gonna call?

Angie Hinrichs - developer of 2nd and 4th versions of Table Browser. Genome browser hacker extraordinaire.

Hiram Clawson - main mouse man at the moment. Developed ‘wiggle’ tracks.

Page 48: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Kate Rosenbloom - ENCODE project and multiple alignment display.

Bob Kuhn - Software and database quality assurance.

David Haussler - Ideas. Money. Comparative genomics.

Page 49: UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

More Acknowledgements• UCSC - Robert Baertsch, Gill Bejerano, Galt

Barber, Ron Chao, Mark Diekhans, Jorge Garcia, Patrick Gavin, Rachel Harte, Fan Hsu, Yontoa Lu, Crystal Lynch, Donna Karolchik, Jennifer Jackson, Ann Pace, Jacob Pedersen, Andy Pohl, Katie Pollard, Ali Sultan-Qurraie, Brian Raney, Krishna Roskin, Adam Siepel, Chuck Sugnet, Paul Tatarsky, Daryl Thomas, Heather Trumbower

• Penn State - Scott Schwartz, Laura Elnitski, Belinda Giardine, Ross Hardison, Minmei Hou, Webb Miller, Anton Nekrutenko

• Funding - NHGRI, HHMI, NCI, UCSC