Top Banner
Genometry Gregg Helt Cyrus Harmon
36

IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Aug 23, 2014

Download

Science

Ann Loraine

These slides were developed by Gregg Helt and Cyrus Harmon to explain the core data models in Integrated Genome Browser. The goal was to make translation between protein, transcript, and genome coordinate systems easier and more powerful. These data models are what makes IGB capable of correctly displaying probes that are split across intron boundaries. They also form the core of the ProtAnnot application, that displays protein domains mapped onto genomic sequence.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Genometry

Gregg Helt Cyrus Harmon

Page 2: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Genometry

• Motivation and Purpose •  Points of Reference • Genometry interfaces • Genometry manipulations • Genometry implementation •  Representation examples •  Prototype apps •  Current status, future work

Page 3: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Motivation and Goals •  Desire for a more unified data model to represent

relationships between biological sequences, such as: –  Annotations –  Alignments –  Sequence composition

•  More networked, less hierarchical (genome-centric, transcript-centric)

•  Simplicity •  Expressivity / Flexibility •  Memory and Computational Efficiency •  Use by others to provide core functionality for various

Affy projects

Page 4: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Points of Reference

•  com.neomorphic.bio models •  Genisys DB and Genisys IDL •  EBI mapping models •  Apollo data models •  BioPerl •  BioJava •  Closest similarity to bio alignment models and

Genisys alignment models

Page 5: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Basic Annotations

Transcript T Genome G

Transcript T

G: 1000..5000

Exon E1 G:1000..1200

Exon E2 G:3000..3500

Exon E3 G:4500..5000

Page 6: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Genometry Annotations – Specify All Coordinates

Transcript T Genome G

Transcript T

G: 1000..5000 T:0..1200

Exon E1 G:1000..1200

T:0..200

Exon E2 G:3000..3500

T:200..700

Exon E3 G:4500..5000 T:700..1200

Page 7: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Genometry Annotations – All coordinates are relative to BioSeqs

Transcript T Genome G

TranscriptAnnot T1 G: 1000..5000

T:0..1200

ExonAnnot E1 G:1000..1200

T:0..200

ExonAnnot E2 G:3000..3500

T:200..700

ExonAnnot E3 G:4500..5000 T:700..1200

Transcript T

Genome G

Page 8: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Genometry Annotations – SeqSpans encapsulate a range along a BioSeq

Transcript T Genome G

TranscriptAnnot T1

ExonAnnot E1

ExonAnnot E2

ExonAnnot E3

Transcript T

Genome G G: 1000..5000

T: 0..200

G:1000..1200 T:0..200

G:3000..3500 T:200..700

G:4500..5000 T:700..1200

Page 9: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Genometry Core Core •  BioSeq

–  length, residues (optional)

•  SeqSpan –  start, end, BioSeq

•  SeqSymmetry –  SeqSpans (breadth) –  SeqSymmetry parent / child hierarchy (depth)

Page 10: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Expressiveness of Core Core

•  “Standard” annotations •  Singleton annotations •  Alternative Splicing •  Pairwise alignments •  Annotations with depth > 2 •  Annotations with breadth > 2 •  Indels •  Structure of analyzed sequence •  Fuzzy locations •  All without explicit pointers from BioSeq to annotation

Page 11: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Genometry Modelling of Insertions and Deletions #1a

G:1000..1006 T:7..18

G:1000..1017

T:0..6 G:1006..1017

T:0..18

…AGGCAATTAATTGATCCAGGTG……GAGTCCGAATAGGGTTAGCG…

GCAATTCAATTGATCCAG TCCGAATAGGTTAGCG

G:2000..2017 T:18..34

G:2000..2010 T:28..34 T:18..28

G:2011..2017

G:1000..2017 T:0..34

insertion in transcript relative to genome (deletion in genome relative to transcript)

deletion in transcript relative to genome (insertion in genome relative to transcript)

Genome G

Transcript T

Page 12: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Genometry Modelling of Insertions and Deletions #1b

G: g0..g2 T:t0..t2

…AGGCAATTAATTGATCCAGGTG……GAGTCCGAATAGGGTTAGCG…

GCAATTCAATTGATCCAG TCCGAATAGGTTAGCG

G:g3..g5 T:t3..t5

G:g3..g4 T:t4..t5 T:t3..t4

G:g4+1..g5 G:g0..g1 T:t0..t1 T:t1+1..t2

G:g1..g2

G:g0..g5

T:t0..t5

insertion in transcript relative to genome (deletion in genome relative to transcript)

deletion in transcript relative to genome (insertion in genome relative to transcript)

Genome G

Transcript T

t0 t1 t1+1 t2

g0 g1 g2 g3 g4 g4+1 g5

t3 t4 t5

Page 13: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Genometry Modelling of Insertions and Deletions #2

G:g0..g1 T:t0..t1 T:t1+1..t2

G:g1..g2

G: g0..g2 T:t0..t2

…AGGCAATTAATTGATCCAGGTG……GAGTCCGAATAGGGTTAGCG…

GCAATTCAATTGATCCAG TCCGAATAGGTTAGCG

G:g3..g5 T:t3..t5

G:g3..g4 T:t3..t4 T:t4..t5

G:g4+1..g5

G:g0..g5

T:t0..t5

insertion in transcript relative to genome (deletion in genome relative to transcript)

deletion in transcript relative to genome (insertion in genome relative to transcript)

Genome G

Transcript T

T:t1..t1+1 “C” :0..1

t0 t1 t1+1 t2

g0 g1 g2 g3 g4 g4+1 g5

t3 t4 t5

G:g4..g4+1 “G” :0..1

Page 14: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Genometry Modelling of Insertions and Deletions #3

G:g0..g1 T:t0..t1 T:t1+1..t2

G:g1..g2

G: g0..g2 T:t0..t2

…AGGCAATTAATTGATCCAGGTG……GAGTCCGAATAGGGTTAGCG…

GCAATTCAATTGATCCAG TCCGAATAGGTTAGCG

G:g3..g5 T:t3..t5

G:g3..g4 T:t3..t4 T:t4..t5

G:g4+1..g5

G:g0..g5

T:t0..t5

insertion in transcript relative to genome (deletion in genome relative to transcript)

deletion in transcript relative to genome (insertion in genome relative to transcript)

Genome G

Transcript T

T:t1..t1+1 G:g1..g1

t0 t1 t1+1 t2

g0 g1 g2 g3 g4 g4+1 g5

t3 t4 t5

G:g4..g4+1 T:t4..t4

Page 15: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Genometry Modelling of Insertions and Deletions #4

G:g0..g1 T:t0..t1 T:t1+1..t2

G:g1..g2

G: g0..g2 T:t0..t2

…AGGCAATTAATTGATCCAGGTG……GAGTCCGAATAGGGTTAGCG…

GCAATTCAATTGATCCAG TCCGAATAGGTTAGCG

G:g3..g5 T:t3..t5

G:g3..g4 T:t3..t4 T:t4..t5

G:g4+1..g5

G:g0..g5

T:t0..t5

insertion in transcript relative to genome (deletion in genome relative to transcript)

deletion in transcript relative to genome (insertion in genome relative to transcript)

Genome G

Transcript T

t0 t1 t1+1 t2

g0 g1 g2 g3 g4 g4+1 g5

t3 t4 t5

T:t1..t1+1 G:g1..g1

“C”:0..1 T:t4..t4

G:g4..g4+1

“G”:0..1

Page 16: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Modelling SNPs with Genometry: Two Approaches

SeqB : 0..n

SeqA : 0..x SeqB : 0..x

“T” : 0..1 SeqB : x..x+1

SeqA : 0..m

SeqA : x+1..m SeqB : x+1..n

SeqA : x..x+1 …GGCAAGGAATGATC… SeqA x x+1

…GGCAAGGAATGATC… SeqA

SeqB …GGCAAGTAATGATC…

x x+1

SeqA = reference chromosome SeqB = exactly same as reference chromosome, except for one SNP

I. SNPs as annotations of differences between sequences

II. SNPs as gaps in similarity between two sequences

T

SeqB : x..x+1 SeqA : x..x+1 …GGCAAGGAATGATC… SeqA

SeqB …GGCAAGTAATGATC…

x x+1

“T” : 0..1 SeqA : x..x+1 …GGCAAGGAATGATC… SeqA

T

x x+1

I.a. annotation of just reference seq

I.b. annotation of reference seq w/ variant base

I.c. annotation of reference and variant seq

Page 17: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Modelling SNPs with Genometry: Two Approaches

SeqB : 0..n

SeqA : 0..x SeqB : 0..x

“T” : 0..1 SeqB : x..x+1

SeqA : 0..m

SeqA : x+1..m SeqB : x+1..n

SeqA : x..x+1 …GGCAAGGAATGATC… SeqA x x+1

…GGCAAGGAATGATC… SeqA

SeqB …GGCAAGTAATGATC…

x x+1

SeqA = reference chromosome SeqB = exactly same as reference chromosome, except for one SNP

I. SNPs as annotations of differences between sequences

II. SNPs as gaps in similarity between two sequences

T

SeqB : x..x+1 SeqA : x..x+1 …GGCAAGGAATGATC… SeqA

SeqB …GGCAAGTAATGATC…

x x+1

“T” : 0..1 SeqA : x..x+1 …GGCAAGGAATGATC… SeqA

T

x x+1

I.a. annotation of just reference seq

I.b. annotation of reference seq w/ variant base

I.c. annotation of reference and variant seq

Page 18: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Sequence-oriented annotations •  AnnotatedBioSeq

–  Contains a collection of SeqSymmetries that annotate the sequence

–  Interfaces to retrieve annotations covered by a span within the sequence

Page 19: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Annotation Networks •  Can traverse networks of annotations, alternating between

AnnotatedBioSeqs and SeqSymmetries

protein2mRNA proteinSpanB

mrnaSpanB

mRNA2genomic genomicSpanC mrnaSpanC

Annotated GenomicSeq G

Annotated mRNASeq M

Annotated ProteinSeq P

m2gSub0 gSpanC0 mSpanC0

m2gSub1 gSpanC1 mSpanC1

m2gSub2 gSpanC2 mSpanC2

domainOnProtein proteinSpanA

= AnnotatedBioSeq = SeqSymmetry

Page 20: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Sequence Composition

•  CompositeBioSeq – Contains a SeqSymmetry describing the mapping

of BioSeqs used in composition to the CompositeBioSeq itself

Page 21: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Sequence Composition Representations

•  Sequence Assembly / Golden Path / etc. •  Piecewise data loading / lazy data loading •  Genotypes •  Chromosomal Rearrangements •  Primer construction •  Reverse Complement •  Coordinate Shifting

Page 22: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Genometry Modelling of Reverse Complement

Sequence B = reverse complement of Sequence A

BioSeq A length: x

Composite BioSeq B

length: x

A:0..x B:x..0

Sym AB composition

AGGCAATTAATTGATCCAGGTGGAGTCCGAATAGGGTTAGCGA

TCGCTAACCCTATTCGGACTCCACCTGGATCAATTAATTGCCT

SeqA

SeqB

Page 23: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

MultiSequence Alignments •  MultiSeqAlignment

–  Alignments sliced “horizontally” -- each “row” in an alignment is a CompositeBioSeq whose composition maps another BioSeq to the same coord space as the alignment

•  Can also slice vertically (synteny)

Page 24: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Alignment Representations •  Can represent same alignment as either MultiSeqAlignment or Synteny •  Transformation from horizontal slicing (MultiSeqAlignment) to vertical

slicing (Synteny)

Page 25: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Complete Genometry Core Models

•  Mutability •  Curations

Page 26: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Genometry Manipulations

•  Symmetry Intersection (AND) •  Symmetry Union (OR) •  Symmetry Inverse (NOT) •  Symmetry Mutual Exclusion (XOR) •  Symmetry Transformation / Mapping

Page 27: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Symmetry Combination Operations

SymA SymB

XOR(A, B)

AND(A, B)

OR(A, B)

NOT(A)

NOT(B)

Page 28: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Genometry Transformations

•  Every symmetry of breadth > 1 describes a mapping between different sequences

•  Therefore every symmetry can be used to transform coordinates of other symmetries from one sequence to another

•  Because sequence annotations, alignments, and composition are all based on symmetries, can use any of them as mappings

•  Discontiguous linear mapping algorithm •  Results of transformation are also symmetries

Page 29: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Coordinate Mapping

(note that domain mapped to spliced transcript only overlaps two of the three exons, hence only end up with two children for resulting domain2genomic symmetry)

Example – mapping domain from protein coords to genomic coords

protein2mRNA proteinSpanB

mrnaSpanB

mRNA2genomic genomicSpanC mrnaSpanC

Annotated GenomicSeq G

Annotated mRNASeq M

Annotated ProteinSeq P

m2gSub0 gSpanC0 mSpanC0

domain2genomic proteinSpanA

d2gSub0 pSpanA0 mSpanA0 gSpanA0

domain2genomic proteinSpanA mrnaSpanA

domain2genomic proteinSpanA mrnaSpanA

genomicSpanA

d2gSub1 pSpanA1 mSpanA1 gSpanA1

transform via protein2mRNA

transform via mRNA2genomic

m2gSub1 gSpanC1 mSpanC1

m2gSub2 gSpanC2 mSpanC2

domainOnProtein proteinSpanA

= AnnotatedBioSeq (BioSeq)

= SeqSymmetry (SeqAnnot)

“Growing” domain2genomic result

= MutableSeqSymmetry

Page 30: IGB genome genometry data models by Gregg Helt and Cyrus Harmon
Page 31: IGB genome genometry data models by Gregg Helt and Cyrus Harmon
Page 32: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

mRNA2genomic genomicSpanC mrnaSpanC

m2gSub0 gSpanC0 mSpanC0

m2gSub1 gSpanC1 mSpanC1

m2gSub2 gSpanC2 mSpanC2

domain2genomic proteinSpanA mrnaSpanA

domain2genomic proteinSpanA mrnaSpanA

d2gSub0 mSpanA0

domain2genomic proteinSpanA mrnaSpanA

d2gSub0 mSpanA0 pSpanA0

domain2genomic proteinSpanA mrnaSpanA

d2gSub0 mSpanA0 pSpanA0 gSpanA0

d2gSub0 pSpanA0 mSpanA0 gSpanA0

domain2genomic proteinSpanA mrnaSpanA

genomicSpanA

d2gSub1 pSpanA1 mSpanA1 gSpanA1

domain2genomic proteinSpanA mrnaSpanA

d2gSub0 mSpanA0 pSpanA0 gSpanA0

d2gSub1 mSpanA1 pSpanA1 gSpanA1

step1b step1c step1a

step 2

step1 (loop2) [a,b,c]

Step 2 “roll up”

Step 1a “sit still”

Step1b “roll back”

Step1c “roll forward”

Step 1 Details of “split” mapping

Page 33: IGB genome genometry data models by Gregg Helt and Cyrus Harmon
Page 34: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Transformations Applications

•  Mapping Affy probes to genome •  Mapping contig annotations to larger genomic assemblies •  Mapping protein annotations to genome •  Mapping genomic annotations to proteins and transcripts

(SNPs, for example) •  Sequence slice-and-dice with annotation propagation •  Propagation of annotations across versioned sequences (such

as Golden Path) •  Deep mappings (for example, SNP to genomeA to transcriptB to

proteinC to homolog proteinD to transcriptE to genomeF to putative SNP location in genomeF – symmetry path of depth 5)

•  Etc., etc.

Page 35: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Prototypes & Applications

•  GenometryTest •  Generic Genometry Viewer •  ProtAnnot (Ann) •  GPView (Cyrus) •  AlignView (Eric) •  ContigViewer (Peter, Barry) •  Unibrow (Transcriptome Group)

Page 36: IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Genometry Summary

• Genometry presents a unified model for location-based sequence relationships

•  Sequence annotation, composition, and alignment are all based on SeqSymmetry

•  Provides powerful genometry manipulations -- any SeqSymmetry can be used to map other SeqSymmetries across sequences / coordinate spaces

• Work in progress