Interface 2012, Rice University Stepping 0 5 10 15 0 100 200 300 Coverage 1 2 48245000 48250000 48255000 48260000 48265000 48270000 hg19::chrX strand + − 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 gieStain acen gneg gpos100 gpos25 gpos50 gpos75 gvar stalk l l l l l l l l l l l l l l l l l l l l l l l 0M 50M 100M 150M 200M 0M 50M 100M 150M 200M 0M 50M 100M 150M 0M 50M 100M 150M 0M 50M 100M 150M 0M 50M 100M 150M 0M 50M 100M 150M 0M 50M 100M 0M 50M 100M 0M 50M 100M 0M 50M 100M 0M 50M 100M 0M 50M 100M 0M 50M 100M 0M 50M 100M 0M 50M 0M 50M 0M 50M 0M 50M 0M 50M 0M 0M 50M 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 rearrangements interchromosomal intrachromosomal tumreads l l l l l 4 6 8 10 12 ggbio Extending the Grammar of Graphics to Genomic Data Tengfei Yin, Di Cook Iowa State University Michael Lawrence Genentech
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Motivation
5
Gviz (Hahne et al)
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Motivation
5
Gviz (Hahne et al)Pretty good!Incorporated with R, and R data
structuresUses grid (low level) graphics, very
flexible, but not leveraging tools like ggplot2
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
OutlineWhat is the grammar of graphics?How it is extended for genomic data.ExamplesNext steps: interactive graphics
6
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
GrammarGrammar forms the foundation of a language. It is a set of structural rules that govern composition.For graphics, it provides a way to construct a plot in a common form, and enables clarification of similarities and differences between plots.
7
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Grammar (ggplot2)
day
count
0
20
40
60
80
Thu Fri Sat Sun
dayThu
Fri
Sat
Sun
Bar chartggplot(data=tips, aes(x=day, fill=day)) + geom_bar(width=1)
8
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Grammar (ggplot2)
day
count
0
20
40
60
80
Thu Fri Sat Sun
dayThu
Fri
Sat
Sun
Bar chartggplot(data=tips, aes(x=day, fill=day)) + geom_bar(width=1)
Pie chartggplot(data=tips, aes(x=day, fill=day)) + geom_bar(width=1) + coord_polar()
8
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Grammar (ggplot2)
day
count
0
20
40
60
80
Thu Fri Sat Sun
dayThu
Fri
Sat
Sun
Bar chartggplot(data=tips, aes(x=day, fill=day)) + geom_bar(width=1)
Pie chartggplot(data=tips, aes(x=day, fill=day)) + geom_bar(width=1) + coord_polar()
day
count
020406080
Thu
FriSat
Sunday
Thu
Fri
Sat
Sun
8
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Grammar (ggplot2)
day
count
0
20
40
60
80
Thu Fri Sat Sun
dayThu
Fri
Sat
Sun
Bar chartggplot(data=tips, aes(x=day, fill=day)) + geom_bar(width=1)
Pie chartggplot(data=tips, aes(x=day, fill=day)) + geom_bar(width=1) + coord_polar()
day
count
020406080
Thu
FriSat
Sunday
Thu
Fri
Sat
Sun
8
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Grammar (ggplot2)
day
count
0
20
40
60
80
Thu Fri Sat Sun
dayThu
Fri
Sat
Sun
Bar chartggplot(data=tips, aes(x=day, fill=day)) + geom_bar(width=1)
Pie chartggplot(data=tips, aes(x=day, fill=day)) + geom_bar(width=1) + coord_polar()
day
count
020406080
Thu
FriSat
Sunday
Thu
Fri
Sat
Sun
8
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Grammar (ggplot2)
day
count
0
20
40
60
80
Thu Fri Sat Sun
dayThu
Fri
Sat
Sun
Bar chartggplot(data=tips, aes(x=day, fill=day)) + geom_bar(width=1)
day
count
020406080
Thu
FriSat
Sunday
Thu
Fri
Sat
Sun
8
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Rose plot/Coxcombggplot(data=tips, aes(x=day, fill=day)) + geom_bar(width=1) + coord_polar()
Grammar (ggplot2)
day
count
0
20
40
60
80
Thu Fri Sat Sun
dayThu
Fri
Sat
Sun
Bar chartggplot(data=tips, aes(x=day, fill=day)) + geom_bar(width=1)
day
count
020406080
Thu
FriSat
Sunday
Thu
Fri
Sat
Sun
8
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Grammar (ggplot2)
day
count
0
20
40
60
80
Thu Fri Sat Sun
dayThu
Fri
Sat
Sun
Bar chartggplot(data=tips, aes(x=day, fill=day)) + geom_bar(width=1)
day
count
020406080
Thu
FriSat
Sunday
Thu
Fri
Sat
Sun
8
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Grammar (ggplot2)Stacked bar chartggplot(data=tips, aes(x=””, fill=day)) + geom_bar(width=1)
""
count
0
50
100
150
200
dayThu
Fri
Sat
Sun
9
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Grammar (ggplot2)Stacked bar chartggplot(data=tips, aes(x=””, fill=day)) + geom_bar(width=1)
Pie chartggplot(data=tips, aes(x=””, fill=day)) + geom_bar(width=1) + coord_polar(theta="y")
""
count
0
50
100
150
200
dayThu
Fri
Sat
Sun
9
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Grammar (ggplot2)Stacked bar chartggplot(data=tips, aes(x=””, fill=day)) + geom_bar(width=1)
Pie chartggplot(data=tips, aes(x=””, fill=day)) + geom_bar(width=1) + coord_polar(theta="y")
""
count
0
50
100
150
200
dayThu
Fri
Sat
Sun
""
count
0
50
100150
200 dayThu
Fri
Sat
Sun
9
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Grammar ElementsDATA: What is to be plottedSTAT: Statistical operations to make on data, like binning.GEOM: Geometric object, elements to use to displays aspects of the dataSCALE: Map data to aesthetics to geomCOORD: Coordinate system to use, eg Cartesian(FACET): subset and display
10
ggbio - Genomic Data Vis - Interface 2012, Rice University /3111
Example: MA plot
ggbio - Genomic Data Vis - Interface 2012, Rice University /3111
Example: MA plot
DATA=expression data frame, x=average intensity, y=fold change
ggbio - Genomic Data Vis - Interface 2012, Rice University /3111
Example: MA plot
ggbio - Genomic Data Vis - Interface 2012, Rice University /3111
Example: MA plotGEOM=point
ggbio - Genomic Data Vis - Interface 2012, Rice University /3111
Example: MA plot
ggbio - Genomic Data Vis - Interface 2012, Rice University /3111
Example: MA plot
SCALE=x is logged
ggbio - Genomic Data Vis - Interface 2012, Rice University /3111
Example: MA plot
ggbio - Genomic Data Vis - Interface 2012, Rice University /3111
Example: MA plot
SCALE=color is mapped to statistical significance
ggbio - Genomic Data Vis - Interface 2012, Rice University /3111
Example: MA plot
ggbio - Genomic Data Vis - Interface 2012, Rice University /3111
Example: MA plot
COORD=default, Cartesian
ggbio - Genomic Data Vis - Interface 2012, Rice University /3111
Example: MA plot
ggbio - Genomic Data Vis - Interface 2012, Rice University /3111
Example: MA plot
FACET=none
ggbio - Genomic Data Vis - Interface 2012, Rice University /3111
Example: MA plot
ggbio - Genomic Data Vis - Interface 2012, Rice University /3112
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
What’s different?Genomic data has interval contextSeveral common geoms used in standard plots, not in current grammarAdditional transformations commonLining up of multiple data plots, especially against genome
13
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
What’s different?
14
No seqnames ranges strand tx id exon id
1 chrX [48242968, 48243005] + 35775 132624
2 chrX [48243475, 48243563] + 35775 132625
3 chrX [48244003, 48244117] + 35775 132626
4 chrX [48244794, 48244889] + 35775 132627
5 chrX [48246753, 48246802] + 35775 132628
... ... ... ... ... ... ...
26 chrX [48270193, 48270307] - 35778 132637
27 chrX [48269421, 48269516] - 35778 132636
28 chrX [48267508, 48267557] - 35778 132635
29 chrX [48262894, 48262998] - 35778 132633
30 chrX [48261524, 48262111] - 35778 132632
Table 2: Typical biological data coerced into a data frame: A GRanges table representing gene SSX4 and
SSX4B. One row represents one exon, seqnames indicates the chromosome name, ranges indicates the interval
of exons, strand is the direction, tx id and exon id are the internal id’s used for mapping cross database.
31
DATA: Genomic ranges
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Extensions
15
layout
data source(s)
abstract data (formal model) meta data
geom stat scale
coord facet
plots
grammar of graphics with extensions
autoplot
I/O packages in bioconductor
tracks
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Extensions
16
current software. Development of new visualization tools should beindependently factorized into components of the grammar. Table 1describes the extensions developed in this work.
Comp name usage icon
geom geom rect rectangle
geom segment segment
geom chevron chevron
geom arrow arrow
geom arch arches
geom bar bar
geom alignment alignment (gene)
stat stat coverage coverage (of reads)
stat mismatch mismatch pileup foralignments
stat aggregate aggregate in slidingwindow
stat stepping avoid overplotting
stat gene consider gene struc-ture
stat table tabulate ranges
stat identity no change
coord linear ggplot2 linear butfacet by chromo-some
genome put everything ongenominc coordi-nates
truncate gaps compact view byshrinking gaps
layout track stacked tracks
karyogram karyogram display
chr1chr2
chr3
50 100 150 200 250 300start
circle circular
faceting formula facet by formula
ranges facet by rangesscale not extended ggplot2default
Table 1: Components of the basic grammar of graphics, with the
extensions available in ggbio.
In comparison to regular data elements which might be mappedto the ggplot2 geoms of points, lines and polygons, genomic datahas a basic currency of an interval. Intervals underlie exons, in-trons and locations on the genome, which form the reference framefor biological data. We have introduced several new geoms forrepresenting intervals and connections between intervals: rectan-gle, segment, chevron, arch, arrow and arrow rectangle. The geomalignment and its variants are combinations of those geoms, whichfunction as a unit. For example, the alignment geom might drawexons as arrow rectangles and introns as chevrons. Figure 2 showsthe new geoms for interval data.
layout
data source(s)
abstract data (formal model) meta data
geom stat scale
coord facet
plots
grammar of graphics with extensions
autoplot
I/O packages in bioconductor
tracks
Figure 1: Diagram of the framework for biological data visualiza-
tion. It starts with a mapping from empirical data to an abstract data
model, followed by a general and extended grammar of graphics that
map data elements to graphical elements. Orange boxes indicate the
new components provided by ggbio and dashed frame indicates the
body of grammar of graphics, including the parts we extended with
ggbio.
Several new types of statistical transformation (stat) are defined.In ggplot2, some common transformations are possible, for exam-ple binning to create a histogram, or smoothing to add a line rep-resenting a model fit to the data. For genomic data there are somecommonly useful transformations that are incorporated in ggbio:coverage, i.e., feature stack depth, and mismatch summaries fromread alignments.
Additional types of coordinate system (coord), layout andfaceting methods are also available. These additional componentsare listed in Table 1, an they are described in more detail in the plotexamples.
Let’s analyze the anatomy in a minimal example, illustrated inFigure 3, to get an impression of the components of the grammar.In this plot, we are showing a gene structure with four transcriptsby using the alignment geom, stepping transformation and aestheticattributes such as color mapped to strand, in the genomic coordinatesystem. It will become clear that almost all graphics found in exist-ing genome browsers or visualization tools could be described bythe different components introduced here. This is the strength ofthe grammar of graphics. While it may appear simplistic initially,once one gains a deeper understanding of the grammar, one maydiscover that seemingly complex data graphics can be abstracted ascomponents from “pictures” and thus made tangible. This will aidin the design of future graphics.
The following sections of the paper describe the grammar exten-sions, and provide examples of their use.
2.1 Modeling Data
Data are the first component of the grammar, and data may be col-lected in different ways. Wilkinson makes a distinction betweenempirical data, abstract data and metadata [19]. Empirical data arecollected from observations of the real world, while abstract dataare defined by a formal mathematical model. Metadata are dataabout data, which might be empirical, abstract or metadata them-selves.
Genomic data are often communicated in tabular text files, suchas csv and tab-delimited files. Rows always represent observationsand columns always represent a set of variables. Annotation tracksare usually stored according to specific formats, each with a fixed
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
current software. Development of new visualization tools should beindependently factorized into components of the grammar. Table 1describes the extensions developed in this work.
Comp name usage icon
geom geom rect rectangle
geom segment segment
geom chevron chevron
geom arrow arrow
geom arch arches
geom bar bar
geom alignment alignment (gene)
stat stat coverage coverage (of reads)
stat mismatch mismatch pileup foralignments
stat aggregate aggregate in slidingwindow
stat stepping avoid overplotting
stat gene consider gene struc-ture
stat table tabulate ranges
stat identity no change
coord linear ggplot2 linear butfacet by chromo-some
genome put everything ongenominc coordi-nates
truncate gaps compact view byshrinking gaps
layout track stacked tracks
karyogram karyogram display
chr1chr2
chr3
50 100 150 200 250 300start
circle circular
faceting formula facet by formula
ranges facet by rangesscale not extended ggplot2default
Table 1: Components of the basic grammar of graphics, with the
extensions available in ggbio.
In comparison to regular data elements which might be mappedto the ggplot2 geoms of points, lines and polygons, genomic datahas a basic currency of an interval. Intervals underlie exons, in-trons and locations on the genome, which form the reference framefor biological data. We have introduced several new geoms forrepresenting intervals and connections between intervals: rectan-gle, segment, chevron, arch, arrow and arrow rectangle. The geomalignment and its variants are combinations of those geoms, whichfunction as a unit. For example, the alignment geom might drawexons as arrow rectangles and introns as chevrons. Figure 2 showsthe new geoms for interval data.
layout
data source(s)
abstract data (formal model) meta data
geom stat scale
coord facet
plots
grammar of graphics with extensions
autoplot
I/O packages in bioconductor
tracks
Figure 1: Diagram of the framework for biological data visualiza-
tion. It starts with a mapping from empirical data to an abstract data
model, followed by a general and extended grammar of graphics that
map data elements to graphical elements. Orange boxes indicate the
new components provided by ggbio and dashed frame indicates the
body of grammar of graphics, including the parts we extended with
ggbio.
Several new types of statistical transformation (stat) are defined.In ggplot2, some common transformations are possible, for exam-ple binning to create a histogram, or smoothing to add a line rep-resenting a model fit to the data. For genomic data there are somecommonly useful transformations that are incorporated in ggbio:coverage, i.e., feature stack depth, and mismatch summaries fromread alignments.
Additional types of coordinate system (coord), layout andfaceting methods are also available. These additional componentsare listed in Table 1, an they are described in more detail in the plotexamples.
Let’s analyze the anatomy in a minimal example, illustrated inFigure 3, to get an impression of the components of the grammar.In this plot, we are showing a gene structure with four transcriptsby using the alignment geom, stepping transformation and aestheticattributes such as color mapped to strand, in the genomic coordinatesystem. It will become clear that almost all graphics found in exist-ing genome browsers or visualization tools could be described bythe different components introduced here. This is the strength ofthe grammar of graphics. While it may appear simplistic initially,once one gains a deeper understanding of the grammar, one maydiscover that seemingly complex data graphics can be abstracted ascomponents from “pictures” and thus made tangible. This will aidin the design of future graphics.
The following sections of the paper describe the grammar exten-sions, and provide examples of their use.
2.1 Modeling Data
Data are the first component of the grammar, and data may be col-lected in different ways. Wilkinson makes a distinction betweenempirical data, abstract data and metadata [19]. Empirical data arecollected from observations of the real world, while abstract dataare defined by a formal mathematical model. Metadata are dataabout data, which might be empirical, abstract or metadata them-selves.
Genomic data are often communicated in tabular text files, suchas csv and tab-delimited files. Rows always represent observationsand columns always represent a set of variables. Annotation tracksare usually stored according to specific formats, each with a fixed
Extensions
17
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Extensions
18
current software. Development of new visualization tools should beindependently factorized into components of the grammar. Table 1describes the extensions developed in this work.
Comp name usage icon
geom geom rect rectangle
geom segment segment
geom chevron chevron
geom arrow arrow
geom arch arches
geom bar bar
geom alignment alignment (gene)
stat stat coverage coverage (of reads)
stat mismatch mismatch pileup foralignments
stat aggregate aggregate in slidingwindow
stat stepping avoid overplotting
stat gene consider gene struc-ture
stat table tabulate ranges
stat identity no change
coord linear ggplot2 linear butfacet by chromo-some
genome put everything ongenominc coordi-nates
truncate gaps compact view byshrinking gaps
layout track stacked tracks
karyogram karyogram display
chr1chr2
chr3
50 100 150 200 250 300start
circle circular
faceting formula facet by formula
ranges facet by rangesscale not extended ggplot2default
Table 1: Components of the basic grammar of graphics, with the
extensions available in ggbio.
In comparison to regular data elements which might be mappedto the ggplot2 geoms of points, lines and polygons, genomic datahas a basic currency of an interval. Intervals underlie exons, in-trons and locations on the genome, which form the reference framefor biological data. We have introduced several new geoms forrepresenting intervals and connections between intervals: rectan-gle, segment, chevron, arch, arrow and arrow rectangle. The geomalignment and its variants are combinations of those geoms, whichfunction as a unit. For example, the alignment geom might drawexons as arrow rectangles and introns as chevrons. Figure 2 showsthe new geoms for interval data.
layout
data source(s)
abstract data (formal model) meta data
geom stat scale
coord facet
plots
grammar of graphics with extensions
autoplot
I/O packages in bioconductor
tracks
Figure 1: Diagram of the framework for biological data visualiza-
tion. It starts with a mapping from empirical data to an abstract data
model, followed by a general and extended grammar of graphics that
map data elements to graphical elements. Orange boxes indicate the
new components provided by ggbio and dashed frame indicates the
body of grammar of graphics, including the parts we extended with
ggbio.
Several new types of statistical transformation (stat) are defined.In ggplot2, some common transformations are possible, for exam-ple binning to create a histogram, or smoothing to add a line rep-resenting a model fit to the data. For genomic data there are somecommonly useful transformations that are incorporated in ggbio:coverage, i.e., feature stack depth, and mismatch summaries fromread alignments.
Additional types of coordinate system (coord), layout andfaceting methods are also available. These additional componentsare listed in Table 1, an they are described in more detail in the plotexamples.
Let’s analyze the anatomy in a minimal example, illustrated inFigure 3, to get an impression of the components of the grammar.In this plot, we are showing a gene structure with four transcriptsby using the alignment geom, stepping transformation and aestheticattributes such as color mapped to strand, in the genomic coordinatesystem. It will become clear that almost all graphics found in exist-ing genome browsers or visualization tools could be described bythe different components introduced here. This is the strength ofthe grammar of graphics. While it may appear simplistic initially,once one gains a deeper understanding of the grammar, one maydiscover that seemingly complex data graphics can be abstracted ascomponents from “pictures” and thus made tangible. This will aidin the design of future graphics.
The following sections of the paper describe the grammar exten-sions, and provide examples of their use.
2.1 Modeling Data
Data are the first component of the grammar, and data may be col-lected in different ways. Wilkinson makes a distinction betweenempirical data, abstract data and metadata [19]. Empirical data arecollected from observations of the real world, while abstract dataare defined by a formal mathematical model. Metadata are dataabout data, which might be empirical, abstract or metadata them-selves.
Genomic data are often communicated in tabular text files, suchas csv and tab-delimited files. Rows always represent observationsand columns always represent a set of variables. Annotation tracksare usually stored according to specific formats, each with a fixed
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Extensionsautoplot
Tries, and does a jolly good job, of recognizing the data object to be plotted, and how it should be displayed.
19
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
BenefitsFlexibility in drawing genomic dataAesthetics are changeable, color schemes for different purposesPlots defined in a way to compare and contrastHuge variety of displays is available in one locationBuilds from a good data model and tools available in bioC.
29
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Future WorkClean up code, autoplot, consistency in usage, make circular layouts as elegant as CircosIdeally integrate new grammar components better with the ggplot2 code (not trivial)Build interactive graphics, using the qtbase, qtpaint primitives
30
ggbio - Genomic Data Vis - Interface 2012, Rice University /31
Availabilityggbio is on www.bioconductor.orgTengfei’s ggbio web page has tutorials and gallery of examples: http://tengfei.github.com/ggbio