1 Supplemental Information Quantitative Estimation of Activity and Quality for Collections of Functional Genetic Elements Vivek K. Mutalik 1,2,3,9 , Joao C. Guimaraes 1,3,4,9 , Guillaume Cambray 1,3,9 , Quynh-Anh Mai 1,3 , Marc Juul Christoffersen 1,3 , Lance Martin 1,3,8 , Ayumi Yu 1,3,8 , Colin Lam 1,3 , Cesar Rodriguez 1,3,8 , Gaymon Bennett 1,3,8 , Jay D. Keasling 1,2,3,6,7 , Drew Endy 1,5,9,* , Adam P. Arkin 1,2,3,9,* 1 BIOFAB International Open Facility Advancing Biotechnology (BIOFAB), 5885 Hollis Street, Emeryville, CA 94608, USA 2 Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA 3 Department of Bioengineering, University of California, Berkeley, CA, 94720, USA 4 Department of Informatics, Computer Science and Technology Center, University of Minho, Campus de Gualtar, Braga, Portugal 5 Department of Bioengineering, Stanford University, Stanford, CA 94305, USA 6 Department of Chemical & Biomolecular Engineering, University of California, Berkeley, CA, 94720, USA 7 Joint Bioenergy Institute, 5885 Hollis Street, Emeryville, CA 94608, USA 8 Present Addresses: Dept. of Bioengineering, Stanford University, Stanford, CA 94305, USA (L. M.); Philotic, Inc. 88 Kearny St, Suite 2100, San Francisco, CA 94108, USA (A. Y.); Autodesk, Inc. One Market Street, Suite 200, San Francisco, CA 94105 (C. R.); Center for Biological Futures, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. Seattle, WA 98109 (G. B.). 9 Equal contribution *Correspondence should be addressed to D.E. or A.P.A. ([email protected]; [email protected]) Nature Methods: doi:10.1038/nmeth.2403
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Supplemental Information
Quantitative Estimation of Activity and Quality for Collections of Functional Genetic Elements
Vivek K. Mutalik1,2,3,9, Joao C. Guimaraes1,3,4,9, Guillaume Cambray1,3,9, Quynh-Anh Mai1,3,
Marc Juul Christoffersen1,3, Lance Martin1,3,8, Ayumi Yu1,3,8, Colin Lam1,3, Cesar Rodriguez1,3,8,
Gaymon Bennett1,3,8, Jay D. Keasling1,2,3,6,7, Drew Endy1,5,9,*, Adam P. Arkin1,2,3,9,*
1 BIOFAB International Open Facility Advancing Biotechnology (BIOFAB), 5885 Hollis Street,
Emeryville, CA 94608, USA
2 Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720,
USA
3 Department of Bioengineering, University of California, Berkeley, CA, 94720, USA
4 Department of Informatics, Computer Science and Technology Center, University of Minho,
Campus de Gualtar, Braga, Portugal
5 Department of Bioengineering, Stanford University, Stanford, CA 94305, USA
6 Department of Chemical & Biomolecular Engineering, University of California, Berkeley, CA,
94720, USA
7Joint Bioenergy Institute, 5885 Hollis Street, Emeryville, CA 94608, USA
8 Present Addresses: Dept. of Bioengineering, Stanford University, Stanford, CA 94305,
USA (L. M.); Philotic, Inc. 88 Kearny St, Suite 2100, San Francisco, CA 94108, USA
(A. Y.); Autodesk, Inc. One Market Street, Suite 200, San Francisco, CA 94105 (C. R.);
Center for Biological Futures, Fred Hutchinson Cancer Research Center, 1100 Fairview
Supplementary Table 2: ANOVA table for main expression elements and their interactions Sum of squares represent the actual explanation of variation in the output measurement (fluorescence, transcript abundance and translational efficiencies). The mean squares represent the average contribution of each of the factors/interactions taking into account their degrees of freedom (df).(a) ANOVA table for fluorescence dataset. (b) ANOVA table for mRNA dataset. (c) ANOVA table for translational efficiency dataset. a: ANOVA table for Fluorescence
360 (120 per) Oligonucleotides ($550), enzymes ($1144), sequencing, ($1650), tips, media, plates ($500). $3,844 total or $17.08 per vector.
$19,500 $3,844
Measurements (plate reader, cytometry, qPCR)
1 senior researcher, 2 research assistants
360 (120 per) Triplicate assays for 225 constructs by plate reader and cytometry. Additional triplicate assays for 168 plasmid-based constructs by qPCR. Cost per plate reader or cytometer assay ($3.00/construct). Cost per qPCR assay $20.00/construct). $12,000 total.
All costs calculated for low-to-medium throughput assays carried out in a rent free BSL1 facility equipped with a -80C freezer, -20C freezer, deli fridge, 1 microplate shaker, multichannel pipettes, a microplate reader, a flow cytometer, and a qPCR machine. Salaries include overhead and benefits and are approximate to within 10%.
Nature Methods: doi:10.1038/nmeth.2403
15
Supplementary Table 4: List of plasmids and strains used in the present work
Column A: Promoter name (generic or source name)
Column B: Promoter name used in the main text
Column C: Abstract part number for promoter element indicated as apFAB #
Column D: 5’UTR name (generic or source name)
Column E: 5’UTR name used in the main text
Column F: Abstract part number for 5’UTR element indicated as apFAB #
Column G: Plasmid number for the combinatorial library with GFP reporter
Column H: Strain number with GFP library
Column I: Plasmid number for the combinatorial library with RFP reporter
Column J: Strain number with RFP library
Column K: Strain number for the combinatorial library with GFP reporter on the chromosome.
Nature Methods: doi:10.1038/nmeth.2403
16
A B C D E F G H I J K
Nature Methods: doi:10.1038/nmeth.2403
17
Nature Methods: doi:10.1038/nmeth.2403
18
Supplementary Table 5: List of primers used in the present work
Column A: Oligonucleotide numbers: oFAB #. Primers used for sequencing are denoted as soFAB #.;
Column B: Forward and reverse primers are indicated as FW and RV; Column C: Information notes for
the primer; Column D: Primer Sequences (5’ to 3’)
A B C D
Nature Methods: doi:10.1038/nmeth.2403
19
Nature Methods: doi:10.1038/nmeth.2403
20
Nature Methods: doi:10.1038/nmeth.2403
21
Nature Methods: doi:10.1038/nmeth.2403
22
Supplementary Note
Quantitative estimate of the time, effort, costs that are required to perform genetic part
characterization in BioFAB like facility
Ideally, when introducing a new biological part into a cell one could predict its operation
from a first principles model. However, the first principles necessary to understand part
function within a cellular milieu are often not clear. Uncertainty in the proper model can
be placed in at least three classes: (i) the mechanism underlying the model, (ii) the values
of kinetic and thermodynamic parameters of that model in vivo, and, most difficult, (iii)
uncertainty regarding what other processes may impact the part operation. In this latter
class, for cellular systems, there are likely a bevy of direct interactions and interferences
with currently uncharacterized or unknown cellular processes not included in the model,
and then indirect effects on cellular resources affecting fitness and stability of the host.
It is inevitable, therefore, that though the known physics will inform and constrain the
model for the part and its interactions with other cellular factors, there will likely have to
be some leeway for modeling interactions for which there is no known mechanism. A
low order regression model is a good approximation for most cases and this is what we
have presented here for gene expression parts. Such an approach is also what has been
used in representing classical sequence/structure/activity relationships with proven
industrial utility9, 10. Such models are most useful when the part classes are limited and
the number of factors being considered is relatively small (i.e., a main effect plus a small
number of interactions). Different classes of elements are likely to have different modes
of interaction with other elements and thus make the model more complex to capture all
possibilities. Stated differently, it is easier to parameterize such models when the part
classes used to vary a functional feature, say translation initiation, have like mechanisms
such that similar underlying models can be used to describe part performance. Once a
class of models exist for a given part family along with an understanding of possible
context mediators (i.e., interactions amongst parts, the cellular context, and external
environment) and the activity variables one wishes to track, it is possible to calculate how
the process of characterizing parts scales (to some degree).
In the example developed here, we use mRNA levels and total protein fluorescence as
activity variables and two classes (parts families) of gene expression controllers (i.e.,
“promoter” and “5’ UTR”) that we assert affect mostly transcription and translation
initiation, respectively, with some additional effect on mRNA stability by both part
classes. There are also two context mediators: interaction amongst parts and the gene they
are driving, and the temperature change tested here. We use a factorial ANOVA to
analyze how all variables impact mRNA and protein values to derive the estimated
Nature Methods: doi:10.1038/nmeth.2403
23
activities of parts (ANOVA main effects) and interactions. The form of the model and the
goal of using such a model to estimate part scores leads to questions re: how many
constructs must be measured to achieve a particular confidence in estimated scores.
Ideally, we would hope to have sets of parts that collectively represent of range of
activities. If we assume that the constructs to be assayed are drawn from a random
combinatorial library composed from such sets then we could ask how many of these we
would have to screen to derive canonical scores for the parts their interactions. For
factorial ANOVA-based models we can use standard formulas for a priori statistical
power to derive the number of constructs necessary to classify these scores into some
number of levels (Reference: Sample Size Calculations: Practical Methods for Engineers
and Scientists, Paul Matthews ISBN:0615324614). To use these formulas we need to
specify a desired maximum false positive and false negative rate and an expected effect
size. The effect size can be estimated from prior data or standard rules-of-thumb. For
example, we might expect to distinguish different “strengths” for each of the promoters
and UTRs assuming about five different distinguishable strengths for each and moderate
interactions among all elements. We then further assert two “levels” for the genes (RFP
and GFP). Thus, there are 5*5*2 groups. To reduce the probability of falsely identifying
two elements as identically strong to below 20% and a significance level of 0.05, we
would need approximately 200 randomly chosen individual constructs made from
representative promoters and UTRs. As the number of variables and levels increase, or
the effect sizes become smaller, or the required false positive and false negative rates
drop, the number of samples goes up rather rapidly and costs go up.
In this manuscript, the parts families are small and perhaps not entirely representative,
and we analyzed an exhaustive combinatorial library. While this approach ensures we
have scores for every element with the maximum statistical power possible given the size
of the families we started with, ultimately for these models to be the most useful we
would need three things.
First, it would be ideal for the parts families to be composed of members that were
mechanistically homogeneous (as noted above, so that the same model faithfully applies
to each element), represented a wide range of activities (as done here), and with as much
sequence variation as possible (to exercise all the possible idiosyncratic interactions with
other parts and context variables, akin as done here).
Second, these parts families should be engineered to have as well-insulated function as
possible (not done in this manuscript, but addressed in the accompanying manuscript11).
In the present case, for example, the promoter library had differing lengths of 5’ UTR at
one end thereby adding elements to the transcript that could affect mRNA stability and
Nature Methods: doi:10.1038/nmeth.2403
24
translation initiation. Similarly, there was a strong interaction between the 5’ UTR and
the downstream gene. Thus multiple features impacting mRNA and fluorescent protein
levels are changed by each member of the library leading to more complex interactions
and the need for a full factorial model. In Lei et al.12, for example, it is shown how
standardized transcript cleavage can insulate promoter and 5’UTR function and in the
companion paper to this manuscript11 we show how to insulate 5’UTR and gene function
thus simplifying the model, thereby theoretically reducing the amount of characterization
necessary.
Third, parts would be characterized in standard ways over the range of compositions and
contexts most useful for any particular application (in this manuscript we are focused on
supporting laboratory research as the application). Once an initial model is well
characterized, any new member of a part family can be effectively characterized with
many fewer samples than needed to create the initial part characterizations with some
cost in error (Figure 5, main text). Such efforts can be distributed outside a single facility
as long as individuals are following standard characterization protocols that don’t
exercise variables not captured in the central model, or if they do exercise such variables
they are captured effectively as metadata.
The entire cost for the process then is the cost to make and characterize the N samples
needed to realize the initial models (Supplementary Table 3). That is, the total cost is
N*((cost to make a construct)+(cost to assay)+(cost to process data)) + (cost to calculate
parameterized model). The latter term is usually negligible. Note that this representation
does not account for the cost to design standard biological parts that operate robustly and
homogeneously across changing contexts (see companion paper).
In summary, for ubiquitous functions like gene expression controllers, core metabolic
activities, and perhaps elements that target macromolecules to different locations or
processing machinery, it will be fruitful for BIOFABs to take on the bulk of the part
choice, composition design, and subsequent characterization. Individual variants not
present in BIOFAB libraries can then be characterized by individual users and made
useful to others by donating information to public domain repositories that use such data
to improve models. Specialty parts that are used in only a few applications will ultimately
fall into the domain of the particular stakeholder for those applications.
Nature Methods: doi:10.1038/nmeth.2403
25
References
1. Markham, N.R. & Zuker, M. UNAFold: software for nucleic acid folding and
hybridization. Methods Mol Biol 453, 3-31 (2008).
2. Lutz, R. & Bujard, H. Independent and tight regulation of transcriptional units in
Escherichia coli via the LacR/O, the TetR/O and AraC/I1-I2 regulatory elements.
Nucleic Acids Res 25, 1203-1210 (1997).
3. Hook-Barnard, I.G. & Hinton, D.M. Transcription initiation by mix and match
elements: flexibility for polymerase binding to bacterial promoters. Gene Regul Syst
Bio 1, 275-293 (2007).
4. Lee, T.S. et al. BglBrick vectors and datasheets: A synthetic biology platform for gene
expression. J Biol Eng 5, 12 (2011).
5. Ringquist, S. et al. Translation Initiation in Escherichia-Coli - Sequences within the