1 Sequence Optimization For Synthetic Genes Using Genetic Algorithms David Sigfredo Angulo 1 Rob Vogelbacher 1, Benjamin R. Capraro 2 , Tobin Sosnick 2 , Shohei Koide 2 1 School of Computer Science Telecommunications and Information Systems DePaul University 2 Department of Biochemistry and Molecular Biology The University of Chicago
42
Embed
1 Sequence Optimization For Synthetic Genes Using Genetic Algorithms David Sigfredo Angulo 1 Rob Vogelbacher 1, Benjamin R. Capraro 2, Tobin Sosnick 2,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Sequence Optimization For Synthetic GenesUsing Genetic Algorithms
David Sigfredo Angulo1
Rob Vogelbacher1, Benjamin R. Capraro2, Tobin Sosnick2,
Shohei Koide2
1 School of Computer Science Telecommunications and
Information Systems DePaul University2 Department of Biochemistry and Molecular Biology
The University of Chicago
Introduction
• Genetic Algorithms:
– Using ideas based on the biology of genes
– Create software to use such a stochastic means to search through large searchspaces
– Resulting algorithm has nothing to do with genes
• Designing Genes
– This search space is huge
– REALLY NOVEL IDEA:
• Use Genetic Algorithms based on genes to design genes!!
3
Outline
• Short biology Tutorial
• DNA Sequence Generation
– Why is the problem difficult?
• IBG Gene Designer
– Genetic Algorithm (GA) solution
– Heuristics and Fitness Evaluation
First
• Before the problem can be described
– Must give some background biochemistry principles
• Tutorial outline
– DNA
– Codons
– Protein
• Synthetic genes– What are they and what are they used for?
– Restriction Enzymes
– Expressing Proteins using Vectors
Transcription/Translation
Transcription Translation
DNA RNA Protein RNA Polymerase Ribosomes
Central Dogma of Molecular Biology
DNA
• Deoxyribonucleic acid
• Strand backbone is made of sugar & phosphate molecules
• Strands connected by nitrogen containing nucleotide bases
• Two strands join making a double helix
• Each strand is made of nucleotides joined together
2 nm
11 nm
30 nm
300 nm
700 nm
1100 nm
Short region of DNA 2bl helix
"beads on a string" form of Chromatin
30 nm chromatin fiber of packed nucleosomes
Section of chromosome in an extended form
Condensed section of chromosome
Entire mitotic chromosome
DNA
Four Nucleotides:AGTC
DNA: Base Pairing
Short Biology Tutorial
• Tutorial outline
– DNA
– Codons
– Protein
– Restriction Enzymes
– Expressing Proteins using Vectors
11
DNA Sequence Generation:Codon to Amino Acid Translation
• Can be designed to “block” the acgtion of other proteins
• Expressed proteins
– Expressed in cow’s milk or chicken eggs
– Can manufacture drugs on large scales in this way
• E.g. insulin
16
Synthetic Genes
• DNA sequences
– “backtranslated” from a novel Protein or Amino Acid sequence
Transcription Translation
DNA RNA Protein RNA Polymerase Ribosomes
• We’ll put the DNA for our designed protein into an organism (a vector)
• Then that vector will make (express) our protein
• But, how do we get the DNA into an organism???
Short Biology Tutorial
• Tutorial outline
– DNA
– Codons
– Protein
– Restriction Enzymes
– Expressing Proteins using Vectors
Restriction Enzyme Digests
• Watson – Crick 1953
• Took 20 years to be able to do anything with DNA
• H. Smith (and others) made a discovery that allowed manipulation and deciphering of DNA
• Discovery was that bacteria produced enzymes that introduce breaks in double stranded DNA molecules whenever they encountered a specific string of nucleotides
• These enzymes are called Restriction Enzymes
• Restriction Enzymes can be used as precise scissors
– They let biologists cut (and paste) portions of DNA
EcoRI
• EcoRI was the very first Restriction Enzyme discovered– "Eco" because it was isolated
from E. Coli (Escherichia Coli)– "R" because it is a Restriction
Enzyme– "I" because it was the first
Restriction Enzyme from E. Coli
– Now over 300 Restriction Enzymes known
• EcoRI cleaves (restricts, digests) DNA– Between the G and A
nucleotides– Only when it encounters them
in the string 5'-GAATTC-3'
– This is called therestriction site
5'-GAATTC-3'3'-CTTAAG-5'
5'-G AATTC-3'3'-CTTAA G-5'
Regulated by EcoRI
Sticky Ends
• Many restriction enzymes in such a way that some single stranded DNA is left at both ends
• These nucleotide sequences
– Are complimentary to each other
– Are 5'-AATT-3' in the case of EcoRI
– Can base pair with other nucleotides in a sequence
– Thus, are called "sticky ends"
– Can temporarily hold twoDNA strands together
– The enzyme ligasewill permanently jointhose strands
– This is calledligation
5'-GAATTC-3'3'-CTTAAG-5'
5'-G AATTC-3'3'-CTTAA G-5'
Regulated by EcoRI
Short Biology Tutorial
• Tutorial outline
– DNA
– Codons
– Protein
– Restriction Enzymes
– Expressing Proteins using Vectors
22
Gene Synthesis:On the Lab Bench
• Initial Sequence Construction
– Oligonucleotides (short strands of DNA) are defined with complementary overlapping sites
• The “sticky ends”
– Assembly PCR
• Oligonucleotides and polymerase are mixed and placed in a thermocycler
• Creates contiguous DNA sequence from component oligos
23
Gene Synthesis:On the Lab Bench (cont)
• After PCR, generated DNA sequence cut with restriction enzymes
• Expression hosts's plasmid cut with restriction enzymes
• Synthetic gene inserted into plasmid and plasmid repaired
• Expression Vectors
– Host organisms used to express the synthetic genes (make the protein)
– Typically E. Coli
• Possibly Chickens or Cows
• Expression vector can now express protein coded for by synthetic gene
– A bit more complicated than described above!!!
24
DNA Sequence Generation:Gene Insertion
25
Outline
• Short biology Tutorial
• DNA Sequence Generation
– Why is the problem difficult?
• IBG Gene Designer
– Genetic Algorithm (GA) solution
– Heuristics and Fitness Evaluation
26
DNA Sequence Generation:The Computational Problem
• Why is the problem difficult?– Conflicting goals
DNA Sequence Generation:The Computational Problem (cont)
• Why is the problem difficult?
– (continued)
– Restriction Enzymes
• The vector will contain many restriction enzymes– If these cut up our DNA, we won’t express our proteins
– We must design the DNA string using synonymous codons so that there are no restriction sites
• Helpful to include some other restriction sites – We must design the DNA string using synonymous codons so that these are
included
– (continued)
29
DNA Sequence Generation:The Computational Problem (cont)
• Why is the problem difficult?
– (continued)
– mRNA Secondary Structure
• In prokaryotes, mRNA can fold into complex shapes
• This inhibits protein creation
– Oligonucleotide generation
• Want a specific melting temperature so that the complex folding doesn’t take place
• The “sticky ends” must have the same melting temperature so that they will bind together.
30
Outline
• Short biology Tutorial
• DNA Sequence Generation
– Why is the problem difficult?
• IBG Gene Designer
– Genetic Algorithm (GA) solution
– Heuristics and Fitness Evaluation
31
IBG GeneDesigner:Our Solution
•IBG GeneDesigner
32
IBG GeneDesigner:Genetic Algorithm
• Uses a Genetic Algorithm for sequence optimization
– Tournament selection model
– Uniform and single-point crossover (behind the scenes – not user selectable at present.)
– Mutation causes codon “wobbling”
– Sequence “fitness” determined by heuristic evaluation
33
IBG GeneDesigner:Fitness Evaluation
• GeneDesigner heuristics
– Manipulation of nucleotide percentages/ratios to reduce mRNA secondary structure formation
– Inclusion and Exclusion of restriction sites
• Restriction sites requested for inclusion should only occur once
– Matching of codon preference
– Oligonucleotide generation
• Fitness determined by melting points, start and end nucleotide
34
IBG GeneDesigner:Future Work
• Algorithm parameters
– Systematically manipulate GA parameters to identify default values for sequence optimization
• Population size
• Number of generations
• Mutation rate
• Convergence criteria
– Modify heuristic weighting scheme
• Selection models
– Experiment with alternative selection models (Roulette wheel, elitism, limit population replacement)
35
IBG GeneDesigner:Future Work
• Move algorithm to ECJ architecture
– Use the Strength-Pareto multi-objective optimization algorithm
• Create web-based version of application
• Explore island model effects on optimization
Results
• IBG GeneDesigner utilized to generate a nucleotide sequence for the SH3 domain of a-spectrin1.
• The codon optimization option was set for expression in E. coli with a 40% G/C bias
• We also used the application to generate four assembly PCR template oligonucleotide sequences to produce the protein coding sequence flanked by desired restriction enzyme recognition sites.
• The calculated Tm values of the three overlapping regions were within 1.6oC
– Promoting similar annealing behavior between strands.
– Success of the reaction was confirmed by DNA sequencing of a pUC19 expression vector containing the PCR product cloned between restriction sites included in the gene design.
• Summary: Protein Made!!!
Input: Protein Sequnce, Vector, Restriction Enzymes
Input: Flanking Sequences
Input: Algorithm Parameters and Fitness Scores
Output: Generation of Oligonucleotides
42
Acknowledgements
• Graduate student who did much of the coding• Rob Vogelbacher
• University of Chicago undergraduate who used it to build a protein• Benjamin R. Capraro
• His advisor• Tobin Sosnick
• Our collaborator at University of chicago• Shohei Koide