Jul 28, 2018
Logistics
• Syllabus distributed– Class taught in 3 stages by faculty in CS, math/stats, and microbio
– Grades will be based on up to six homework assignments
– Office hours on syllabus. All faculty are readily available by email.We are happy to discuss the class with you personally.
– Not all notes will be available online - you should attend all lecturesand take good notes
• Diverse group of students
• Emphasis will be on understanding methods and practicaluse of existing bioinformatics tools
• Why are you here? What is your background? What areyou hoping to get out of this class? Please sign the emailsheet!
• Homework will involve the use of the unix ED-LABcomputers. There will be a special meeting onWEDNESDAY, SEPTEMBER 14 for novice unix users.
What is Bioinformatics
• Computational Biology: The use of algorithmic,mathematical, and statistical methods to analyzegenome sequences (i.e. DNA, RNA, protein) andderived data (e.g. expression, NMR, etc.)
• Informatics: The software and data managementmethodologies for storing, retrieving, andintrigrating such data
• Data Mining / In-silico Biology: Hypothesisgeneration and testing from genome data sets
Topics
• Detecting similar sequences (homology)– Pairwise and multiple sequence alignment
– Protein function/structure prediction
• Sequence pattern modeling and recognition– Motif discovery
– Gene finding
• Analyzing high-dimension data– Function prediction, target discovery, etc. from gene
expression
• Constructing trees– Phylogenetics
• Informatics and integration– Genome biology
The Cell
• Prokaryotes are unicellular with minimal compartments -bacteria, archaea
• Eukaryotes are multicellular with differentiation and manyorganelles including the nucleus that typically canreproduce sexually - all higher organisms includingmammals, birds, fish, invertebrates, mushrooms, plants,and yeast. ~300,000,000,000,000 cells in a human.
The Cell
• The cell is composed of and makes thousands of proteins, e.g.
– the cell wall is made of a layer of proteins and lipids.
– There are special proteins embedded in the wall as channels andpumps
– And the cell makes (synthesizes) proteins• “DNA makes RNA, RNA makes proteins, and proteins make us!” F.
Crick
• The cell is a chemical catalytic machine
• Networks:
– one type of network are metabolic networks describing catalyticreactions for the consumption or synthesis of products necessaryfor life. Many of these are fairly well understood. (e.g.photosynthesis)
– Another type of network are signaling networks where informationis conveyed about the environment. These are partially understood.(e.g. protein kinases are involved in cell differentiation and celldeath)
The Cell - Genetic Information
• There is a third major type of network: geneticinformation processing. We will focus on thesenetworks.
• To understand this:– we describe the nature of DNA
– Tangentially mention homology and conservation
– Then discuss the process of translation
DNA Structure - Eukaryotic Chromosome
• DNA - a string of nucleic acids (Adenine, Guanine, Cytosine, and Thymine)
• Regular, long, stable, oriented, double-stranded, helical structure
• Humans: 23 pairs of chromosomes. Total ~3B “bases” (x2)
• DNA resides in nucleus in eukaryotes
DNA StructureDNA
• Always: chemical pairing of A-T andC-G. Thus, strands arecomplementary.
• Two chains run in opposite directions:5’ to 3’
5’
3’
5’
3’
Prokaryotic Chromosomes
• Prokaryotes (andmitochondria)have one circularchromosome
• This shows the E.coli genome withorange andyellow barsindicating thepositions of thegenes on the twostrands.
RNA
RNA is a similar molecule composed of 4 nucleic acids (A, C,G, and U)
• Single-stranded.
• Can base-pair with DNA (synthesis)
• Can self-base-pair and fold
DNA Replication
• We won’t be discussing the details of DNA replication.There are 2 processes:– Mitosis for normal cell duplication
– Meiosis for gametes for sexual reproduction - single,recombined chromosomes
• In both processes, DNA is copied by breaking double-strand (dsDNA) into single-strands (ssDNA) at originsof replication and synthesizing a complementary copyfrom the template.– 50 bp/sec * 15K origins = ~1 hr to replicate human genome
• Problem:– How does DNA polymerase find the origins? Are there
sequence patterns?
DNA Conservation and Variation
• Mutations occur in DNA due to environmental effects (e.g. radiation)and random mistakes during synthesis. Usually just singlenucleotides are changes, sometimes large rearrangements.
• Those changes occurring in somatic (non-sex) cells cause localdamage, usually cell death, but can cause cancer. (Search for thecommon mutations that cause different types of cancers.)
• Those changes occurring in gametes can be inherited and if favorablecan become “fixed”
• Variation in non-functional (junk) DNA tends to “drift”, whereasfunctional DNA (e.g. containing genes) tends to remain “conserved”.
• Problems:– Given a set of sequences from different organisms:
• Identify and align sequences from a common ancestor (homologous)
• What are the important (conserved) parts?
• What was the evolutionary history? (Reconstruct the “tree”)
– Given a model organism (e.g. mouse, yeast, fruitfly, etc.), find theorthologous locus in human
Examples of Sequence Conservation
• A segment from the RNA needed for protein synthesis - a fundamentalprocess in all life forms. It is conserved across all 3 major branches ofthe tree of life.
• A multiple alignment of homologous protein sequences. Colorsindicate different classes of amino acids. Dots are inserts/deletes.
DNA contains “GENES”• Genes are heriditary units of DNA
– We now know that, for the most part, genes are regions that “code”for proteins
• Proteins are derived from DNA according to the “centraldogma”: DNA => RNA => Protein– Like DNA replication, DNA is opened into two single strands.
– Using a ssDNA as a template, a complementary copy of RNA issynthesized for a small region of the genome (1000-100000nt)
– The RNA is processed and transported (more about that in laterlectures)
– Each triple of RNA (codon) is translated to one of 20 amino acidscreating a polypeptide chain, which folds into a protein
• Problems:– How does the cell know where to find a gene? (Sequence
patterns?)
– How does RNA transcription know when to stop? (Patterns?)
– How is RNA edited?
Codon Translation
• Each triplet translates to a unique amino acid. Forexample, CUU is Leucine.
• There are 4*4*4=64 possible codons that translate into 20amino acids
• This translation table is fixed for almost all life
Cell Differentiation
• Eukaryotes have many different cell types (skin,muscle, neurons, etc.) that each play a differentrole.
• To accomplish the cell’s role, different genes mustbe activated
• Problems:– How are genes activated? What regulatory patterns are
in the DNA?
– What genes control other genes? What networkassociations among genes can be found?
– What genes are “differentially expressed”?
Protein Sequence, Structure, Function
• Lastly, given a protein sequence, what is the 3-Dstructure and function?
• The most common approach is to exploitconservation (see earlier)
• Problem:– Find similar proteins to my query protein. Maybe I can
assign structure or function to my new query protein, ifstructure or function is already known for a homologousprotein. (Sequence similarity searching, protein familymodeling)