This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Functional Genomics - The challenge: Many new genes of unknown function• Where/when are they expressed?
– Known genes (e.g. from genome projects)• Gene chips (Affymetrix)• Microarrays (Oligo, cDNA, protein) (Iyer paper week 4)
– Novel genes• Expression profiling (3 papers this week)
– Genomic tiling microarrays (Kapranov)– SAGE and related approaches (RIKEN)– Massively parallel sequencing (RNAseq) (t’Hoen)
• Which genes regulate what other genes? • Epigenetic modification of gene expression (week 7 papers)• What is the phenotype of loss-of-function? (week 8 papers)
• Positional cloning – identifying by where the gene is located– from genetic linkage map– by walking from nearby sequence tags (ESTs, STS, STC, etc)– Gene trap techniques (week 8)– Interspecific backcrosses (Mus musculus vs M. spretus)– When possible – try to rescue phenotype with candidate region
• Positional cloning – ok you have identified a region where your gene of interest may be located - now what?– How do we figure out what genes are in this region without
knowing function?– General problem for annotation of sequences
• Three major databases of sequences (automatically duplicated)– GENBANK http://www.ncbi.nlm.nih.gov/Genbank/index.html– DNA Databank of Japan http://www.ddbj.nig.ac.jp/– European Molecular Biology Laboratory (EMBL)
• Gene identification/prediction is important but difficult– Large variety of methods and algorithms to predict exons– To identify genes must first identify open reading frames (ORFs)
• When dealing with cDNAs – look for regions that code for proteins
– Do all genes code for proteins?– Correct reading frame for a sequence is assumed to be
largest with no stop codons (TGA, TAA, TAG)• Lots of tricks can be employed
– Codon frequency for an organism» Coding sequences follow codon usage» Noncoding sequences do not, often have lots of stop
codons– Consensus sites
» Kozak translational initiation CCRCCATGG• What is a very important consideration when searching
sequences to predict ORFs?– Sequence must be accurate
» Incorrect base calls are troublesome» But indels (insertions or deletions) are disastrous
– Generally, microbial genomes are much easier to annotate – WHY?
• Simply identify ORFS > 300 bp (100 amino acids)– Works very well– But can miss small coding sequences– Must run on both strands because there are shadow genes
(overlap on two strands)• Using a variety of programs, can predict genes in bacterial
Genome annotation – how to identify genes (contd)?
• Computer based gene prediction methods (contd)– Training sets are used to
“teach” programs howto solve problems
• Training set is actual data – genes with known features– Programs use training sets to classify new data
• Neural networks use training data to build decision trees• Rule-based systems use training data to generate rules• HMM build table of probabilities for states and transitions
– Pierre Baldi in IGB is UCI expert in machine learning
• How well do gene predicting programs work?– Extremely well on bacterial genomes– Fairly well on simply eukaryotic genomes– Variable on complex genomes– Rule of thumb – use a group of programs and look for areas of
agreement among them– The best current programs combine ab initio predictions with
• You have identified a gene – what is its function?– Always look for similarity to known sequences
• Swiss-prot is fairly well annotated • GENBANK translated database is most complete• BLAST is tool to use• Amino acid searches more sensitive than nucleotide searches
– Because identical amino acid sequences might only be 67% identical at nucleotide level
– What might you find?• Match may predict biochemical and physiological function
– e.g., a known enzyme from another organism• Match may predict biochemical function only
– e.g., a kinase• Match a gene from another organism with no known function
– May match ESTs or ORFs from other organisms• Match a known gene with partly characterized function
– Search leads to clarification of function – NifS in book• Might not match anything at all
• Extremely important as number of sequences increases– Goals are to identify
• all of the sequences• all of the features of each sequence• All of the functions of the identified genes
– Sometimes annotation does not agree with known function• Human error• New and updated information not propagated to database• Inaccurate sequencing• Sometimes annotation is correct but protein lacks function
under certain conditions (e.g., need cofactors)– Gold standard for functional analysis is loss-of-function analysis
• Most accurate annotation– Common to have “annotation jamborees” where biologists and
bioinformaticians come together to annotate new sequences• Xenopus tropicalis jamboree was in Spring 2006• But many genes and gene models are still unannotated
• The biggest defect of expression microarrays or transcript profiling is that neither can distinguish direct targets from indirect targets– Which genes are a primary response to the treatment vs.– Which ones have one or more intermediates?
• How can we approach and solve this important problem?
• All eukaryotic DNA occurs as chromatin (DNA+histone and other proteins)
– Chromatin conformation influences whether transcription can occur
– Identify the genes to which a transcription factor binds– Identify to which genes RNA polymerase II is recruited
• And from which it is dismissed
Which genes regulate what other genes?
—Open chromatin is accessible to transcriptional machinery• DNA is unmethylated• Histones are methylated and acetylated (acetylase activates)• Many transcriptional co-activators methylate and acetylate
histones and other chromatin associated proteins− Opens the chromatin
—Closed chromatin is inaccessible to transcriptional machinery• DNA is methylated• Some histone tails are methylated, others not• Transcriptional co-repressors recruit histone deacetylases
(HDACs) which lead to chromatin condenstation• Chromatin condensation leads to gene silencing
• Identification of chromatin-localized proteins is diagnostic for direct target genes of transcription factors
− Most common application
• Identification of promoters to which RNA Pol II is recruited upon some treatment is diagnostic for genes directly upregulated by the treatment
− Trickier but very useful
• Identification of promoters from which RNA Pol II is dismissed upon some treatment is diagnostic for genes downregulated by the treatment
• ChIP - general strategy (contd)– Break chromatin into small chunks by sonicating– Typically want ~500 bp fragments– Evaluate sonication quality and extent by gel electrophoresis to
ensure that size range is obtained• Needs MUCH optimization
• Flavors of ChIP commonly in use– ChIP-sequencing (ChIP-seq) – chromatin immunoprecipitation
sequencing• Massively parallel sequencing of recovered fragments• Unbiased method to identify transcription factor binding sites• Price is fast converging on ChIP-chip• Requires excellent, well-characterized antibody!
• Phylogenetic footprinting comparison of zebrafish, mouse and xenopus caudal orthologs (cdx4, cdx4, Xcad3)– A number of putative conserved elements identified including– TTCATTTGAATGCAAATGTA– Absolutely conserved in all 3 promoters– Compare with database
• http://www.ncbi.nlm.nih.gov/blast/
– Also found in human cdx4 – a good validation of the result
• As more genomes are sequenced and compared, phylogenetic footprinting becomes a very powerful filter to identify potentially conserved regulatory sequences– ECR browser offers precomputed comparisons of conserved
1. (8 points) You are a 4th year senior at UCI and had hoped to stay for a 5th year so that you could double major in Molecular Biology and Biochemistry together with Pharmaceutical Sciences. Sadly, the powers that be in the School of Biological Sciences have decreed that no one can stay for more than 4 years irrespective of the reason. In order to graduate on time, you need to take an advanced molecular biology laboratory. The class is a new one at UCI and involves the entire class collaborating on a project to sequence and characterize the genome of an as yet uncharacterized organism. UCI is part of the Genome 10K project that aims to sequence 10,000 vertebrate genomes. An enterprising young assistant professor has secured support from NSF to fund this course. This year’s project is to sequence the genome of Myrmecophaga tridactyla - the giant anteater (why are you surprised?). This is a one quarter long project so there is no time to waste.
a) (2 points) The department chair read Craig Venter's autobiography and suggests that the EST (expressed sequence tag, cDNA) project use a 384-capillary sequencer (donated by Craig Venter), together with cycle sequencing to generate 200,000 ESTs. Is this a good idea? Explain why, or why not.
Not a good idea. Standard sequencing can certainly generate 200K ESTs, but in more like a year using a single machine and Sanger sequencing (200,000/384 = 520 runs). Therefore, it will be too slow and you will not graduate. It will probably cost much more than the existing budget, too.
Amanda is right – you need a new department chair. The only hope of generating a map in the time that is available will be to use HAPPY mapping. I would map the ESTs being identified in part a) to genomic DNA using this PCR based method. It will be tedious, but doable. HAPPY mapping uses isolated genomic DNA and PCR to determine how close 2 markers are to each other in the genome.
1b) (3 points) Your project is to generate a map of the anteater genome. The department chair (who is full of ideas) suggests that you conduct large-scale BAC end sequencing and order these by sequentially hybridizing individual BACs to the library. Amanda, the course director, thinks this is a bad idea. Who is right? Amanda says that you must select the method to use and generate a good map by the end of the quarter, otherwise you are not going to pass the course or graduate. Explain briefly why you chose the method you did and in succinct terms highlight the main features and strengths of the method
Bill is correct – only nextgen sequencing will give yield enough sequence to assemble the genome in the time available. Sequencing by hybridization is only really applicable for resequencing regions that you already know the sequence of. The essential feature of nextgen sequencing is that, depending on which flavor you choose, it can generate ~108-109 bases per run. Thus, you should be able to sequence the genome to a high degree of completion without many runs.
1. c) (3 points) Your significant other is in the group that will generate the actual genome sequence and asks your advice about how to proceed. Bill, the group leader, proposes to use nextgen sequencing while George, a very vocal member of the group read Jurassic Park and insists that sequencing by hybridization is the best and fastest approach (after all, they cloned dinosaurs!). Who is correct? Why? What key features of the chosen method might enable the sequence to be generated and assembled in 10 weeks?
2. (12 points) Since you were able to compete your tasks in the molecular biology lab course and graduated on time, you have been recruited to NASA where you will be part of a team that characterizes the biological material returned from a probe that has been secretly sent to collect samples from Europa – a moon of Jupiter that was long suspected to have liquid water under a crust of ice. Indeed, Europa has an ocean under the ice and NASA has quietly returned a sample to Earth for characterization.
2a) (3 points) The first task is to determine whether there are any living organisms in the sample. How might you go about identifying how many different organisms are in the sample with a high probability of not missing any? What is the main assumption you must make in order for this to be successful?This is based on the Venter paper. You will want to extract DNA from the sample and
sequence it directly. The main assumption you have to make is that these organisms will contain DNA that is sufficiently similar to that of terrestrial organisms so that you can sequence it using standard methods and assemble into genomes by computer. If you are lucky it will be and a strategy similar to whole genome shotgun that Venter used will be best. However, it is 2015 so you would want to use one of the nextgen methods.
2b) (3 points) You are lucky - your sample yields 11,279 sequence assemblies, 8017 of which are complete, or nearly complete. Are these genomes likely to be prokaryotic or eukaryotic? Why? How will you determine whether these sequences are bona fide residents of Europa vs. contaminants introduced on Earth by a sloppy scientist (not you, of course) ?
If you got 8017 complete genomes, then these are probably small, making it likely that they will be prokaryotic. I will compare these sequences with those available in the GENBANK database and determine whether they are identical to any organisms found on Earth. If they are not, then they have likely originated on Europa. If some are identical, or highly similar to terrestrial organisms, then you would suspect contamination.
After sequencing the bacteria, I would develop a RT-PCR assay using Taqman or similar technology to quickly identify these sequences in any sort of sample you receive. Such technology uses custom primers to specifically and quantitatively identify matching sequences in any number of samples. PCR makes it very sensitive. Or, if you had microarrays containing Europan sequences, you could use those.
2c) (3 points) Unfortunately, your colleagues on the project develop a strange illness. Orange fur grows all over their bodies and after two weeks, they develop high fevers, become delirious and demand to be BioSci majors at UCI. Sadly, they die shortly afterward. You are the kind of person who makes lemonade out of lemons, instead of complaining that they are not sweet enough. It is tragic that your colleagues died, but the identification of Europan DNA sequences on their skin suggests that the microorganisms can be cultured. Sure enough, you are able to culture two types of bacteria from the skin of the first deceased patient. Design a diagnostic test to determine whether someone carries these Europan bacteria on their skin, or whether it is present in any sort of random sample to be tested? Note the key features of your assay
3d) (3 point) Clearly, causing a person to develop orange skin and later die is a serious problem. Describe a relatively thorough approach to identify and track the types of changes over time induced in patients by exposure to the Europan organisms ?
The approach I was looking for was that of the Chen paper. Since you have learned about genomic and transcriptomic analyses, it would be appropriate to perform whole genome sequencing to rule out any mutations induced by the infection and transcriptome analysis to identify changes in gene expression over time. It would be best if you had pre-infection samples to compare with those after the person was presumed to be infected.
3. (8 points) The plot thickens. Patients, doctors and nurses in the local hospital begin to show signs of the disease – orange fur. Word gets out to the media and suddenly Anderson Cooper is there broadcasting from in front of the hospital, bravely not wearing hazmat gear. He is muy macho but not very bright (and does not look good in a hazmat suit anyway). Intriguingly, some of the patients do not become delirious and die, they simply have orange fur.
3a) (2 points) The first thing you need to figure out is how the contamination spread from the original patient to others. You recall that your sarcastic D145 professor said that doctors are the major source of infections in hospitals, but you aspire to be a doctor and don't believe him. The patients were immediately put into isolation after arriving in the hospital so it is very unlikely that the contamination spread without personal contact. How could you test whether the Europan microorganisms in Dr. Quackenbush (the attending physician in the ER) are the same as in the initial patients from NASA and in the others who subsequently become affected? Dr. Quackenbush appears to be unaffected.Perhaps the best way to do this is to sequence microorganisms collected from the various patients and swabs from Dr. Quackenbush’s body. If the sequences are identical, AND present on Dr. Quackenbush, then one appropriate conclusion would be that he is an intermediate carrier of the infection. You could also use high resolution fingerprinting to compare the strains with each other. Expression microarrays are not sufficient to discriminate closely related species as these are likely to be.
3b) (3 points) Intriguingly, Dr. Quackenbush appears to have the Europan microorganisms on his hands and on his lab coat and gloves, but remains unaffected. Nevertheless, he is placed into isolation and observed. As time passes and more data are collected, you notice that there are 3 phenotypes among people who show evidence of Europan microorganisms on their skin (detected with the test you developed in 2c). These are a) orange fur, delirious, then dead, b) orange fur but otherwise healthy, and c) unaffected. This sounds to you like a classic case of a single gene that differs when it is homozygous for one allele (sensitive, i.e. dead), or the other allele (resistant), or heterozygous (orange but alive). You perform whole genome association studies and quickly discover that one region on chromosome 20 (about 1 megabase long) is highly associated with resistance to the Europan microorganisms. How might you determine whether this region is transcribed into mRNA and, if so, how large the transcripts are and how much of the transcript is present?Make RNA from skin of affected people and perform Northern analysis using the region from chromosome 20 as a probe to detect whether transcripts are present and what size they are. Add a standard to determine how much is present. Northern is the only way to detect size but you could use QRT-PCR to quantitate with appropriate standards. Many people wanted to use RNA-seq and then determine which sequences mapped to this region. I gave credit for this, but this would be equivalent to killing a fly with a missle.
3c) (3 points)You have identified several families where one parent is totally resistant to the infection, the other parent gets orange fur and the children are either resistant or get orange fur but do not die. Your analysis in b) above has shown that there are no differences in the coding capacity of this region between resistant, partially resistant or susceptible individuals and that the sequence of the affected region does not change significantly among patients. The data suggest that a variation in copy number of some part of the candidate region might be responsible for resistance. Outline how you would go about identifying copy number variations in this region, determining whether patients have the CNV or not and if CNVs are linked with the disease?This would use an approach similar to that in the Redon paper. 1) Use SNP or chromosome tiling microarrays to identify CNV regions and 2) test this in all sorts of patients. 3) If the CNV is linked with the disease, then the patients should segregate into groups where the number of copies is correlated with the severity of disease.
4. (7 points) As you might expect, Anderson Cooper has now developed orange fur. CNN is desperate to know whether or not he will survive and the ongoing drama is generating big ratings. They need to know how long the ratings bonanza will continue so that they can project how much they can charge for commercials. Meanwhile, Larry King has come out of retirement to anchor the news story since he is so old that no one is worried whether he will contract the disease (plus it makes him look brave).
4a) (1 point) How could you determine whether Anderson will die, or simply remain orange (other than waiting and observing what happens) ?Assuming you showed that CNVs are associated with resistance in 2c, use the same method to determine which group Anderson falls into.
Two methods come to mind. The first would be gene expression microarray analysis. You would prepare mRNA from both cell populations, label it and probe appropriate microarrays. You would then compare the genes that are expressed in normal skin stem cell and the infected, hair follicle stem cell-like cells. Genes that are expressed differently will be the ones that you focus your attention on. You could also perform exhaustive nextgen sequencing of cDNA derived from the mRNA to identify which sequences are different between the 2 populations. This method has the benefit that it would detect microorganism encoded transcripts that would not appear in your analysis otherwise. Of course, you need a genome assembly to map the RNA-seq reads against.
4b) (3 points) Continuing to make lemonade out of lemons, you hypothesize that understanding how the Europan microorganisms infect skin cells and lead to the generation of orange fur might lead you to a cure for baldness and give you financial security forever. You found that the microorganisms behave somewhat like Chlamydia and grow as intracellular parasites in skin stem cells. Your analysis suggests that the infected skin stem cells are instead transformed into a cell type like hair follicle stem cells. Assuming that you can isolate both cell types in pure form, outline how would you determine what transcripts are expressed in normal skin stem cells compared with infected stem cells?
First you need to culture skin cells from regions with orange fur (infected) and regions of normal skin (presumably uninfected). The most brute force method would be to conduct genome sequencing from both cells and focus on where LINE elements are located in each. If they are in different places, then you will have proven that they are moving around. Alternatively, you could use FISH to show that the chromosomal patterns of LINE localization are different between the two cell types. Credit was given for a variety of creative answers.
4b) c) (3 points) Another possible hypothesis is that something about the infection of skin cells with the Europan microorganisms has mobilized one or more LINE elements (retrotransposon-like repeated sequences) allowing them to hop around the genome and cause the strange effects observed. How could you determine whether LINE elements in cultured skin cells infected with Europan organisms compared with uninfected cells, move from their original location in the genome to a new location?