Top Banner
Unfinished part of lecture 5
63

Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Unfinished part of lecture 5

Page 2: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

How to explain trans-linked gene

• If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages is that the

• sequence polymorphism in the eQTL affects the expression of the cis-linked gene first, and

• then the cis-linked gene affects expression of the trans-linked genes.

• In this situation, we would expect to observe the overall co-expression between the cis-linked gene and the translinked

• genes.

Page 3: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

• a more complicate situation encountered in 1D-trait mapping : loci with only trans-linkages but no cis-linkages are detected.

• the correlation between the expression profile of a trans-linked gene and that of any gene in/around the eQTL is most likely to be low.

• This suggests that we may use 2D-trait mapping to find out whether there are more subtle dynamic coexpression patterns or not.

Page 4: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

• With 1D-trait mapping, we find altogether 76 genes trans-linked to the marker blocks

that contain no cis-linked genes (details in the supplementary materials).

Focus on spots with more than 3 trans-linked genes.

Altogether 7 such linkage spots (corresponding to a total of 44 trans-linked genes ) are identified

Page 5: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

• For each linkage spot, we measure the function enrichment of the trans-linked genes by GO Term Finder of SGD (http://db.yeastgenome.org/cgi-bin/GO/goTermFinder ).

• Find 3 spots with enriched GO term annotation:

• 1. Marker block 391: 8 genes are trans-linked to it, with enriched GO term “ATP metabolic process” (1.97e-7). (To be discussed next)

• 2. Marker block 335: 3 genes are trans-linked to it, with enriched GO term “formate metabolic process” (3.87e-10).

• 3. Marker block 446: 4 genes are trans-linked to it, with enriched GO term “mitochondrial electron transport, ubiquinol to cytochrome c” (0.00041).

Page 6: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

• First, by 1D-trait mapping, eight genes functioning in ATP metabolism and aerobic

respiration are linked to Chromosome XI: 235.0kb to 252.8kb (marker block 390-391)

HAP4, which encodes a transcription activator of respiratory genes is found in this locus

Genome-wide TF binding data shows that Hap4 binds the upstream

regions of ATP5, ATP7, and ATP14• HAP4 is not cis-linked since this locus is a cis-null/all trans-linkage

spot• the correlations in expressions between HAP4 andany of the 8

trans-linked genes are quite low.

Page 7: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

• To identify the possible dynamic co-expression patterns between HAP4 and the eight trans-linked genes,

• we take the expression profile of each of the eight trans-linked genes as X,

• the expression profile of HAP4 as Y , • and the genotypes of all the 667 marker blocks asZ to calculate LA scores.

We look for marker blocks appearing multiple times in the short list of marker blocks with best LA scores (20 most positive and 20 most negative).

Page 8: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

We find one marker block, marker block 41 (Chromosome II: 328.5kb to 334.0kb), appears six times (Table 6) as one of the marker block among the 20 marker blocks with most negative LA scores.

We further find out that HAP4 co-expresses well with these genes if the sequence of marker block 41 is inherited from RM strain.

Page 9: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

• Using 1D-trait mapping, a gene, TCM62, is• found to be cis-linked to this marker block

• It is known that Tcm62 forms a complex containing at least three SDH subunits Sdh1, Sdh2 and Sdh3 (Dibrov et al., 1998), and all these SDH genes are involved in aerobic respiration Oyedotun and Lemire, 2004,

• which is consistent with the function of HAP4 and its target genes.

• Thus marker block 41, or more specifically, gene TCM62 is a plausible candidate that mediates the co-expression pattern between HAP4 and its target genes.

Page 10: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.
Page 11: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

References

• Wei Sun,, Shinsheng Yuan and Ker-Chau Li.

Trait-trait dynamic interaction: 2D-trait eQTL mapping for genetic variation study. BMC Genomics 2008, 9:242 doi:10.1186/1471-2164-9-242

• Li, K-C. Genome-wide co-expression dynamics: theory and application. Proc. Natl. Acad. Sci. USA 2002; 99: 16875-16880.

• Li, K-C. and S. Yuan. A functional genomic study on NCI’s anticancer drug screen.The Pharmacogenomics Journal 2004; 4, 127-135.

• Li, K-C, C-T Liu, W Sun, S Yuan and T Yu. A system for enhancing genome-wide co-expression dynamics study. Proc. Natl. Acad. Sci, USA 2004.

Page 12: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.
Page 13: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.
Page 14: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.
Page 15: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.
Page 16: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.
Page 17: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.
Page 18: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.
Page 19: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.
Page 20: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.
Page 21: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

http://www.geneontology.org/index.shtml

An Introduction to the Gene Ontology

(GO)

The Gene Ontology project provides a controlled vocabulary to describe gene and gene product attributes in any organism.

Page 22: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

http://www.geneontology.org/index.shtml

• Search the Gene Ontology DatabaseSearch for genes, proteins or GO terms using AmiGO:gene or protein name GO term or IDAmiGO is the official GO browser and search engine. Browse the Gene Ontology with AmiGO.

Page 24: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

What does the Gene Ontology Consortium

do?• Biologists currently waste a lot of time and effort in searching for all of the available information about each small area of research.

• This is hampered further by the wide variations in terminology that may be common usage at any given time, which inhibit effective searching by both computers and people.

• For example, if you were searching for new targets for antibiotics, you might want to find all the gene products that are involved in bacterial protein synthesis, and that have significantly different sequences or structures from those in humans. If one database describes these molecules as being involved in 'translation', whereas another uses the phrase 'protein synthesis', it will be difficult for you - and even harder for a computer - to find functionally equivalent terms.

• The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases.

• The project began as a collaboration between three model organism databases, FlyBase (Drosophila), the Saccharomyces Genome Database (SGD) and the Mouse Genome Database (MGD), in 1998.

• Since then, the GO Consortium has grown to include many databases, including several of the world's major repositories for plant, animal and microbial genomes. See the GO Consortium page for a full list of member organizations.

Page 25: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

• The GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner.

• There are three separate aspects to this effort: • first, the development and maintenance of the ontologies themselves; • second, the annotation of gene products, which entails making associations

between the ontologies and the genes and gene products in the collaborating databases;

• and third, development of tools that facilitate the creation, maintenance and use of ontologies.

• The use of GO terms by collaborating databases facilitates uniform queries across them.

• The controlled vocabularies are structured so that they can be queried at different levels:

• for example, you can use GO to find all the gene products in the mouse genome that are involved in signal transduction,

• or you can zoom in on all the receptor tyrosine kinases.• This structure also allows annotators to assign properties to genes or gene

products at different levels, depending on the depth of knowledge about that entity.

Page 26: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Terms in the Gene Ontology

• The building blocks of the Gene Ontology are the terms, so what makes up a GO term?

• Each entry in GO has a unique numerical identifier of the form GO:nnnnnnn, and a term name, e.g. cell, fibroblast growth factor receptor binding or signal transduction.

• Each term is also assigned to one of the three ontologies, molecular function, cellular component or biological process.

• The majority of terms have a textual definition, with references stating the source of the definition.

• If any clarification of the definition or remarks about term usage are required, these are held in a separate comments field.

• Many GO terms have synonyms; GO uses 'synonym' in a loose sense, as the names within the synonyms field may not mean exactly the same as the term they are attached to. Instead, a GO synonym may be broader or narrower than the term string; it may be a related phrase; it may be alternative wording, spelling or use a different system of nomenclature; or it may be a true synonym. This flexibility allows GO synonyms to serve as valuable search aids, as well as being useful for applications such as text mining and semantic matching. The relationship of the synonym to the term is recorded within the GO file.

Page 27: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

• The scope of the Gene Ontology overlaps with a number of other databases, and in cases where a GO term is identical in meaning to an object in another database, a database cross reference is added to the term. These cross references can also be downloaded from the mappings to GO page.

• Species-specific termsThe Gene Ontology aims to provide a controlled vocabulary that can be used to describe any organism; nevertheless, many functions, processes and components are not common to all life forms. The convention is to include any term that can apply to more than one taxonomic class of organism. To specify the class of organisms to which a term is applicable, GO uses the designator sensu, 'in the sense of'; for example, trichome differentiation (sensu Magnoliophyta) represents the differentiation of plant hair cells (trichomes).

• Obsolete termsOccasionally, a term is found that is outside the scope of GO, is misleadingly named or defined, or describes a concept that would be better represented in another way. Rather than delete the term, it is deprecated or made obsolete. The term and ID still exist in the GO database, but the term is marked as obsolete, and a comment added, giving a reason for the obsoletion and recommending alternative terms where appropriate.

Page 28: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

The Ontologies

• The three organizing principles of GO are cellular component, biological process and molecular function. A gene product might be associated with or located in one or more cellular components; it is active in one or more biological processes, during which it performs one or more molecular functions. For example, the gene product cytochrome c can be described by the molecular function term oxidoreductase activity, the biological process terms oxidative phosphorylation and induction of cell death, and the cellular component terms mitochondrial matrix and mitochondrial inner membrane.

Page 29: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Cellular component

• A cellular component is just that, a component of a cell, but with the proviso that it is part of some larger object; this may be an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene product group (e.g. ribosome, proteasome or a protein dimer). See the documentation on the cellular component ontology for more details.

Page 30: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Biological process

• A biological process is series of events accomplished by one or more ordered assemblies of molecular functions. Examples of broad biological process terms are cellular physiological process or signal transduction.

• Examples of more specific terms are pyrimidine metabolism or alpha-glucoside transport.

• It can be difficult to distinguish between a biological process and a molecular function, but the general rule is that a process must have more than one distinct steps.A biological process is not equivalent to a pathway; at present, GO does not try to represent the dynamics or dependencies that would be required to fully describe a pathway.Further information can be found in the process ontology documentation.

Page 31: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Molecular function• Molecular function describes activities, such as catalytic or binding

activities, that occur at the molecular level. • GO molecular function terms represent activities rather than the

entities (molecules or complexes) that perform the actions, and do not specify where or when, or in what context, the action takes place.

• Molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products.

• Examples of broad functional terms are catalytic activity, transporter activity, or binding;

• examples of narrower functional terms are adenylate cyclase activity or Toll receptor binding.

• It is easy to confuse a gene product name with its molecular function, and for that reason many GO molecular functions are appended with the word "activity". The documentation on gene products explains this confusion in more depth. The documentation on the function ontology explains more about GO functions and the rules governing them.

Page 32: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Ontology structure• The terms in an ontology are linked by two relationships, is_a and

part_of. is_a is a simple class-subclass relationship, • where A is_a B means that A is a subclass of B; for example, nuclear

chromosome is_a chromosome. • part_of is slightly more complex; C part_of D means that whenever C is

present, it is always a part of D, but C does not always have to be present. An example would be nucleus part_of cell; nuclei are always part of a cell, but not all cells have nuclei.

• The ontologies are structured as directed acyclic graphs,• which are similar to hierarchies but differ in that • a child, or more specialized, term can have many parents, or less

specialized, terms. • For example, the biological process term hexose biosynthesis has two

parents, hexose metabolism and monosaccharide biosynthesis. This is because biosynthesis is a subtype of metabolism, and a hexose is a type of monosaccharide. When any gene involved in hexose biosynthesis is annotated to this term, it is automatically annotated to both hexose metabolism and monosaccharide biosynthesis,

• because every GO term must obey the true path rule: if the child term describes the gene product, then all its parent terms must also apply to that gene product.

Page 33: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

What GO is NOT• It is important to clearly state the scope of GO, and what it does and does not

cover. The ontologies section explains the domains covered by GO; the following areas are outside the scope of GO, and terms in these domains would not appear in the ontologies.•Gene products: e.g. cytochrome c is not in the ontologies, but attributes of cytochrome c, such as oxidoreductase activity, are.•Processes, functions or components that are unique to mutants or diseases: e.g. oncogenesis is not a valid GO term because causing cancer is not the normal function of any gene.•Attributes of sequence such as intron/exon parameters: these are not attributes of gene products and will be described in a separate sequence ontology (see the OBO website for more information).•Protein domains or structural features.•Protein-protein interactions.•Environment, evolution and expression.•Anatomical or histological features above the level of cellular components, including cell types.GO is not a database of gene sequences, nor a catalog of gene products. Rather, GO describes how gene products behave in a cellular context.GO is not a dictated standard, mandating nomenclature across databases. Groups participate because of self-interest, and cooperate to arrive at a consensus.GO is not a way to unify biological databases (i.e. GO is not a 'federated solution'). Sharing vocabulary is a step towards unification, but is not, in itself, sufficient. Reasons for this include the following:•Knowledge changes and updates lag behind.•Individual curators evaluate data differently. While we can agree to use the word 'kinase', we must also agree to support this by stating how and why we use 'kinase', and consistently apply it. Only in this way can we hope to compare gene products and determine whether they are related.•GO does not attempt to describe every aspect of biology; its scope is limited to the domains described above.Back to

Page 34: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

topAnnotation and tools

• How do the terms in GO become associated with their appropriate gene products? Collaborating databases annotate their genes or gene products with GO terms, providing references and indicating what kind of evidence is available to support the annotations. More information can be found in the GO Annotation Guide.If you browse any of the contributing databases, you'll find that each gene or gene product has a list of associated GO terms. Each database also publishes downloadable files containing these associations; these can be downloaded from the GO annotations page. You can browse the ontologies using a range of web-based browsers. A full list of these, and other tools for analyzing gene function using GO, is available on the GO Tools section.In addition, the GO consortium has prepared GO slims, 'slimmed down' versions of the ontologies that allow you to annotate genomes or sets of gene products to gain a high-level view of gene functions. Using GO slims you can, for example, work out what proportion of a genome is involved in signal transduction, biosynthesis or reproduction. See the GO Slim Guide for more information.

Page 35: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Beyond GO

• GO allows us to annotate genes and their products with a limited set of attributes. For example, GO does not allow us to describe genes in terms of which cells or tissues they're expressed in, which developmental stages they're expressed at, or their involvement in disease. It is not necessary for GO to do these things because other ontologies are being developed for these purposes. The GO consortium supports the development of other ontologies and makes its tools for editing and curating ontologies freely available. A list of freely available ontologies that are relevant to genomics and proteomics and are structured similarly to GO can be found at the Open Biomedical Ontologies website . A larger list, which includes the ontologies listed at OBO and also other controlled vocabularies that do not fulfill the OBO criteria is available at the Ontology Working Group section of the Microarray Gene Expression Data (MGED) Network site .

Page 36: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Download

• All data from the GO project is freely available. You can download the ontology data in a number of different formats, including XML and mySQL, from the GO Downloads page. For more information on the syntax of these formats, see the GO File Format Guide.If you need lists of the genes or gene products that have been associated with a particular GO term, the Current Annotations table tracks the number of annotations and provides links to the gene association files for each of the collaborating databases is available.

Page 37: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

GO term enrichment

• Hypergeometric

• SGD : example using LA output

Page 38: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 39: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Transitive functional annotation by shortest-path analysis of gene expression

dataPNAS | October 1, 2002 | vol. 99 | no. 20 | 12783-12788

Xianghong Zhou*, Ming-Chih J. Kao*, and Wing Hung Wong

• Fig. 1.   (A) Application of the shortest-path (SP) algorithm to gene expression data. Nine genes are depicted in the graph. The distance between two genes is a decreasing function of their correlation. For example, there are multiple expression dependence paths leading from gene a to gene e. Among them, the shortest dependence path is a-b-c-d-e, with genes b, c, and d serving as the transitive genes. This is the most parsimonious summary of the expression relationship between the terminal genes a and e. (B) Level 0 (L0) and level 1 (L1) matches of genes on the SP a-b-c-d-e defined according to their relationships in the Gene Ontology (GO) classification tree. With respect to the terminal genes a and e, the transitive gene b is a L0 match because it is annotated in the informative node where a and e are annotated; the transitive gene c is a L1 match because it shares the same

direct parent as the two terminal genes; the transitive gene d is neither a L0 nor a L1 match.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 40: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Current methods for the functional analysis of microarray gene expression data make the implicit assumption that genes with

similar expression profiles have similar functions in cells. However, among genes involved in the same biological pathway, not all gene pairs show high expression similarity. Here, we propose that transitive expression similarity among genes can be used as an important attribute to link genes of the same biological pathway. Based on large-scale yeast microarray expression data, we use the shortest-path analysis to identify transitive genes between two given genes from the same biological process. We find that not only functionally

related genes with correlated expression profiles are identified

but also those without. In the latter case, we compare our method to hierarchical clustering, and show that our method can reveal functional relationships among genes in a more precise manner. Finally, we show that our method can be used to reliably predict the function of unknown genes from known genes lying on the same shortest path. We assigned functions for 146 yeast genes that are considered as unknown by the Saccharomyces Genome Database and by the Yeast Proteome Database. These genes constitute around 5% of the unknown yeast ORFome.

Page 41: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Data Processing

• Saccharomyces cerevisiae gene expression profiles from the Rosetta Compendium (6), which includes 300 deletion and drug treatment experiments. Genes were annotated by using

the biological process ontology of Gene Ontology (GO) (7) provided by the Saccharomyces Genome Database (SGD) (8).

• After removing the genes without GO process annotation and the 20 genes for which there are less than 80 experimental measurements in the Rosetta Compendium, we were left with 266 mitochondrial, 398 cytoplasmic, and 659 nuclear GO-annotated genes.

• For each of the three sets of genes, we calculated the expression similarities of all gene pairs {a, b} using Ca,b, the minimum of the absolute value of leave-one-out Pearson correlation coefficient estimates. This estimate is a measurement robust against single experiment outliers and sensitive to overall similarities in expression

patterns.

Page 42: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Graph Construction and SP Computation.

• We constructed three graphs, one for each set of the 266 mitochondrial genes, the 398 cytoplasmic genes, and the 659 nuclear genes. In each graph, two genes were assigned an edge if their absolute expression correlation Ca,b was higher than  = 0.6. 

• This cut-off, while conservative, nonetheless retains a sufficient

number of connected gene pairs in the graph. The edge length between vertices a and b is da,b = f(Ca,b) = (1  Ca,b)k. The powering factor k is used to enhance the differences between low and high correlations. Because the length of a path is the sum of the individual edge lengths, by exaggerating the differences between edge lengths, the SPs will be more likely to cover more transitive genes. Thus by increasing k we gain more power to reveal transitive co-expression. We set k = 6 because for k  6, the numbers of transitive genes stabilizes (detailed results at www.biostat.harvard.edu/complab/SP/). To ensure the quality of SPs, we consider only SPs with total path lengths <0.008.

Page 43: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Predicting the Functions of Unknown Genes.

• We use the SP method to classify previously unannotated yeast genes by adding the 3,255 ORFs unknown to SGD into the graphs of known genes in the mitochondrial, cytoplasmic, and nuclear compartments.

• As before, an edge is constructed between two genes if their absolute expression correlation is higher than 0.6. 

• For all pairs of known genes, we determine the SPs connecting them. For the purpose of functional prediction, we would like to assign a putative function that is as specific as possible to the gene. Given all known genes on a SP, we achieve this by tracing back their annotations along the GO process tree and finding their lowest common ancestor.

• If the lowest ancestral node is at least 4 levels below the root of the GO tree, that is, it defines a sufficiently specific gene function, we then assign this function to the unknown genes on the SP.

• Analogous to the L0 and L1 matches, here the L0 prediction then corresponds to the lowest common ancestor, and the L1 prediction to its direct parent. In this way, the function

represented by the lowest common ancestor can be more specific than that defined by the informative nodes.

• . For each predicted gene function, we provide both the number of support SPs from which the prediction was derived and the number of unique known genes on those support SPs (support

genes). The more support genes there are, the more confidence we have in the corresponding prediction.

• Note that a gene can be assigned putative functions in multiple graphs, because many genes are known to function in multiple cellular compartments.

• Under two circumstances an unknown gene may be assigned with multiple functions: (i) Because known genes on a SP may each have multiple functions, they may share several lowest common ancestors in the GO tree. (ii) An unknown gene may reside in different SPs with different lowest common ancestors

Page 44: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Study of coordinative gene expression at the biological

process levelTianwei Yu , Wei Sun , Shinsheng Yuan and Ker-Chau Li

Bioinformatics 2005 21(18):3651-3657• Motivation: Cellular processes are not isolated groups of events.

Nevertheless, in most microarray analyses, they tend to be treated as standalone units. To shed light on how various parts of the interlocked biological processes are coordinated at the transcription level, there is a need to study the between-unit expressional relationship directly.

• Results: We approach this issue by constructing an index of

correlation function to convey the global pattern of coexpression

between genes from one process and genes from the entire genome.

Processes with similar signatures are then identified and projected to a process-to-process association graph. This top–down method allows for detailed gene-level analysis between linked processes to follow up. Using the cell-cycle gene-expression profiles for Saccharomyces cerevisiae, we report well-organized networks of biological processes that would be difficult to find otherwise. Using another dataset, we report a sharply different network structure featuring cellular responses under environmental stress.

Page 45: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Strategy of the study

• Arrow a: select biological processes from the gene ontology system using a scheme described in Supplementary Figure 7.

• Arrow b: compute correlations from large scale microarray data.

• Arrow c: find gene level linkages between processes (this step may be skipped).

• Arrow d: GIOC functions are established for each process.

• Arrow e: use similarity between GIOC functions to measure the degree of expressional association between processes.

• Arrow f: determine the significance of process association by randomization test.

• Arrow g: connect associated processes and project the results as a graph.

Page 46: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Genome-wide index of correlation

• For each GO term H, we create a probability function to serve as its GIOC. • Denote the collection of all yeast genes present in the gene-expression dataset

as G.• For each gene profile xi in G, we first evaluate its correlation with every gene

profile yj in H. • The highest correlation, ci = maxj corr(xi,yj), where the maximum is taken over

all genes in H, indicates the level of interaction between gene i and term H. • Using the clustering analysis terminology, this corresponds to the single linkage

distance measure between xi and all genes in term H.• We then convert ci into an index of correlation by a power function

transformation. • Assign each gene i in G a probability mass pi (1+ci)6. Here the proportionality

can be determined by setting the total probability mass equal to 1. • The resulting probability function PH(xi) = pi, i = 1,...,n, is called the GIOC

function for term H.

Page 47: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

GO term expressional association measure

• The degree of expressional association between two GO terms H1 and H2 is determined by how similar their GIOC functions are. We use K–L divergence between probability measures to quantify the distance: ∑==ni iHiHiH xPxPxPHHKL1 221 )()(log)(),( 211

Page 48: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Randomization test of significance

• We first specify the null hypothesis. • Suppose there are n genes in term H1, and m genes in term H2. To

incorporate the case that there may be genes that are annotated to both terms, we further assume that there are r overlap genes.

• Under the null hypothesis of no association between two terms, the m + n – r gene-expression profiles for these two terms should behave as if they were randomly drawn from the entire gene-expression database.

• To find the null distribution of the K–L distance, we use the Monte Carlo method.

• Draw n+m–r profiles randomly from the collection of all gene profiles. We use the first n of them to form one term and the last m of them to form the second term. This naturally leads to r overlaps between the two terms.

• Compute the K–L distance between these two artificially created terms.• This procedure is iterated many times to yield an approximation of the

distribution of K–L distance. • Once the null distribution is available, we can call a pair of GO terms

significantly associated if their K–L distance is shorter than a cutoff percentile.

Page 49: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Selecting GO terms to represent biological processes

• Use the ‘biological process’ ontology for Saccharomyces cerevisiae. • The GO system forms a directed acyclic graph. • Construct a representative set of GO terms that do not have ancestor–

descendent relationships.

• This is because the analysis of a full size GO, which contains both ancestor–descendent and sibling relationships, involves too much complexity and redundancy to yield easily interpretable results.

• Use computer search to gain objectiveness. • Our program traverses the entire ‘biological process’ branch of GO

from top to bottom (Supplementary Figure 5).• A couple of parameters are optimized to reach the dual aim of

choosing terms as close to the bottom level as possible, and covering as many genes as possible.

• The result is a collection C of 214 parallel terms. • This representative list is at a scale finer than ‘GO slims’ (Ashburner et

al., 2000; Dwight et al., 2002). The distribution of the number of genes in the selected terms is shown in Supplementary Figure 6.

Page 50: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Within-GO term and between-GO term correlation structures

• In order to find a proper measure of the expression association

between two GO terms, we first study how gene-expression profiles within a GO term are correlated. We created an on-line GO term computation page (a module in http://kiefer.stat.ucla.edu/lap2) to facilitate the investigation.

• Given a pair of terms X and Y, the system computes gene-level correlations within each term and between the two terms. Subject to a user-specified size limit, the system also searches the entire genome for two lists of highest co-expressed genes, one for each term.

• These two lists are then linked to the GO Term Finder of SGD to identify enriched functional groups.

Page 51: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Not all genes from the same GO term are tightly coexpressed.

• To the contrary, the correlations within the majority of the terms we investigate are low (Supplementary Figure 7);

• e.g. the range is between –0.50 and 0.47 for ‘actin cortical patch assembly’ (14 genes),

• between –0.59 and 0.80 (median 0.03) for ‘axial budding’ (21 genes),

• and between –0.18 and 0.43 (median 0.19) for ‘NAD biosynthesis’ (6 genes).

• The correlations are much higher for terms involving translation mechanism, e.g. from –0.16 to 0.85 (median 0.53) for ‘ribosomal large subunit biogenesis’ (14 genes).

Page 52: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

yeast uses multiple intracellular or extracellular cues in regulating the resources devoted to a functional

module.• Despite the low average correlation within a GO term, each

term has many strongly correlated genes from elsewhere of the genome;

• but these genes are not highly correlated within themselves, and their cellular roles are diverse.

• For instance, when we submit the top 200 genes which have the best correlations (all >0.57) with ‘NAD biosynthesis’ to GO Term Finder, no more than one-quarter of them fall into

functionally enriched groups, the most visible ones being ‘catabolism’ (27 genes), ‘protein folding’ (10 genes) and ‘regulation of protein metabolism’ (5 genes).

Page 53: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

These preliminary findings argue for the merit of considering GIOC

function.

• Our aim is to find a higher order organization

among a diverse list of biological processes. • Therefore, in quantifying the degree of

expressional association between a pair of GO terms, we should not isolate the genes in the term pair from the rest of the genome.

• The information from genes outside of the two GO terms must be integrated first.

Page 54: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Expressional association in cell cycle data

we find a total of 202 GO-term associations significant at level 0.025.

A :cell-cycle mechanism

B:coherent operation within the translation mechanism

C: features the protein

transport mechanism

Component D shows an extensively connected network of metabolic processes

including four major categories: coenzyme metabolism, amino

acid/lipid metabolism, small molecule transport/homeostasis

and polysaccharide metabolism/energy generation.

Page 55: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Less connections are found.

Section A features yeast's characteristic responses under stress.

Section B features a cluster of ribosome/protein synthesis terms, together with a group of closely related metabolic terms.

Environmental stress gene expression data.

Page 56: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

GO-graph distance and expressional association.

(a)Boxplots showing the relationship between GO-graph distances and K–L distances.

(b) Proportion of expressionally associated pairs versus GO-graph distance. The GO-graph distance between two terms is the length of the shortest path between them, considering all edges as bi-directional. The K–L distances were computed from cell-cycle data.

Page 57: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Further discussion

• We find two possible scenarios for a pair of terms to be linked by our expressional measure:

• (1) by tight coexpression between their genes directly;

• (2) by their shared co-expressed genes

elsewhere in the genome. • Ribosome and translation related genes are

known to be under tight cellular control. As expected, both the within-term and the between-term correlations in component B of Figure 2 are high.

Page 58: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

• In contrast, we find both the within-term and the between-term correlations in component D are much lower (Supplementary Figure 12). This indicates that multiple intracellular cues have been utilized to ensure the proper flow of metabolites across a variety of metabolic processes.

Page 59: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

An example of the first scenario

• In Supplementary Figure 13a, the expression profiles for genes in the pair ‘rRNA modification’ and ‘ribosomal large subunit biogenesis’ (both 14 genes; no overlap) are compared by hierarchical clustering. Many cross-term neighbors are observed.

Page 60: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

an example of the second scenario

• Revisit the term ‘NAD biosynthesis’ in component D of Figure 2. • As one of the key coenzymes involved in multiple metabolic pathways,

the level of NAD and NAD/NADH ratio is crucial for maintaining well-regulated metabolism.

• Reflecting this important physiological relationship, our method finds a direct link between ‘NAD biosynthesis’ and ‘NADH metabolism’ (6 and 7 genes respectively; no overlap).

• In order to identify the source of the link, we find the coexpressed genes for each term.

• There are 463 genes that have correlations of >0.5 with ‘NAD biosynthesis’, and 363 genes with ‘NADH metabolism’. The two groups

share 117 genes. • These 117 genes serve as the bridges that link the two terms.

However, there are only two cross-term correlations >0.5.• We note that the two terms share an ancestor ‘nicotinamide

metabolism’. Among the 13 genes that are annotated to this ancestor but not to the two NAD terms, 11 are in the descendent term ‘NADPH regeneration’ (no overlap with the two NAD terms). However, ‘NADPH regeneration’ is connected to neither of the two terms, and none of its 11 genes serve as a bridge for the two terms.

Page 61: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.

Another example

• The pair ‘NAD biosynthesis’ and ‘tricarboxylic acid cycle’ (6 and 14 genes respectively; no overlap).

• It is well-known that multiple steps in the TCA cycle require NAD (Alberts et al., 2002)

• Our method does find the link between these two terms. • There are 463 genes that have correlations of >0.5 with ‘NAD

biosynthesis’, and 566 genes with ‘tricarboxylic acid cycle’. • The two groups share 207 genes. • However, there is only one cross-term correlation >0.5. • Supplementary Figure 13b and c show how the clustering

patterns in these two examples are different from what is seen in Supplementary Figure 13a.

Page 62: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.
Page 63: Unfinished part of lecture 5. How to explain trans-linked gene If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages.