Top Banner
GOSt a Gene Ontology mining tool Jüri Reimand
32

GOSt a Gene Ontology mining tool Jüri Reimand

Dec 30, 2015

Download

Documents

vielka-huffman

GOSt a Gene Ontology mining tool Jüri Reimand. Overview. Introduction, bioinformatics Gene Ontology (GO) GOSt, a Gene Ontology mining tool Statistics and thresholds Ordered gene lists Extending GO. cluster similar profiles. measures over time. Introduction. Bioinformatics - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: GOSt a Gene Ontology mining tool Jüri Reimand

GOSta Gene Ontology mining tool

Jüri Reimand

Page 2: GOSt a Gene Ontology mining tool Jüri Reimand

Overview

• Introduction, bioinformatics• Gene Ontology (GO)• GOSt, a Gene Ontology mining tool• Statistics and thresholds• Ordered gene lists• Extending GO

Page 3: GOSt a Gene Ontology mining tool Jüri Reimand

Introduction

• Bioinformatics– Analysis of experimental data

• Genes encode proteins – Proteins : building blocks of living organisms– Gene expression : protein production from

genetic code• Microarray experiments measure gene

expression– Thousands of genes simultaneously– Expression levels over time– Different biological conditions– Comparison of healthy and diseased cells

measuresover time

cluster similarprofiles

Page 4: GOSt a Gene Ontology mining tool Jüri Reimand

Introduction

• Biological experiments give large amounts of data

• Groups of similar genes: – top “most active” genes– similar expression profiles over time

• Many genes have some available annotations– Previous knowledge from databases

• How to describe the group as a whole?– What are the common features?– Which features are significantly overrepresented?

“steroid metabolism”“biosynthesis”“iron ion binding”

Page 5: GOSt a Gene Ontology mining tool Jüri Reimand

Gene Ontology (GO)

• GO - Directed Acyclic Graph (DAG)– Vertices: terms– Edges: relations between general and specific terms

• Hierarchically structured vocabulary– 3 DAGs: processes, components, functions

• Annotations to vocabulary terms– Association between

a gene g and a property t (GO term t)– Based on biological discoveries– Genes of many genomes are annotated to GO

• Annotation sets : for a fixed organism– All genes associated with GO term t

Page 6: GOSt a Gene Ontology mining tool Jüri Reimand

GO example• Graph fragment

with some terms related to organ development

• Vocabulary is general to living organisms

• Gene annotations organism-specific

• True Path Rulehierarchical annotations

ENSG00000163217ENSG00000161202

Page 7: GOSt a Gene Ontology mining tool Jüri Reimand

GO example

ENSG00000163217ENSG00000161202

• Graph fragment with some terms related to organ development

• Vocabulary is general to living organisms

• Gene annotations organism-specific

• True Path Rulehierarchical annotations

Page 8: GOSt a Gene Ontology mining tool Jüri Reimand

GOSt – Gene Ontology Statistics

• GO annotations to groups of genes• Statistical significance of results• Thresholds for distinguishing significant results• Analysing ordered lists of genes• Visualisation methods, WWW interface• Command line toolset for large-scale analysis

Page 9: GOSt a Gene Ontology mining tool Jüri Reimand

GOSt example

Page 10: GOSt a Gene Ontology mining tool Jüri Reimand

45 mouse genes

338 GO

Page 11: GOSt a Gene Ontology mining tool Jüri Reimand

GenesGO

termsP-value

Evidencecodes

Page 12: GOSt a Gene Ontology mining tool Jüri Reimand

Annotations to gene groups

• Result: term t matches query Q

Gq GtGq GtGq GtQuery GO Terme.g.heartdevelopment

Page 13: GOSt a Gene Ontology mining tool Jüri Reimand

Statistical significance

• Is intersection Q∩T significant?• Fisher's one-tailed test

– Cumulative hypergeometric probability– Get observed or more genes in intersection Q∩T– P ( pick k white balls out of K white and N-K black balls )

• Multiple testing– Every query results in a number of p-values– Matching GO terms are not independent– Increased rate of false positive matches

• Which p-values are significant?

Page 14: GOSt a Gene Ontology mining tool Jüri Reimand

Experimental thresholds

• Simulation experiment– Fix some gene query size k– Repeat 1000 times:

• Generate synthetic query Q with k elements :random subset of organism's genes

• Observe best p-value p for query Q• Store p-value, p --> P

– Choose p', 50th smallest p-value from P– Threshold p' – top 5% of p-values for random queries

of size k• Calculate for query lengths k = [1,1000]• Compare with standard multiple testing

corrections– Bonferroni (1936), Benjamini-Hochberg (1995)

Page 15: GOSt a Gene Ontology mining tool Jüri Reimand

Analytical thresholds

• Analytical approach to simulated thresholds– Fix gene query size k– Observe all sizes and frequencies of GO annotation

sets T– Presume events with different T independent– Observe possible p-values p with query of k elements

– Always correct p by constant c=0.97 (set dependencies!)

– Find such threshold p', that gives p ~= 0.95• Repeat for query lengths k = [1,1000]

Page 16: GOSt a Gene Ontology mining tool Jüri Reimand

Significance thresholds

Page 17: GOSt a Gene Ontology mining tool Jüri Reimand

Significance thresholds

Page 18: GOSt a Gene Ontology mining tool Jüri Reimand

Significance thresholds

Page 19: GOSt a Gene Ontology mining tool Jüri Reimand

Significance thresholds

Page 20: GOSt a Gene Ontology mining tool Jüri Reimand

Ordered lists of genes

• Gene groups may be ordered– Interesting gene and few most

similar genes – Top “most active” genes– Increasing distance from cluster

centre

• Top of the list, but how many? – Compare list with GO term– Which portion gives best p-value?– Peak significance of ordered query

Page 21: GOSt a Gene Ontology mining tool Jüri Reimand

GOSt algorithms

• Unordered query– Intersections with all annotation sets T

• Exhaustive algorithm for ordered queries:

– intersections with all Qi and annotation sets T• Approximate algorithm for ordered queries:

– for every annotation set T, view only list portions that give local p-value extremes

• local best p : list ends with matching gene• local worst p : list ends just before matching

gene

Page 22: GOSt a Gene Ontology mining tool Jüri Reimand

Example: Ordered list analysis

query length

p-value

Peak significance at ordered list of

28 genes

List of genes, and matches for “Biosynthesis of steroids”

Page 23: GOSt a Gene Ontology mining tool Jüri Reimand

GenesGO

categoriesP-value

Evidencecodes

Ordered list query

Page 24: GOSt a Gene Ontology mining tool Jüri Reimand

Algorithm speed comparison

24 sec

2.8 sec

Page 25: GOSt a Gene Ontology mining tool Jüri Reimand

GOSt features

• Command line interface (C/C++ and Perl)• Graphical user interface in web

http://bioinf.ebc.ee/GOST– SWOG (Graphics language, Jaanus Hansen 2005)

• Data for multiple organisms– yeast, chicken, cow, mouse, rat, human...

• Wrappers for parallel applications (GRID, MPI)• Pipelines for gene expression data analysis

Page 26: GOSt a Gene Ontology mining tool Jüri Reimand

Extending GO ( i )

• Pathway – a network of interacting genes and proteins– metabolism pathways, disease pathways, ..

• Include pathway data to GO vocabulary– KEGG Pathway database– pathways as vocabulary terms– related genes as annotations to terms

• KEGG terms independent of GO vocabularyGO:0003674 molecular_function

GO:0005575 cellular_component

GO:0008150 biological_process

KEGG:00000 KEGG pathways

GO

Page 27: GOSt a Gene Ontology mining tool Jüri Reimand

KEGG:05010 - Alzheimer's disease

Page 28: GOSt a Gene Ontology mining tool Jüri Reimand

Extending GO ( ii )

• Gene expression started by transcription factors (TF)

• TFs bind to certain patterns in DNA– Transcription Factor Binding Sites (TFBS)– Often found in regions close to gene (1k bp)

• Include TFBS data from TRANSFAC– Patterns (putative TFBS) as vocabulary terms– annotations to genes near patterns

ATATAATAAAGATGAGGCGAATATAAATATACCGGCCCTTAGCGCGAAGCAATTCATCATATAAGCGAGAGAGGCCAATATGCAATCTTCGACAGCAT

geneTF binding site

Transcription factor

Page 29: GOSt a Gene Ontology mining tool Jüri Reimand

TRANSFAC motifs

• Motifs added in a hierarchy– according to PWM score– 5 levels:

• near_threshold• ...• near_MAX_score

• Work in progress– Hedi Peterson

GO:0003674 molecular_function

GO:0005575 cellular_component

GO:0008150 biological_process

KEGG:00000 KEGG pathways

GO

TF:M00000 TRANSFAC motifs

TF:M00431_4 TTTSGCGS:4TF:M00431_3 TTTSGCGS:3TF:M00431_2 TTTSGCGS:2TF:M00431_1 TTTSGCGS:1TF:M00431_0 TTTSGCGS:0TF:M00328_4 NCNNTNNTGCRTGANNNN:4TF:M00328_3 NCNNTNNTGCRTGANNNN:3TF:M00328_2 NCNNTNNTGCRTGANNNN:2

depth inhierarchy

Page 30: GOSt a Gene Ontology mining tool Jüri Reimand

Summary• We investigated means for finding GO annotations to

groups of genes, and statistical methods for determining significance of results.

• We combined GO vocabulary with various types of biological data, such as KEGG pathways and TRANSFAC regulatory elements.

• We proposed analytical thresholds for distinguishing significant results from structured and partly dependent GO annotations, and verified thresholds with simulation experiments.

• We proposed a novel concept of analyzing GO annotations for ordered lists of genes, and implemented fast algorithms for the purpose.

• The practical result of our work is GOSt, a GO mining tool. Command line interface is suitable for large-scale automatic analysis, while graphical web interface enables highly visualized and interactive analysis.

Page 31: GOSt a Gene Ontology mining tool Jüri Reimand

Sneak preview

• GO analysis of hierarchical clustering tree– Cluster genes according

to expression similarity and ..

– .. “Wrap up” nodes that show no significant annotations in GO

• Work in progress– Meelis Kull– Darja Krushevskaja

Page 32: GOSt a Gene Ontology mining tool Jüri Reimand

Acknowledgments

Jaak Vilo

BIIT groupHedi Peterson Raivo KoldeMeelis Kull Konstantin TretjakovJaanus Hansen Pavlos PavlidisPriit Adler Asko TiidumaaIlja Livenson Darja Krushevskaja

FunGenES Consortium