NGS Bioinformatics Workshop 2.5 Meta-Analysis of Genomic Data May 30 th, 2012 IRMACS 10900 Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

NGS Bioinformatics Workshop2.5 Meta-Analysis of

Genomic Data

May 30th, 2012IRMACS 10900

Facilitator: Richard BruskiewichAdjunct Professor, MBB

Acknowledgment:Several slides courtesy of Professor Fiona Brinkman, MBB

Today’s AgendaA brief overview of the bioinformatics for

SNP detection softwareProteinsSystems biologyMetagenomics (some resources; very brief…)

Group feedback: bioinformatics needs at SFU?

NGS-based SNP Analysis Programs

From: Nielsen et al. 2011. Nature Reviews Genetics 12:443-451

BIOINFORMATICS OF PROTEINS

NGS Bioinformatics Workshop2.5 Meta-Analysis of Genomic Data

From DNA to Protein to Systems

ATGGAATTC…

Amino Acid Properties – Venn Diagram

Polypeptides

R4HH3N+

Ramachandran Plot

Secondary Structure (SS) Prediction

Note major assumptions in all– Entire information for forming ss is contained in the primary sequence– Side groups of residues will determine structure

• Pattern recognition – Looks for patterns in common ss’s like amphipathic alpha-helices (e.g. pattern

of polar and non-polar residues)

• Homology– Predict ss of the central residue of a given segment from homologous segments

(neighbors)– Based on alignments of homologous residues from a protein family– Assumption: homologous proteins = similar structure– Extension: Use BLOSUM to detect similarity, or, better, use Position Specific

Scoring Matrix (PSSM)

SS Prediction Programs• PredictProtein-PHD (72%)

– http://www.predictprotein.org/ • PREDATOR (75%)

– http://www-db.embl heidelberg.de/jss/servlet/ de.embl.bk.wwwTools.GroupLeftEMBL/argos/ predator/predator_info.html

• PSIpred (77%)

– http://bioinf.cs.ucl.ac.uk/psipred/ (PSSM generated by PSI-BLAST, better sequence database, won CASP competition for many years)

• Jpred (81%)

– http://www.compbio.dundee.ac.uk/jpred/

Tertiary Structure

Lactate Dehydrogenase: Mixed a / b

Immunoglobulin Fold: b

Hemoglobin B Chain: a

Tertiary Structure: Protein Folds

Holm, L. and Sander, C. (1996) Mapping the protein universe. Science, 273, 595-603.

Protein Folds

Folds: definition difficult and different criteria used for different classification systems– Normally formed around a separate hydrophobic core

Current protein fold taxonomy– Very roughly …– Approx. 1000-2000 different estimated folds,

depending on method of analysis – of which about half are estimated to be known (500-1000)

– Average domain size approx. 150 aa (50 – 250 aa approx std dev)

Protein Fold Major ClassesAll alpha proteins (all a)

All beta proteins (all b)

Alpha/beta proteins (a/b)- Parallel strands connected by helices (bab motifs)

Alpha plus beta proteins (a+b)- More irregular a and b combinations

“Other”- Often subclassified now

Protein Fold Classification• Curated/Semi Manual Classification

– SCOP (Structural Classification Of Proteins)

http://scop.mrc-lmb.cam.ac.uk/scop/

– CATH (Class, Architecture, Topology, Homologous superfamily)

http://www.cathdb.info/

SCOP classification Family: clear evolutionarily relationship

– Residue identities >= 30% – OR known similar functions and structures (example:

globins form family though some only 15% identical)

Superfamily: Probable common evolutionary origin– Low sequence identities, but structural and functional

features suggest common evolutionary origin. (example: actin, ATPase domain of heat shock proteins, and hexakinase form a superfamily).

Fold: major structural similarity– Same major ss in same arrangement with the same

topological connections– May occur by convergent evolution

SCOP example

CATH example

Protein Fold Classification• Automated Classification

– DALIhttp://ekhidna.biocenter.helsinki.fi/dali

– VAST (Vector Alignment Search Tool)http://www.ncbi.nlm.nih.gov/Structure/ VAST/vast.shtml

Domain Classification # (DC_l_m_n_p)

l: fold space attractor region

m: globular folding topology/fold type (clusters of structural neighbours in fold space with average pairwise Z-scores, by Dali, above 2)

n: functional family (PSI-Blast, clusters of identically conserved functional residues, E.C. numbers, Swissprot keywords)

p: sequence family (>25% identities)

DALI/FSSP – Automated classificationExhaustive all-against-all 3D structure comparison of protein structures currently in the PDB

http://www.ncbi.nlm.nih.gov/Structure/VAST/vasthelp.html

All against all BLAST comparison of NCBI’s MMDB (database of known protein structure at NCBI, derived from the PDB)

Clustered into groups by a neighbor joining procedure, using BLAST p-value cutoffs of C or less (where C=10e-7, 10e-40 or 10e-80, to reflect three different levels of redundancy). A fourth level of classification is based on sequence identity

VAST – Automated classification

Motif and Domain Searching• InterPro – an integration of tools (PROSITE,

PFAM, PRINTS, PRODOM)– http://www.ebi.ac.uk/interpro/

• Expasy Tools has more…– PATTINPROT, to search for patterns in proteins yourself, etc…

But first… Check if the analysis you want to do has already been done!

i.e. www.ebi.ac.uk/proteome/ db.psort.org

Phylofacts

PhyloFacts includes hidden Markov models for classification of user-submitted protein sequences to protein families across the Tree of Life.

http://phylogenomics.berkeley.edu/phylofacts/

Subcellular Localization Prediction – Example of the benefit of integrating results with a Baysian approach

Localization Prediction - methods

Several programs analyze single features:

TargetP

Initially one program analyzed multiple features:

PSORT I (eukaryotes and prokaryotes)

Developed in 1990

PSORT I prediction method: Rule based

Nakai & Kanehisa, Proteins: Structure, Function, Genetics (1991)

Compositional Analysis

• Molecular Weight• Amino Acid Frequency• Isoelectric Point• UV Absorptivity• Solubility, Size, Shape

SYSTEMS BIOLOGY

Systems Biology

What is systems biology?

① Considers all (or many) of the proteins and genes in the system

② Links proteins and genes using interactions and functions

③ Uses computational models to study system

④ Provides insights into mechanisms, system dynamics, global properties

Molecular Interaction (MI) Network

Nodes = Gene / Protein Edge = Interaction Possible interactions:

phosphorylation physical binding transcriptional regulation others?

Cytoscape

http://www.cytoscape.org/

Cytoscape supports many use cases in molecular and systems biology, genomics, and proteomics:

Load molecular and genetic interaction data sets in many formats

Project and integrate global datasets and functional annotations

Establish powerful visual mappings across these data

Perform advanced analysis and modeling using Cytoscape plugins

Visualize and analyze human-curated pathway datasets such as Reactome or KEGG.

Cytoscape

Attributes for highlighted nodes / edges

Change visible attributes

Network navigation

Visible networks

Search for nodes

Control tabs: Network, VizMapper, plugin tabs

Data Files:1. Network (Simple Interaction Format)2. Node attributes (tab-delimited)3. Gene expression (tab-delimited)

Cytoscape – Loading Data

1. Network (Simple Interaction Format)• Format:

gene1 interaction_type gene2

• E.g.:

C1QB pp C1RC1R pp C2C2 pp C4

2. Gene Attribute (tab-delimited table)• Maps data values to nodes

Load File

Check off “Show Text File Import Options”

Check off “Transfer first line as attribute names..”

Preview

3. Gene expression (tab-delimited table)• Format:

gene1 exp_cond1 exp_cond2 … sig_cond1 sig_cond2 …

• Expression value: fold-change or intensity from microarray

• Significance value: P-value indicating how likely the expression value is different between conditions.

Cytoscape – Network Style

Can change color by double-clicking on arrows

Select “Continuous Mapping” as mapping type

Select expression fold-change values (CMexp)

Double-click “Node color”

In “Vizmapper” tab…

1. Differentially-expressed subnetworks• jActiveModules

2. Functional enrichment• BiNGO

Systems Biology Analyses

Search for sub-networks that contain a significant number differentially-expressed genes (nodes)

All genes in sub-network interact… SO these highly differentially-expressed sub-networks

may represent a critical pathway or complex involved in a condition of interest

Differentially-Expressed Subnetworks

jActive algorithm: Searches for sub-networks that contain a significant

number differentially-expressed genes (or nodes) Heuristic – won’t always find the optimum result Z-score signifies how likely to find a subnetwork

with a similar number of DE genes.

Differentially-Expressed Subnetworks

Search from highlighted nodes

Select expression significance (p-values)

jActive - Inputs

Highlight result and click “Create Network”

Subnetworks listed here

jActive - Results

Functional Enrichment: Also called over-representation analysis

Searches for common or related functions in a gene set Is there a common annotation (e.g. pathway, GO term)

for a set of genes that is more frequent than you would expect by chance?

Functional Enrichment

Gene Ontology• Controlled vocabulary describing functions, processes and cell

components• Consistency between organisms and gene products• GO terms linked by relationships (is-a, part-of) and have

hierarchy (parent – child)

is-apart-of

[other protein complexes]

[other organelles]

protein complex organelle

mitochondrion

fatty acid beta-oxidation multienzyme complex

BiNGO: Looks for GO terms that are over-represented in a set of

genes. Displays the results in two ways

A table with p-values A graph showing relationships between terms

Uses the hypergeometric test to statistically test for over-representation of each GO term.

Performs multiple hypothesis correction (since we are testing multiple GO terms for over-representation).

Functional Enrichment

BiNGO - Inputs

Click Start BiNGO

Select “Custom” and then load go.annot file

Lower significance level

Fill in Name

BiNGO - Results

General GO Terms

Specific GO Terms

Significance

EGAN: Exploratory Gene Association Networks

http://akt.ucsf.edu/EGAN/

METAGENOMICS

What is Metagenomics? The culture-independent isolation and characterization of

DNA from uncultured microorganism communities Nice reading list on the topic:

http://www.cbcb.umd.edu/confcour/CMSC828G-materials/reading-list.html

See also: Torsten Thomas Jack Gilbert and Folker Meyer. 2012. Metagenomics - a guide from sampling to data analysis. Microb. Inform. Exp. doi:10.1186/2042-5783-2-3 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3351745/

I will just mention a few relevant bioinformatics tools here (no specific endorsements implied).

MG-RAST server

http://metagenomics.nmpdr.org/

Meyer, F. et al. 2008. The metagenomics RAST server – a public resource for the automatic phylogenetic and

functional analysis of metagenomes. BMC Bioinformatics. 9:386 doi:10.1186/1471-2105-9-386

MEGAN - MEtaGenome ANalyzerhttp://ab.inf.uni-tuebingen.de/software/megan/

Huson DH et al. 2007. MEGAN analysis of metagenomic data. Genome Res. 17: 377-386

NGS Bioinformatics Workshop 2.5 Meta-Analysis of Genomic Data May 30 th, 2012 IRMACS 10900 Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Documents

Fmu mbb guide1415

MBB Healthcare Report

NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation...

MBB Camp Brochure

MBB 18 Brochure

Columbine Report Pgs 10801-10900

Mbb Intro Draft

MBB week twelve

Feld Washington 0250O 10900

MBB Biodiversity

MBB Belmont Notes

Cambio MBB

MBB Section I

Piscataway NJ MBB

MBB College

saldajeno MBB