Top Banner
Bioinformatics Basics Cyrus Chan, Peter Lo, David L am Courtesy from LO Leung Yau’s original presenta tion
47

Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Bioinformatics Basics

Cyrus Chan, Peter Lo, David LamCourtesy from LO Leung Yau’s original presentation

Page 2: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Biological Background

Page 3: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Outline

Biological Background Cell Protein DNA & RNA Central Dogma Gene Expression

Bioinformatics Sequence Analysis Phylogentic Trees Data Mining

Page 4: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Biological Background – Cell

Basic unit of organisms Prokaryotic (lacks a cell nucleus)

Eukaryotic A bag of chemicals Metabolism controlled

by various enzymes Correct working needs

Suitable amounts of various proteins

Picture taken from http://en.wikipedia.org/wiki/Cell_(biology)

Page 5: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Biological Background – Protein Polymer of 20 types of

Amino Acids Folds into 3D structure Shape determines the

function Many types

Transcription Factors Enzymes Structural Proteins …

Picture taken from http://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Amino_acid

Page 6: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Biological Background – DNA & RNA DNA

Double stranded Adenine, Cytosine, Guani

ne, Thymine A-T, G-C Those parts coding for pr

oteins are called genes RNA

Single stranded Adenine, Cytosine, Guani

ne, Uracil

Picture taken from http://en.wikipedia.org/wiki/Gene

Chromosome

Page 7: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Chromatin Structure

Super compact packaging

euchromatin heterochromatin

Page 8: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Biological Background – Genes Genes – protein coding regions

3 nucleotides code for one amino acid

There are also start and stop codons

Page 9: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Biological Background—in a nutshell Abstractions—the Central Dogma

Functional Units: Proteins

Templates: RNAs

Blueprints: DNAs

Templates: RNAs

Blueprints: DNAs

Not only the information (data), but also the control signals about what and how much data is to be sentProteins (TFs) so help

Page 10: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Biological Background

…acatggccgatca…tcaccctgaacatgtcgctttaacctactggtgatgcacct…atgatcaggg…atactggatacagggcata….

RNARNA

Protein Protein

Intergenic region“Non-coding region”

GeneGene

Page 11: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Biological Background

…acatgggcgatca…tcaccctgaacatgtcgctttaacctactggtgatgcacct…atgatcaggg…atactggatacagggcata….

RNARNA

Protein (malfunctioning) Protein

Intergenic region“Non-coding region”

GeneGene

Genetic Disease caused by a single mutation

Page 12: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Biological Background

There can be multiple mutations that cause diseases (increase risks of diseases)

DNA from different people

Normal

Disease!

AA

A

C

CC

TTT

G

GG

A T

C G

SNP (single nucleotide polymorphism)

Page 13: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Biological Background – Sequences Abstractions

Sequences

…acatggccgatcaggctgtttttgtgtgcctgtttttctattttacgtaaatcaccctgaacatgtTTGCATCAacctactggtgatgcacctttgatcaatacattttagacaaacgtggtttttgagtccaaagatcagggctgggttgacctgaatactggatacagggcatataaaacaggggcaaggcacagactc…

FT intron <1..28FT /gene="CREB"FT /number=3FT /experiment="experimental evidence…FT recorded"FT exon 29..174FT /gene="CREB"FT /number=4FT /experiment="experimental evidence…FT recorded"FT intron 175..>189FT /gene="CREB"FT /number=4

Annotations

Visualizations

Page 14: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Biological Background – DNA RNA Protein

Picture taken from http://en.wikipedia.org/wiki/Gene

gene

Page 15: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Biological Background – DNA RNA Protein

Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding

sites (TFBS).

Other functions

Transcription FactorsBinding sites

GenesPromoter regions

Page 16: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Complex Interactions between Genes, TFs and TFBSs

Page 17: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Biological Background – DNA RNA Protein

Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding

sites (TFBS).

Other functions

Transcription FactorsBinding sites

GenesPromoter regions

Page 18: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Gene Expression Microarray Data High throughput Measures RNA level Relies on A-T, G-C

pairing Can monitor expression

of many genes

Picture taken from http://en.wikipedia.org/wiki/DNA_microarray_experiment

Page 19: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Gene Expression Microarray Data

Picture taken from http://en.wikipedia.org/wiki/DNA_microarray

Genes

Time points/Condiditions

Colors: Expression (RNA) Levels

Page 20: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Bioinformatics

Page 21: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Bioinformatics—Sequence Analysis Alignments

a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences

http://en.wikipedia.org/wiki/Sequence_alignment

Page 22: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Bioinformatics—Sequence Analysis Pair-wise alignments

Method: dynamic programming!

No penalty for the consecutive ‘-’s before and after the sequence to be aligned

\\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC3220 Lectures

Page 23: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Bioinformatics—Sequence Analysis Multiple (global) sequence alignment

Also dynamic programming (but can’t scale up!)

Page 24: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Bioinformatics—Sequence Analysis Multiple local sequence alignment

i.e. Motif (pattern) discovery

>seq1acatggccgatcagctggtttttgtgtgcctgtttctgaatc>seq2ttctattttacgtaaatcagcttgaacatgtacctactggtg>seq3atgcacctttgatcaataccagctagacaaacgtgtgttg>seq4agtccaaagatcagggctggctgaatactggatcagct>seq5cagctacagggcatataaaggggcaaggcacagactc

Such overrepresented patterns are often important components (e.g. TFBSs if the sequences are promoters of similar genes).

TFBSs are the controlling key holes in gene regulation!

Page 25: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

DNA motifs

Similar DNA fragments across individuals and/or species TFBS Motifs: DNA fragments similar to “TATAA” are common in order to

recruit the polymerase to initiate transcription in eukaryotes Expensive and time-consuming to try a large set of candidates in biological

experiments

Transcription

RNA

Translation

Protein

TATAA

TFBS (controlling)

Gene(functioning)

TF

Transcription Factor

DNA

Page 26: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Motif discovery

CGATTGAf

Similar controlled functionse.g. cancer gene activities

Maximized

TFBS Motif Discovery

Motif discovery usually refers to TFBS motifs

But motif is a general term meaning “pattern”:Sequence motifs, structural motifs, network motifs…

Page 27: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

ChIP-Seq motif discovery

Same to traditional TFBS motif discovery in principle

Data input precision and scale are different Genome-wide: tens of thousands of sequences Short: 50-100bp Each sequence measured by some enrichment

score (a peak)

Page 28: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Introduction

ChIP-Seq technology Peak-calling

High-resolution sequences from more direct binding evidence; The enriched regions are likely to contain motifs coupled with peak signals; genome-wide sequences; in vivo

Too many sequences for old-day methods

Page 29: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Enrichment

Introduction

ChIP-Seq technology Motifs?

…Old-day methods reapplied

Page 30: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Phylogentic Trees (Phylogenies) Preliminaries Distance-based methods Parsimony Methods

Adopted from: Fundamental Concepts of BioinformaticsMichael L. RaymerComputer Science, Biomedical SciencesWright State Universitybirg.cs.wright.edu/text/Tutorial.ppt

Page 31: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Phylogenetic Trees

Hypothesis about the relationship between organisms

Can be rooted or unrootedA B C D E

A B

C

D

E

Time

Root

birg.cs.wright.edu/text/Tutorial.ppt

Page 32: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Tree proliferation

!22

!322

n

nN

nR

!32

!523

n

nN

nU

Species Number of Rooted Trees Number of Unrooted Trees

2 1 1

3 3 1

4 15 3

5 105 15

6 34,459,425 2,027,025

7 213,458,046,767,875 7,905,853,580,625

8 8,200,794,532,637,891,559,375 221,643,095,476,699,771,875

birg.cs.wright.edu/text/Tutorial.ppt

Page 33: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

An ongoing didactic

Pheneticists tend to prefer distance based metrics, as they emphasize relationships among data sets, rather than the paths they have taken to arrive at their current states.

Cladists are generally more interested in evolutionary pathways, and tend to prefer more evolutionarily based approaches such as maximum parsimony.

birg.cs.wright.edu/text/Tutorial.ppt

Page 34: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Parsimony methods

Belong to the broader class of character based methods of phylogenetics

Emphasize simpler, and thus more likely evolutionary pathways

Enumerate all possible trees Note the number of substitutions events invoked by

each possible tree Can be weighted by transition/transversion probabilities, et

c. Select the most parsimonious

birg.cs.wright.edu/text/Tutorial.ppt

Page 35: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Branch and Bound methods

Key problem – number of possible trees grows enormous as the number of species gets large

Branch and bound – a technique that allows large numbers of candidate trees to be rapidly disregarded

Requires a “good guess” at the cost of the best tree

birg.cs.wright.edu/text/Tutorial.ppt

Page 36: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Parsimony – Branch and Bound Use the UPGMA tree for an initial best

estimate of the minimum cost (most parsimonious) tree

Use branch and bound to explore all feasible trees

Replace the best estimate as better trees are found

Choose the most parsimonious

birg.cs.wright.edu/text/Tutorial.ppt

Page 37: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Bioinformatics—Data mining

Clustering (Unsupervised learning) Similar things go together Similarity measure is critical Types:

Hierarchical clustering (UPGMA) Partitional clustering (K-means)

Page 38: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Bioinformatics—Data mining

Classification (Supervised Learning) To predict! Pre-processing—tidy up your materials! Feature selection—the key points to go over Classifier—the thinking style/manner of how to combine the

key points and get some answer Training—your practice of your thinking manner with

answers known Validation—mock quiz to evaluate what you’ve learnt from

the training Testing—your examination!

\\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class1.pdf

Underfitting & Overfitting

Page 39: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Bioinformatics—Data mining

Evaluation (scores!) Confusion Matrix Binary Classification

Performance Evaluation Metrics Accuracy Sensitivity/Recall/TP Rate Specificity/TN Rate Precision/PPV …

\\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class3.pdf

FNFPTNTP

TNTP

FNTP

TP

FPTP

TP

FPTN

TN

Page 40: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Bioinformatics—Data mining

Evaluation ROC (Receiver Operating Characteristics) Trade-off between positive hits (TP) and false alar

ms (FP)

Page 41: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Statistical Tests

Many different kinds of tests You should choose the appropriate ones

Page 42: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Where to get data

Databases Transfac—TF and TFBS sequence data Protein Data Bank—protein and protein-DNA, prot

ein-ligand complexes 3D structures (sequences and atoms included as well)

There are thousands more… find the ones that fit your topic

Page 43: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Where to get data

Typical format: tags + descriptions in plain text

Page 44: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Where to get data

We have to parse and pre-process data before using Tedious and time-consuming process Some packages can help accelerate this: BioPerl,

BioJava, BioPython… Besides data, sometimes evaluation has to be do

ne with literature evidence (manual!)

Page 45: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Where to get papers (published) A difficult question…

Your research quality, your writing and organization, plus some luck… 知己知彼 : learn from the published papers and compare your research topic

and level to them

Where to find papers to read Play on the CS side:

IEEE Transactions, ACM Transactions IEEE and ACM top conferences

Play on the Bioinformatics side: Bioinformatics, BMC Bioinformatics, Nucleic Acids Research PLoS Computational Biology…

Aim high: Nature (series), Science PNAS, Cell, …

Page 46: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Roadmap

Page 47: Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Not The End

Your corresponding tutor will have more project-specific stuff to tell you

Thanks Q & A