Bioinformatics Basics Cyrus Chan, Peter Lo, David L am Courtesy from LO Leung Yau’s original presenta tion
Dec 22, 2015
Bioinformatics Basics
Cyrus Chan, Peter Lo, David LamCourtesy from LO Leung Yau’s original presentation
Outline
Biological Background Cell Protein DNA & RNA Central Dogma Gene Expression
Bioinformatics Sequence Analysis Phylogentic Trees Data Mining
Biological Background – Cell
Basic unit of organisms Prokaryotic (lacks a cell nucleus)
Eukaryotic A bag of chemicals Metabolism controlled
by various enzymes Correct working needs
Suitable amounts of various proteins
Picture taken from http://en.wikipedia.org/wiki/Cell_(biology)
Biological Background – Protein Polymer of 20 types of
Amino Acids Folds into 3D structure Shape determines the
function Many types
Transcription Factors Enzymes Structural Proteins …
Picture taken from http://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Amino_acid
Biological Background – DNA & RNA DNA
Double stranded Adenine, Cytosine, Guani
ne, Thymine A-T, G-C Those parts coding for pr
oteins are called genes RNA
Single stranded Adenine, Cytosine, Guani
ne, Uracil
Picture taken from http://en.wikipedia.org/wiki/Gene
Chromosome
Biological Background – Genes Genes – protein coding regions
3 nucleotides code for one amino acid
There are also start and stop codons
Biological Background—in a nutshell Abstractions—the Central Dogma
Functional Units: Proteins
Templates: RNAs
Blueprints: DNAs
Templates: RNAs
Blueprints: DNAs
Not only the information (data), but also the control signals about what and how much data is to be sentProteins (TFs) so help
Biological Background
…acatggccgatca…tcaccctgaacatgtcgctttaacctactggtgatgcacct…atgatcaggg…atactggatacagggcata….
RNARNA
Protein Protein
Intergenic region“Non-coding region”
GeneGene
Biological Background
…acatgggcgatca…tcaccctgaacatgtcgctttaacctactggtgatgcacct…atgatcaggg…atactggatacagggcata….
RNARNA
Protein (malfunctioning) Protein
Intergenic region“Non-coding region”
GeneGene
Genetic Disease caused by a single mutation
Biological Background
There can be multiple mutations that cause diseases (increase risks of diseases)
…
DNA from different people
Normal
Disease!
AA
A
C
CC
TTT
G
GG
A T
C G
…
…
…
…
SNP (single nucleotide polymorphism)
Biological Background – Sequences Abstractions
Sequences
…acatggccgatcaggctgtttttgtgtgcctgtttttctattttacgtaaatcaccctgaacatgtTTGCATCAacctactggtgatgcacctttgatcaatacattttagacaaacgtggtttttgagtccaaagatcagggctgggttgacctgaatactggatacagggcatataaaacaggggcaaggcacagactc…
FT intron <1..28FT /gene="CREB"FT /number=3FT /experiment="experimental evidence…FT recorded"FT exon 29..174FT /gene="CREB"FT /number=4FT /experiment="experimental evidence…FT recorded"FT intron 175..>189FT /gene="CREB"FT /number=4
Annotations
Visualizations
Biological Background – DNA RNA Protein
Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding
sites (TFBS).
Other functions
Transcription FactorsBinding sites
GenesPromoter regions
Biological Background – DNA RNA Protein
Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding
sites (TFBS).
Other functions
Transcription FactorsBinding sites
GenesPromoter regions
Gene Expression Microarray Data High throughput Measures RNA level Relies on A-T, G-C
pairing Can monitor expression
of many genes
Picture taken from http://en.wikipedia.org/wiki/DNA_microarray_experiment
Gene Expression Microarray Data
Picture taken from http://en.wikipedia.org/wiki/DNA_microarray
Genes
Time points/Condiditions
Colors: Expression (RNA) Levels
Bioinformatics—Sequence Analysis Alignments
a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences
http://en.wikipedia.org/wiki/Sequence_alignment
Bioinformatics—Sequence Analysis Pair-wise alignments
Method: dynamic programming!
No penalty for the consecutive ‘-’s before and after the sequence to be aligned
\\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC3220 Lectures
Bioinformatics—Sequence Analysis Multiple (global) sequence alignment
Also dynamic programming (but can’t scale up!)
Bioinformatics—Sequence Analysis Multiple local sequence alignment
i.e. Motif (pattern) discovery
>seq1acatggccgatcagctggtttttgtgtgcctgtttctgaatc>seq2ttctattttacgtaaatcagcttgaacatgtacctactggtg>seq3atgcacctttgatcaataccagctagacaaacgtgtgttg>seq4agtccaaagatcagggctggctgaatactggatcagct>seq5cagctacagggcatataaaggggcaaggcacagactc
Such overrepresented patterns are often important components (e.g. TFBSs if the sequences are promoters of similar genes).
TFBSs are the controlling key holes in gene regulation!
DNA motifs
Similar DNA fragments across individuals and/or species TFBS Motifs: DNA fragments similar to “TATAA” are common in order to
recruit the polymerase to initiate transcription in eukaryotes Expensive and time-consuming to try a large set of candidates in biological
experiments
Transcription
RNA
Translation
Protein
TATAA
TFBS (controlling)
Gene(functioning)
TF
Transcription Factor
DNA
Motif discovery
CGATTGAf
Similar controlled functionse.g. cancer gene activities
Maximized
TFBS Motif Discovery
Motif discovery usually refers to TFBS motifs
But motif is a general term meaning “pattern”:Sequence motifs, structural motifs, network motifs…
ChIP-Seq motif discovery
Same to traditional TFBS motif discovery in principle
Data input precision and scale are different Genome-wide: tens of thousands of sequences Short: 50-100bp Each sequence measured by some enrichment
score (a peak)
Introduction
ChIP-Seq technology Peak-calling
…
High-resolution sequences from more direct binding evidence; The enriched regions are likely to contain motifs coupled with peak signals; genome-wide sequences; in vivo
Too many sequences for old-day methods
Phylogentic Trees (Phylogenies) Preliminaries Distance-based methods Parsimony Methods
Adopted from: Fundamental Concepts of BioinformaticsMichael L. RaymerComputer Science, Biomedical SciencesWright State Universitybirg.cs.wright.edu/text/Tutorial.ppt
Phylogenetic Trees
Hypothesis about the relationship between organisms
Can be rooted or unrootedA B C D E
A B
C
D
E
Time
Root
birg.cs.wright.edu/text/Tutorial.ppt
Tree proliferation
!22
!322
n
nN
nR
!32
!523
n
nN
nU
Species Number of Rooted Trees Number of Unrooted Trees
2 1 1
3 3 1
4 15 3
5 105 15
6 34,459,425 2,027,025
7 213,458,046,767,875 7,905,853,580,625
8 8,200,794,532,637,891,559,375 221,643,095,476,699,771,875
birg.cs.wright.edu/text/Tutorial.ppt
An ongoing didactic
Pheneticists tend to prefer distance based metrics, as they emphasize relationships among data sets, rather than the paths they have taken to arrive at their current states.
Cladists are generally more interested in evolutionary pathways, and tend to prefer more evolutionarily based approaches such as maximum parsimony.
birg.cs.wright.edu/text/Tutorial.ppt
Parsimony methods
Belong to the broader class of character based methods of phylogenetics
Emphasize simpler, and thus more likely evolutionary pathways
Enumerate all possible trees Note the number of substitutions events invoked by
each possible tree Can be weighted by transition/transversion probabilities, et
c. Select the most parsimonious
birg.cs.wright.edu/text/Tutorial.ppt
Branch and Bound methods
Key problem – number of possible trees grows enormous as the number of species gets large
Branch and bound – a technique that allows large numbers of candidate trees to be rapidly disregarded
Requires a “good guess” at the cost of the best tree
birg.cs.wright.edu/text/Tutorial.ppt
Parsimony – Branch and Bound Use the UPGMA tree for an initial best
estimate of the minimum cost (most parsimonious) tree
Use branch and bound to explore all feasible trees
Replace the best estimate as better trees are found
Choose the most parsimonious
birg.cs.wright.edu/text/Tutorial.ppt
Bioinformatics—Data mining
Clustering (Unsupervised learning) Similar things go together Similarity measure is critical Types:
Hierarchical clustering (UPGMA) Partitional clustering (K-means)
Bioinformatics—Data mining
Classification (Supervised Learning) To predict! Pre-processing—tidy up your materials! Feature selection—the key points to go over Classifier—the thinking style/manner of how to combine the
key points and get some answer Training—your practice of your thinking manner with
answers known Validation—mock quiz to evaluate what you’ve learnt from
the training Testing—your examination!
\\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class1.pdf
Underfitting & Overfitting
Bioinformatics—Data mining
Evaluation (scores!) Confusion Matrix Binary Classification
Performance Evaluation Metrics Accuracy Sensitivity/Recall/TP Rate Specificity/TN Rate Precision/PPV …
\\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class3.pdf
FNFPTNTP
TNTP
FNTP
TP
FPTP
TP
FPTN
TN
Bioinformatics—Data mining
Evaluation ROC (Receiver Operating Characteristics) Trade-off between positive hits (TP) and false alar
ms (FP)
Where to get data
Databases Transfac—TF and TFBS sequence data Protein Data Bank—protein and protein-DNA, prot
ein-ligand complexes 3D structures (sequences and atoms included as well)
There are thousands more… find the ones that fit your topic
Where to get data
We have to parse and pre-process data before using Tedious and time-consuming process Some packages can help accelerate this: BioPerl,
BioJava, BioPython… Besides data, sometimes evaluation has to be do
ne with literature evidence (manual!)
Where to get papers (published) A difficult question…
Your research quality, your writing and organization, plus some luck… 知己知彼 : learn from the published papers and compare your research topic
and level to them
Where to find papers to read Play on the CS side:
IEEE Transactions, ACM Transactions IEEE and ACM top conferences
Play on the Bioinformatics side: Bioinformatics, BMC Bioinformatics, Nucleic Acids Research PLoS Computational Biology…
Aim high: Nature (series), Science PNAS, Cell, …