Introduction to Pattern Recognition Prediction in Bioinformatics • What do we want to predict? – Features from sequence – Data mining • How can we predict? – Homology / Alignment – Pattern Recognition / Statistical Methods / Machine Learning • What is prediction? – Generalization / Overfitting – Preventing overfitting: Homology reduction • How do we measure prediction? – Performance measures – Threshold selection Henrik Nielsen Center for Biological Sequence Analysis Technical University of Denmark
33
Embed
Introduction to Pattern Recognition Prediction in Bioinformatics What do we want to predict? –Features from sequence –Data mining How can we predict? –Homology.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Introduction to Pattern Recognition
Prediction in Bioinformatics• What do we want to predict?
– Features from sequence
– Data mining
• How can we predict?– Homology / Alignment– Pattern Recognition / Statistical Methods / Machine Learning
• What is prediction?– Generalization / Overfitting
– Preventing overfitting: Homology reduction
• How do we measure prediction?– Performance measures– Threshold selection
Henrik NielsenCenter for Biological Sequence Analysis
• Proteins belong in different organelles of the cell – and some even have their function outside the cell
• Günter Blobel was in 1999 awarded The Nobel Prize in Physiology or Medicine for the discovery that "proteins have intrinsic signals that govern their transport and localization in the cell"
Data: UniProt annotation of protein sorting
Annotations relevant for protein sorting are found in:– the CC (comments) lines– cross-references (DR lines) to GO (Gene Ontology)– the FT (feature table) lines
ID INS_HUMAN Reviewed; 110 AA.AC P01308;...DE Insulin precursor [Contains: Insulin B chain; Insulin A chain].GN Name=INS;...CC -!- SUBCELLULAR LOCATION: Secreted....DR GO; GO:0005576; C:extracellular region; IC:UniProtKB....FT SIGNAL 1 24
3 types of non-experimental qualifiers in the CC and FT lines:– Potential: Predicted by sequence analysis methods– Probable: Inconclusive experimental evidence– By similarity: Predicted by alignment to proteins with known
location
Problems in database parsing
Extreme example: A4_HUMAN, Alzheimer disease amyloid protein
CC -!- SUBCELLULAR LOCATION: Membrane; Single-pass type I membraneCC protein. Note=Cell surface protein that rapidly becomesCC internalized via clathrin-coated pits. During maturation, theCC immature APP (N-glycosylated in the endoplasmic reticulum) movesCC to the Golgi complex where complete maturation occurs (O-CC glycosylated and sulfated). After alpha-secretase cleavage,CC soluble APP is released into the extracellular space and the C-CC terminal is internalized to endosomes and lysosomes. Some APPCC accumulates in secretory transport vesicles leaving the late GolgiCC compartment and returns to the cell surface. Gamma-CTF(59) peptideCC is located to both the cytoplasm and nuclei of neurons. It can beCC translocated to the nucleus through association with Fe65. Beta-CC APP42 associates with FRPL1 at the cell surface and the complex isCC then rapidly internalized. APP sorts to the basolateral surface inCC epithelial cells. During neuronal differentiation, the Thr-743CC phosphorylated form is located mainly in growth cones, moderatelyCC in neurites and sparingly in the cell body. Casein kinaseCC phosphorylation can occur either at the cell surface or within aCC post-Golgi compartment....DR GO; GO:0009986; C:cell surface; IDA:UniProtKB.DR GO; GO:0005576; C:extracellular region; TAS:ProtInc.DR GO; GO:0005887; C:integral to plasma membrane; TAS:ProtInc.