Bioinformatics: course introduction Filip ˇ Zelezn´ y and Jiˇ r´ ı Kl´ ema Czech Technical University in Prague Faculty of Electrical Engineering Department of Cybernetics Intelligent Data Analysis lab http://ida.felk.cvut.cz Filip ˇ Zelezn´ y and Jiˇ r´ ı Kl´ ema ( ˇ CVUT) Bioinformatics - intro 1 / 38
38
Embed
Filip Zelezn y and Ji r Kl ema - cvut.cz...Teachers Doc. Ji r Kl ema CTU Prague, Dept. of Computer Science [email protected] Prof. Filip Zelezn y CTU Prague, Dept. of Computer Science
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Bioinformatics: course introduction
Filip Zelezny and Jirı Klema
Czech Technical University in PragueFaculty of Electrical Engineering
Department of CyberneticsIntelligent Data Analysis lab
http://ida.felk.cvut.cz
Filip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 1 / 38
A6M33BIN – Biomedical Engineering and InformaticsB4M36BIN – Open Informatics, Bioinformatics
Purpose of this course:
Understand the computational problems in bioinformatics, theavailable types of data and databases, and the algorithms that solve
the problems.
Methods/PrerequisitiesI mainly: probability and statistics, algorithms (complexity classes),
programming skillsI also: discrete math topics (graphs, automata), relational databases
Lectures may be held in EnglishI OI study program open to foreign students
Purpose of this lecture
Sneak informal preview of the major bioinformatics topics
Filip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 2 / 38
NP-complete problem (exp time in the number of aligned sequences)
Filip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 21 / 38
Probabilistic Sequence Models
specific sites (substrings) on a sequence have specific roles
e.g. genes or promoters on DNA, active sites on proteins
How to tell them apart?
Markov Chain Model
Each type of site has a different probabilistic model
Filip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 22 / 38
Protein Spatial Structure
From the DNA nucleic-acid sequence, the protein amino-acidsequence is constructed by cell machinery
The protein folds into a complex spatial conformation
Spatial conformation can be determined at high cost
e.g. X-ray crystallography
Determined structures are deposited in public protein data bases
Filip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 23 / 38
Protein Structure Prediction
Can we compute protein structure from sequence?
At least distinguish α-helices from β-sheets
Very difficult, not yet solved problem
Approches include machine learning
Filip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 24 / 38
Protein Function Prediction
Protein function is given by itsgeometrical conformation
E.g., ability to bind to DNA or to otherproteins
The active site (shown in purple) ismost important
Important machine-learning tasks:I prediction of function from structureI detection of active sites within
structure
purple - active siteFilip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 25 / 38
Protein Docking Problem
Proteins interact by docking
Will a protein dock into another protein?
Optimization problem in a geometrical setting
Important for novel drug discoveryI e.g: green - receptor, red - drugI the trouble is, the protein may dock also in many unwanted receptorsI immensely hard computational problems under uncertainty
Filip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 26 / 38
Gene Expression Analysis
A gene is expressed is the cellproduces proteins according to it
Rate of expression can bemeasured for thousands of genessimultaneously by microarrays
Can we predict phenotype (e.g.diseases) by gene expressionprofiling?
Filip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 27 / 38
High-throughput data analysis
Gene expression data are called high-troughput since lots ofmeasurements (thousands of genes) are produced in a singleexperiment
Puts biologists in a new, difficult situation: how to interpret suchdata?
Example problems:I Too many suspects (genes), multiple hypothesis testingI How to spot functional patterns among so many variables?I How to construct multi-factorial predictive models?
Wide opportunities for novel data analysis methods, incl. machinelearning
Filip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 28 / 38
Other high-throughput technologies
Methylation arrays Chip-on-chip(epigenetics) (protein X DNA interactions)
mass spectrometry ..and more(presence of proteins)
Filip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 29 / 38
Genome-wide association studies
Correlates traits (e.g. susceptibility to disease) to genetic variations
“variations”: single nucleotide polymorphisms (SNP) in DNAsequence
involves a population of people
X: SNP’s, Y: level of association
Filip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 30 / 38
Gene Regulatory Networks
Feedback loops in expression:I (a protein coded by) a gene influences the expression of another geneI positively (transcription factor) or negatively (inhibitor)
Results in extremly complex networks with intricate dynamics
Most of regulatory networks are unknown or only partially known.
Can we infer such networks from time-stamped gene expression data?
Filip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 31 / 38
Metabolic Networks
Capture metabolism (energy processing) in cells
Involves gene/proteins but also other molecules
Computational problems similar as in gene regulation networks
Filip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 32 / 38
Exploiting Background Knowledge
The bioinformatics tasks exemplified so far followed the pattern
Data → Genomic knowledge
A lot of relevant formal (computer-understandable) knowledgeavailable so the equation should be
Data + Current Genomic Knowledge → New Genomic Knowledge
for example:
Gene expression data + Known functions of genes→ Phenotype linked to a gene function
But how to represent backround knowledge and use it systematicallyin data analysis?
Important bioinformatics problem
Filip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 33 / 38
Filip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 34 / 38
Bioinformatics: impact in scientific literatureBioinformatics programs are 31-fold over-represented among the highestimpact scientific papers of the past two decades [Wren, Bioinformatics ’16]
Filip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 35 / 38
IDA methods in journal papers
Filip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 36 / 38
IDA applications in medical studies
Filip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 37 / 38
Bioinformatics at the IDA lab
If you find this course interesting, you can take part in IDA’s research!
Filip Zelezny and Jirı Klema (CVUT) Bioinformatics - intro 38 / 38