Bioinformatics Bioinformatics The application of computer The application of computer science to biological data science to biological data Tony C Smith Tony C Smith Department of Computer Science Department of Computer Science University of Waikato University of Waikato [email protected][email protected]
41
Embed
Bioinformatics The application of computer science to biological data
Bioinformatics The application of computer science to biological data. Tony C Smith Department of Computer Science University of Waikato [email protected]. The essence is prediction …. My dog is very littl _ ? - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
BioinformaticsBioinformatics
The application of computer science to The application of computer science to biological databiological data
Tony C SmithTony C SmithDepartment of Computer ScienceDepartment of Computer Science
University of WaikatoUniversity of [email protected]@cs.waikato.ac.nz
Bioinformatics Tony C Smith
The The essenceessence is prediction … is prediction …
My dog is very littlMy dog is very littl__ ?
We know that letters do not occur in English at random (e.g. ‘t’ is more common than ‘x’)
We know that context changes the probability of a letter (e.g. ‘x’ is more common than ‘t’ after the sequence “I eat Weet-Bi_”)
Predicting symbols is fundamental to a wide range of important applications (e.g. encryption, compression)
Bioinformatics Tony C Smith
Prediction in bioinformaticsPrediction in bioinformatics
Predicting the location of genes in DNAPredicting the location of genes in DNA
Predicting gene roles in an organismPredicting gene roles in an organism
Predicting errors in a genetic transcriptionPredicting errors in a genetic transcription
Predicting the function of proteinsPredicting the function of proteins
Predicting diseases from molecular samplesPredicting diseases from molecular samples
Anything that involves “making a judgment”; a Anything that involves “making a judgment”; a yes/no decision about whether some sample yes/no decision about whether some sample datum ‘does’ or ‘does not’ have some property.datum ‘does’ or ‘does not’ have some property.
Bioinformatics Tony C Smith
RepresentationRepresentation
W e e t – B i xW e e t – B i x
0101011101100101011001010111010000101101 …
… to the computer, everything is binary!
Bioinformatics Tony C Smith
0101011101100101011001010111010000101101
0101101100100111111011010011010000101101 A A C G T C A T T C G A T G A T T C G A
Just as we can teach a computer to predict things about a sequence of letters in English prose, we can also teach it to predict things about a other sequences—like a genetic sequence
Bioinformatics Tony C Smith
A genetic prediction problemA genetic prediction problem
A genetic prediction problemA genetic prediction problem
A gene encodes a protein
It is a blueprint that provides biochemical instructions on how to construct a sequence of amino acids so as to make a working protein that will perform some function in the organism
Bioinformatics Tony C Smith
A genetic prediction problemA genetic prediction problem
encoding region untranslated region
transcription
factor RNARNARNARNARNA
Bioinformatics Tony C Smith
A genetic prediction problemA genetic prediction problem
untranslated region
Bioinformatics Tony C Smith
A genetic prediction problemA genetic prediction problem
Clues: Where there is one binding site, often there is another nearby.
Bioinformatics Tony C Smith
A genetic prediction problemA genetic prediction problem
All of these properties are the kinds of things for which computer science has developed algorithms and data structures to identify quickly and efficiently, and therefore it is exactly the kind of problem computer scientists should be able to solve.
Bioinformatics Tony C Smith
proteomicsproteomics
Three consecutive nucleotides in the coding regionform a ‘codon’ … i.e. encode an amino acid.
A relatively short sequence of amino A relatively short sequence of amino residues at the N-terminus of the nascent residues at the N-terminus of the nascent proteinprotein
Maximum entropyMaximum entropy (Clote, 2002)(Clote, 2002)
Bioinformatics Tony C Smith
SignalP SignalP (Nielsen et al., 1997-2004)(Nielsen et al., 1997-2004)
HMMs (or NNs) used to predict cleavage point
Bioinformatics Tony C Smith
Existing methods all perform reasonably Existing methods all perform reasonably well and with about the same accuracy well and with about the same accuracy (90% eukaryotes, 87% gram-, 85% gram+)(90% eukaryotes, 87% gram-, 85% gram+)
Do not offer a transparent explanatory Do not offer a transparent explanatory framework as to the underlying biologyframework as to the underlying biology
Many other learning algorithms do!Many other learning algorithms do!(WEKA data mining tools, Waikato University)(WEKA data mining tools, Waikato University)
Bioinformatics Tony C Smith
From sequences to textFrom sequences to text
Primary sequence data has many Primary sequence data has many similarities with textsimilarities with text– Amino residues (letters)Amino residues (letters)– Polypeptides (words)Polypeptides (words)– Secondary structures (phrases/sentences)Secondary structures (phrases/sentences)
Bioinformatics Tony C Smith
Bioinformatics Tony C Smith
From sequences to textFrom sequences to text
Primary sequence data looks like textPrimary sequence data looks like text– Amino residues (letters)Amino residues (letters)– Polypeptides (words)Polypeptides (words)– Secondary structures (phrases/sentences)Secondary structures (phrases/sentences)– Tertiary structure (whole documents)Tertiary structure (whole documents)
Approach: Approach: transform a sequence into a transform a sequence into a set of pseudo-text documentsset of pseudo-text documents
Bioinformatics Tony C Smith
ApproachApproach
Problem is stated as two-class:Problem is stated as two-class:
an amino acid is either the first residue of an amino acid is either the first residue of the mature protein or it is notthe mature protein or it is not
Each residue is described by a single Each residue is described by a single document, which includes as many document, which includes as many electrochemical, structural or contextual electrochemical, structural or contextual facts as are available (desirable)facts as are available (desirable)
Bioinformatics Tony C Smith
Properties of amino acidsProperties of amino acids
Bioinformatics Tony C Smith
Free facts about amino acidsFree facts about amino acids
Bioinformatics Tony C Smith
Residue as a documentResidue as a document
E.g.E.g. CysteineCysteine CysCys CC
aliphatic [aliphatic [yesyes], aromatic [], aromatic [nono], hydrophobic [], hydrophobic [yesyes], ], charge [charge [--], polarized [], polarized [yesyes]],, small [ small [nono], number of ], number of nitrogen atoms [nitrogen atoms [11], contains sulphur [], contains sulphur [yesyes], has a ], has a carbon ring [carbon ring [nono], ionized [], ionized [yesyes], valence [], valence [22], cbeta ], cbeta [[nono], covalent [], covalent [yesyes], h-bond [], h-bond [yesyes], ],
etc. (whatever else experimenter wants to include)etc. (whatever else experimenter wants to include)
A [pseudo] text classification approach to sequence A [pseudo] text classification approach to sequence prediction problems can perform as well as the state-of-prediction problems can perform as well as the state-of-the-art stochastic methodsthe-art stochastic methods
Allows miscellaneous facts (i.e. any textual description of Allows miscellaneous facts (i.e. any textual description of relevant information) to be includedrelevant information) to be included
A ranked list of features from the text classifier provides A ranked list of features from the text classifier provides insights into the underlying biologyinsights into the underlying biology
Features could be used for text generationFeatures could be used for text generation
Together, they can find out a lot of hidden Together, they can find out a lot of hidden information about genes and proteinsinformation about genes and proteins
Biotechnology is a multi-billion dollar Biotechnology is a multi-billion dollar industryindustry
Biotechnology is one of the best funded Biotechnology is one of the best funded areas of scientific research areas of scientific research
Bioinformatics Tony C Smith
The University of WaikatoThe University of Waikato
Waikato University is the centre of the Waikato University is the centre of the universe for machine learninguniverse for machine learning
The Machine Learning Group is a large, The Machine Learning Group is a large, globally active, well-funded research groupglobally active, well-funded research group
The WEKA workbench of ML tools is used The WEKA workbench of ML tools is used around the worldaround the world
Professors at Waikato University literally Professors at Waikato University literally wrote the book on sequence modelingwrote the book on sequence modeling
Bioinformatics Tony C Smith
The University of WaikatoThe University of Waikato
If you’re seriously interested in machine If you’re seriously interested in machine learning, in getting involved in learning, in getting involved in
bioinformatics research, or indeed any bioinformatics research, or indeed any other area along the leading edge of other area along the leading edge of
computer science, then university is the computer science, then university is the only place to be, andonly place to be, and