Biological sequence analysis and information processing by artificial neural networks Søren Brunak Center for Biological Sequence Analysis Technical University of Denmark [email protected]
Dec 18, 2015
Biological sequence analysis and information processing by artificial neural networks
Søren Brunak
Center for Biological Sequence Analysis
Technical University of Denmark
Parvis alignment>carp Cyprinus carpio growth hormone 210 aa vs.
>chicken Gallus gallus growth hormone 216 aa
scoring matrix: BLOSUM50, gap penalties: -12/-2
40.6% identity; Global alignment score: 487
10 20 30 40 50 60 70
carp MA--RVLVLLSVVLVSLLVNQGRASDN-----QRLFNNAVIRVQHLHQLAAKMINDFEDSLLPEERRQLSKIFPLSFCNSD
:: . : ...:.: . : :. . :: :::.:.:::: :::. ..:: . .::..: .: .:: :.
chicken MAPGSWFSPLLIAVVTLGLPQEAAATFPAMPLSNLFANAVLRAQHLHLLAAETYKEFERTYIPEDQRYTNKNSQAAFCYSE
10 20 30 40 50 60 70 80
80 90 100 110 120 130 140 150
carp YIEAPAGKDETQKSSMLKLLRISFHLIESWEFPSQSLSGTVSNSLTVGNPNQLTEKLADLKMGISVLIQACLDGQPNMDDN
: ::.:::..:..: ..:::.:. ::.:: : : ::. .:.:. :. ... ::: ::. ::..:.. : .: .
chicken TIPAPTGKDDAQQKSDMELLRFSLVLIQSWLTPVQYLSKVFTNNLVFGTSDRVFEKLKDLEEGIQALMRELEDRSPR---G
90 100 110 120 130 140 150 160
170 180 190 200 210
carp DSLPLP-FEDFYLTM-GENNLRESFRLLACFKKDMHKVETYLRVANCRRSLDSNCTL
.: : .. : . . .:. : ... ::.:::::.:::::::.: .::: .::::.
chicken PQLLRPTYDKFDIHLRNEDALLKNYGLLSCFKKDLHKVETYLKVMKCRRFGESNCTI
170 180 190 200 210
Diversity of interactions in a network enables complex calculations
• Similar in biological and artificial systems
• Excitatory (+) and inhibitory (-) relations between compute units
Transfer of biological principles to neural network algorithms
• Non-linear relation between input and output
• Massively parallel information processing
• Data-driven construction of algorithms
• Ability to generalize to new data items
Simplest non-trivial classification problem
CNHSYYP, HIETRRA, NWQSADY, NQYSEPR, WHITRCA, DYHSANY, ...
• Two categories: positives and negatives• Data described by two features, e.g. charge, sidechain volume, molecular weight, number of atoms, ...
Features of phosphorylations sites
PKGcGMP-dep.kinase
PKC
CaM-IICa++/cal-modulin-dep. kinase
cdc2Cyclin-dep.kinase 2
CK-IICasein kinase 2
Transfer of biological principles to neural network algorithms
• Non-linear relation between input and output
• Massively parallel information processing
• Data-driven construction of algorithms
Sparse encoding of nucleotide sequence windows
Nucleotides
4 letter alphabet
Normally no need for a fifth letter
ACGTAGGCAATCTCAGACGTTTATC
1000010000100001100000100010010010001000000101000001010010000010100001000010000100010001100000010100