Top Banner
Bioinformatics: Bioinformatics: Knowledge Knowledge - - representation in molecular biology representation in molecular biology Sándor Pongor Protein Structure and Bioinformatics, ICGEB, Trieste
68

Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Jun 27, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Bioinformatics: Bioinformatics:

KnowledgeKnowledge--representation in molecular biologyrepresentation in molecular biology

Sándor Pongor

Protein Structure and Bioinformatics, ICGEB, Trieste

Page 2: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Representation of biological knowledgeRepresentation of biological knowledge

Source: NCBI

BIBLIOGRPHY

0e+01e+62e+63e+64e+65e+66e+67e+68e+69e+61e+7

1965 1970 1975 1980 1985 1990 1995

Art

icle

s

NUCLEOTIDE SEQUENCES

0e+02e+54e+56e+58e+51e+61e+61e+62e+62e+62e+6

1965 1970 1975 1980 1985 1990 1995

Sequ

ence

s

BIBLIOGRAPHY-GENETICS

0e+01e+52e+53e+54e+55e+56e+57e+58e+59e+51e+6

1965 1970 1975 1980 1985 1990 1995

Art

icle

s

PROTEIN SEQUENCES

0e+0

1e+4

2e+4

3e+4

4e+4

5e+4

6e+4

7e+4

8e+4

1965 1970 1975 1980 1985 1990 1995

Sequ

ence

s

PROTEIN 3D STRUCTURES

0

1000

2000

3000

4000

5000

6000

7000

8000

1965 1970 1975 1980 1985 1990 1995

Stru

ctur

es

SWISS-PROT

GenBank PDB

PKRRSARLSA

MAPPED HUMAN GENES

0

5000

10000

15000

20000

25000

30000

35000

1965 1970 1975 1980 1985 1990 1995

Gen

es

Page 3: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Bioinformatics milestones 1Bioinformatics milestones 1

1962 1962 -- PaulingPauling’’s theory of molecular evolutions theory of molecular evolution1967 1967 -- Margaret Margaret Dayhoff'sDayhoff's Atlas of Protein Sequences Atlas of Protein Sequences 1970 1970 -- NeedlemanNeedleman--WunschWunsch algorithmalgorithm1977 1977 -- DNA sequencing and software to analyze it (DNA sequencing and software to analyze it (StadenStaden))1981 1981 -- The concept of a sequence motif (Doolittle)The concept of a sequence motif (Doolittle)1982 1982 -- Phage Phage labmdalabmda genomegenome1983 1983 -- Database search (WilburDatabase search (Wilbur--LipmanLipman))1985 1985 -- FASTP/FASTN: fast sequence similarity searchingFASTP/FASTN: fast sequence similarity searching1987 1987 -- Sequence profilesSequence profiles1987 1987 -- EMBL, EMBL, Genbank,SwissGenbank,Swiss--Prot databasesProt databases

Page 4: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Bioinformatics milestones 2Bioinformatics milestones 2

1988 1988 -- National Center for Biotechnology Information (US)National Center for Biotechnology Information (US)1988 1988 -- EMBnetEMBnet network for database distributionnetwork for database distribution1990 1990 -- BLAST: fast sequence similarity searchingBLAST: fast sequence similarity searching1991 1991 -- EST: expressed sequence tag sequencingEST: expressed sequence tag sequencing1993 1993 -- Sanger Centre, Sanger Centre, HinxtonHinxton, UK, UK1994 1994 -- EMBL European Bioinformatics Institute, EMBL European Bioinformatics Institute, HinxtonHinxton, UK, UK1995 1995 -- First bacterial genomesFirst bacterial genomes1996 1996 -- Yeast genomeYeast genome1997 1997 -- PSIPSI--BLASTBLAST1998 1998 -- Worm (Worm (multicellularmulticellular) genome ) genome 2000+ The rice and human genomes. 2000+ The rice and human genomes. MicroarraysMicroarrays, high throughput methods, new generation sequencing, high throughput methods, new generation sequencing……

Page 5: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

The ingredientsThe ingredients

Data collection techniques (DNA sequencing, protein Data collection techniques (DNA sequencing, protein sequencing, microarrays)sequencing, microarrays)

Theoretical milestones (concepts of DNA structure, Theoretical milestones (concepts of DNA structure, protein structure, evolution)protein structure, evolution)

Algorithms and programs (BLAST, FASTA)Algorithms and programs (BLAST, FASTA)

DatabasesDatabases

InstitutionsInstitutions

Complex genomic and high throughput dataComplex genomic and high throughput data

Page 6: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

NCBI, Washington DC

EBI, Hinxton, UK

Page 7: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

The evolution of bioinformatics as seen in the 90’sThe evolution of bioinformatics as seen in the 90’s

??

Page 8: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Bioinformatics is an approach to biology…Bioinformatics is an approach to biology…

Systems theory

Cognitive sciences

Page 9: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

COGNITIVE

SCIENCE

BIOINFORMATICSBIOINFORMATICS

BIOLOGICAL

DATA

INFORMATICS

Model, description and visualization

Page 10: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Why is bioinformatics important?Why is bioinformatics important?

““A paradigm shift in biology: from A paradigm shift in biology: from data collection to data processingdata collection to data processing””

““Biotechnology is the industrial use Biotechnology is the industrial use of biological informationof biological information””

Lee Hood, in The Economist, 1997

Walter Gilbert, Nature, 1991

Page 11: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Current trendsCurrent trends

Massive data processingMassive data processing

Massive generation of data: sequences (genomics) Massive generation of data: sequences (genomics) functions (functional genomics) and structures functions (functional genomics) and structures (structural genomics)(structural genomics)

Interpretation of data: data mining, data warehousing Interpretation of data: data mining, data warehousing techniquestechniques

Informatics strike back...

Page 12: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Structural genomicsStructural genomics

Classify proteins (Database of Classify proteins (Database of protein motifs)protein motifs)Choose and express Choose and express representative proteins from all representative proteins from all familiesfamiliesDetermine structure by XDetermine structure by X--ray or ray or NMRNMRPredict the rest by homology Predict the rest by homology modellingmodelling

Tom Terwilliger, Los Alamos National Labs

Gu et al, 1999

Bioinformatics

Page 13: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Functional genomicsFunctional genomics

Sequence complete genomeSequence complete genomeIdentify protein coding regionsIdentify protein coding regionsIdentify unique genesIdentify unique genesGene knockoutGene knockoutFunctional analysis Functional analysis (phenotype, detailed functional (phenotype, detailed functional characterization..)characterization..)Structural studies, drug Structural studies, drug developmentdevelopment

Bioinformatics

Page 14: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

acaattgtaataggcgaacatgttacgcaaagtggtattgaggaacattgtaacaacaattgtaataggcgaacatgtcagtacaagtggtattgaggaacattgtaacaacaattgtaataggcgaacaatatgttacaagtggtattgaggaactacattgtaacaacaattgtaataggcgaacatgttacaagtggtattgaggaacattgtaacaacaattgtaataggcgaacatgttacgcaaagtggtattgaggaacattgtaacaacaattgtaataggcgaacatgtcagtacaagtggtattgaggaacattgtaacaacaattgtaataggcgaacaatatgttacaagtggtattgaggaactacattgtaacaacaattgtaataggcattataagattat

aattgtaataggcattataagattatgcaaagtggtattgaggaacattgtaacaacaattgtaataggcgaacatgtcagtacaagtggtattgaggaacattgtaacaacaattgtaataggcgaacaatatgttacaagtggtattgaggaactacattgtaacaacaattgtaataggcgaacatgttacaagtggtattgaggaacattgtaacaacaattgtaataggcgaacatgttacgcaaagtggtattgaggaacattgtaacaacaattgtaataggcgaacatgtcagtacaagtggtattgaggaacattgtaacaacaattgtaataggcgaacaatatgttacaagtggtattgaggaactacattgtaacaacaattgtaataggcattataagattat

New data representationsNew data representations

Data (property) visualization

Page 15: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

What bioinformatics is…What bioinformatics is…

USE

R

Data Analysis Interpretation

Processing of raw sequence data &

instrument output

Database maintenance

Biocomputing, biomathematics

Data management

nfrastructure

Research

Page 16: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Bioinformatics: managing information for the life sciences

Bioinformatics: managing information for the life sciences

For: Biomedicine, AgricultureFor: Biomedicine, Agriculture

In: Academic and Industrial Research and In: Academic and Industrial Research and Development, Medical PracticeDevelopment, Medical Practice

Scientific Infrastructure (service)Scientific Infrastructure (service)

Advanced Informatics (research)Advanced Informatics (research)

Education (biologists vs. Education (biologists vs. informaticiansinformaticians))

An independent field of study but a general approach to biology

Page 17: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Current challenges – the three gapsCurrent challenges – the three gaps

Understanding (Understanding (““annotatingannotating””) new data: ) new data: ““annotationalannotational gapgap””

Translating data to practice: personalized medicine, Translating data to practice: personalized medicine, epidemiesepidemies…… ““translational gaptranslational gap””

Making users (biologists, medical doctors) Making users (biologists, medical doctors)

aware of what is there and how to use itaware of what is there and how to use it……

((““communication gapcommunication gap””))

Page 18: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Protein Structure and BioinformaticsProtein Structure and Bioinformatics

Established as a resource Established as a resource group for protein chemistry group for protein chemistry and protein engineeringand protein engineering

In charge of bioinformatics In charge of bioinformatics services since 1991services since 1991

Research projects on Research projects on bioinformatics, structural bioinformatics, structural biology, systems modelingbiology, systems modeling

Currently includes 12 Currently includes 12 researchers and studentsresearchers and students

Page 19: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

ICGEB bioinformaticsICGEB bioinformatics

Biological computing service Biological computing service for 800 users from 47 for 800 users from 47 countriescountries

22--3 training courses per year, 3 training courses per year, 1400+ students in 18 years...1400+ students in 18 years...

Methods development Methods development (classification, machine (classification, machine learning)learning)

WWWWWW--services: DNAservices: DNA--tools, tools, protein domain identificationprotein domain identification

Page 20: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

EMBnet: A world wide network of bioinformatics

EMBnet: A world wide network of bioinformatics

32 national nodes, 35,000 registered users.

12 specialist nodes including all major European database producers.

Includes China, India, Australia

Education: High level courses organised in member countries. WWW-tutorials.

A coordinated network of bioinformatics services, a global technical and educational resource

of bioinformatics centres world wide

www.embnet.org

Page 21: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Bioinformatics summer course

-One of the loldest continuous teaching traditions in Europe, over 1300 students since 1991

- Introduction to theory and practice of bioinformatics

- Contact with the centers of (NCBI, EBI, SwissProt, KEGG)

Protein Structure and Bioinformatics

Page 22: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

A take-home messageA take-home message

Bioinformatics is a a general approach (paradigm) in Bioinformatics is a a general approach (paradigm) in biology today.biology today.

A A bioinformaticianbioinformatician has to understand has to understand

The biological question, the biological modelThe biological question, the biological model

The dataThe data--collection technology, the data modelcollection technology, the data model

The mathematics/statistics of dataThe mathematics/statistics of data--evaluationevaluation

This is why we have this course This is why we have this course

Page 23: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Thank you for your attention…

Page 24: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Bioinformatics: Bioinformatics:

KnowledgeKnowledge--representation in molecular biologyrepresentation in molecular biology

Sándor Pongor

Protein Structure and Bioinformatics, ICGEB, Trieste

Page 25: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

An overview of bioinformaticsAn overview of bioinformatics

History and development History and development Model, description and visualizationModel, description and visualization•• SequencesSequences•• 3D structures3D structures•• NetworksNetworks•• Text (abstracts)Text (abstracts)

Similarity and classification: Similarity and classification: •• similarity measures (structured, unstructured)similarity measures (structured, unstructured)•• database searchdatabase search•• consensus descriptionsconsensus descriptions

Integrated resources Integrated resources

Page 26: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

The subjects: Molecular structuresThe subjects: Molecular structures

MARTKQTARKSTGGKAPRKQLATKAARKSA

Sequences

CIPKWNRCGPKMDGVPCCEPYTCTSDYYGNCS

Extended sequences(e.g. disulphide-topologies)

Domain-cartoons(sec. str. cartoons)

Diagrams (hydrophobicity plots, helical circles) 3D cartoons

3D structures

Page 27: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

A structural modelA structural model

Structure

Substructures Relationships

Entity-relationship model Pongor, Nature, 1987

Susbstructures, relations, rules = onthology

Page 28: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Structures As Database RecordsStructures As Database Records

IdentificationName of proteinOrganismFunctionCross-references...Domain structureSec. structureDisulphides….

Sequence (structure)qfinetdttvivtwtpprarivgyrltvgllseegdepqyldlpstatsvnipdllpgrkytvnvyeiseegeqnlilstsqttapdappdptvdqvddtsivvrwsrprapitgyrivyspsvegsstelnlpetansvtlsdlqpgvqynitiyaveenqestpvfiqqettgvprsdkvppprdlqfvevtdvkitimwtppespvtgyrvdvipvnlpgehgqrlpvsrntfaevtglspgvtyhfkv

ANNOTATIONS

SEQUENCEOR STRUCTURE

CIPKWNRCGPKMDGVPCCEPYTCTSDYYGNC

Database record, fields

Page 29: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

The subjects of bioinformaticsThe subjects of bioinformatics

ModelsModels

(knowledge)(knowledge)

Stored data = Stored data = descriptions for descriptions for

computerscomputers

Visualization, text = Visualization, text = simplified descriptions simplified descriptions

for humansfor humans

Page 30: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

SEQUENCESSEQUENCES

Page 31: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

SEQUENCESSEQUENCES

Model: Chemical Model: Chemical structurestructure

Description: Series Description: Series of charactersof characters

Simplified and/or Simplified and/or extended extended visualizationvisualization

IFPPVPGP

domain1 Binding site

Page 32: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Sequences as languageSequences as language

qfinetdttvivtwtpprarivgyrltvgllseegdepqyldlpstatsvnipdllpgrkytvnvyeiseegeqnlilstsqttapdappdptvdqvddtsivvrwsrprapitgyrivyspsvegsstelnlpetansvtlsdlqpgvqynitiyaveenqestpvfiqqettgvprsdkvppprdlqfvevtdvkitimwtppespvtgyrvdvipvnlpgehgqrlpvsrntfaevtglspgvtyhfkvfavnqgreskpltaqqatkldaptnlqfinetdttvivtwtpprarivgyrltvgltrggqpkqynvgpaasqyplrnlqpgseyavslvavkgnqqsprvtgvfttlqplgsiphyntevtettivitwtpaprigfklgvrpsqggeaprevtsesgsivvsgltpgveyvytisvlrdgqerdapivkkvvtplspptnlhleanpdtgvltvswersttpditgyritttptngqqgysleevvhadqssctfenlspgleynvsvytvkddkesvpisssfvvswvsasdtvsgfrveyelseegdepqyldlpstatsvnipdllpgrkytvnvyeisee

Query (name, length, self-score)

From To

HSP

Pattern

Score

L HSP, j

L HSP, i

From To

Subject (name, length, self-score)

LANGUAGE

Character strings, computer-languages, Chomsky et al, etc.

Alignments

Page 33: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

3D STRUCTURES3D STRUCTURES

Page 34: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Chimie dans l’espaceChimie dans l’espace

Van t’Hoff1852-1911

1898

Page 35: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Some molecules are more equal then others…Some molecules are more equal then others…

…”This figure is purely diagrammatic. The two ribbons symbolize the the phosphate-sugar chains, and the horizontal rods the pairs of the bases holding the chains together. The vertical line marks the fibre axis”

Page 36: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Protein modelsProtein models3D OBJECTS

Page 37: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

3D structures3D structures

Model: 3D chemical Model: 3D chemical structures structures

Description: 3D Description: 3D coordinatescoordinates

Simplified and/or Simplified and/or extended extended visualizationvisualization

(xi, yi, zi)n

!!!??

Surface, backbone

Page 38: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

NETWORKSNETWORKS

Page 39: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Small molecules – classical graphsSmall molecules – classical graphs

Loschmidt, 1861 Kekulé, 1865

Crum Brown, 1861 Cayley, 1872

Van’t Hoff, 1898

TOPOLOGIES, GRAPHS

Page 40: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Genomes, assembliesGenomes, assemblies

Entity-relationship modelsTopological meta-modelsEntity-relationship modelsTopological meta-models

Similarity group Neighbourhood

Genome Metabolicpathway

Genetic network Food network

Tree-hierarchyComplexes

TOPOLOGIES, GRAPHS

Page 41: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

The transcription regulatory networksThe transcription regulatory networks+ (up)- (down)

E. coli S. cerevisiae

TOPOLOGIES, GRAPHS

Page 42: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

TEXTS (article abstracts)TEXTS (article abstracts)

Page 43: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

The language of bibliographiesThe language of bibliographiesLANGUAGE

Structures As Database RecordsStructures As Database Records

IdentificationName of proteinOrganismFunctionCross-references...Domain structureSec. structureDisulphides….

Sequence (structure)qfinetdttvivtwtpprarivgyrltvgllseegdepqyldlpstatsvnipdllpgrkytvnvyeiseegeqnlilstsqttapdappdptvdqvddtsivvrwsrprapitgyrivyspsvegsstelnlpetansvtlsdlqpgvqynitiyaveenqestpvfiqqettgvprsdkvppprdlqfvevtdvkitimwtppespvtgyrvdvipvnlpgehgqrlpvsrntfaevtglspgvtyhfkv

ANNOTATIONS

SEQUENCEOR STRUCTURE

CIPKWNRCGPKMDGVPCCEPYTCTSDYYGNC

Database record, fields

Keyword-collecttions, onthologies, etc.

Page 44: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Texts (abstracts)Texts (abstracts)

Model: ?? Model: ??

Description: structured Description: structured files (records, fields), files (records, fields), standardized languagestandardized language

Simplified and/or Simplified and/or extended visualizationextended visualization

Page 45: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

ModelsModels

A structural modelA structural model

Structure

Substructures Relationships

Entity-relationship mPongor, Nature, 19

SEQUENCES 3-D NETWORKS

tassfvvswvsasdtvsgfrveyelseegdepqyldlpstatsvnipdllpgrkytvnvyeiseegeqnlilstsqttapdappdptvdqvddtsivvrwsrprapitgyrivyspsvegsstelnlpetansvtlsdlqpgvqynitiyaveenqestpvfiqqettgvprsdkvppprdlqfvevtdvkitimwtppespvtgyrvdvipvnlpgehgqrlpvsrntfaevtglspgvtyhfkvfavnqgreskpltaqqatkldaptnlqfinetdttvivtwtpprarivgyrltvgltrggqpkqynvgpaasqyplrnlqpgseyavslvavkgnqqsprvtgvfttlqplgsiphyntevtettivitwtpaprigfklgvrpsqggeaprevtsesgsivvsgltpgveyvytisvlrdgqerdapivk

TEXT

Page 46: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

An overview of bioinformaticsAn overview of bioinformatics

History and development History and development Model, description and visualizationModel, description and visualization•• SequencesSequences•• 3D structures3D structures•• NetworksNetworks•• Text (abstracts)Text (abstracts)

Similarity and classification: Similarity and classification: •• similarity measures (structured, unstructured)similarity measures (structured, unstructured)•• database searchdatabase search•• consensus descriptionsconsensus descriptions

Integrated resources Integrated resources

Page 47: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

The concept of similarity IThe concept of similarity I

...easier if modular

Shared parts Shared context

Page 48: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

The concept of similarity IIThe concept of similarity II

…Easy for humans, hard for computers

Page 49: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Multiple ObjectsMultiple Objects

Similarity groupsor neighborhoods

Metabolic pathwaysSubunit structures,

ligands Genomes

Evolutionary trees

Trajectories

CGPK-MDGVPCCEPYCGGQNWSGPTCCASGCSPTSYN---CCR--CSRLMY---DCCT--CIPYYL---DCCEPL

Multiple alignments

CGPK-MDGVPCCEPYCGGQNWSGPTCCASGCSPTSYN---CCR--CSRLMY---DCCT--CIPYYL---DCCEPL

Structural similarity

Context (function)

Page 50: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Shared context

Similarity of moleculesSimilarity of molecules

Shared parts

Shared relations

Page 51: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Quantitative comparisonQuantitative comparison

Unstructured models Structured models

Typical form: numbers, lists

vectors (x1, x2,…x3)

Similarity scoreClustering, classification etc.

Alignment (matching)

Similarity score

Typical form: sequences, networks etc.

Clustering, classification etc.

Page 52: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Quantification of sequence similarity: sequence alignment and its scoring

Quantification of sequence similarity: sequence alignment and its scoring

Mismatch Gap

Range of Alignment

ATTGTCAAAGACTTGAGCTGATGCAT|||| ||| ||||

GGCAGACATGA-CTGACAAGGGTATCG

Score = sum contributions of matches subtract penalties for mismatches

Page 53: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Substructure identity ~ similaritySubstructure identity ~ similarity

”The similarity of objects can be best described aspartial identities of components and relationships

Erich Goldmeier, The similarity of perceived forms, 1936

Page 54: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Twighlight zone

Using similarity: Comparing one sequence with a group (database)

Using similarity: Comparing one sequence with a group (database)

Ranked list of best similarities

1

2

3

4

SEQUENCE SCORE DESCRIPTION

SWISSALL:IAAI 457.36 ALPHA-AMYLASE INHIBITOR AAI. 2/9

SWISSALL:O426 152.82 CELLULOSE BINDING PROTEIN

SWISSALL:GUX 145.77 EXOGLUCANASE I PRECURSOR

SWISSALL:Q126 145.66 CELLULASE (EC 3.2.1.91)

Similarities??

EXPECTation Threshold(E parameter)

|V Observed Counts-->

10000 6336 1688 |============================================================6310 4648 1618 |=========================================================3980 3030 886 |===============================2510 2144 706 |=========================1580 1438 438 |===============1000 1000 272 |=========631 728 185 |======398 543 141 |=====251 402 103 |===158 299 63 |==100 236 43 |=

63.1 193 15 |:39.8 178 18 |:25.1 160 17 |:15.8 143 7 |:

>>>>>>>>>>>>>>>>>>>>> Expect= 10.0, Observed= 136 <<<<<<<<<<<<<<<<<10.0 136 2 |:6.31 134 3 |:3.98 131 2 |:2.51 129 2 |:1.58 127 0 |1.00 127 1 |:0.63 126 0 |0.40 126 4 |:0.25 122 0 |0.16 122 0 |0.10 122 0 |

0.063 122 0 |0.040 122 0 |0.025 122 0 |0.016 122 1 |:0.010 121 0 |

0.0063 121 1 |:0.0040 120 0 |0.0025 120 1 |:

BLAST program

Page 55: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Using similarity: comparing a group with itselfUsing similarity: comparing a group with itself

Similarity groupor neighbourhood

CGPK-MDGVPCCEPYCGGQNWSGPTCCASGCSPTSYN---CCR--CSRLMY---DCCT--CIPYYL---DCCEPL

Multiple alignment

Mathematical consensusfor database search

Regular expressionsConsensus sequenceFrequency matrixMarkov chainsNeural networksetc.

Publish

Nature

CLUSTAL program

Page 56: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Similarities: a practical overviewSimilarities: a practical overview

tassfvvswvsasdtvsgfrveyelseegdepqyldlpstatsvnipdllpgrkytvnvyeiseegeqnlilstsqttapdappdptvdqvddtsivvrwsrprapitgyrivyspsvegsstelnlpetansvtlsdlqpgvqynitiyaveenqestpvfiqqettgvprsdkvppprdlqfvevtdvkitimwtppespvtgyrvdvipvnlpgehgqrlpvsrntfaevtglspgvtyhfkvfavnqgreskpltaqqatkldaptnlqfinetdttvivtwtpprarivgyrltvgltrggqpkqynvgpaasqyplrnlqpgseyavslvavkgnqqsprvtgvfttlqplgsiphyntevtettivitwtpaprigfklgvrpsqggeaprevtsesgsivvsgltpgveyvytisvlrdgqerdapivk

SEQUENCES 3D NETWORKS

Bulk “Glycine-rich” “α-helical” “scale-free”

Substructure-alignment

Motifs G-RR

(metabolic pathways)

PAPERS

“genomics”

same author, common

references

“Joe Doe, folding”

Page 57: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

An overview of bioinformaticsAn overview of bioinformatics

History and development History and development Model, description and visualizationModel, description and visualization•• SequencesSequences•• 3D structures3D structures•• NetworksNetworks•• Text (abstracts)Text (abstracts)

Similarity and classification: Similarity and classification: •• similarity measures (structured, unstructured)similarity measures (structured, unstructured)•• database searchdatabase search•• consensus descriptionsconsensus descriptions

Integrated resources Integrated resources

Page 58: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Biological knowledge as a network of data

Text (keyword)Similarity

Taxonomic Similarity

NucleotideSequence Similarity

Protein Sequence Similarity

Structural Similarity

Nucleotide sequences

Protein sequences

3-D Structure

3 -D Structure

Bibliograpy

Genomes

Phylogeny(Taxonomy)

actactgagaacat

MSLLDHRGDRGD

The world according to a PC...

Source: NCBI

Page 59: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Search on a preprocessed, integrated database:the importance of a good neighbourhood

Search on a preprocessed, integrated database:the importance of a good neighbourhood

Unknown DNA query

+

DNA

Proteins3D Structures

Literature, abstracts

Blast

Derived protein sequence

+

Blast Oops!

Page 60: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Models are human constructs...Models are human constructs...

THIS IS NOT A PIPE!

Page 61: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Models are human constructs...Models are human constructs...

THIS IS NOT A MOLECULE

Page 62: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

RNADNA Protein

The central dogma:

Dogma, paradigm, mythology

Page 63: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

RNA

Metabolites

DNA Protein

Growth rateExpression

Interactions

Polymers: Initiate, elongate, terminate, fold, modify, localize, degrade

New central dogma:Self-assembly, catalysis, replication, networks

Evolution + Self assembly, Systems biology

Page 64: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Summary of topics discussedSummary of topics discussed

History and development History and development

Model, description and visualizationModel, description and visualization•• SequencesSequences•• 3D structures3D structures•• NetworksNetworks•• Text (abstracts)Text (abstracts)

Similarity and classification: Similarity and classification: •• similarity measures (structured, unstructured)similarity measures (structured, unstructured)•• database searchdatabase search•• consensus descriptionsconsensus descriptions

Integrated resources Integrated resources

Page 65: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Summary of the introductionSummary of the introduction

Bioinformatics is the science of biological information or ratheBioinformatics is the science of biological information or rather a r a computercomputer--based approach to biological problems.based approach to biological problems.

All kinds of biological data are All kinds of biological data are structures defined with entities structures defined with entities and relationshipsand relationships (metabolites, genes, networks).(metabolites, genes, networks).

Typical tasks: Similarity search, categorization and clusteringTypical tasks: Similarity search, categorization and clustering

Simultaneous handling of many, complex dataSimultaneous handling of many, complex data--typestypes

Page 66: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

On-line help to this lectureOn-line help to this lecture

Bioinformatics tutorials onBioinformatics tutorials on--linelinehttp://www.ebi.ac.uk/2can/http://www.ebi.ac.uk/2can/homehome.html.html

ICGEBnetICGEBnethttp://www.icgeb.org/~netsrv/http://www.icgeb.org/~netsrv/

The Trieste bioinformatics courseThe Trieste bioinformatics coursehttp://http://www.icgeb.org/~netsrv/netcourse.htmlwww.icgeb.org/~netsrv/netcourse.html

Page 67: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Reading about bioinformaticsReading about bioinformatics

In depth introduction

Genomics research problems Math principles

Evolutionary principles

Page 68: Knowledge-representation in molecular biology › ... › bioinformatics_PL_INTRO.pdf · Bioinformatics milestones 2 1988 - National Center for Biotechnology Information (US) 1988

Computer methods in Molecular BiologyTrieste June 21 - 26, 2010

Computer methods in Molecular BiologyTrieste June 21 - 26, 2010

Theoretical overview: STheoretical overview: Sáándorndor PongorPongor

Sequence database searching, theory and practice (Dave Judge andSequence database searching, theory and practice (Dave Judge and Jack Jack LeunissenLeunissen))

Nucleic acid databases, Medline, Nucleic acid databases, Medline, PubmedPubmed (David Landsman)(David Landsman)

Functional genomics databases, KEGG (Minoru Functional genomics databases, KEGG (Minoru KanehisaKanehisa) )

EBI Services (Jim Watson)EBI Services (Jim Watson)

Protein databases, Protein databases, SwissprotSwissprot, , PrositeProsite (Marie(Marie--Claude Claude BlatterBlatter))

Genome analysis (Martin Bishop)Genome analysis (Martin Bishop)