SBML (the Systems Biology Markup Language), model databases, and other resources

SBML (the Systems Biology Markup Language), model databases, and other resources

Michael Hucka, Ph.D.Department of Computing + Mathematical Sciences

California Institute of TechnologyPasadena, CA, USA

CCB 2012, August 2012, Cold Spring Harbor Laboratory, NY, USA

Email: [email protected] Twitter: @mhucka

http://cshlccb.wordpress.com/2012/08/03/michael-hucka-california-institute-of-technology-the-systems-biology-markup-language-sbml-model-databases-and-translation/




mailto:[email protected]?subject=About%20your%20presentation

mailto:[email protected]?subject=About%20your%20presentation

https://twitter.com/%23!/mhucka

https://twitter.com/%23!/mhucka

Outli

ne

General background and motivations

Brief summary of SBML features

A selection of resources for the SBML-oriented modeler

Annotations, connections and semantics

Current and upcoming developments in community standards

Closing

Outli

ne






Closing

Research today: experimentation, computation, cogitation

The many roles of computation in biological researchInstrument/device control, data management, data processing, database applications, statistical analysis, pattern matching, image processing, text mining, chemical structure prediction, genomic sequence analysis, proteomics, other *omics, molecular modeling, molecular dynamics, kinetic simulation, simulated evolution, phylogenetics, ... (to name only a subset)!

Focus here: modeling and simulation

Usually, there are at least two scientific outcomes:

• One or more models (+ associated claims about their behaviors)

• Publication of the results (in some form)

What are the outcomes of modeling and simulation?

Models comein many forms

Models are resultsModels serve as statements of our current understanding of the phenomena being studied*

• A computational model documents your theory in a concrete form

Model can—

• Reduce ambiguity in communication

• Offer a concrete framework for adding new data and theories

• Support direct evaluation of relationships between theories

Bower & Bolouri, Computational modeling of genetic and biochemical networks, MIT Press, 2001

But only if the modeling results are reproducible

Many models have traditionally been published this way

Problems:

• Errors in printing

• Missing information

• Dependencies onimplementation

• Outright errors

• Can be a hugeeffort to recreate

Is it enough to describe the model & equations in a paper?

Is it enough to make your (software X) script available?It’s vital for good science:

• Someone with access to the same software can try to run it, understand it, verify the computational results, build on them, etc.

• Opinion: you should always do this in any case

Is it enough to make your (software X) code available?It’s vital for good science—

• Someone with access to the same software can try to run it, understand it, build on it, etc.

• Opinion: you should always do this in any case

But it’s still not ideal for communication of scientific results:

• What if they don’t have access to that software?

• And anyway, how will people find the model?

• And how will people be able to relate the model to other work?

Different tools ⇒ different interfaces & languages

Communication is better with interoperable data formats

Outli

ne






Closing

SBML: a lingua fra

nca

for software

Format for representing computational models of biological processes

• Data structures + usage principles + serialization to XML

Neutral with respect to modeling framework

• E.g., ODE, stochastic systems, etc.

Development started in 2000, with first specification distributed in 2001

SBML = Systems Biology Markup Language

The process is central

• Called a “reaction” in SBML

• Participants are pools of entities (species)

Models can further include:

• Other constants & variables

• Compartments

• Explicit math

• Discontinuous events

Basic SBML concepts are fairly simple

• Unit definitions

• Annotations

Well-stirred compartments

c

n

Species pools are located in compartments

c

n

protein A protein B

gene mRNAn mRNAc

Reactions can involve any species anywhere

c

n

protein A protein B

gene mRNAn mRNAc

Reactions can cross compartment boundaries

c

n

protein A protein B

gene mRNAn mRNAc

Reaction/process rates can be (almost) arbitrary formulas

c

n

protein A protein B

gene mRNAn mRNAc

f1(x)

f2(x)

f3(x)f4(x)

f5(x)

“Rules”: equations expressing relationships in addition to reaction sys.

c

n

protein A protein B

gene mRNAn mRNAc

f1(x)

f2(x)

f3(x)

g1(x)g2(x)

.

.

.

f4(x)

f5(x)

“Events”: discontinuous actions triggered by system conditions

c

n

protein A protein B

gene mRNAn mRNAc

f1(x)

f2(x)

f3(x)

g1(x)g2(x)

.

.

.

Event1: when (...condition...), do (...assignments...)


...

f4(x)

f5(x)

Annotations: machine-readable semantics and links to other resources



...

c

n

protein A protein B

gene mRNAn mRNAc

f1(x)

f2(x)

f3(x)

g1(x)g2(x)

.

.

.

f4(x)

f5(x)

“This event represents ...”

“This is identified by GO id # ...”

“This is an enzymatic reaction with EC # ...”

“This is a transport into the nucleus ...” “This compartment

represents the nucleus ...”

Today: spatially homogeneous models

• Metabolic network models

• Signaling pathway models

• Conductance-based models

• Neural models

• Pharmacokinetic/dynamics models

• Infectious diseases

Coming: SBML Level 3 packages to support other types

• E.g.: Spatially inhomogeneous models, also qualitative/logical

Scope of SBML encompasses many types of models

Find examples inBioModels Databasehttp://biomodels.net/biomodels

2342 reactions

NATURE BIOTECHNOLOGY VOLUME 26 NUMBER 10 OCTOBER 2008 1155

of their parameters. Armed with such information, it is then possible to provide a stochastic or ordinary differential equation model of the entire metabolic network of interest. An attractive feature of metabolism, for the purposes of modeling, is that, in contrast to signaling pathways, metabo-lism is subject to direct thermodynamic and (in particular) stoichiometric constraints3. Our focus here is on the first two stages of the reconstruction process, especially as it pertains to the mapping of experimental metabo-lomics data onto metabolic network reconstructions.

Besides being an industrial workhorse for a variety of biotechnological products, S. cerevisiae is a highly developed model organism for biochemi-cal, genetic, pharmacological and post-genomic studies5. It is especially attractive because of the availability of its genome sequence6, a whole series of bar-coded deletion7,8 and other9 strains, extensive experimental ’omics data10–14 and the ability to grow it for extended periods under highly con-trolled conditions15. The very active scientific community that works on S. cerevisiae has a history of collaborative research projects that have led to substantial advances in our understanding of eukaryotic biology6,8,13,16,17. Furthermore, yeast metabolic physiology has been the subject of inten-sive study and most of the components of the yeast metabolic network are relatively well characterized. Taken together, these factors make yeast metabolism an attractive topic to test a community approach to build models for systems biology.

Several groups18–21 have reconstructed the metabolic network of yeast from genomic and literature data and made the reconstructions freely available. However, due to different approaches used to create them, as well as different interpretations of the literature, the existing reconstruc-tions have many differences. Additionally, the naming of metabolites and enzymes in the existing reconstructions was, at best, inconsistent, and there were no systematic annotations of the chemical species in the form of links to external databases that store chemical compound informa-tion. This lack of model annotation complicated the use of the models for data analysis and integration. Members of the yeast systems biology community therefore recognized that a single ‘consensus’ reconstruction and annotation of the metabolic network was highly desirable as a starting point for further investigations.

A crucial factor that enabled the building of a consensus network recon-struction is the ability to describe and exchange biochemical network

Genomic data allow the large-scale manual or semi-automated assembly of metabolic network reconstructions, which provide highly curated organism-specific knowledge bases. Although several genome-scale network reconstructions describe Saccharomyces cerevisiae metabolism, they differ in scope and content, and use different terminologies to describe the same chemical entities. This makes comparisons between them difficult and underscores the desirability of a consolidated metabolic network that collects and formalizes the ‘community knowledge’ of yeast metabolism. We describe how we have produced a consensus metabolic network reconstruction for S. cerevisiae. In drafting it, we placed special emphasis on referencing molecules to persistent databases or using database-independent forms, such as SMILES or InChI strings, as this permits their chemical structure to be represented unambiguously and in a manner that permits automated reasoning. The reconstruction is readily available via a publicly accessible database and in the Systems Biology Markup Language (http://www.comp-sys-bio.org/yeastnet). It can be maintained as a resource that serves as a common denominator for studying the systems biology of yeast. Similar strategies should benefit communities studying genome-scale metabolic networks of other organisms.

Accurate representation of biochemical, metabolic and signaling net-works by mathematical models is a central goal of integrative systems biology. This undertaking can be divided into four stages1. The first is a qualitative stage in which are listed all the reactions that are known to occur in the system or organism of interest; in the modern era, and especially for metabolic networks, these reaction lists are often derived in part from genomic annotations2,3 with curation based on literature (‘bibliomic’) data4. A second stage, again qualitative, adds known effectors, whereas the third and fourth stages—essentially amounting to molecular enzymology—include the known kinetic rate equations and the values

A consensus yeast metabolic network reconstruction obtained from a community approach to systems biologyMarkus J Herrgård1,19,20, Neil Swainston2,3,20, Paul Dobson3,4, Warwick B Dunn3,4, K Yalçin Arga5, Mikko Arvas6, Nils Blüthgen3,7, Simon Borger8, Roeland Costenoble9, Matthias Heinemann9, Michael Hucka10, Nicolas Le Novère11, Peter Li2,3, Wolfram Liebermeister8, Monica L Mo1, Ana Paula Oliveira12, Dina Petranovic12,19, Stephen Pettifer2,3, Evangelos Simeonidis3,7, Kieran Smallbone3,13, Irena Spasi!2,3, Dieter Weichart3,4, Roger Brent14, David S Broomhead3,13, Hans V Westerhoff3,7,15, Betül Kırdar5, Merja Penttilä6, Edda Klipp8, Bernhard Ø Palsson1, Uwe Sauer9, Stephen G Oliver3,16, Pedro Mendes2,3,17, Jens Nielsen12,18 & Douglas B Kell*3,4

*A list of affiliations appears at the end of the paper.

Published online 9 October 2008; doi:10.1038/nbt1492

P E R S P E C T I V E

©20

08 N

atur

e Pu

blis

hing

Gro

up h

ttp://

ww

w.n

atur

e.co

m/n

atur

ebio

tech

nolo

gyHerrgård et al., Nature Biotech., 26:10, 2008

Model scale & complexity have been increasingMany significant and popular models are in SBML form

SBML Level 1 SBML Level 2 SBML Level 3

predefined math functions user-defined functions user-defined functions

text-string math notation MathML subset MathML subset

reserved namespaces for annotations

no reserved namespaces for annotations

no reserved namespaces for annotations

no controlled annotation scheme

RDF-based controlled annotation scheme

RDF-based controlled annotation scheme

no discrete events discrete events discrete events

default values defined default values defined no default values

monolithic monolithic modular






Closing

Outli

ne

You want models? We got models.

Stores & serves quantitative models of biological interest

• Free, public resource

• Models must be described in peer-reviewed publication(s)

Hundreds of models are curated by hand

Imports & exports models in several formats

BioModels Database

Figure courtesy of Camille Laibe

BioModels Database

http://biomodels.net/biomodels





Contents of BioModels DatabaseContents today:

• 142,000+ pathway models (converted from KEGG)

• 400+ hand-curated quantitative models

• 400+ non-curated quantitative models

9%3%

3%5%

6%

8%

9%

9%

23%

25%

signal transductionmetabolic processmulticelullar organismal processrhythmic processcell cyclehomeostatic processresponse to stimuluscell deathlocalizationothers (e.g., developmental process)

Database data from 2012-08-10

How can you check that a given SBML file is valid?

The Online SBML Validator

The Online SBML Validator

Find ithere

http://sbml.org/Facilities/Validator





Where can you find more software?

Find software in the SBML Software Guide

Find SBML software

Find software in the SBML Software Guide

http://sbml.org/SBML_Software_Guide

http://sbml.org/SBML_Software_Guide

Question: Which of the following categories best describe your software? (Check all that apply.)

Results of 2011 survey of SBML-compatible software

Out of 81 responses

Simulation software

Analysis s/w (in addition, or instead of, simulation)

Creation/model development software

Visualization/display/formatting software

Utility software (e.g., format conversion)

Data integration and management software

Repository or database

Framework or library (for use in developing s/w)

S/w for interactive env. (e.g., MATLAB, R, ...)

Annotation software0 20 40 60 80

11

13

13

14

16

23

31

31

40

42

What about libraries for writing SBML-compatible software?

libSBMLReads, writes, validates SBML

Can check & convert units

Written in portable C++

Runs on Linux, Mac, Windows

APIs for C, C++, C#, Java, Octave, Perl, Python, R, Ruby, MATLAB

Well documented API

Open-source (LGPL)

http://sbml.org/Software/libSBML





JSBMLPure Java implementation

API is compatible with libSBML but more Java-like

Functionality is subset of libSBML

Open source (LGPL)

http://sbml.org/Software/JSBML





How can you stay informed of new developments?

Resources for news, questions and discussions

Front-page news


Twitter & RSS feeds


Mailing lists/forums







Closing

Outli

ne

SBML itself provides syntax and only limited semantics


No standard identifiers


Low info content


Raw models alone are insufficient

Need standard schemes for machine-readable annotations

• Identify entities

• Mathematical semantics

• Links to other data resources

• Authorship & pub. info


Low info content


Element in the model

Entity elsewhere (e.g., in a database)

relationship qualifier(optional)

Annotations at their simplest

Annotations can answer questions:

• “What exactly is the process represented by equation ‘r17’?”

• “What other identities (synonyms) does this entity have?”

• “What role does constant ‘k3’ play in equation ‘r17’?”

• “What organism are we talking about?”

• ... etc. ...

Multiple annotations on same entity are common

Annotations add meaning and connections

SBML supports two annotation schemesSBO (Systems Biology Ontology)

• For mathematical semantics

• One SBML object ← one SBO term

• Short, compact, tightly coupled but limited scope

MIRIAM (Minimum Information Requested In the Annotation of Models)

• For any kind of annotation

• One SBML object ← multiple MIRIAM annotations

• Larger, more free-form, wider scope

Both are externalized and independent of SBML

Systems Biology Ontology (SBO)

http://biomodels.net/sbo





<sbml ...> ... <listOfCompartments> <compartment id="cell" size="1e-15" /> </listOfCompartments> <listOfSpecies> <species compartment="cell" id="S1" initialAmount="1000" /> <species compartment="cell" id="S2" initialAmount="0" /> <listOfSpecies> <listOfParameters> <parameter id="k" value="0.005" sboTerm="SBO:0000339" /> <listOfParameters> <listOfReactions> <reaction id="r1" reversible="false"> <listOfReactants> <speciesReference species="S1" stoichiometry="2" sboTerm="SBO:0000010" /> </listOfReactants> <listOfProducts> <speciesReference species="S1" stoichiometry="2" sboTerm="SBO:0000011" /> </listOfProducts> <kineticLaw sboTerm="SBO:0000052"> <math> ... <math> ...</sbml>


SBO:0000339


SBO:0000339

“forward bimolecular rate constant, continuous case”

semanticSBML

SBMLsqueezer

Software can use SBO terms to help you work with models

Addresses 2 general areas of annotation needs:

MIRIAM is not specific to SBML


Requirements for reference correspondence

Scheme for encoding annotations

Annotations for attributing model creators & sources

Annotations for referring to external

data resources








data resources


Annotations for attributing model creators and sources

Goal: permit tracing model’s origins & people involved in its creation

Minimal info required:

• Name for the model

• Citation for a description of what is being modeled & its author

• Contact info for the model creator(s)

• Creation date & time

• Last modification date & time

• Statement of the model’s terms of distribution

- Specific terms not mandated, just a statement of the terms








data resources








data resources


data resources

Annotations for external referencesGoal: link model constituents to corresponding entities in bioinformatics resources (e.g., databases, controlled vocabularies)

• Supports:

- Precise identification of model constituents

- Discovery of models that concern the same thing

- Comparison of model constituents between different models

MIRIAM approach avoids putting data content directly in the model; instead, it points at external resources that contain the knowledge.

Why might you care?

http://www.ebi.ac.uk/chebi

Low info content





Why might you care?


Low info content

Known by different names – do you want to write all of

them into your model?

salicylic acid

Identifying resources has its own challengesFor linking to data, need:

• Globally unique, unambiguous identifiers

• ... that are persistent despite resource changes (e.g., changed URLs)

• ... that are maintained by the community

Problem: different resources have different identification schemes

• E.g.: entity “16480”

- In ChEBI: entry 16480 is nitrous oxide

- In PubMed: entry 16480 is the 1977 paper “Effect of gallstone-dissolution therapy on human liver structure”

- In PubChem: entry 16480 is 1-chloro-4-isothiocyanatobenzene

How do we create globally unique identifiers consistently?Long story short:

• Create unique resource identifiers (URIs) by combining 2 parts:

• Create registry for namespaces

- Allows people & software to use same namespace identifiers

• Create service for URI resolution

- Allows people & software to take a given resource identifier and figure out what it points to

namespace entity identifier{ {

Identifies a dataset Identifies a datumwithin the dataset

Resolving resource identifiersMIRIAM Registry supports the creation of globally unique identifiers

• Example MIRIAM identifier:urn:miriam:ec-code:1.1.1.1

• Provides various data about theresource, including alternate servers

• Provides web services

identifiers.org is layered on top of that and provides resolvable URIs

• Can type it in a web browser!

• Example identifiers.org URI:http://identifiers.org/ec-code/1.1.1.1

BioModels Database: example of using the annotations

Annotations enable many interesting possibilitiesAnnotations enable many interesting possibilities

Figure courtesy of Wolfram Leibermeister

semanticSBML

Summary: why care about standard ways of writing annotations?

Structured, machine-readable annotations increase your model’s utility

• Allow more precise identification of model components

- Understand model structure

- Search/discover models

- Compare models

• Adds a semantic layer—integrates knowledge into the model

- Helps recipients understand the underlying biology

- Allows for better reuse of models

- Supports conversion of models from one form to another






Closing

Outli

ne

Mathematical semantics

Biological semantics

Visual interpretation

Discrete stochastic entities

Continuous lumped parameter

State transition

Mean field approximation

Model type

Model creation

Model annotation

Model analysis

Numerical results

Model life-cycle

Model representation level

Conc

ept d

ue to

Nic

olas

Le N

ovèr

e

Major dimensions of a computational model

What about other kinds of models?

SBML Level 3: Supporting more categories of models

An SBML Level 3 package adds constructs & capabilities

Models declare which packages they use

• Applications tell users which packages they support

Package development can be decoupled

SBML Level 3 Core

Package X Package Y Package Z

Package W

(dependencies)

Level 3 package What it enablesHierarchical composition Models containing submodels

Flux balance constraints Flux balance analysis models

Qualitative models Petri net models, Boolean models

Spatial Nonhomogeneous spatial models

Multicomponent species Entities with structure & state; rule-based models

Graph layout Diagrams of models

Graph rendering Diagrams of models

Distribution & ranges Nonscalar values

Annotations Richer annotation syntax

Groups Arbitrary grouping of model components

Dynamic structures Creation & destruction of model components

Arrays & sets Arrays or sets of entities

How can we capture the simulation/analysis procedures?

Software can’t read figure legends

?

BIOMD0000000319 in BioModels Database

Decroly & Goldbeter, PNAS, 1982

Application-independent format to capture procedures, algorithms, parameter values

• Neutral format for encoding the steps to go from model to output

Can be used for

• Simulation experiments encoding parametrizations & perturbations

• Simulations using more than one model

• Simulations using more than one method

• Data manipulations to produce plot(s)

libSedML project developing API library

SED-ML = Simulation Experiment Description ML

http://www.biomodels.net/sedml

http://www.biomodels.net/sed%C2%ADml

http://www.biomodels.net/sed%C2%ADml

What about visual diagrams?

Graphical representation of modelsToday: broad variation in graphical notation used in biological diagrams

• Between authors, between journals, even people in same group

However, standard notations would offer benefits:

• Consistency = easier to read diagrams with less ambiguity

• Software support: verification of correctness, translation to math

SBGN = Systems Biology Graphical NotationGoal: standardize the graphical notation in diagrams of biological processes

• Community-based development, à la SBML

Many groups participating

3 sublanguages to describe different facets of a model

http://sbgn.org

http://sbgn.org

http://sbgn.org

Outli

ne






Closing

Such standards are the work of a great communityAttendees at SBML 10th Anniversary Symposium, Edinburgh, 2010

COMBINE (Computational Modeling in Biology Network)

• SBML, SBGN, BioPAX, SED-ML, CellML, NeuroML

Upcoming meeting: August 15–19 in Toronto, Canada

• Right before ICSB (International Conference on Systems Biology)

Get involved and make things better!

http://co.mbine.org

http://co.mbine.org

http://co.mbine.org

http://co.mbine.org

http://co.mbine.org

SBML http://sbml.org

BioModels Database http://biomodels.net/biomodels

COMBINE http://co.mbine.org

identifiers.org http://identifiers.org

MIRIAM http://biomodels.net/miriam

SED-ML http://biomodels.net/sed-ml

SBO http://biomodels.net/sbo

SBGN http://sbgn.org

URLs

http://sbml.org

http://sbml.org



http://identifiers.org

http://identifiers.org

http://biomodels.net/miriam

http://biomodels.net/miriam

http://biomodels.net/sed-ml

http://biomodels.net/sed-ml



http://sbgn.org

http://sbgn.org

I’d like your feedback!You can use this anonymous form:

http://tinyurl.com/mhuckafeedback



National Institute of General Medical Sciences (USA) European Molecular Biology Laboratory (EMBL)JST ERATO Kitano Symbiotic Systems Project (Japan) (to 2003)JST ERATO-SORST Program (Japan)ELIXIR (UK)Beckman Institute, Caltech (USA)Keio University (Japan)International Joint Research Program of NEDO (Japan)Japanese Ministry of AgricultureJapanese Ministry of Educ., Culture, Sports, Science and Tech.BBSRC (UK)National Science Foundation (USA)DARPA IPTO Bio-SPICE Bio-Computation Program (USA)Air Force Office of Scientific Research (USA)STRI, University of Hertfordshire (UK)Molecular Sciences Institute (USA)

SBML was made possible thanks to funding from:

SBML (the Systems Biology Markup Language), model databases, and other resources

Technology

g1xcg2x protein

sbml participants

modeling results

protein bngene mrnan

computational results

molecular modeling

modeling framework

model equations