Top Banner
1 Subcellular Location Markup Language (SLML) Level 1 Version 1.6 Release 1 Communicating subcellular location protein patterns for systems biology A Thesis Presented To Carnegie Mellon University Pittsburgh, Pennsylvania In Partial Fulfillment of the Requirements For the Degree of Master of Science In Computational Biology By Iván E. Cao-Berg Spring 2009
34

SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

Aug 12, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

1

Subcellular Location Markup Language (SLML)

Level 1 Version 1.6 Release 1

Communicating subcellular location protein

patterns for systems biology

A Thesis Presented To

Carnegie Mellon University

Pittsburgh, Pennsylvania

In Partial Fulfillment of the Requirements

For the Degree of

Master of Science

In Computational Biology

By

Iván E. Cao-Berg

Spring 2009

Page 2: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

2

“The nice thing about standards is that there are so many to choose from.”

-Andrew S. Tannenbaum

Page 3: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

3

Contents Abstract ........................................................................................................................................... 5

Introduction .................................................................................................................................... 5

Subcellular Location Markup Language Level 1 .............................................................................. 7

Document Conventions ............................................................................................................... 8

Preliminary Definitions and Principles ........................................................................................ 9

Matrix ...................................................................................................................................... 9

Identification, Name, Meta and Notes...................................................................................... 10

Identification ......................................................................................................................... 10

Name ..................................................................................................................................... 11

Meta ...................................................................................................................................... 12

Value ...................................................................................................................................... 12

Mathematical notation support ................................................................................................ 13

SLML Components ........................................................................................................................ 13

The SLML Container .................................................................................................................. 13

Documentation ......................................................................................................................... 14

ListOfCells .................................................................................................................................. 15

Cell ............................................................................................................................................. 15

Information ............................................................................................................................... 16

ListOfModels ............................................................................................................................. 16

Model ........................................................................................................................................ 16

ListOfPatterns ............................................................................................................................ 17

ListOfObjects ............................................................................................................................. 17

Object ........................................................................................................................................ 18

Shape ......................................................................................................................................... 19

Texture ...................................................................................................................................... 19

Frequency .................................................................................................................................. 20

ListOfParameters ....................................................................................................................... 21

Parameter .................................................................................................................................. 21

XML Definition ....................................................................................................................... 22

Materials and Methods ................................................................................................................. 23

SLML Toolbox for Matlab .......................................................................................................... 24

Main Tools ............................................................................................................................. 25

Page 4: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

4

SLML Model Trainer .............................................................................................................. 25

Results ........................................................................................................................................... 26

Disscussion .................................................................................................................................... 28

Futute Levels in SLML ................................................................................................................ 28

Language Integration ................................................................................................................ 29

SBML ...................................................................................................................................... 29

VCML ...................................................................................................................................... 29

Software Integration ................................................................................................................. 29

MCell ...................................................................................................................................... 30

References .................................................................................................................................... 30

Appendix ....................................................................................................................................... 32

List of Tests ................................................................................................................................ 32

Page 5: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

5

Abstract

The Subcellular Location Markup Language (SLML) Level 1 Version 1.6 Release 1 is a model

representation for generative models of subcellular location protein patterns. SLML is oriented

towards describing and annotating parameters and relationships of these models that can be

used to synthesize, among other things, multicolor images. SLML is an XML-based language,

that is, it is a written in neutral fashion with respect to programming languages and software

encoding. SLML was built as a tool for communicating these patterns with fewer bits than the

original data by describing a model that is automated, generative and statistically accurate.

Thus it provides a foundation for accurately describing compartmental volumes that can be

incorporated with other systems biology markup languages like SBML and CellML as well

biochemical applications like MCell and VCell. A detailed description of the language model is

presented with a set of tools to train the generative models and synthesize multicolor images

from the SLML instances.

Introduction

While studying complex biological phenomena, one of the most popular mathematical tools,

involve using ordinary and partial differential equations to represent biochemical kinetics

(Doyle 2001). Study of the behavior of such models involve much more than finding the

solutions to the system of equations, and these approaches, such as equilibrium analysis,

provide a deep insight on the behavior of the model. Nevertheless, in recent years, research

studies have shown that to understand complex biological systems it is required the integration

Page 6: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

6

of experimental and computational research –in other words a systems biology approach

(Kitano 2002).

Simulations of single or several biochemical pathways that can be found in literature have the

potential to be used in a systems-wide approach because they serve as building blocks for more

complex phenomena. Hence, computational models that reproduce and predict the detailed

behaviors of cellular systems at this level are the Holy Grail of systems biology (Kitano 2006).

Since there are multiple tools that allow simulation at the systems level, their existence has

fueled the developing of languages that enable the use and reuse of mathematical models

without the necessity of rewriting them for each tool. This permits instances of models to

become blocks of more complex simulations. Languages like Virtual Cell Markup Language

(VCML), CellML and the Systems Biology Markup Language (SBML) allow the communication of

mathematical models in a neutral fashion. Yet their support for compartmental geometries is

limited by constructions of geometrical shapes or the pixilation of 2D experimental images.

Thus, we strive to provide more detailed information about compartmental topologies that

could be mapped into a language similar to SBML and used in other applications. The

Subcellular Location Markup Language (SLML) Level 1 Version 1.6 Release 1 is a model

representation format for generative models of protein subcellular location patterns. SLML is

defined in eXtensible Markup Language (W3C 2001) and is supported by an XML Schema which

defines the different components and relationship of the language model. These models as

described in the instances allow the systematic and comprehensive study of protein subcellular

location and provide useful descriptions of these patterns. The models described in SLML Level

Page 7: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

7

1 instances are (1) automated, (2) generative, (3) statistically accurate and (4) compact (Zhao

and Murphy 2007).

The definition of the model description language presented in this document only specifies

generative model parameters and the relationship between models. It doesn’t specify how

programs should use SLML instances nor does it describe how to implement them.

Nevertheless, a collection of applications were written for reading/writing SLML instances as

well as for generation of new examples from these. The SLML Toolbox for Matlab is also

described in detail on this document.

Subcellular Location Markup Language Level 1

The Subcellular Location Markup Language (SLML) Level 1 Version 1.6 Release 1 is a model

representation format for generative models of protein subcellular location patterns (Zhao and

Murphy, 2007). SLML is oriented towards communicating models that are

1. automated, in the sense that they are learned from experimental data,

2. generative, in the sense that we can synthesize new examples from the SLML instance

3. statistically accurate, in the sense that the SLML instance describes the variations from cell

to cell, and

4. compact, in the sense that we can communicate these variations using fewer bits than the

original data set.

Page 8: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

8

SLML is described as a collection of components in UML and mapped into an XML schema. This

allows the description of its contents in a neutral fashion that is system independent and widely

supported by most modern programming languages (Quackenbush 2006).

Document Conventions

All the components and attributes of the language model are described in Unified Modeling

Language (UML). The main reason for using UML to describe the main components of the

language model is that it provides a system independent representation of the model that is

both intuitive and clear.

In XML Schema 1.0 language there are two main classes of relationships between components.

The first relationship is the superclass relationship. In SLML Level 1 all major components have

the Name and Identification components as parent classes. This notation will allow future

developers of SLML to easily make changes across the schema without modifying the general

structure of the language. The second is the “composed of” relationship which may seem

similar to the previous one for those who are not familiar with XML. The latter kind of

relationship describes the instance where a compartment is composed of other

subcompartments but these do not inherit attributes from the parent compartment. Most XML

languages use convention in a similar fashion to HTML.

Page 9: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

9

Figure 1 listOfParameters class in SLML Level 1 in UML. In SLML, classes do not possess operations so the third part is ignored.

In this document all parent classes are ignored in diagrams while “composed of” relationships

are shown for simplicity.

Figure 2 A snippet of SLML Level that shows the two main relationships. Parent classes are ignored throughout this document since

Identification and Name are parent classes of every major component.

Preliminary Definitions and Principles

SLML Level 1.0 inherits all primitive data types from XML Schema 1.0 (Biron and Malhotra,

2000) but in reality only a minor subset of them is actually used in the language model. These

data types are (1) integers, (2) strings, (3) booleans and (4) doubles.

Matrix

Figure 3 The Matrix, Mrow and Cn components in UML format. Only the Matrix component is presented in detail.

Page 10: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

10

The Matrix component is a helper container that is used to define multidimensional arrays or

matrices. Matrices are used in SLML to hold multidimensional parameters. The Matrix

component follows a similar notation to the Matrix element defined in MathML (W3C 2001),

but adds other dimensions to the matrix and has length, width and height attributes. A matrix

with only length is considered a vector; a matrix with length and width is considered a 2D

matrix while one containing these and height is considered a 3D matrix.

The matrix component may contain other matrices which in turn are composed of matrix rows.

Each of the rows may contain only numbers. Usage of variables as entries of arrays hasn’t been

considered in this Version. Hence the Matrix component in SLML cannot be mapped into the

MathML namespace.

XML Definition

<!-- Definition:Matrix -->

<xsd:complexType name="Matrix">

<xsd:sequence>

<xsd:element name="mrow" type="Mrow" minOccurs="1"

maxOccurs="unbounded" />

</xsd:sequence>

<xsd:attribute name="id" type="Identification" use="optional" />

<xsd:attribute name="name" type="Name" use="required" />

<xsd:attribute name="length" type="xsd:int" use="optional" />

<xsd:attribute name="width" type="xsd:int" use="optional" />

<xsd:attribute name="height" type="xsd:int" use="optional"/>

<xsd:attribute name="notes" type="xsd:notes use="optional"/>

</xsd:complexType>

Identification, Name, Meta and Notes

These are the minor components of SLML used for describing patterns of data used in the

major components and their attributes.

Identification

Page 11: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

11

Figure 4 The Identification component in UML format.

The Identification component describes the characters that can be used for the identification

attribute of all main and minor components. Even though the identification attribute is optional

and left to the user, a good programming practice should make all identifications unique across

an instance if the user decides to implement them.

XML Definition

<!-- Definition:Identification -->

<xsd:simpleType name="Identification">

<xsd:restriction base="xsd:string">

<xsd:pattern value="(_|[a-z]|[A-Z])(_|[a-z]|[A-Z]|[0-9])*" />

</xsd:restriction>

</xsd:simpleType>

Name

Figure 5 The Name component in UML format.

The Name component defines the character set that can be used for the name attribute of all

containers. The use of this is required for all major components and optional for minor ones.

Page 12: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

12

Even though the name assigned to the components is left to the user, the main idea behind this

attribute is to be able to map major components such as models, to other languages that may

reside in a different namespace but in the same file, e.g. having a generative model mapped

into a SBML instance.

The characters allowed by the pattern include all Unicode characters. The Name component is a

parent class of every component in SLML Level 1.

XML Definition

<!-- Definition:Name -->

<xsd:simpleType name="Name">

<xsd:restriction base="xsd:string">

<xsd:pattern value="(_|[a-z]|[A-Z])(_|[a-z]|[A-Z]|[0-9])*" />

</xsd:restriction>

</xsd:simpleType>

Meta

The Meta component defines the meta container used by the Information and Documentation

class. It follows the notation of HTML meta tags and its mainly used for annotation. All Meta

components are optional.

The characters allowed by the pattern include all Unicode characters.

XML Definiton

<!-- Definition:Name -->

<xsd:complexType name="Meta">

<xsd:attribute name="name" type="Name" use="required" />

<xsd:attribute name="value" type="Value" use="required" />

</xsd:complexType>

Value

Page 13: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

13

The Value component defines the character set that can be used for value attributes in the

Meta components. The characters allowed by the pattern include all Unicode characters.

XML Definition

<!-- Definition:Value -->

<xsd:simpleType name="Value">

<xsd:restriction base="xsd:string">

<xsd:pattern value="(_|[a-z]|[A-Z])(_|[a-z]|[A-Z]|[0-9])*" />

</xsd:restriction>

</xsd:simpleType>

Mathematical notation support

SLML Level 1.0 includes support of MathML (W3C, 2008). Nevertheless, SLML itself doesn’t use

MathML at this point because a new Matrix class was defined in the SLML namespace for

Version 1.6. Yet, support for MathML will be necessary to provide a new model for future

Levels of the language, since it has been discussed the inclusion of methods in MathML content

format that will allow any generic parser to synthesize images directly from the XML instance.

MathML parsers are standards in most programming languages and support for conversion of

MathML content format to equations is supported by most popular programming languages

like Matlab and Java.

SLML Components

This section discusses the main components of SLML. Some of the components contain the

parameters of the generative models while other exist to describe relationships between

compartments.

The SLML Container

Page 14: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

14

Figure 6 The main component of SLML.

The SLML container is the main component of the language. It follows the notation of the

XML Schema 1.0. It is the main class of the language and it contains 4 attributes

1. The namespace of the language, i.e.

http://murphylab.web.cmu.edu/services/SLML/level1. By convention

it should point to the actual schema. Its use is required.

2. The Level of the current schema. The current Level is 1, which corresponds to the first

public release of SLML. Its use is required.

3. The Version of the current schema. The current Version is 1.6 which is the version

discussed in this document. Its use is required.

4. The Release of the current schema. The current Release is 1. Its use is required.

Documentation

Figure 7 The Documentation component of SLML in UML format.

Page 15: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

15

The Documentation container allows the user to add new information to the SLML instance.

The main purpose for this class is to allow the user to annotate the SLML schema with

additional data that might be found useful to the user of the SLML instance. This class is in turn

composed of a Meta component that is similar in notation to the meta tag used in HTML. The

use of this component is optional. Its only argument is also optional.

ListOfCells

Figure 8 The ListOfCells component in UML format.

The ListOfCells components is merely a container of all cell models in an SLML Level 1.0. It

aggregates all cell models making

Cell

Figure 9 The Cell component in UML format.

Page 16: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

16

The Cell component defines the cell container which is composed of a list of models and

information regarding the data set. Several three-color generative models may be contained

within a single cell compartment. This means they come from the same data set.

Information

Figure 10 The Information component in UML format.

The Information component is a container of information regarding the Cell component. Its

purpose is to annotate the experiment or dataset used to train the generative model.

ListOfModels

Figure 11 The ListOfModels component in UML format.

The ListOfModels is an aggregator of models that facilitates searching.

Model

Page 17: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

17

Figure 12 The Model component in UML format.

A model is composed of a list of patterns. Several patterns can make a model, e.g. a vesicular

model is composed of a medial-axis model for nuclear shape, a radial distance model for the

cell membrane and Gaussian mixture model for the protein distribution of vesicular

compartments.

ListOfPatterns

Figure 13 The ListOfPatterns component in UML format.

The ListOfPatterns is an aggregator of patterns that facilitates searching.

ListOfObjects

Page 18: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

18

Figure 14 The ListOfObjects component in UML format.

The ListOfObjects is an aggregator of objects that facilitates searching.

Object

Figure 15 The Object component in UML format. The order of the other containers it is made of is irrelevant.

The Object component contains a description of an object in the pattern. Every object in SLML

Level 1.0 is composed of other four main components

1. Shape component. It describes the shape of the object, e.g. a nuclear shape model.

2. Texture component. It describes the texture of the object, e.g. nuclear texture model.

3. Position component. It describes the position of the object with respect to other objects,

e.g. Gaussian object position model.

Page 19: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

19

4. Frequency component. It describes the number of objects in the pattern.

The minimum number of objects supported by this language is 1 and the maximum is

unbounded. It should be pointed out that a pattern may be composed of several object types.

Shape

Figure 16 The Shape component in UML format. This is the only member of the Model container that is required.

The Shape component contains a description of the shape of the object. A shape model is

composed of a list of parameters that describe the model. This list should contain all the

parameters needed to synthesize the shape of the object. This is the only member of the Model

component that is required, since the minimum information needed to synthesize an object is

its shape.

Texture

Page 20: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

20

Figure 17 The Texture component in UML format.

The Texture component contains a description of the texture of an object. A texture model is

composed of a list of parameters that describe the model. This list should contain all the

parameters needed to synthesize an object with textured. This member of the Model

component is optional since some objects can be synthesized without texture. That is, any

software that parses an SLML instance where an object doesn’t contain texture model should

synthesize the object outline.

Frequency

Figure 18 The Frequency component in UML format.

The Frequency component contains a description of the number of objects in a pattern. A

frequency model is composed of a list of parameters that describe the model. This list should

contain all the parameters needed to synthesize as many objects as described by the latter. This

member of the Model component is optional since some patterns are composed of a single

object. That is, any software that parses an SLML instance where an object doesn’t contain a

frequency model should synthesize a single object.

Page 21: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

21

Even though the frequency model is composed of parameters needed to sample from a

distribution, SLML allows frequency models to be described as integers that simply tell how

many objects should be synthesized.

ListOfParameters

Figure 19 ListOfParameters component in UML format.

The ListOfParameters component is a mere aggregatorfor the parameters of a shape, texture,

position and frequency models. The use of this component is optional, though its absence

means no parameters are present.

Parameter

Figure 20 The Parameter component in UML format.

Page 22: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

22

The Parameter component is probably the most important container of SLML Level 1. It holds

the attributes of a parameter as well as its value. It contains six attributes

1. Identification. Similar to other components. Its use is optional.

2. Name. Similar to other components. Its use is required.

3. Constant. If the parameter is constant, then it is true. The default value is true. Its use is

optional.

4. Complex. True for parameters that contain other parameters. The default is false. Its use is

optional.

5. Type. The data type of this parameter. It includes basic data types as well as a matrix

definition. The DataType class is a list container that includes the data types supported by

the parameter container.

The Parameter component may hold a scalar, another parameter or a Matrix component.

XML Definition <!-- Definition:Parameter -->

<xsd:complexType name="Parameter">

<xsd:attribute name="id" type="Identification" use="optional" />

<xsd:attribute name="name" type="Name" use="required" />

<xsd:attribute name="constant" type="xsd:boolean" default="true" />

<xsd:attribute name="complex" type="xsd:boolean" default="false" />

<xsd:attribute name="type" type="DataType"

use="optional" default="double" />

<xsd:attribute name=”notes” type=”xsd:anyType” use=”optional” />

</xsd:complexType>

<!-- Definition:DataType -->

<xsd:simpleType name="DataType">

<xsd:restriction base="xsd:string">

<xsd:enumeration value="double" />

<xsd:enumeration value="integer" />

<xsd:enumeration value="string" />

<xsd:enumeration value="boolean" />

</xsd:restriction>

</xsd:simpleType>

Page 23: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

23

The SLML container

Materials and Methods

Figure 21 Examples of synthesized images from SLML instances.

The data used to train the SLML Level 1 instances come from a 3D HeLa dataset. The data

contains three fluorescence channels for each field which corresponds to DNA distribution,

Page 24: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

24

total protein and one of six proteins in the dataset, i.e. (1) giantin, (2) gpp130, (3) LAMP2, (4) a

mitochondrial protein, (5) nucleolin and (6) transferrin. The data used for this project can be

found at http://murphylab.web.cmu.edu/data/2007_Cytometry_GenModel.html

The algorithms and software in this work were implemented in Matlab 2008a and Java. The

software written for this project can be found on

http://murphylab.web.cmu.edu/software/SLML. Training of generative models was

performed according to (Zhao and Murphy 2009). After the models are learnt then these are

parsed in XML format following the rules in the SLML Level 1 schema.

Figure 22 Flowchart describing the process of learning the SLML instances from a collection of microscope images.

SLML Toolbox for Matlab

The Subcellular Location Markup Language (SLML) Toolbox 2009 (v1.5.2) for Matlab is a

collection of scripts and functions that perform most common tasks on SLML Level 1 Version

Page 25: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

25

1.* instances. The toolbox can (1) read, write, edit and save SLML instances; (2) validate SLML

instances; (3) train generative models of protein subcellular location pattern and (4) synthesize

multicolor images from SLML instances.

The SLML Toolbox 2009 (v1.5.2) for Matlab is compliant with SLML Level 1. For more

information about the SLML Level 1 as well as other tools, visit

http://murphylab.web.cmu.edu/services/SLML/level1

Main Tools

The Toolbox contains five tools for training generative models and synthesizing multicolor

images. These tools were used.

img2slml

The img2slml is a command line tool that trains a generative model of protein subcellular

location and saves the model as a SLML Level 1 instance.

slml2img

The slml2img is a command line tool that synthesizes multicolor images from one or several

SLML Level 1 instances.

SLML Image Synthesizer

This GUI-based tool allows the user to synthesize multicolor images from multiple SLML Level 1

instances.

SLML Model Trainer

The SLML Model Trainer will train a generative model of protein subcellular location from a

collection of three-color images. To use this tool run the command

Page 26: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

26

Results

The generative models of protein subcellular location patterns were mapped to an XML

language known as SLML Level 0. This level merely represented a mapping of all the variables

needed to synthesize three-channel digital image.

Level Release Description

Level 0 Version 0 Private An XML dump of the generative models data structure. (Deprecated).

Level 0 Version 0.5

Private An XML dump of the generative models data structure. The data structure changed from previous version. Supported by a DTD. (Deprecated)

Level 1 Version 1.6

Public A detailed description of the generative models data structure that is not a dump yet allows a one-to-one mapping between documents. It supports documentation and is supported by an XML schema.

Figure 23 Three main versions of SLML.

After future examination of the first private release new compartment relationships were

included to consider the dependencies between the vesicle model that can be described as

containing a (1) a nuclear shape and texture model, (2) a cell membrane model, (3) a Gaussian

mixture protein pattern. These new relationships were added and that became Level 1 as it is

described in this document.

Since SLML is software independent, a set of tools for reading, editing and writing SLML

instances was written and were used to test the validity of the SLML instances with respect to

the schema language definition.

After the models were verified and known to be syntactically and semantically correct, two

applications were constructed. First a GUI based application to synthesize multicolor images

Page 27: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

27

which allows the user to view the generated images in gallery form and a GUI based application

for training generative models of subcellular location protein pattern from a three-color image

collection.

The set of applications, known as the SLML Toolbox for Matlab was then ported as executables

for Windows, MacOSX and Linux.

Name Release Description

SLML Toolbox 2006 Private • Train generative models from three-channel images

• Synthesize three color images

• XML dump of the models

SLML Toolbox 2007 Private • Train generative models from three-channel images

• Synthesize three color images

• XML dump of the models supported –but not validated- by a DTD

• Verification through recursion

SLML Toolbox 2008 Public • Train generative models from three-channel images

• Synthesize multicolor images

• Validation using a Java parser

• Joining of multiple models

SLML Toolbox 2009 Public • Train generative models from three-channel images

• Synthesize multicolor images Figure 24History of the SLML Toolbox for Matlab.

The four main distributions were then tested on the different OS to verify their integrity.

Page 28: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

28

OS/Matlab Matlab R13 Matlab 2006a Matlab 2006b Matlab 2007a Matlab 2007b Matlab 2008a

Win XP SP2 Untested Failed Passed Passed Passed Passed

Win Vista Not Compatible Not Compatible Not Compatible Untested Untested Passed

Cygwin Failed Failed Failed Failed Failed Untested

MacOSX Tiger Passed Passed Passed Passed Passed Passed

MacOSX Leopard

Untested Untested Passed Passed Untested Passed

Mandrake Untested Passed Passed Passed Untested Passed

OpenSuse Passed Passed Passed Passed Untested Passed

Ubuntu Dapper Untested Passed Passed Passed Untested Passed

Figure 25 Shows the different combinations of OS and Matlab Version the Toolbox was tested on.

Disscussion

Futute Levels in SLML

As it stands SLML Level 1 provides a robust description for generative models of protein

subcellular location. It is designed to cater a huge variety of new models that can be easily

mapped as SLML instances. Yet the language lacks a powerful descriptor for the relationships

between compartments and models. Future developments of the language should consider the

consequences of modeling dependencies and be able to map these to a set of rules.

Page 29: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

29

Another important aspect of SLML is that even though it has the potential of annotating

numerous additional data through its Documentation and Information components, a collection

of meta-data should be included so that information about the data set from which the models

were trained could be mapped into SLML, e.g. resolution of the original images.

Language Integration

SBML

As it stands, SLML instances can easily be mapped into a SBML instances. Even though they

share some class names, they reside in different namespaces so clashing between languages is

not present.

Since we constructed the SLML Toolbox for Matlab and the SBML Toolbox for Matlab exists,

future developments towards inclusion in new languages such start at this point, where parsing

a generative model to Matlab is trivial and the same goes for the SBML models.

VCML

VCML resides on its own namespace, so inclusion of SBML instances is just as trivial as with

SBML. Nevertheless, VCML has powerful components for describing compartmental

geometries. Thus, future development of SLML VCML should seek the generation of

compartmental geometries in this format rather than synthesizing of multicolor images.

Software Integration

The PSLID-VCell application is an integration of the Protein Subcellular Location Image Database

(PSLID) and VCell that allows user to create geometries from generative models of subcellular

location protein patterns.

Page 30: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

30

Integration of SLML instances into PSLID-VCell should allow importing of SLML files to generate

new compartmental geometries in a similar fashion that can be done with experimental data.

MCell

MCell is a modeling tool for cellular microphysiology in 3D. Even though 3D models were not

considered as part of this project generative models of 3D cellular framework are known (Zhao

and Murphy 2007). Since MCell uses Model Description Language (MDL), which is not an XML

based language, integration of SLML to MCell should occur at a different level. An intermediate

solution to this problem would be to generate 3D meshes and map these to the Virtual Reality

Markup Language (VRML). VRML instances can be easily read in Blender, a free open source 3D

content creation suite, and then mapped to MDL.

References

1. Aderem. Systems Biology: Its Practice and Challenges (2005). Cell 121:511-513.

2. Biron and Malhotra. XML Schema Part 2: Datatypes (2000). Retrieved from

http://www.w3c.org/TR/xmlschema-2 on May 1, 2009.

3. Butler. Computing 2010: from black holes to biology (1999). Nature 402:C67-C70.

4. Chou and Cai. Prediction and classification of protein subcellular location – sequence-order

effect and pseudo amino acid composition (2003). Journal of Cellular Biochemistry 90:1250-

1260.

5. Doyle. Beyond the spherical cow (2001). Nature 411:151-152.

6. Editorial. Towards a theory of biological robustness (2007). Molecular Systems Biology

3:137.

Page 31: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

31

7. Hucka, et. al. Systems biology markup language (SBML) Level 1: structures and facilities for

basic model definitions. Retrieved from http://ww.sbml.org on January 23, 2009.

8. Hucka, et. al. The systems biology markup language (SBLML): a medium for representation

and exchange of biochemical network models (2003). Bioinformatics 19(4):524-531.

9. Hunter, et. al. Beginning XML (2007). O’Reilly. ISBN: 978-0-470-11487-2.

10. Kitano. Computational cellular dynamics: a network-physics integral (2006). Nature Reviews

7:163.

11. Kitano. Computational systems biology (2002). Nature 420:206-210.

12. Kitano. International alliances for quantitative modeling in systems biology (2005).

Molecular Systems Biology 1:1-2.

13. Kitano. Systems Biology: A Brief Overview (2002). Science 295(5560):1662-1664.

14. Lloyd, Halstead and Nielsen. CellML: its future, present and past. Progress in Biophysics and

Molecular Biology 85(2-3):433-450.

15. Nature. Are you ready for the revolution? (2001) Nature 409:758-760.

16. Noble. The rise of computational biology (2002). Nature Reviews 3:460-463.

17. Oram and Wilson. Beautiful Code, Leading Programmers Explain How They Think (2007).

O’Reilly. ISBN-10: 0-5960-51004-7.

18. Quackenbush. Standardizing the standards (2006). Molecular Systems Biology 10:1-2.

19. Slepchenko, et. al. Computational Cell Biology – Spatiotemporal Simulation of Cellular

Events (2002). Annu. Rev. Biophys. Biomol. Struct. (31):423-441.

20. W3C. Extensible Markup Language (XML) 1.0 (2008). Retrieved from

http://www.w3.org/TR/2008/REC-xml-20081126/ on May 1, 2009.

Page 32: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

32

21. W3C. Mathematical Markup Language (MathML) 2.0 (2003). Retrieved from

http://www.w3.org/TR/MathML2/ on May 1, 2009.

22. W3C. Namespaces in XML 1.0 (2006). Retrieved from http://www.w3.org/TR/REC-xml-

names/ on May 1, 2009.

23. W3C. XML Schema Data Types (2004). Retrieved from http://www.w3.org/TR/xmlschema-

2/ on May 1, 2009.

24. You. Toward Computational Systems Biology (2004). Cell Biochemistry and Biophysics

40:167-185.

25. Zhao and Murphy. Automated learning of generative models for subcellular location:

building blocks for systems biology (2007). Cytometry Part A 71A:978-990.

Appendix

List of Tests

The following table contains a description of all the tests performed on the SLML Toolbox. In

order for a distribution/OS to be compatible with the toolbox, all tests must pass.

Test Description

0000 Train a generative model of protein subcellular location pattern. This

test trains the models from (Zhao & Murphy, 2007) and then parses

them to SLML Level 1.

0001 Test isCompatible.m on juggernaut.cbi.cmu.edu, troll.cbi.cmu.edu and

alien.cbi.cmu.edu the three systems that were used to create the stand-

Page 33: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

33

alone applications.

0002 Test array2mathml.m

0003 Test mathml2array.m

0004 Parses a set of generative models of protein subcellular location to

SLML.

0005 Train a generative model of protein subcellular location from a

collection of microscope images and save them as SLML instances . The

amount of images used to train a negligible model because the purpose

of this test is just to test the parsing into SLML vocabulary.

0006 Train a generative model of protein subcellular location from a subset

of a collection of microscope of images. The amount of images used will

train a negligible model because the purpose of this test is just to test

the set of functions to train the model.

0007 Train a generative model of protein subcellular location from a

collection of TfR microscope images and save them as SLML instances.

The purpose of this test is to assess the time it takes to train a model.

0008 Train a generative model of protein subcellular location pattern from a

collection of giantin microscope images and save them as SLML

instances. The purpose of this test is to assess the time it takes to train

a model.

0009 Generate multicolor images from multiple generative models.

0010 Test model2slml.m

0011 Generate framework from a single SLML instance. Used to test

ml_gencellcomp from SLIC.

Page 34: SLML Level 1 Version 1.6 Release 1 - Murphy Lab | Homemurphylab.web.cmu.edu/services/SLML/Cao-BergMSThesis.pdf · Language (UML). The main reason for using UML to describe the main

34

0012 Generate cell framework from lysosome.mat

0013 Generate cell framework from nucleolus.mat

0014 Generate cell framework from endosome.mat

0015 Generate cell framework from giantin.mat