UniProt and the Semantic Web

UniProt and the Semantic WebChimezie Ogbuji

‘Omics’ Data Challenges

Advances in protein science is a major catalyst in the exploding availability of bioinformatics data

We have already discussed the dimensions of omics data: Molecular components, interactions, and phenotype

observations

Data from large-scale experiments are no longer published conventionally but stored in a database

Protein sequence databases are one of the most comprehensive information resources for scientists

Protein Sequence Databases

Universal protein sequence databases cover all species

Specialized protein databases are particular to a protein family or organism

Sequence repositories A simple registry of sequence record No annotations

Curated protein databases Enrich sequence information with links to various sources

(scientific literature primarily)

Informatics Challenges

Standard data integration challenge is the lack of common conventions

Applies to not just notation but also to: Use of identifiers Representation of cross-references Framework for defining terms and relationships between

them

Links between omics sources is another important component of data integration

What is UniProt?

A comprehensive repository of protein sequences and their functional annotations

Curators add value to raw data by annotations against scientific literature

Objective is: the creation and maintenance of stable, comprehensive, and high-quality protein databases, with high level of accessibility, to facilitate cross-database information retrival

Makes use of Semantic Web technologies to address its challenges

UniProt: Core Activities

Sequence archiving

Manual (peer-reviewed) and automated curation of sequences

Development of human / machine-readable Uniprot web site

Interaction with other protein-related databases for expanding cross references

UniProt: Components

UniProtKB –Protein sequence annotations and metadata: Protein name, function, taxonomy, enzyme-specific

information, domains, sites, subcellular location, interactions, relationships to disease etc.

Links to external sources: DNA sequence repositories, protein structure databases, protein domain and family databases, and species & function-specific data collections

UniRef – Compresses sequences at different resolutions Parameterized by percent of how identical two sequences or

sub-sequences are (100,90,50). UniParc – Non-redundant database of all publically

available protein sequences Manages globaly-unique identifers, the sequence, information

on source database, and CRC check number.

Semantic Web Technologies

Set of standards for managing web-based content in a way that emphasizes use by an automaton Automaton: a machine that performs a function according to

a predetermined set of coded instructions

The architectural vision (the Semantic Web) is to extend the standards and best practices behind the World-wide Web with new standards that emphasize meaning over structure of data. Common data formats Provide a means to make assertions about the world such that

an automaton can reason about it through them

The vision is often confused with the tools meant to achieve it (i.e., set of standards)

RDF: Data Model

Standardized format for representating arbitrary information as a labelled, directed graph

Comprised of statements: subject, predicate, object

Terms in statements can be Universal Resource Identifiers (URIs), Blank Nodes (anonymous entities), or Literals

Abstract data model: a labelled, directed graph

Various serializations: XML-based and text-based

Information About John Smith

Modelling vocabulary: RDFS/OWL

RDF Schema (RDFS) Simple, minimal schema language for RDF

Ontology Web Language (OWL) Vocabulary for defining classes, relationships, and various

constraints that limit how RDF is interpreted More powerful modeling language

Tools for constraining & defining reality that can be used to codify scientific understanding

Gene Ontology is modelled in this way to capture our understanding of macromolecular reality

Query Language: SPARQL

Provides a common graph-matching language for querying RDF data

Similar to SQL in many respects

Nature of UniProt Data

Very large number of cross references to external resources

Cross-reference topology that of a graph not a tree

Automated and manual annotation require storage of provenance information (how / when data was acquired)

Requires a framework for both data as well as metadata (data about data)

UniProt Distribution

UniProt: Data Conventions

All outbound RDF statements are grouped together (statements about the same subject)

Datasets (nodes in previous graph) are distributed as a single file

Only stores stated data, not entailed data. For instance, relationships involving symmetric properties

are only stored in one direction

UniProt: Naming Conventions

Generally, in semiotics: a symbol denotes a referent.

In Web architecture, URIs identify resources URIs that can be resolved over the web are URLs

UniProt URIs identify: Resources that correspond to database entries Modeling vocabulary that use standard namespaces: RDFS

and OWL Classes and properties used by UniProt

For ex: http://purl.uniprot.org/core/Gene

Resources without stable identifiers (from their source)

The Omics Identification Problem

UniProt uses a templated naming convention: http://purl.uniprot.org/{database}/{identifier} http://purl.uniprot.org/uniprot/{protein_identifier}

Problem http://purl.uniprot.org/uniprot/P04926 denotes the Malaria

protein EX-1

If loading that address in a browser returns a web page, can an automaton infer that Malaria protein EX-1 is a web page?

How do you identify abstract concepts v.s. digital media

http://purl.uniprot.org/%7Bdatabase%7D/%7Bidentifier

http://purl.uniprot.org/uniprot/%7Bprotein_identifier

http://purl.uniprot.org/uniprot/P04926

The PURL Solution

Persistent Uniform Resource Locator (PURL) is a public URI management service for allocating a ‘URI space’ as a mapping of identifiers (aliases) for resources they are not immediately responsible for

PURLs are web addresses that act as permanent identifiers in the face of a dynamic and changing Web infrastructure

A request to a PURL returns a 303 HTTP status code and a location: 303 indicates that a response can be found under the

returned location

The PURL Solution: Continued

Can use PURL addresses to identify abstract concepts

Redirect requests to such addresses to an informative web page (for humans) with a means for machines to extract other formats

RDF statements are about proteins, machines can reasons about proteins, and humans resolve protein identifiers to view informative web pages

RDF/XML link:

http://www.uniprot.org/uniprot/P04926.rdf

UniProt: Protein Class

UniProt: Annotation Hierarchy

Serendipitous Re-use

Having a rich repository of protein sequence metadata, annotations, and taxonomic classification in a distributed, standard format encourages scientific collaboration

General UniProt Re-Use Scenario

User A refers to protein P1 in their dataset User A’s dataset doesn’t include statements about P1 (the

host organism for instance)

User B comes across this dataset and (in order to find out more about protein P1) puts the URI of protein P1 in their browser and pulls up human-readable information about it (including the host organism)

Automaton C comes across the same dataset, fetches the web page, fetches the RDF about P1 and has access to the same information as user B and can reason about the major taxon the host organism belongs to

References

Wu, C. et.al.,”The Universal Protein Resource (UniProt): an expanding universe of protein information”. Nucleic Acids Research, vol. 34. 2006

Swiss Institute of Bioinformatics, “UniProt RDF (project page)”. http://dev.isb-sib.ch/projects/uniprot-rdf/

Redaschi, N. and UniProt Consortium, “UniProt in RDF: Tackling Data Integration and Distributed Annotation” Nature Proceedings, 3rd International Biocuration Conference, April 2009. http://precedings.nature.com/documents/3193/version/1

http://dev.isb-sib.ch/projects/uniprot-rdf/

http://precedings.nature.com/documents/3193/version/1

UniProt and the Semantic Web

Health & Medicine

data conventions

metadata data

raw data

stated data

structure of data

nature of uniprot data

dimensions of omics

common data formats