UniProt and the Semantic Web Chimezie Ogbuji
May 10, 2015
UniProt and the Semantic WebChimezie Ogbuji
‘Omics’ Data Challenges
Advances in protein science is a major catalyst in the exploding availability of bioinformatics data
We have already discussed the dimensions of omics data: Molecular components, interactions, and phenotype
observations
Data from large-scale experiments are no longer published conventionally but stored in a database
Protein sequence databases are one of the most comprehensive information resources for scientists
Protein Sequence Databases
Universal protein sequence databases cover all species
Specialized protein databases are particular to a protein family or organism
Sequence repositories A simple registry of sequence record No annotations
Curated protein databases Enrich sequence information with links to various sources
(scientific literature primarily)
Informatics Challenges
Standard data integration challenge is the lack of common conventions
Applies to not just notation but also to: Use of identifiers Representation of cross-references Framework for defining terms and relationships between
them
Links between omics sources is another important component of data integration
What is UniProt?
A comprehensive repository of protein sequences and their functional annotations
Curators add value to raw data by annotations against scientific literature
Objective is: the creation and maintenance of stable, comprehensive, and high-quality protein databases, with high level of accessibility, to facilitate cross-database information retrival
Makes use of Semantic Web technologies to address its challenges
UniProt: Core Activities
Sequence archiving
Manual (peer-reviewed) and automated curation of sequences
Development of human / machine-readable Uniprot web site
Interaction with other protein-related databases for expanding cross references
UniProt: Components
UniProtKB –Protein sequence annotations and metadata: Protein name, function, taxonomy, enzyme-specific
information, domains, sites, subcellular location, interactions, relationships to disease etc.
Links to external sources: DNA sequence repositories, protein structure databases, protein domain and family databases, and species & function-specific data collections
UniRef – Compresses sequences at different resolutions Parameterized by percent of how identical two sequences or
sub-sequences are (100,90,50). UniParc – Non-redundant database of all publically
available protein sequences Manages globaly-unique identifers, the sequence, information
on source database, and CRC check number.
Semantic Web Technologies
Set of standards for managing web-based content in a way that emphasizes use by an automaton Automaton: a machine that performs a function according to
a predetermined set of coded instructions
The architectural vision (the Semantic Web) is to extend the standards and best practices behind the World-wide Web with new standards that emphasize meaning over structure of data. Common data formats Provide a means to make assertions about the world such that
an automaton can reason about it through them
The vision is often confused with the tools meant to achieve it (i.e., set of standards)
RDF: Data Model
Standardized format for representating arbitrary information as a labelled, directed graph
Comprised of statements: subject, predicate, object
Terms in statements can be Universal Resource Identifiers (URIs), Blank Nodes (anonymous entities), or Literals
Abstract data model: a labelled, directed graph
Various serializations: XML-based and text-based
Information About John Smith
Modelling vocabulary: RDFS/OWL
RDF Schema (RDFS) Simple, minimal schema language for RDF
Ontology Web Language (OWL) Vocabulary for defining classes, relationships, and various
constraints that limit how RDF is interpreted More powerful modeling language
Tools for constraining & defining reality that can be used to codify scientific understanding
Gene Ontology is modelled in this way to capture our understanding of macromolecular reality
Query Language: SPARQL
Provides a common graph-matching language for querying RDF data
Similar to SQL in many respects
Nature of UniProt Data
Very large number of cross references to external resources
Cross-reference topology that of a graph not a tree
Automated and manual annotation require storage of provenance information (how / when data was acquired)
Requires a framework for both data as well as metadata (data about data)
UniProt Distribution
UniProt: Data Conventions
All outbound RDF statements are grouped together (statements about the same subject)
Datasets (nodes in previous graph) are distributed as a single file
Only stores stated data, not entailed data. For instance, relationships involving symmetric properties
are only stored in one direction
UniProt: Naming Conventions
Generally, in semiotics: a symbol denotes a referent.
In Web architecture, URIs identify resources URIs that can be resolved over the web are URLs
UniProt URIs identify: Resources that correspond to database entries Modeling vocabulary that use standard namespaces: RDFS
and OWL Classes and properties used by UniProt
For ex: http://purl.uniprot.org/core/Gene
Resources without stable identifiers (from their source)
The Omics Identification Problem
UniProt uses a templated naming convention: http://purl.uniprot.org/{database}/{identifier} http://purl.uniprot.org/uniprot/{protein_identifier}
Problem http://purl.uniprot.org/uniprot/P04926 denotes the Malaria
protein EX-1
If loading that address in a browser returns a web page, can an automaton infer that Malaria protein EX-1 is a web page?
How do you identify abstract concepts v.s. digital media
The PURL Solution
Persistent Uniform Resource Locator (PURL) is a public URI management service for allocating a ‘URI space’ as a mapping of identifiers (aliases) for resources they are not immediately responsible for
PURLs are web addresses that act as permanent identifiers in the face of a dynamic and changing Web infrastructure
A request to a PURL returns a 303 HTTP status code and a location: 303 indicates that a response can be found under the
returned location
The PURL Solution: Continued
Can use PURL addresses to identify abstract concepts
Redirect requests to such addresses to an informative web page (for humans) with a means for machines to extract other formats
RDF statements are about proteins, machines can reasons about proteins, and humans resolve protein identifiers to view informative web pages
RDF/XML link:
http://www.uniprot.org/uniprot/P04926.rdf
UniProt: Protein Class
UniProt: Annotation Hierarchy
Serendipitous Re-use
Having a rich repository of protein sequence metadata, annotations, and taxonomic classification in a distributed, standard format encourages scientific collaboration
General UniProt Re-Use Scenario
User A refers to protein P1 in their dataset User A’s dataset doesn’t include statements about P1 (the
host organism for instance)
User B comes across this dataset and (in order to find out more about protein P1) puts the URI of protein P1 in their browser and pulls up human-readable information about it (including the host organism)
Automaton C comes across the same dataset, fetches the web page, fetches the RDF about P1 and has access to the same information as user B and can reason about the major taxon the host organism belongs to
References
Wu, C. et.al.,”The Universal Protein Resource (UniProt): an expanding universe of protein information”. Nucleic Acids Research, vol. 34. 2006
Swiss Institute of Bioinformatics, “UniProt RDF (project page)”. http://dev.isb-sib.ch/projects/uniprot-rdf/
Redaschi, N. and UniProt Consortium, “UniProt in RDF: Tackling Data Integration and Distributed Annotation” Nature Proceedings, 3rd International Biocuration Conference, April 2009. http://precedings.nature.com/documents/3193/version/1