May (updated) 2010 Product Stack
Nov 28, 2014
May (updated) 2010
Product Stack
Enterprise Approach
3
Enterprise Approach
Semantic Enterprise based on semantic Web, linked data
Leverage existing assets Data, records and instances Taxonomies, structure and schema
Layer semantics on to existing systems
Develop incrementally
Add sophistication, scope over time
Keep risks low
Integrate with public and Web data (“open world”)
4
Linked Data
“Linked Data is a set of best practices for publishing
and deploying instance and class data using the RDF
data model, naming the data objects using uniform
resource identifiers (URIs), thereby exposing the data
for access via the HTTP protocol, while emphasizing
data interconnections, interrelationships and context
useful to both humans and machine agents.”
5
Layers and Current Products
6
Current Products
the pivotal product; Web services middleware that provides distributed data access and federation
Drupal-based structured data linkage to structWSF
spreadsheet, JSON and XML authoring and conversion framework
reference set of linking subjects and basis for domain vocabularies
an ontology- and entity-driven information extraction and tagging system
7
Fit of Current Products within Layers
8
Existing Assets Layer
9
Existing Assets
These are the materials that need to be federated, made interoperable, and given a common semantics
» structured data / databases» semi-structure data (XML, Web pages)
» unstructured data (text)
10
Preserving Existing Assets
Relational databases (RDBMs)
Distributed structured assets spreadsheets lightweight datastores
Web pages and Web sites
Existing documents and text
Web databases and APIs
Other databases (RDF, OO, etc.)
11
Access/Conversion Layer
12
Conversion
Provides in-place access to existing information
Translates existing formats and structures to RDF
Extracts structured information from unstructured text
Aids creation of interoperable datasets
Geared almost entirely to records, instances or entities (that is, basic data)
13
Conversion Methods
Relational DBs: RDB2RDF
RDFizers
Information Extraction
New Dataset Authoring
Direct Use (already in RDF)
14
Relational DB Conversion
Simple mappings of instance records to RDF
Methodologies well proven if kept to the instance level
RDB schema inform the interoperable layer (“ontologies”)
Relational datastores left in place
Record data obtained via access layer (structWSF)
15
RDFizers
General serialization or data format conversions to RDF
Mostly applied to: Standard data formats and data structs Web content APIs Some legacy content
Sometimes some minor ontology or schema mapping
Embodies all conversion steps to linked data
We have access to more than 100+ existing formats
16
RDFizers – Listing 1
URN handlers (in addition to IRI and URI):
DOI LSID OAI
RDF Serialization formats:
irON N3 RDF/XML Turtle
Languages and ontologies: AB Meta Annotea APML AtomOWL Bibliographic Ontology Creative Commons EXIF FOAF GeoNames GoodRelations Java Javadoc MARC/MODS Meta Standards Music Ontology Natural Language Open Archives Initiative Protocol for
Metadata Harvesting (OAI-PMH) Open Geospatial OWL SIOC SIOCT
SKOS UMBEL vCard XML Others
(X)HTML pages Embedded Microformats and GRDDL * (see
note below): DC eRDF geoURL Google Base hAudio hCalendar hCard hListing hResume hReview HR-XML Ning RDFa relLicense SVG XBRL XFN xFolk XR-XML XSLT
Syndication Formats: Atom OPML OCS RSS 1.1 RSS 2.0 XBEL (for bookmarks)
REST-style Web service APIs: Alchemy Amazon Apple Best Buy Calais CNet CrunchBase Del.icio.us Digg Discogs Disqus eBay Facebook Flickr Freebase (MQL) FriendFeed Garmin Get Satisfaction Google Google Apps Hoover's HTTP (raw) ISBN DB Last.fm Library Thing Magnolia Meetup MusicBrainz New York Times New York Times Campaign Finance
(NYTCF) New York Times tags
17
RDFizers – Listing 2
Open Library Open Social Open Street OpenLink (facets) O'Reilly Picasa Radio Pop (BBC) Rhapsody Salesforce Slideshare Slidy Technorati Tesco They Work For You Twine Twitter Weather Wikipedia World Bank Yahoo! BOSS Yahoo! Finance Yahoo! Maps Yahoo! Weather Yelp YouTube Zemanta Zillow
Files (multitude of file formats and MIME types, including):
audio (general) BibJSON BibTEX and others BitTorrent commON CSV Fink Flat files irJSON irXML JPEG JSON images MS Office OpenOffice Open Document Format Palm RDF123 video XLS etc.
Metadata extractors: CRW DEB EXIF OCW RPM XMP
Email formats: EMail Outlook RFC822
Version control and related systems: Bugzilla Jira POM Subversion
Other Web service frameworks: BPEL WSDL XBRL XBEL
Data exchange formats: iCalendar LDIF vCalendar vCard
Relational databases and related: D2RQ D2RMAP RDF Views
Virtuoso VADs OpenLink license files Third party metadata extraction frameworks:
Aperture Spotlight
Miscellaneous and other related converters: MPEG-7/CS → OWL Random XSD → OWL
*GRDDL (Gleaning Resource Descriptions from Dialects of Languages) accommodates a wide variety of dialects (see one listing) and can be combined with arbitrary transformation mechanisms (though currently mostly based on XSLTs).
18
scones
19
Information Extraction
scones (Subject Concept Or Named EntitieS) is our IE tagger
Information extraction is applied to input Web pages and unstructured text
May be applied after structure extraction:
(often, at minimum, defluffing)
Settable “window” for snippet (from # of bracketing terms to full document)
Extraction is performed for both: Entities (per Wikipedia and enterprise dictionaries) Subject concepts (per UMBEL and domain ontologies)
Presently in prototype
20
(Named) Entities
The places, events, people, objects, and specific things of the real world
Literally millions of notable instances
Each belongs to one or more subject concept(s)
Currently, the predominate basis for linked data
Public sources include Wikipedia and Freebase, others
Can be readily mixed-and-matched with private entities
21
Creating New Entity Dictionaries
22
Triangulating Information Extraction
23
irON – instance record and Object Notation
24
irON Dataset Authoring Framework
Simple authoring and dataset creation
irON includes an abstract notation and vocabulary for instance records
Serializations available for: XML (irXML) JSON (irJSON) CSV/spreadsheets (commON)
Notations for: Instance records Schema Datasets and metadata Linkages to other schema
25
Three irON SerializationsirXML irJSON
commON
26
More-or-less Interchangeable Formats
27
structWSF
28
structWSF
Generally RESTful Web services middleware
Uniform, distributed access point
Provides the interoperability architecture
Based on canonical RDF data model
Dataset access orientation
Standard tools and services: User permissions and access CRUD (create, read, update, delete) Browse Full-text, faceted search Import / export Many others
29
RDF and Data Federation Model
30
Advantages of a Canonical Model
All tools can be driven from a single data format basis
Single converters can link in other hubs of data forms
‘Round-tripping’ thru the canonical form can bring consistency and cleanliness to inputted data
RDF is well-suited as the canonical form: Structured data Semi-structured data Unstructured data (after IE) Simple-to-complex data structures Logic and inferencing Suitable to all input data formats Many serializations possible
31
A Collaborative, Distributed Network
32
Flexible User Access Permissions
33
Access, APIs and Endpoints
The resulting linked data may be exposed as:
APIs
Web services
SPARQL endpoints
34
Ontologies Layer
35
Ontologies
Ontologies provide the basis for: Interoperating Reconciling semantics
Multiples may be used at any time
Both enterprise (internal) and external ontologies
Best built incrementally, with participation
Easily modified: OK to test and experiment
36
Ontologies
The structural relationships of concepts within a domain
Generally class- (or set-) oriented
Analogous to relational database schema, only with controlled vocabularies and exact semantics
Sets the structure of how to organize the actual data (“instances”) in the domain
Semantics and mapping techniques allow disparate ontologies to be inter-related
Can inference or reason over the structure
37
Migrating Structure to the Ontology Layer
38
Ontologies Layer
39
irON
40
irON Record Vocabulary
irON also provides the standard instance record vocabulary for all federated records
Each record source has its own attributes
But, irON provides common descriptors: Useful for interoperating Unique, Web-accessible identifiers Standard descriptions and labels Conventions for “driving” user interfaces and tools
41
UMBEL
UMBEL (Upper Mapping and Binding Exchange Layer)
20,000 defined reference points in information space
Means to assert what a given chunk of content is about
Enable similar content to be aggregated
Place content in context with other content
Aggregation points for tying in instances and entities
Derived and a subset of the Cyc knowledge base
Vocabulary basis for domain-specific subject ontologies
42
Notable Ontologies and Vocabularies
43
Management Layer
44
Management/Federation Layer
Management/Federation Layer handles: Ontology mapping, management Queries and retrievals All Web services Imports and exports Inferencing and logic Ontology creation and expansion
Works off of many RDF datastores
Has efficient, full-text indexing with faceting
Interface to the system is structWSF
Can plug into many options at the Applications Layer (only Drupal with conStruct SCS yet deployed)
45
Web-oriented Architecture
46
Applications Layer
47
conStruct SCS
48
conStruct Browse Screen
49
conStruct Capabilities
Based on Drupal
Single-click (cloud) deployment
Theming
User and group access and management
Data display templates
General content management system (CMS)
Publishing RDF
Open source
50
Re-cap
51
Summary
Incremental, low-risk approach to the semantic enterprise
Maximum leverage and re-use of existing information assets
Conversion and federation of all available data forms
Excellent uses for: Business intelligence Knowledge management Master data management modernization Taxonomy modernization Enterprise content integration
All baseline products are open source
52
Contacts & InformationMichael K. Bergman
CEO
319.621.5225
blog: www.mkbergman.com
Steve ArdireSenior Advisor
Frédérick GiassonCTO
blog: fgiasson.com/blog
Web Sitesstructureddynamics.com
umbel.org
umbel.structureddynamics.com (UMBEL Web services)
citizen-dan.org (community indicator systems)
openstructs.org (open source distros + documentation)
constructscs.com (Drupal structured data system)