Confidential HP Content Management, Content Management, Metadata & Semantic Web Metadata & Semantic Web Keynote Address Keynote Address Net.ObjectDAYS 2001, Erfurt, Germany, September 11, Net.ObjectDAYS 2001, Erfurt, Germany, September 11, 2001 2001 Amit Sheth CTO/SrVP, Voquette (www.voquette.com) [formerly Founder/CEO, Taalee, www.taalee.com] Director, Large Scale Distributed Information Systems Lab, University Of Georgia (lsdis.cs.uga.edu) [email protected]Metadata Extraction is a patented pending technology of Taalee, Inc. Semantic Engine and WorldModel are trademarks of Taalee. Inc.
Keynote given at NetObjectDays conference, Erfurt, September 11, 2001.
One of the earliest keynotes discussing commercial semantic web technologies, semantic web applications (including semantic search, semantic targeting, semantic content management). Prof. Sheth started a Semantic Web company Taalee, Inc. in 1999 (Product was MediaAnywhere A/V search engine),that merged to become Voquette in 2001 (product was called SCORE), Semagix in 2004 (product was called Semagix Freedom), and then Fortent in 2006 (products included Know Your Customers). Additional details can be found in U.S. Patent #6311194, 30 Oct. 2001 (filed 2000).
Note: the commercial system used "WorldModel" as at the time, business customers were not yet warm to "Ontology" - the concept/intent is the same. More recent information at http://knoesis.org
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
New Content Management Challenges faced by Enterprises
Semantic Content Management
Metadata
Metadata Descriptions and Standards
(Automated) Metadata Creation/Extraction/Tagging
Metadata Usage/Applications
Semantics (and Semantic Web)
Current and Future
HP 3
Traditional Content Management: Core Objectives and Features
Primary Objective: Effectively create, manage and publish internal content, with
Existing content creation applications (MS-Office, Notes) and provide some new capabilities (Speech to text)
(Basic, Syntactic) metadata Workflow or lifecycle support (from author to Web publication
or distribution) Versioning and Rollback (Keyword-based/Syntactical) Search and Personalization Internal Distribution Web publishing
ContentCreation
andEdition
ContentManagement
ContentPersonalization
andServices
ContentDelivery
HP 4
Technology/Product Provider Landscape
Traditional Content Management Companies Interwoven, Vignette, Broadvision, Enprise, Documentum, Open
Market
Three of several upcoming companies focusing on metadata, semantics and/or semantic web Applied Semantics, Voquette (Taalee), Ontoprise See http://business.semanticweb.org for more
Enterprise Content Management – sample user requirements (from a large Financial Svcs Company)
“If a new bond comes into inventory, then we should get a message, an alert...and be able to refine to say that I only have California, Oregon and Washington clients...."
“In the month of July, I received 95 e-mails from my subscriptions. These e-mails included 61 that had 143 attachments that had 67 more attachments. In total therefore, I received almost 400 documents including 5 different types (HTML,PDF, Word, Rich Media, …). Even with this volume, I had subscribed to only 10 categories in the Equities area. There are a total of 26 Equity Subscription areas and a total of 166 categories to which a user can subscribe across all Product Areas.”
Professional users of a traditional Content Management Product/Solution
HP 6
Enterprise Content Management – sample user requirements (from a large Financial Svcs Company)
The real question is, "Which sales ideas may have significant relevance to my book of business?" For example, an earnings warning on an equity rated Hold or Lower and not owned by any of my clients may not be of high relevance to me. Ideally, a relevance analysis would: Greatly reduce the volume of Product Area Ideas sent to every FA,
hopefully to perhaps 10% to 20% or less of today's volume with ideas that are potentially actionable for that FA and his/her client
Result in FAs reading and evaluating the Product Area Ideas, taking appropriate actions, and generating sales because the Product Area Ideas would be relevant
Result in customer satisfaction because clients would understand FAs are paying attention to their needs and developing focused ideas
Professional users of a traditional Content Management Product/Solution
HP 7
Enterprise Content Management – sample product requirements (from a large Financial Svcs Company)
“Content generation is a more complex and probably costly problem to solve ... we reportedly create about 9 million messages a month for field delivery. On average, this would mean 1,000 messages per month per ‘big user’ or perhaps only 500 to 600 per ‘little user’.…I strongly believe an analysis is in order of the nature and necessity of generated content , the establishment of content generation standards, themovement towards development and implementation of a relevance engine, … “
Director (Product Management) of a large company that uses a leading Content Management Product
HP 8
New Enterprise Content Management Challenges
1. More variety and complexity More formats (MPEG, PDF, MS Office, WM, Real, AVI, etc) More types (Docs, Images -> Audio, Video, Variety of text-
structured, unstructured) More sources (internal, extranet, internet, feeds)
2. Information Overload Too much data, precious little information (Relevance)
3. Creating Value from Content How to Distribute the right content to the right people as needed?
(Personalization -- book of business) Customized delivery for different consumption options
(mobile/desktop, devices) Insight, Decision Making (Actionable)
HP 9
New Enterprise Content Management Technical Challenges
1. Aggregation Feed handlers/Agents that understand content representation and
media semantics Push-pull, Web-DB-Files, Structured-Semi-structured-
Unstructured data of different types
2. Homogenization and Enhancement Enterprise-wide common view
Domain model, taxonomy/classification, metadata standards Semantic Metadata– created automatically if possible
3. Semantic Applications Search, personalization, directory, alerts, etc. using metadata and
semantics (semantic association and correlation), for improved relevance, intelligent personalization, customization
HP 10
Creating and Serving Metadata to Power the Life-cycle of Content
Where is the
content? Whose is
it?
ProduceAggregate
What is this
content about?
Catalog/Index
What other
content is it
related to?
Integrate Syndicate
What is the right
content for this user?
Personalize
What is the best way to
monetize this interaction?
Interactive Marketing
Broadcast,Wireline,Wireless,Interactive TV
Semantic Metadata
ApplicationsBack End
"A Web content repository without metadata is like a library without an index." - Jack Jia, IWOV“Metadata increases content value in each step of content value chain.” Amit Sheth
HP 11
A Metadata Classification
Data (Heterogeneous Types/Media)(Heterogeneous Types/Media)
Content Dependent Metadata (size, max colors, rows, columns...)(size, max colors, rows, columns...)
Direct Content Based Metadata (inverted lists, document vectors, LSI)(inverted lists, document vectors, LSI)
Domain Independent (structural) Metadata (C++ class-subclass relationships, HTML/SGML(C++ class-subclass relationships, HTML/SGML Document Type Definitions, C program structure...)Document Type Definitions, C program structure...)
Domain Specific Metadata area, population (Census),area, population (Census), land-cover, relief (GIS),metadata land-cover, relief (GIS),metadata concept descriptions from ontologiesconcept descriptions from ontologies
“The Web of data (and connections) with meaning in the sense that a computer program can learn enough about what the data means to process it. . . . Imagine what computers can understand when there is a vast tangle of interconnected terms and data that can automatically be followed.” (Tim Berners-Lee, Weaving the Web, 1999)
A Content Management centric definition ofSemantic Web: The concept that Web-accessible content can be organized and utilized semantically, rather than though syntactic and structural methods.
Semantics: The Next Step in the Web’s Evolution
HP 14
Next Generation:
Semantic Content Management
HP 15
Organizing Content
Different and Related Objectives: Search, Browse, Summarization, Association/Relationships
Indexing Clustering Classification Controlled Vocabulary, Reference Data/ Dictionary/Thesaurus Metadata Knowledge Base (Entities/Objects and Relationships)
HP 16
Statistical/AI Techniques
Customer Article Feed
4715
Classification of Article 4715
Customer Training
Set
Traditional Text Categorization
Routing/Distribution
Classify Place ina taxonomy
feed
Standard Metadata
Feed Source: iSyndicate
Posted Date: 11/20/2000Most traditional Content Management Products support Categorization of unstructured content..
WordNet Cyc The Medical Subject Headings (MeSH): NLM's controlled
vocabulary used for indexing articles, for cataloging books and other holdings, and for searching MeSH-indexed databases, including MEDLINE. MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts. Year 2000 MeSH includes more than 19,000 main headings, 110,000 Supplementary Concept Records (formerly Supplementary Chemical Records), and an entry vocabulary of over 300,000 terms.
The content provider supplies NewsML packaged media content to the operator. The content can be categorized as current events, finance, sport, etc. (but no standards is specified) and updated hourly.
The operator receives NewsML data from the content provider. The content server automatically pushes updated news articles to all news service subscribers.
Consumers sign up for the news service directly on the device. When using the news service, the user browses through the categories and reads the news articles. The news articles are presented in a continuous flow (one after the other) without end-user interaction.
Source:http://www.mediabricks.com
HP 30
NewsML
Content-descriptive metadata:<HeadLine>Seattle attacked by Godzilla-like creature, Microsoft closes HQ</HeadLine>
<DateLine>Seattle, Was., Aug 30, 2009 /AthensWire via COMTEX/ --</DateLine>
<CopyrightLine>Copyright (C) 2009 AthensWire. All rights reserved.</CopyrightLine>
Financial metadata for Buy/Sell sides Highly domain-specific Schema (see next slide) [from UserGuide, p. 31] Example: MorningCall.xml
HP 32
RIXML Schema
HP 33
Metadata Creation and Semanticization
Automatic Content
Classification/Categorization
Metadata Creation/Extraction:
Types of metadata created
Semantic Engine and WorldModel are trademarks of Taalee, Inc.Metadata Extraction is a patented technology of Taalee, Inc.
HP 34
Content Handling/Ingest
Infrastructure/Exchange
Feed Handlers
Crawlers/Screen Scrapers/Bots
Software Agents
Centralized, Distributed, or Mobile/Migratory
HP 35
Information Extraction for Metadata Creation
WWW, EnterpriseRepositories
METADATAMETADATA
EXTRACTORSEXTRACTORS
Digital Maps
NexisUPIAPFeeds/
Documents
Digital Audios
Data Stores
Digital Videos
Digital Images. . .
. . . . . .
Key challenge: Create/extract as much (semantics)metadata automatically as possible
HP 36
Extracting a Text Document:Extracting a Text Document:Syntactic approachSyntactic approach
INCIDENT MANAGEMENT SITUATION REPORT
Friday August 1, 1997 - 0530 MDT
NATIONAL PREPAREDNESS LEVEL II
CURRENT SITUATION: Alaska continues to experience large fire activity. Additional fires have beenstaffed for structure protection.
SIMELS, Galena District, BLM. This fire is on the east side of the Innoko Flats, between Galena and McGrThe fore is active on the southern perimeter, which is burning into a continuous stand of black spruce. Thefire has increased in size, but was not mapped due to thick smoke. The slopover on the eastern perimeter is35% contained, while protection of the historic cabit continues.
CHINIKLIK MOUNTAIN, Galena District, BLM. A Type II Incident Management Team (Wehking) is assigned to the Chiniklik fire. The fire is contained. Major areas of heat have been mopped up. The fire iscontained. Major areas of heat have been mopped-up. All crews and overhead will mop-up where the fireburned beyond the meadows. No flare-ups occurred today. Demobilization is planned for this weekend,depending on the results of infrared scanning.
LAYOUT
Date => day month int ‘,’ int
HP 37
Extraction Agent
Web Page Enhanced Metadata Asset
Taalee Extraction and Knowledgebase Enhancement
HP 38
Automatic Categorization & Metadata Tagging (unstructured text/transcript of A/V)
ABSOLUTE CONTROL OF THE SENATE IS STILL IN QUESTION. AS OF TONIGHT, THE REPUBLICANS HAVE 50 SENATE SEATS AND THE DEMOCRATS 49. IN WASHINGTON STATE, THE SENATE RACE REMAINS TOO CLOSE TO CALL. IF THE DEMOCRATIC CHALLENGER UNSEATS THE REPUBLICAN IUMBENT THE SENATE WILL BE EVENLY DIVIDED. IN MISSOURI, REPUBLICAN SENATOR JOHN ASHCROFT SAYS HE WILL NOT CHALLENGE HIS LOSS TO GOVERNOR MEL CARNAHAN WHO DIED IN A CRASH THREE WEEKS AGO. GOVERNOR CARNAHAN'S WIFE IS EXPECTED TO TAKE HIS PLACE. IN THE HIGHEST PROFILE SENATE EVENT OF THE NIGHT, HILLARY CLINTON WON THE NEW YORK SENATE SEAT. SHE IS THE FIRST FIRST LADY TO RUN MUCH LESS WIN.
Jimmy Smith Interview Part SevenJimmy Smith explains his philosophy on showboating. URL: http://cbs.sportsline...
Brian Griese Interview Part FourBrian Griese talks about the first touchdown he ever threw. URL: http://cbs.sportsline...
Metadata from Typical Cataloging of Football
Assets
Taalee Metadata on Football Assets
Rich Media Reference Page
Baltimore 31, Pit 24
http://www.nfl.com
Quandry Ismail and Tony Banks hook up for their third long touchdown, this time on a 76-yarder to extend the Raven’s lead to 31-24 in the third quarter.
ProfessionalRavens, SteelersBal 31, Pit 24Quandry Ismail, Tony BanksTouchdownNFL.com2/02/2000
League:Teams:Score:
Players:Event:
Produced by:Posted date:
Crawler provided text for indexing vs Agent provided semantic metadata
Content of all format, media, push/pull:Web sites/pages: static, dynamicContent Feeds (unstructured, semistructured/docs, tagged/XML)Corporate Repositories/databases
Homogenization/integration:with taxonomy (categorization)contextually relevant metadata wrt to domain model, automatically generated from content and inferenced
Example (test on http://directory.mediaanywhere.com)
Search for company ‘Commerce One’
Links to news on companies
that compete against
Commerce One
Links to news on companies
Commerce One competes
against
(To view news on Ariba, click
on the link for Ariba)
Crucial news on
Commerce One’s
competitors (Ariba) can
be accessed easily and
automatically
HP 49
Wh
at e
lse
can
a c
on
text
do
?(a
co
mm
erci
al p
ersp
ecti
ve)
Sem
anti
c E
nri
chm
ent
Semantic Targeting
HP 50
Semantic/Interactive Targeting
Buy Al Pacino VideosBuy Russell Crowe VideosBuy Christopher Plummer VideosBuy Diane Venora VideosBuy Philip Baker Hall VideosBuy The Insider Video
Precisely targeted through the use of Structured Metadata and integration from multiple sources
HP 51
Example 1 – Snapshots (“Jamal Anderson”)
Click on first result for Jamal Anderson
View metadata. Note that Team name and League name are also included
in the metadata
Search for ‘Jamal Anderson’ in ‘Football’
View the original source HTML page. Verify that
the source page contains no mention of Team name and League name. They
were Taalee’s value-additions to the metadata to facilitate easier search.
HP 52
Example 2 – Snapshots (“Gary Sheffield”)
Click on first result for Gary Sheffield
View metadata. Note that Team name and League name are also included
in the metadata
Search for ‘Gary Sheffield’ in ‘Baseball’
View the original source HTML page. Verify that
the source page contains no mention of Team name and League name. They
were Taalee’s value-additions to the metadata to facilitate easier search.
HP 53
Related Stock
News
Related Stock
News
Semantic Web – Intelligent Content(supported by Taalee Semantic Engine)
IndustryNews
IndustryNews
Technology Products
Technology Products
COMPANYCOMPANY
SECEPAEPA
RegulationsRegulations
CompetitionCompetition
COMPANIES in Same or Related INDUSTRY
COMPANIES inINDUSTRY with Competing PRODUCTS
Impacting INDUSTRY or Filed By COMPANY
Important to INDUSTRY or COMPANY
Intelligent Content = What You Asked for + What you need to know!
HP 54
Focused relevantcontent
organizedby topic
(semantic categorization)
Automatic ContentAggregationfrom multiple
content providers and feeds
Related news not
specifically asked for(Semantic
Associations)
Competitive research inferred
automatically
Automatic 3rd party content
integration
Semantic Application – Equity Dashboard
HP 55
Internal Source 1Research
Internal Source 2
External feeds/Web(e.g. Reuters)
VoquetteMetabase
World Model
Third-partyContent Mgmt
AndSyndication
SemanticEngine
1
2
3
4
Cisco story from Source 1passed on to addsemanticassociations
ConsultsKnowledgeBasefor Cisco’scompetition
Returns result:Lucent is a competitor of Cisco
Lucent story from external
feeds picked for publishing as
“semantically related” to Cisco
story – passedon to Dashboard
Story onLucent
Story onCisco
XCM-compliant metadata, XML or other format
SemanticApplication
ASP/Enterprise hosted
Extractor Agent 1
Extractor Agent 2
Extractor Agent 3
Metadata centricContent Management Architecture
HP 56
Wireless Application of Semantic Metadata and Automatic Content Enrichment
MyStocks
News
Sports
Music
MyMedia
$
My Stocks
CSCO
NT
IBM
Market
CSCO
Analyst Call
Conf Call
Earnings
11/08 ON24 Payne
11/07 ON24 H&Q 11/06 CBS Langlesis
CSCO Analysis
Clicking on the link for Cisco Analyst Calls displays a listingsorted by date. Semantic filtering uses just the right metadata to meet screen and other constrains. E.g., Analyst Call focuses on the source and analyst name or company. The icon denote additional metadata, such as “Strong Buy” by H&Q Analyst.
HP 57
SceneDescriptionTree
Retrieve Scene Description Track
“NSF Playoff”
Node
Enhanced XML
Description
MPEG-2/4/7
Enhanced Digital Cable
Video
MPEGEncoder
MPEGDecoder
Node = AVO Object
Voqutte/TaaleeSemantic
Engine“NSF Playoff”
Produced by: Fox Sports Creation Date: 12/05/2000 League: NFLTeams: Seattle Seahawks, Atlanta Falcons Players: John Kitna Coaches: Mike Holmgren, Dan Reeves Location: Atlanta
Object Content Information (OCI)
Metadata-richValue-added Node
Create Scene Description Tree
GREATUSER
EXPERIENCE
Metadata’s role in emerging iTV infrastructure
Channel salesthrough Video Server Vendors,
Video App Servers, and Broadcasters
License metadata decoder and semantic applications to
device makers
HP 58
Metadata for Automatic Content Enrichment
Interactive Television
This segment has embedded or referenced metadata that isused by personalization application to show only the stocksthat user is interested in.
This screen is customizablewith interactivity featureusing metadata such as whetherthere is a new ConferenceCall video on CSCO.
Part of the screen can beautomatically customized to show conference call specific information– including transcript,participation, etc. all of which arerelevant metadata
Conference Call itself can have embedded metadata to support personalization andinteractivity.
HP 59
Semantic Technology Features
Unstructured Text Content Semi-Structured Content Structured Content Audio/Video Content with associated text (transcript, journalist notes) Create a Customized "World Model" (Taxonomy Tree with customized domain
attributes) Automatically homogenize content feed tags Automatically categorize unstructured text Automatically create tags based on text Itself Create and maintain a Customized Knowledge Base for any domain Automatically enhance content tags based on information beyond text Build contextually relevant custom research applications Contextual Search (an order of magnitude better than keyword-based search) Support push or pull delivery/ingestion of content Personalization/Alerts/Notifications Real Time Indexing (stories indexed for search/personalization within a minute) Provide the user with relevant information not explicitly asked for (Semantic
Associations)
Confidential HP
Along with the evolution of metadata and semantic
technologies enabling the next generation of the Web, Content Management has entered the next generation of Enhanced
Content Management.
Resources/References
RDF:www.w3.org/TR/REC-rdf-syntax/ ICE: www.icestandard.org Meta Object Facility (MOF) Specification, Version 1.3, September 27, 1999:
http://cgi.omg.org/cgi-bin/doc?ad/99-09-05 XML Metadata Interchange (XMI) Specification, Version 1.1, October 25, 1999:
Multimedia Data Management: Using Metadata to Integrate and Apply Digital Media, Amit Sheth & Wolfgang Klas, Eds., McGraw Hill, ISBN: 0-07-057735-8, 1998.
Information Brokering, Vipul Kashyap & Amit Sheth, Kluwer Academic Publishers, 2001.
Voquette Semantic Technology White Paper.
Mysteries of Metadata, Speaker – Amit Sheth, Workshop at Content World 2001.