© Copyright 2008 Dow Jones and Company, Inc. Putting Structured Business Vocabularies to Work November 4, 2008 Data Management and Information Quality Conference IRM UK Ian Davis Global Project Manger, Dow Jones & Company
May 12, 2015
© Copyright 2008 Dow Jones and Company, Inc.
Putting Structured Business Vocabularies to Work
November 4, 2008Data Management and Information Quality Conference
IRM UK
Ian DavisGlobal Project Manger, Dow Jones & Company
2 © Copyright 2008 Dow Jones and Company, Inc.
What we’ll cover today:
Understanding the challenges of controlled versus uncontrolled vocabularies
Developing a strategy to create and maintain controlled vocabularies
Identifying how you want to integrate your controlled vocabularies into your systems
Understanding the requirements of integrating controlled vocabularies into multiple applications
© Copyright 2008 Dow Jones and Company, Inc.
Setting the Context
4 © Copyright 2008 Dow Jones and Company, Inc.
Once upon a time…
Most of the business was IT enabled. There was some degree of “sharing” of information
and content, there were even some large, well structured document repositories.
Yet, no one could find anything. Actually, they found things,
but not what they wanted when they wanted it and they were never sure they found the “best” or “saw
it all”.
5 © Copyright 2008 Dow Jones and Company, Inc.
Once upon a time…
The C-level executives were a bit irritated. They’d spent lots on the technology and people really weren’t much more efficient, the pinch point in the workflow had simply
moved further downstream. So, what happened next?
6 © Copyright 2008 Dow Jones and Company, Inc.
Once upon a time…
They SPENT <more> MONEY and bought the best in class search utilities.
Yet, no one could find anything. Actually, they found things,
but not what they wanted when they wanted it and they were never sure they found the “best”
or “saw it all”.
7 © Copyright 2008 Dow Jones and Company, Inc.
Once upon a time…
The C-level executives became a bit more irritated.
Everyone was a bit frustrated. What was missing?
8 © Copyright 2008 Dow Jones and Company, Inc.
Optimized?
Is the search utility optimized using all the bells and whistles it came with?
Relevancy rankings “Thesaurus” files (synonym lists) Multi-lingual capabilities Common searches saved and presented to
users Logs reviewed to understand user issues
9 © Copyright 2008 Dow Jones and Company, Inc.
Usable?
Is the user interface considerate to users? Was it designed with YOUR users in mind
Designed for occasional users? Designed for power users?
Was it designed with YOUR business in mind Task-based views for context sensitive
searches Present results in a format readily used
within work flows
10 © Copyright 2008 Dow Jones and Company, Inc.
Metadata?
Are there required metadata fields within the CMS? Author, Title, Language, Topic, Product/Service, etc
Are the entry values to those fields controlled? Lookups against authority files, taxonomies, thesauri
Does the search utility support fielded searches? Does the search utility weight terms within metadata
fields higher than free-text?
11 © Copyright 2008 Dow Jones and Company, Inc.
Metadata?
For example: If a financial analyst enters the query term “stock”
within the company’s knowledge base, Will he get back results with the documents
specifically discussing “stock” as a financial instrument listed first?
Or will he have to look through 100’s of documents discussing what’s relevant to him as well as every document that references free-text in the body of the document about:
soup stock (food industry), cows (livestock industry),
or stock car racing (professional sports industry)?
12 © Copyright 2008 Dow Jones and Company, Inc.
Metadata?
Precise and comprehensive searches Only if controlled vocabularies have been used to
populate metadata fieldsAND The search utility takes advantage of that by giving
priority to query term occurrence within controlled value metadata fields
OR Fielded searches are enabled
e.g. <Author = Smith> + <Service = Consulting> + <Industry = Automotive> + <Date = January 2006> + <Content Type = Proposal>
© Copyright 2008 Dow Jones and Company, Inc.
Challenges: Controlled versus Uncontrolled
14 © Copyright 2008 Dow Jones and Company, Inc.
Controlled Vocabularies Explained
Authority files e.g. Company’s active directory, ISO standard for Languages Typically a flat list of allowed values
Taxonomies e.g. Linnaean Classification (kingdom, phylum, class, order,
family, genus, and species ) Typically includes only hierarchical relationships between terms
Thesauri e.g. NASA Thesaurus (http://www.sti.nasa.gov/thesfrm1.htm) Includes full set of semantic relationships defined between terms
(hierarchical, associative, equivalence)
15 © Copyright 2008 Dow Jones and Company, Inc.
NASA Thesaurus – Sample Entry
16 © Copyright 2008 Dow Jones and Company, Inc.
Semantic Relationships
Hierarchical Superordination - representing a class or a whole, and
subordination - referring to members or parts e.g. mammals and vertebrates e.g. cherry pie and cherry pie slices
Equivalence One concept expressed by two or more terms
e.g. dogs and canines Associative
Terms that are conceptually linked, but not through hierarchy or equivalence e.g. accounting and accountant
17 © Copyright 2008 Dow Jones and Company, Inc.
Challenges – Uncontrolled Vocabularies
Uncontrolled vocabularies are: Comprehensive but noisy
Only comprehensive if synonym lists are used
Limited in their precision and relevancy Time lost scanning through hundreds of
“miss” hits Reduced effectiveness of cross-repository
searches Limited ways to disambiguate ‘soup stock’
from ‘stock car’
18 © Copyright 2008 Dow Jones and Company, Inc.
Challenges - Controlled Vocabularies
Controlled vocabularies can produce: Potentially significant overhead effort (manual
and technical) Organizational politics can add YEARS to
establishing an initial set of controlled vocabularies
A lack of basic understanding of what the controlled vocabularies are and how they work impedes effective development and utilization
19 © Copyright 2008 Dow Jones and Company, Inc.
Challenges - Controlled Vocabularies
Controlled vocabularies: Richness and power comes from a full set of semantic
relationships, not just hierarchical ones Hierarchy supports the ability to narrow and broaden
search queries Association supports “did you mean” and “you might
also want to look at” Equivalence enables the use of familiar language to
retrieve content which is conceptually on target but never uses their term
e.g. user enters dog and search utility expands query to include “canine, k-9, puppy”
20 © Copyright 2008 Dow Jones and Company, Inc.
Challenges - Controlled Vocabularies
Controlled vocabularies: Richness and power comes at the cost of
added complexity of development, implementation, integration and maintenance
Utilization of controlled vocabularies can produce performance issues During search index creation During query run time
© Copyright 2008 Dow Jones and Company, Inc.
Tackling the Challenges
22 © Copyright 2008 Dow Jones and Company, Inc.
Strategy – Creation and Maintenance
State the business case clearly Benefits
Reduced time for knowledge discovery Increased richness of knowledge discovery Decreased risk to firm of making business
decisions with partial information Scope
One business unit or enterprise-wide? Resource requirements
Skill sets (IS, IT, business knowledge) Time commitment
23 © Copyright 2008 Dow Jones and Company, Inc.
Strategy – Creation and Maintenance
Tackle organizational politics head-on Gain credibility and ensure usability by establishing a
cross-functional working committee that will become the Review Committee
Include all major stakeholder groups and any interested parties (even the non-supporters)
Establish methods of broadly soliciting end-user input that will become a source of change requests during maintenance phases
24 © Copyright 2008 Dow Jones and Company, Inc.
Strategy – Creation and Maintenance
Additional considerations before you start: How rigorous does it need to be?
What external standards should be adopted? ANSI/NISO Z39.19-2005 British Standard – BS 8723
What internal standards should be developed? Editorial Guidelines Usage Guidelines
How extensive will it be? Depth and breadth within and across facets
What about adaptability and flexibility Will there be a need for local extensions?
25 © Copyright 2008 Dow Jones and Company, Inc.
Strategy – Creation and Maintenance
Additional considerations before you start: Projected frequency of revisions
How quickly does the content base change with respect to concepts; is there significant content drift?
How volatile is the language? Management consulting vs. accounting
Vocabulary Management Software DON’T spend money just to spend money However, you CAN’T manage controlled
vocabularies in a spreadsheet Buy the tool you need based on your documented
functional requirements
26 © Copyright 2008 Dow Jones and Company, Inc.
Strategy – Integration Choices
Performance trade-offs Store UIDs within content, then use look-up table at
query run time Store full-text of a term, then touch all content when
taxonomy value changes (must re-assign new term value)
Version control Use static versions of controlled vocabularies within
CMS and search utilities, releasing new versions periodically
Use dynamic version of controlled vocabularies with continuous revisions occurring
27 © Copyright 2008 Dow Jones and Company, Inc.
Strategy – Integration Choices
Utilizing semantic relationships Store full set (term values or UIDs) within
content record OR Store single UID and have search utility use
reference tables to determine related terms Display of semantic relationships
User interface considerations for effective presentation of non-hierarchically related terms
28 © Copyright 2008 Dow Jones and Company, Inc.
Strategy – Integration Choices
Browse navigationoptions
Query entry (including ability to broaden or narrow current search results)
Query results listing
Related topics(defined through
Associative relationships)
Previous query statement user entered plus any auto-expansion done by engine
29 © Copyright 2008 Dow Jones and Company, Inc.
Strategy – Multiple Applications
Expanding the adoption and use of controlled vocabularies Know the business objectives of the applications
In conjunction with the search utility, does the controlled vocabulary enable this objective?
Are there metadata fields available within current application for the controlled vocabulary?
Does the business have resources to assign the controlled vocabulary?
What format does the controlled vocabulary need to be in to be integrated with the application?
30 © Copyright 2008 Dow Jones and Company, Inc.
Strategy – Multiple Applications
Additional considerations Will there be conflicting version management
needs? How does search currently index these
applications and will that change with the use of controlled vocabularies?
31 © Copyright 2008 Dow Jones and Company, Inc.
Five Key Points
1. Controlled vocabularies are a lever to improve precision and comprehensiveness
2. Controlled vocabularies are never finished – they are always a work in process
3. Search utilities can only be tweaked so far4. Tapping into the richness of the semantic
relationships between terms can be extremely powerful
5. There are lots of options for implementing and integrating controlled vocabularies