1 Synonyms & Taxonomies Synonyms & Taxonomies Thesaurus Design for Information Architects an ACIA Seminar by Peter Morville & Samantha Bailey
1
Synonyms & TaxonomiesSynonyms & TaxonomiesThesaurus Design for Information
Architects
an ACIA Seminar
by Peter Morville & Samantha Bailey
2
Introductions
Peter Morville ([email protected])
• CEO, Argus Associates• Co-author, Information Architecture
for the World Wide Web• Director, ACIA• LIS background• Fortune 500 consulting
3
Introductions
Samantha Bailey ([email protected])
• VP of Operations, Argus Associates• LIS background• Fortune 500 consulting• VC experience
4
Seminar OutlineI. Thesauri in ContextII. Value of ThesauriIII. MethodologyIV. MetadataV. Vocabulary ControlVI. Structure & RelationshipsVII. Thesaurus ManagementVIII. Case Study IX. Related Topics
Instructional MethodsExercises, Quizzes, Discussions, Breaks
5
Our Approach
Assumptions• Understanding of IA Basics• Interest in Thesauri and the Web
Philosophy• Reality is Important• Technology has Limitations• Success takes Time• Tension can be Healthy
6
Thesauri in Context
What is IA?
The art and science of structuring and organizing information systems to help people achieve their goals.
7
Thesauri in Context
An Ecological Approach
Books: Information Ecologies by Bonnie Nardi and
Information Ecology by Thomas Davenport
Content
BusinessContext
Users
8
Thesauri in Context
IA From Top to BottomTop-Down Bottom-Up
portal sub-site
strategy objects
hierarchy metadata
primary path multiple pathsportal
local subsites(HR, Engineering, R&D…)
Object XName:Product Category:Topic:Stale Date:Author:Security:
9
Thesauri in Context
Where Does IA Fit?http://www.jjg.net/ia/elements.pdf
The Elements of
User Experience
Jesse James Garrett
10
Thesauri in Context
What is Vocabulary Control?
Controlled Vocabulary
A list of preferred and variant terms.
A subset of natural language.
Preferred Variants Authority
AZ Ariz, Arizona, 85XXX US Postal Service
IBM Intl Bus Machines, Big Blue
NY Stock Exchange
Nyctalopia Night blindness
Moon blindness
National Library of Medicine
11
Thesauri in Context
Why Control Vocabulary?Language is Ambiguous
• Synonyms, homonyms, antonyms, contronyms, etc.
In the Oxford English Dictionary:• “Round” takes 7 ½ pages or 15,000
words to define.• “Set” has 58 uses as a noun, 126 as a
verb, 10 as an adjective.
The Mother Tongue: English & How It Got That Way
by Bill Bryson
12
Thesauri in Context
Why Control Vocabulary?
So Your Users Don’t Have To!
Users
Documents and Applications
Communication Chasm
ExamplePersonal Digital Assistant
SynonymsHandheld Computer
"Alternate" SpellingsPersenal Digitel Asistent
Abbreviations / AcronymsPDA
Broader TermsWireless, Computers
Narrower TermsPalmPilot, PocketPC
Related TermsWindowsCE, Cell Phones
13
Thesauri in Context
Semantic RelationshipsTypes1. Equivalence
2. Hierarchical
3. Associative
(Preferred)Vermont
(Related)Skiing
(Narrower)Burlington
(Broader)United States
(Variant)Green
Mountain State
(Related)Maple Syrup
(Variant)Vt
1
3
2
14
Thesauri in Context
Levels of Control
Simple Complex
SynonymRings
AuthorityFiles
ThesauriClassificationSchemes
Equivalence Hierarchical Associative
(Vocabularies)
(Relationships)
15
Thesauri in Context
What is a Thesaurus?
Traditional Use• Dictionary of synonyms (Roget’s)• From one word to many words
Information Retrieval Context• A controlled vocabulary in which
equivalence, hierarchical, and associative relationships are identified for purposes of improved retrieval
• Many words to one concept
16
Thesauri in Context
TerminologyPreferred Terms (UF subject headings,
descriptors)
SN Scope Notes
UF Used For
BT Broader Term
NT Narrower Term
RT Related Terms (“See Also”)
Variant Terms (UF non-preferred, entry terms)
USE (“See”)
17
Thesauri in Context
Types of ThesauriUsed in Indexing
No Yes
No
Yes
Used inSearching
NaturalLanguage
IndexingThesaurus
ClassicThesaurus
SearchingThesaurus
18
Thesauri in Context
VisibilityClassic Use
• Both indexers and searchers explicitly map natural language terms onto controlled vocabularies
Web Environment• Able to choose level of visibility
(implicit use, thesaural browsers)
• Opportunity to educate users (terminology, associative learning)
19
Thesauri in Context
Niche Applications (hypothetical example)
Product Catalog: multipleviews enabled by thesaurus
Technical Support Database:entry vocabulary mapsproblems to solutions
Searching Thesaurus:implicit term explosionmanages synonyms
20
Thesauri in Context
Thesaurus StandardsMono-Lingual Thesauri
• ISO 2788 (1974, 1985, 1986, International)• BS 5723 (1987, British)• AFNOR NFZ 47-100 (1981, French)• DIN 1463 (1987-1993, German)• ANSI/NISO Z39.19 (1994, United States)
Multi-Lingual Thesauri• ISO 5964 (1985, International)
21
Thesauri in Context
ANSI/NISO StandardZ39.19-1993
Guidelines for the Construction, Format, and
Management of Monolingual Thesauri.
84 pp. ISBN: 1-880124-04-1 Price: $49.00
http://www.niso.org/stantech.html
Reasons to Follow Standard• Significant thinking behind guidelines• Technology integration• Cross-database compatibility
22
Thesauri in Context
Oracle’s Perspective“The phrase…thesaurus standard is somewhat misleading. The computing industry considers a ‘standard’ to be a specification of behavior or interface. These standards do not specify anything. If you are looking for a thesaurus function interface, or a standard thesaurus file format, you won't find it here. Instead, these are guidelines for thesaurus compilers -- compiler being an actual human, not a program.
What Oracle has done is taken the ideas in these guidelines and in ANSI Z39.19…and used them as the basis for a specification of our own creation…So, Oracle supports ISO-2788 relationships or ISO-2788 compliant thesauri.”
23
Thesauri in Context
A World in Transition
“The majority of basic problems of thesaurus construction had already been solved by 1967.” (Krooks and Lancaster, 1993)
Traditional Thesauri Web Thesauri
Print Online
Academic / Library Business
Expert / Repeat Users Novice / Infrequent Users
Visible Invisible
Accepted Value Unknown Value
24
Section Break
I. Thesauri in Context II. Value of ThesauriIII. MethodologyIV. MetadataV. Vocabulary ControlVI. Structure & RelationshipsVII. Thesaurus ManagementVIII. Case Study IX. Related Topics
25
Value of Thesauri
IA Metrics
• Cost of finding (time, clicks, frustration, precision).
• Cost of not finding (success, recall, frustration, alternatives).
• Cost of development (time, budget, staff, frustration).
• Value of learning (related products, services, projects, people).
26
Value of Thesauri
KM Metrics
• Revenue Generation (% revenues spent on KM, new revenue generation)
• Opportunity Cost (staff time, customers lost)
• Knowledge Efficiency (faster product development, # mistakes made twice)
• Data Quality (% knowledge on intranet, % email with attachments)
• Intranet Usage (# hits, # contributions)
• Individual Behavior (# citations)
• Technical Performance (uptime, search response time)
Working Council for Chief Information OfficersBasic Principles of Information Architecture
(http://www.cio.executiveboard.com)
27
Value of Thesauri
Web Site Statistics
Wasted expense: most sites will waste between $1.5M and $2.1M on redesigns next year.
Forfeited revenue: poorly architected retailing sites are underselling by as much as 50%.
Lost customers: the sites we tested are driving away up to 40% of repeat traffic.
Eroded brand: people who have a bad experience, typically tell 10 others.
Forrester Research Why Most Web Sites Fail (Sept 98)
28
Value of Thesauri
Intranet Statistics
Employees spend 35% of productive time searching for information online.
Working Council for Chief Information Officers
Basic Principles of Information Architecture
(http://www.cio.executiveboard.com)
Managers spend 17% of their time (6 weeks a year) searching for information.
Information Ecology
Thomas Davenport and Lawrence Prusak
(http://argus-acia.com/content/review001.html)
29
Value of Thesauri
Intranet Statistics
Sun Microsystems’ usability experts calculated that 21,000 employees were wasting an average of six minutes per day due to inconsistent intranet navigation structures. When lost time was multiplied by staff salaries, the estimated productivity loss exceeded $10 million per year.
Jakob Nielsen
Web Design and Development
September 1997
30
Value of Thesauri
Intranet Statistics
After spending two years and $3 million on development and usability testing, Bay Networks expects to see $10 million in productivity gains and a 10 percent cycle-time reduction for new product development as a result of its new information architecture.
Working Council for Chief Information Officers
Basic Principles of Information Architecture
(http://www.cio.executiveboard.com)
31
Value of Thesauri
Intranet Statistics
40% of corporate users can’t find the information they need on their intranet.
Prior to intranet reengineering in 1997, Ford conducted a survey of its 100,000+ user base. Employees stated they could only find 15% of the information they needed to do their jobs.
Under-investment in (unstructured) information. 80% spending on 20% (structured) data.
Working Council for Chief Information OfficersBasic Principles of Information Architecture
(http://www.cio.executiveboard.com)
32
Value of Thesauri
Searching Problems“Most of the complaints we get are due to the way users search – they use the wrong keywords.”
- a manufacturing company
“We have problems with the way customers enter queries. Capitalizations and misspellings give us headaches.”
- a software company
Forrester Research
Must Search Stink? (June 2000)
33
Value of Thesauri
Searching Statistics
“Search will become the center piece of navigation.”
90% of firms rate search as very or extremely important.
52% don’t measure search effectiveness.
Forrester Research
Must Search Stink? (June 2000)
34
Value of Thesauri
CV StatisticsResearchers at Bell Labs found the probability that two people would choose the same word to describe an object to be less than 20%.
Furnas, Landauer, et. al., Bell Labs (1987)
30% of corporations systematically utilize metadata to classify information, while only one to three percent of companies populate those metadata tags using controlled vocabularies.
71% don’t account for misspellings or synonyms.
Forrester ResearchBuilding an Intranet Portal (Jan 1999)
35
Value of Thesauri
CV Statistics
Principle of unlimited aliasing: by leveraging synonyms, recall went from 20% to 80% (in a small collection).
The Trouble with ComputersResearch study at Bellcore (Furnas et al. 1987)
“The findings indicate that a hypertext index with multiple access points for each concept…led to greater effectiveness and efficiency of retrieval on almost all measures.”
A Usability Assessment of Online Indexing Structures By Carol A. Hert, Elin K. Jacob, and Patrick Dawson
Journal of the American Society for Information Science (September 2000)
36
Value of Thesauri
Complementary ApproachesBasic• Navigation Design (Browsing)• Full Text Indexing (Searching)
Advanced• Collaborative Filtering• Lexical Databases • Automated Hierarchy-Generation
37
Value of Thesauri
Navigation DesignRelationships• Global & Local (hierarchical)• Contextual (associative)
Where am I?
Wha
t's n
earb
y?
What's related towhat's here?
Global Navigation
Loca
l Nav
igat
ion
Content is here,with contextual
navigationembedded or
separate.
38
Value of Thesauri
Full Text IndexingStrengths• Enables high precision (exact phrase)• Enables high recall (word occurrence)
Weaknesses• Often results in low precision (“aboutness”)• Often results in low recall (synonyms)
Complementary Use• Provide users with option (search CV, full text)• Intelligent next step (no hits on CV > full text)• Full text search within CV search zones
39
Value of Thesauri
Collaborative Filtering
SN. Approaches that leverage knowledge about preferences or behaviors of people or organizations to facilitate information retrieval.
Popularity / Importance • Direct Hit (analysis of searcher behavior)• Amazon (cross-title purchasing habits)• Google (citation indexing)
Considerations• Favors established materials• Lacks benefits of vocabulary control• User-centric (ignores content, context)
40
Value of Thesauri
Lexical DatabasesScope Notes• Broad term banks or semantic networks
that specify lexical variants and term relationships.
• General-interest, off-the-shelf thesauri.
Examples• Roget’s Thesaurus• WordNet• Plumb Design Visual Thesaurus
41
Value of Thesauri
Lexical DatabasesNumber of Terms (General, Niche)
Importance of Context (Bug in Software, Espionage)
# of Terms
# of Meanings
Notes
WordNet 50,000 70,000
Oxford English Dictionary
615,000 2.4M > 20,000 New Terms Per Year
Named Insect Species
1.4M Drosophila UF Fruit Fly
Square D
Products
300,000 Electrical Distribution
42
Value of Thesauri
Hierarchy-Generation SoftwareAn Intimidating Vocabulary• Multivariate regression models, probabilistic
Bayesian models, neural networks, symbolic rule learning, computational semiotics, and support vector machines
General Techniques• Clustering (similarity, word co-occurrence)• Vector Space (extract “meaning” from terms,
teach by example)
43
Value of Thesauri
Hierarchy-Generation SoftwareExamples
• Autonomy (http://www.autonomy.com/)• Semio (http://www.semio.com/)• Cartia (http://www.cartia.com/)
Hyperbole
Autonomy claims their software eliminates "the need for any manual labor in the process."
44
Value of Thesauri
Hierarchy-Generation SoftwareConsiderations • No business context• No consideration of users• No planning for future• Mixed category schemes• Hidden costs
integration rule design training
Trends• Niche use (e.g., news, web search results)• Integration with manual classification schemes
Content
BusinessContext
Users
X
X
45
Section Break
I. Thesauri in Context II. Value of Thesauri III. MethodologyIV. MetadataV. Vocabulary ControlVI. Structure & RelationshipsVII. Thesaurus ManagementVIII. Case Study IX. Related Topics
46
Methodology
Overview
indicates special emphasis during this phase
Strategy Design Build
Process
Deliverables
Consulting
47
Methodology
Strategy x Process
* select right mix for project; this is a partial list of tools
Information Architect’s Toolbox *
Business Context
strategy meetings
opinion leader interviews
technology assessment
Content & Applications
content inventory
content analysis
metadata
evaluation
Users log analysis observation / usability testing
interviews / affinity
modeling
Existing IA heuristic evaluation
classification scheme analysis
benchmarking
48
Methodology
Design x Deliverables
* select right mix for project; this is a partial list of tools
Information Architect’s Toolbox *
Organization & Labeling
metadata specifications
controlled vocabularies
thesaurus
Navigation (Embedded)
primary taxonomy
classification schemes
blueprints and wireframes
Navigation (Supplemental)
search system sitemap /
indexes
personalization / customization
Synthesis design / authoring guidelines
content management
policies
functional specifications
49
Methodology
Consulting x Build
* select right mix for project; this is a partial list of tools
Information Architect’s Toolbox *
Metadata Application
object-level indexing guides
support indexers
support thesaurus managers
Point of Production
support designers / developers
usability testing input / analysis
fix problems
Post - Launch
metrics evaluation improvement
50
Methodology
Thesaurus Construction
Strategy1. Define Thesaurus Strategy2. Develop Project PlanDesign3. Gather Candidate Terms / Variants4. Select Preferred Terms5. Develop Facet Hierarchies6. Identify ‘See Also’ Links7. Write Design / Functional Specifications8. Build / Buy Software ApplicationsBuild9. Launch Indexing Operation10. Refine Controlled Vocabularies
51
Methodology
Strategy Questions
• Does vocabulary control make sense?• Where and for what purposes?• How will it align with business goals?• How will it support users’ goals?• How will it impact content management?• Will we buy, borrow, or build?
52
Section Break
I. Thesauri in Context II. Value of Thesauri III. Methodology IV. MetadataV. Vocabulary ControlVI. Structure & RelationshipsVII. Thesaurus ManagementVIII. Case Study IX. Related Topics
53
Metadata
Definition
Information about information
Purposes
1. Document surrogate (abstract)
2. Provides context (date, publisher)
3. Facilitates retrieval (subject)
54
Metadata
Ways to LeverageUser Interface
• Generate browsable indexes (site-wide, sub-site, specialized authority files)
• Enable field-specific searching (filters, zones, sorting)
• Support personalization (map profile to tags)
Behind the Scenes• Enable efficient content management • Support decentralized tagging
55
Metadata
Types of IndexingManual Automated
Full Text x complete text minus stop words
Keyword
(Natural Language)
humans assign “relevant” words and phrases
software assigns “relevant” words and phrases
Controlled Vocabulary
humans map variants to preferred terms
software maps variants to preferred terms
56
Metadata
Full Text Indexing
57
Metadata
Keyword Indexing<HTML><HEAD><TITLE>STARTREK.COM:The Official Star Trek Web
Site!</TITLE><META NAME='description'
CONTENT='STARTREK.COM:The Official Star Trek Web Site! The starting point for all Star Trek information on the web.'>
<META NAME='keywords' CONTENT='star trek, enterprise, james kirk, mister spock, seven of nine, doctor mccoy, captain sulu, borg, klingon, romulan, ferengi, human, starfleet command, delta quadrant, alpha quadrant, gamma quadrant, excelsior, paramount, voyager, deep space nine, captain sisko, jean luc picard, kathryn janeway, starfleet academy, united federation of planets'>
<META NAME='author' CONTENT='Paramount Digital Entertainment'>
58
Metadata
CV IndexingPartners/Competitors
UI ACCEPTED TERM
LRID Variant Terms
PC0004 Bell Atlantic
BellAtlantic; Bell Atlantic / North; NYNEX; Nynex
PC0091 NLG National Leisure Group
PC0076 VH1 Video Hits 1; VH-1
59
Metadata
Indexing GuidelinesConsiderations• Specificity: rule of specific entry • Exhaustivity: number of terms per document• Aboutness: strive for consistent interpretation• Consistency: can be more important than quality• Quality: balance against speed and consistency
60
Metadata
Comparative AnalysisFull Text (extraction)• High specificity enables precision (sometimes)
• Exhaustivity allows for high recall (sometimes)
Keyword (assignment or extraction)• Relatively low level of investment• Selection of more relevant words / phrases may
increase recall and precision (sometimes)
Controlled Vocabulary (assignment)• Synonym management increases recall• Disambiguation increases precision
(value increases with size, Medline > 6M documents)
• Enables hierarchical and “see also” browsing
61
Metadata
Cost Analysis
Searching Costs# users, usage volume,
user value, success value,size, complexity
ThesaurusCosts
complexity,vocabulary
stability,technology
Indexing Costscontent volume, #
fields, time per field,rate of growth /
churn
62
Metadata
Automated Indexing
Primary Benefit• Save money (cost of manually classifying 1 journal
article = $1.70)
Approaches• Term Extraction: extraction of “important” words
and phrases (proximity, stemming)• Latent Semantic Indexing: vector space approach
(extracts meaning, training required)
Desired Features• Assign terms from controlled vocabularies• Integrate with thesauri, database tools, etc.• Handle multi-lingual collections
63
Metadata
Automated IndexingSoftware Categories & Labels
Search Engines, Data Mining, Text Extraction, Knowledge Management, Automatic Classification, Meta-Tagging
Leading Products
Metacode’s Metatagger (http://www.metacode.com/)
Mohomine (http://www.mohomine.com/)
Oingo (http://www.oingo.com/)
InXight Categorizer (http://www.inxight.com/)
Semio Taxonomy (http://www.semio.com/)
Inktomi / Ultraseek CCE (http://www.inktomi.com/)
64
Metadata
Selecting a StrategyFactors to Consider Manual Automated
Cost (per document) High Low
Speed Slow Fast
Consistency Variable High
Quality Variable Variable
Multimedia-Capable Yes No
Intelligent(understand text and guidelines)
Yes No
65
Section Break
I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary ControlVI. Structure & RelationshipsVII. Thesaurus ManagementVIII. Case Study IX. Related Topics
66
Vocabulary Control
Getting StartedTypes1. Equivalence
2. Hierarchical
3. Associative
(Preferred)Vermont
(Related)Skiing
(Narrower)Burlington
(Broader)United States
(Variant)Green
Mountain State
(Related)Maple Syrup
(Variant)Vt
1
3
2
67
Vocabulary Control
Identify TermsPublished Reference Materials
Thesauri, classification schemes, encyclopedias, dictionaries, glossaries, indexes
ContentRepresentative sample of web site / intranet
UsersSearch log analysis, surveys, interviews
ExpertsAuthors, subject experts
68
Vocabulary Control
Organize Terms
1. Define preferred terms
2. Link synonyms and variants
3. Group preferred terms by subject
4. Identify broader and narrower terms
5. Identify related terms
Note: steps 3-5 are tentative designations and part of iterative process.
69
Vocabulary Control
Form of Preferred TermsGrammatical Form (noun, adjective, verb)
Spelling (defined authority, house style)
Singular & Plural Form (count nouns)
Abbreviations & Acronyms (popular use)
Considerations• Stemming helps (but not for mouse/mice) • Global guidelines / term-specific decisions• Rules simplify decision-making • Consistency enhances usability
70
Vocabulary Control
Selection of Preferred Terms
ANSI/NISO Z39.19-1993
3.0 “Literary warrant (occurrence of terms in documents) is the guiding principle for selection of the preferred (term).”
5.2.2 “Preferred terms should be selected to serve the needs of the majority of users.”
71
Vocabulary Control
Definition of TermsThe meaning of the term must be deliberately restricted.
Qualifiers (manage homographs)
Cells (biology) / Cells (electric)
Scope Notes (restrict meaning)
Hamburger. SN: includes burgers made with beef. Otherwise use “Turkey Burger” or “Veggie Burger”
Definition (clarify and educate)
Trend towards integration of glossaries
72
Vocabulary Control
Variant TermsVariant terms provide the users with entry points into the vocabulary.
Synonyms (same meaning)
cats USE felines, helicopters USE whirlybirds
Lexical Variants (different word forms)
paediatrics USE pediatrics, BK USE Burger King
Quasi-Synonyms (treated as equivalent)
generic posting: beagle USE dog
antonyms/continuum: wetness USE dryness
73
Vocabulary Control
Recall and Precision
CostsTime to Find
Failure to FindDevelopment
PrecisionDevices
SpecificityCoordination (AND)Compound Terms
Term DefinitionProximity
RecallDevices
Word StemmingVariants (OR)
Generic PostingRelationships
74
Vocabulary Control
Term SpecificityAssuming a good entry vocabulary, increased term specificity allows for improved precision without hurting recall (but costs grow fast).
Vocabulary A Vocabulary B
United States United States
California
San Diego
75
Vocabulary Control
Compound Terms
ANSI/NISO Z39.19.
“Each descriptor…should represent a single concept.”
ISO 2788.
“It is a general rule that…compound terms should be factored (split) into simple elements.”
76
Vocabulary Control
Compound TermsArticle: “Software for Information Architecture”
Hig
h P
rec
isio
n
Hig
h R
ec
all
One Term Information Architecture Software
Two Terms Information Architecture Software
Three Terms Architecture Information Software
77
Section Break
I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary Control VI. Structure & RelationshipsVII. Thesaurus ManagementVIII. Case Study IX. Related Topics
78
Structure & RelationshipsTypes
• Bottom-up (semantic, term to term)
• Top-down (shape, classification)
Semantic Relationships (reciprocity)
• Equivalence • Hierarchical• Associative
79
Structure & Relationships
Semantic Relationships
(Preferred)Settlements
(Related)Housing
(Narrower)Ghost Towns
(Broader)Cultural
Landscapes
(Variant)Human
Settlements
(Related)Dwellings
(Synonym)Inhabited
Places
80
Structure & Relationships
Semantic RelationshipsEquivalence
• Use/Used For (USE/UF)• Leads from variants to preferred
e.g., prams: USE baby carriages
A = B
81
Structure & Relationships
Semantic RelationshipsHierarchical
• Broader Term/Narrower Term (BT/NT)
Types• Generic (class/species, inheritance)
Vertebrata NT Amphibia
• Whole-Part (associative unless exclusive)
Ear NT Vestibular Apparatus
• Instance (proper name)
Seas NT Mediterranean SeaA
B
82
Structure & Relationships
Semantic RelationshipsAssociative
• Related Term (RT, See Also)
• Non-hierarchical and non-equivalent• Relation should be “strongly implied”
e.g., hammers RT nails
A B
83
Structure & Relationships
Associative RelationshipsExamples
Field of Study and Object of Study• Forestry RT Forests
Process and its Agent• Temperature Control RT ThermostatConcepts and their Properties
• Poisons RT ToxicityAction and Product of Action
• Weaving RT ClothConcepts Linked by Causal Dependence
• Bereavement RT Death
84
Structure & Relationships
Classification Schemes
SN Hierarchical arrangement of terms. In navigation context, use Hierarchy.
UF Categorization
Taxonomy
Ontology
RT Hierarchy
85
Structure & Relationships
Pre- & Post-CoordinationEnumerative Classification Schemes
• Pre-coordinate (more compound terms)• All terms are enumerated (listed) in their
entirety in the scheme.
Library of Congress Classification Scheme
Synthetic Classification Schemes• Post-coordinate (more uni-terms)• New terms can be created by combining
terms during a search (AND).
Art & Architecture Thesaurus
86
Structure & Relationships
Pre- & Post-Coordination
• In the highly enumerative LC Classification, “Groundwater - - Pollution” and “Soil pollution” are dispersed at indexing (high precision, low recall).
• Keyword searching improves recall, hurts precision (a synthetic band-aid, potential false drop on “soil purification standards”).
87
Structure & Relationships
PolyhierarchyStrict Hierarchies• Each term appears in only
one place in the hierarchy.• Essential for placement
of physical objects.
Polyhierarchies• Terms cross-listed
in multiple categories.• Accepts complex
nature of reality.
88
Structure & Relationships
PolyhierarchyMedical Subject Headings (MeSH)• Compound terms needed
to manage 6 million documents in Medline.
• High level of pre-coordination forces polyhierarchy.
• Terms may have more than one BT.
ViralPneumonia
Diseases
VirusDiseases
RespiratoryTract
Diseases
89
Structure & Relationships
Faceted ClassificationOverview• Invented by S.R. Ranganathan (1930s)• Handle complex subjects (reality)• One principle of division at a time• Multiple “pure” taxonomies• UF analytico-synthetic scheme, fielded database
Facets• Fundamental facets: personality, matter, energy,
space, time• Common facets: subject (about), geography (in),
author (by whom)
Art & Architecture Thesaurus, ASIS Thesaurus
90
Structure & Relationships
Facets, Coordination, Specificity
Drying of ApplesDrying of PearsDrying of PeachesCanned ApplesCanned PearsCanned PeachesFrozen ApplesFrozen PearsFrozen PeachesFresh ApplesFresh PearsFresh PeachesFreezing of Canned ApplesCanning of Dried PearsDrying of Fresh Peaches
EntitiesApplesPearsPeaches
ProcessesCanningFreezingDrying
FormsCannedFrozenFresh
ApplesPearsPeachesCanningFreezingDryingCannedFrozenFreshCanning of ApplesCanning of PearsCanning of PeachesFreezing of ApplesFreezing of PearsFreezing of Peaches
Partial List of Potential Combinations
91
Structure & Relationships
YahooCharacteristics• Single Facet (a topical hierarchy)• Fairly Enumerative (search on “Boston” finds
45 categories including: Boston Celtics, Boston Tea Party, Anonymous Account of the Boston Massacre)
• Polyhierarchical (Computer Science@ listed under Computers & Internet and Science)
Observations• Huge number of categories and levels (unwieldy)• Fits user expectations (where do I find this?)
92
Structure & Relationships
ASIS ThesaurusCharacteristics• Faceted (16 facets including document types,
fields and disciplines, organizations, qualities)• Fairly Synthetic (large percentage of one or two
word single-concept descriptors)• Polyhierarchical (machine aided indexing
BT computer applications, BT indexing)
Observations• Faceted approach allows small number of terms
to be combined in large number of unexpected ways (e.g., ambiguity and informatics)
• Presentation is not accessible to typical user
93
Structure & Relationships
A Unification Theory
Hypothesis: This hybrid information architecture will become a common model for web sites and intranets over the next several years.
Taxonomy
single facet, enumerative
Thesaurus
faceted, synthetic
fits user expectations (where did they put this?)
fits content complexity
(how can I describe this?)
use for top few levels
(familiar gateway to site)
populate the hierarchy
(combinations, see also)
early user tests
(best primary hierarchy)
ongoing user tests
(leverage power, flexibility)
application of human expertise
human-software hybrid
(facet-specific solutions)
94
Section Break
I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary Control VI. Structure & Relationships VII. Thesaurus ManagementVIII. Case Study IX. Related Topics
95
Thesaurus ManagementWhat’s Involved?• Software, workflow, quality control• Vocabularies evolve over time• Impacts authors, indexers, users
Vocabulary Maintenance Tasks• Add, delete, enhance, normalize terms• Overall evaluation
96
Thesaurus Management
Software: What to Look For• Traditional database functionality• Compliant with standards (ANSI, ISO)
• Relationship control (reciprocity, validation, orphan identification)
• Term status (proposed, provisional, accepted)
• Flexible output (alphabetical, hierarchical)
• Integration with related tools and tasks (indexing, searching, browsing)
Willpower’s List of Thesaurus Software
http://www.willpower.demon.co.uk/thessoft.htm
97
Thesaurus Management
Software: What You’ll FindThesaurus Management Software• Standards-compliant, sophisticated, • Poor integration (library-centric)• Examples: Lexico, MultiTes
Database Management Software• Strong integration• Less thesaurus-specific functionality• Examples: Oracle (interMedia),
Sybase (English Wizard)
98
Thesaurus Management Software
What You’ll FindSearch Engines• Watch for casual use of “thesaurus”• Look for integration with browsing.
UltraseekThesaurus Expansion for Queries: Administrators may put sets of synonyms in the thesaurus.txt file…When a query matches one of the terms in that file, the synonyms will automatically appear, so the user has the option to add it to the query.
VerityVerity's core search products include the following advanced knowledge retrieval capabilities: advanced query expansion and disambiguation tools, including linguistic stemming and thesaurus expansion.
99
Section Break
I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary Control VI. Structure & Relationships VII. Thesaurus Management VIII. Case Study IX. Related Topics
100
Case Study
Call Center IntranetIntroduction
• KM application• 6,000 users (customer care associates)
• 8,000 documents (hierarchy, search)
• 6 month project (10/97 to 4/98)
• $500K of $10M redesign
Goals• Reduce training time / time to find• Increase use / customer satisfaction
101
Case Study: Call Center Intranet
Process OverviewStrategy • Background, vocabulary, meetings, observation • 4 weeks x 2.5 PM + 1 IA
Design • Bottom-up focus (doc types, fields, templates)• 4 weeks x 2 PM + 2 IA • 4 weeks x 1 IA (during implementation)
Implementation • Indexing / develop controlled vocabularies• Specifications (authors, indexers, developers)• 16 weeks x 4 indexers + 1 IA + 2 PM + 1
subject expert
102
Case Study: Call Center Intranet
Controlled VocabulariesPrimary Vocabularies
• Partners/Competitors (122)
• Plans/Promotions (173)
• Products/Services (151 / 184 variants)
• Geographic Codes (51)
Secondary Vocabularies• Adjustment Codes (36)
• Corporate Terminology (70)
• Time Codes (12)
103
Case Study: Call Center Intranet
Primary VocabulariesPartners/Competitors
UI ACCEPTED TERM
LRID Variant Terms
PC0004 Bell Atlantic
BellAtlantic; Bell Atlantic / North; NYNEX; Nynex
PC0091 NLG National Leisure Group
PC0076 VH1 Video Hits 1; VH-1
104
Case Study: Call Center Intranet
Primary VocabulariesProducts/Services
UI Accepted Term
LRID Variant Terms
PS0135 Access Dialing
10-288; 10-322; dial around
PS0006 Air Miles
AirMiles
PS0151 XYZ Direct
USADirect; XYZ USA Direct; XYZDirect card
105
Case Study: Call Center Intranet
Primary VocabulariesGeographic Codes
CT Connecticut
DE Delaware
DC District of Columbia; Dist. of Columbia; Dist. Columbia
Note:Continental U.S. is equivalent to the lower 48 states.
106
Case Study: Call Center Intranet
Secondary VocabulariesAdjustment Codes
DAK Denies All Knowledge
-
MOS Monthly Service Charge
Mnthly. Service Charge; Mnthly. Svc. Charge; Monthly Svc. Charge
WNO Wrong Number -
WTN Working Telephone Number
Working Tele. Number
107
Case Study: Call Center Intranet
Secondary VocabulariesCorporate Terminology
Billed Telephone Number (BTN)
Billed Tele. Number
Cross Boundary Account
Foreign Account
Fraud -
Multi Level Marketing
Multi-Level Marketing; MultiLevel Marketing; MLM
World Wide Web WWW; WorldWideWeb
108
Case Study: Call Center Intranet
Blueprints
CustomerCare
Browse byPlans &
Promotions
Browse byProducts &Services
Browse byTopics
Browse byPartners &
Competitors
Browse byGeography
Browse byWhat's New
AdvancedSearch
SearchInterface
ExpressLinks
(Top 10)
ExpressLinks
109
Case Study: Call Center Intranet
Wireframes: Content
110
Case Study: Call Center Intranet
Wireframes: Browsable Index
Provides ability to view all documents tagged with same preferred term. Ability to combine fields for powerful search/browse.
111
Case Study: Call Center Intranet Deliverables Overview• Blueprints and Wireframes• Controlled Vocabularies• Authoring & Indexing Guidelines• Indexed Documents (4,000)
• Functional Specifications• Documentation & Training
112
Section Break
I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary Control VI. Structure & Relationships VII. Thesaurus Management VIII. Case Study IX. Related Topics
113
Related Topics
Multi-Lingual ThesauriConcepts• Source / Target Language• Degrees of Equivalence• Localization, not Globalization
Facts (from The Mother Tongue by Bill Bryson)
• There are now more students of English in China than there are people in the United States
• The French can’t distinguish house and home• Finnish has 15 case forms (noun variants)• The Eskimos have 50 words for types of snow
but no word that just means snow• A blizzard in England is a flurry in Nebraska
114
Related Topics
The List Goes On…Thesauri AND
• Business Strategy• Content Management• Markup Languages• Notation• XML
115
Seminar Review
I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary Control VI. Structure & Relationships VII. Thesaurus Management VIII.Case Study IX. Related Topics
116
How To Learn MoreArgus Center for Information Architecture
Web Site
http://argus-acia.com
Email Newsletter
Strange Connections, Events, Interviews
Thesaurus Resources & Examples
http://argus-acia.com/seminars/
user name and password both = “lajolla”
117
Contact UsArgus Associates, Inc.912 North Main StreetAnn Arbor, Michigan 48104(734) 913-0010
Sales [email protected]
Employment http://argus-inc.com/recruiting/
Web Sites http://argus-inc.com/
http://argus-acia.com/