Taxonomies: Insuring compatibility and crosswalks Marjorie M. K. Hlava Access Innovations / Data Harmony [email protected]
Dec 29, 2015
Taxonomies:Insuring compatibility and crosswalks
Marjorie M. K. Hlava
Access Innovations / Data Harmony
Background "Underlying the information architecture for web sites
and search are taxonomies. The standards for thesauri, taxonomies, ontologies, semantic web and topic maps are converging.
Where do they differ and where are they the same? This one hour talk will cover the ISO ANSI/NISO and
W3C terminology and controlled vocabulary standards, as well as the differences in the new standards compared to the previous editions.
Finally it will talk about the crosswalks and registries underway between these development communities."
What we will cover today Background Overview of standards Specifics on 3 things
NISO Z39.19 BSI 8723 IFLA
Thoughts on a registry
Why are taxonomies hot? Search doesn’t work
Without tagged data Websites need them to display
information To tag navigation back to content
What’s happening to the business? Carpet baggers Differences of opinion Want to build on existing taxonomies Need for standards Need for cross walks Need for international communication Need for general registries of taxonomies
The Problem – KEEPING UP
Many players we know and don’t know Between controlled vocabulary standards
ISO 2788 and 5964, BSI 8723
Groups developing guidelines and standards W3C with SKOS and OWL Governments world wide developing and mandating taxonomies
Communities increase reuse mapping interoperability between controlled vocabularies.
Traditional Standards ISO
TC 46 SC 9
ANSI NISO
Z39.19 BSI
BS 8723 W3C
OWL SKOS
US Government Office of Management and Budget
European Union
Thesaurus related NISO Z39.19 2006 www.niso.org BSI (BS 8723) the next revised ISO ISO 2788 - Monolingual (1986) ISO 5964 - Multilingual (1985)
www.iso.ch/iso/en/ISOOnline.frontpage ISO 5127, Information and documentation
Vocabulary OWL from W3C SKOS the W3C thesaurus standard
Thesaurus and Indexing Standards – ANSI/NISO
ANSI/NISO Z39.19 - 2003 Guidelines for the Construction, Format, and Management of Monolingual Thesauri
NISO Z39.19-200x Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies
NISO TR02-1997 Guidelines for Indexes and Related Information Retrieval Devicesby James D. Anderson
The standards NISO Z39.19 2006 www.niso.org BSI (BS 8723) - the next revised ISO ISO 2788 - Monolingual (1986) ISO 5964 - Multilingual (1985)
www.iso.ch/iso/en/ISOOnline.frontpage ISO 5127 - Information and documentation
Vocabulary OWL from W3C SKOS - the W3C thesaurus standard
Z39.19 - What’s new?The old standard
Coverage documents
Types of vocabularies Thesauri
Single BT Post-coordinated Printed formats Monolingual
vocabularies
The revised standard
Coverage Content objects
Types of vocabularies lists, synonym rings,
taxonomy Pre-coordinated Web format Multilingual vocabularies
(general) Polyheirachical Interoperability Facet analysis
British Standards - BS 8723 Structured vocabularies for information retrieval
– Guide Part 1: General Part 2: Thesauri Part 3: Vocabularies other than thesauri Part 4: Interoperability between vocabularies Part 5: Interoperability with applications
ISO TC 37Scope of ISO TC 37:
Standardization of principles, methods and applications relating to terminology and other language resources.
TC 37/SC 1 - Principles and methods TC 37/SC 2 - Terminography and lexicography TC 37/SC 3 - Computer applications for
terminology TC 37/SC 4 - Language resource management
Other ISO standards: Concept-oriented terminology ISO 704:2000 Terminology work -
Principles and methodsISO 860:1996 Terminology work -
Harmonization of concepts and termsISO 1087-1:2000 Terminology work - Vocabulary -
Part 1: Theory and applicationISO 1087-2:2000 Terminology work - Vocabulary -
Part 2: Computer applications ISO 10241:1992 Preparation and layout of
international terminology standards
Sample ISO - Data Categories ISO 12200:1999 Computer applications in
terminology - Machine-readable terminology interchange format (MARTIF) - Negotiated interchangeISO 12616:2002 Translation-oriented terminographyISO/TR 12618:1994 Computer aids in terminology - Creation and use of terminological databases and text corpora ISO 12620:1999 Computer applications in terminology - Data categories
used to create glossaries
ISO Thesaurus and Indexing Standards ISO 2788:1986
Documentation - Guidelines for the establishment and development of monolingual thesauri
ISO 5964:1985Documentation - Guidelines for the establishment and development of multilingual thesauri
ISO 5963:1985Documentation - Methods for examining documents, determining their subjects, and selecting indexing terms
ISO 999:1996Information and documentation - Guidelines for the content, organization and presentation of indexes
ISO TC 46/SC 9
Information and Documentation - Identification and Description
TC 46 is ISO's Technical Committee (TC) for information and documentation standards.
SC 9 is the TC 46 Subcommittee (SC) that develops and maintains ISO standards on the identification and description of information resources.
ANSI/NISO Thesaurus and Indexing Standards
ANSI/NISO Z39.19 - 2005 Guidelines for the Construction, Format, and Management of Monolingual Thesauri
NISO Z39.19-200x Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies
NISO TR02-1997 Guidelines for Indexes and Related Information Retrieval Devicesby James D. Anderson
Reports to use Report on the Workshop on Electronic
Thesauri, November 4-5, 1999 http://www.niso.org/news/events_workshops/thes99rprt.html
Final Report to the ALCTS/CCS Subject Analysis Committee: Subcommittee on Subject Relationships/Reference StructuresJune 1997 http://archive.ala.org/alcts/organization/ccs/sac/rpt97rev.html
Other links http://esw.w3.org/topic/SkosDev/ThesaurusLinks/
XmlFormats MARC-21 XMLSchema. Zthes Z39.50 profile for thesaurus navigation (2001). TML thesaurus markup language (1999). ADL Thesaurus Protocol XML formats (2002). MeSH XML format (2001). GEMET XML format (2003). APAIS XML thesaurus format, an extension of Zthes
(2000). Open University thesaurus schemas (2002). Soergel XML thesaurus specification (2001).
W3C OWL – Web Ontology Language RDF – Resource Description Format Topic Maps SKOS - Simple Knowledge Organization
Systems
Which community to serve? Build on the current standard Might make this link next
Other things to watch Other W3C and ISO areas Support groups
Blogs Communities of Practice
SIMILE Web 2.0 activities WSDL – Web Services Digital Library
Other Relevant ISO & W3C Standards
For translation, terminology and applied linguists go to: http://appling.kent.edu/ResourcePages/LTStandards/Chart/standards.chart.htm#Ontology
•Markup Languages •Metadata Resources •Character Coding •Access Protocols and Interoperability•Content Creation, Manipulation, and Maintenance •Authoring Standards •Text and Content Markup •Translation Standards •Terminology and Lexicography Standards •ISO TC 37 Standards •Terminology Interchange Standards •Controlled Language Standards •Taxonomy and Ontology Standards •Corpus Management Standards •Locale-Related Standards
SIMILE Semantic Interoperability of Metadata and Information in unLike Environments
Forming a data reference for open source taxonomies
Revised Standards for Controlled VocabulariesU.S. Standard (NISO Z39.19 - 2005)British Standard (BS 8723 - 2005)IFLA Guidelines - 2005
U.S. Standard for Controlled Vocabularies – NISO Z39.19
NISO Z39.19-200x Guidelines for the Construction, Format,
and Management of Monolingual Controlled Vocabularies
Some of the slides are based on
Emily Fayen 2004.6 SLA presentation, Margie Hlava’s talk at 2005 Data Harmony User Group meeting 2005 and Marcia Zeng – NKOS Meeting in Denver
A little bit history… ANSI/NISO Z39.19,Guidelines for the
Construction, Format, and Management of Monolingual Thesauri – 1993
The most frequently requested NISO Standard In spite of its age the Standard is still relevant 1999: NISO Workshop on Electronic Thesauri
http://www.niso.org/news/events_workshop/thes99rpt.html
2002: NISO initiates revision of Z39.19 2004: 1993 reaffirmed 2005 new standard published
Scope Expand beyond thesaurus Make more user-friendly Explain important concepts Explain principles of vocabulary control Include electronic information environment Include additional user search methods:
Browse Navigate Keyword searching
Expand beyond A & I services Include Web applications
The Team: Vivian Bliss – Microsoft Carol Brent – ProQuest John Dickert – DTIC Lynn El-Hoshy – Library of Congress Marjorie Hlava – Access Innovations Stephen Hearn – ALA Sabine Kuhn – Chemical Abstracts Service Pat Kuhr – H.W. Wilson Company Diane McKerlie – DMA Consulting Peter Morville -- Semantic Studios Stuart Nelson – National Library of Medicine Allan Savage – National Library of Medicine Diane Vizine-Goetz – OCLC Marcia Lei Zeng – Special Libraries Association
Z39.19 Chapters
1. Introduction 2. Scope3. Referenced Standards4. Definitions, Abbreviations, and Acronyms5. Controlled Vocabularies – Purpose,
Concepts, Principles, and Structure6. Term Choice, Scope, and Form7. Compound Terms8. Relationships9. Displaying Controlled Vocabularies10. Interoperability11. Construction, Testing, Maintenance, and
Management Systems
Z39.19 - What’s new? The old standard
Coverage documents
Types of vocabularies
Thesauri Single BT Post-coordinated Printed formats Monolingual
vocabularies
The revised standard
Coverage Content objects
Types of vocabularies lists, synonym rings,
taxonomy Pre-coordinated Web format Multilingual vocabularies
(general) Poly hierarchical Interoperability Facet analysis
Principles of Controlled Vocabularies
There are four important principles of vocabulary control that guide their design and development.• eliminating ambiguity• controlling synonyms• establishing relationships among terms where appropriate• testing and validation of terms
Type of vocabulary control
Lists A list is a simple group of
terms Example:
Alabama
Alaska
Arkansas
California
Colorado
. . . .
Frequently used in Web site pick lists and pull down menus
Synonym Rings A synonym ring is a list of synonyms or near synonyms
that are used interchangeably for retrieval purposes
Synonym Rings-- ExamplesSynonym rings are
usually found as sets of lists that allow users to access all content containing any of the terms.
e.g., cholesterol:
CholesterolBlood CholesterolSerum CholesterolGood CholesterolBad CholesterolLDL . . .
-- Frequently used in systems where the content is not indexed or the indexing vocabulary is not controlled
An example from International SEMATECH;
a search for Silicon would look like this:
Your search was submitted as “SILICON” or “SI”
Synonym Rings are used-- To expand queries for content objects.
any one of these terms retrieves any of the terms in the cluster.
With unstructured natural language format, interface draws together similar terms
With search engines Help control of the diversity of the language
Taxonomies A taxonomy is a set of preferred terms, all
connected by a hierarchy or polyhierarchy
Example:Chemistry
Organic chemistry
Polymer chemistry
Nylon
Frequently used in web navigation systems
Thesauri A thesaurus is a controlled vocabulary with
multiple types of relationships
Example:Rice
UF paddy
BT Cereals
BT Plant products
NT Brown rice
RT Rice straw
Thesauri (cont.)Relationship types: Equivalence (Use/Used For) – indicates
preferred term in a synonym relationship Hierarchy – indicates broader and narrower
terms Associative – almost unlimited types of
relationships may be used - related
It is the most complex format for controlled vocabularies and widely used.
Interoperability One of the most important issues from
the 1999 workshop
Question: How to compare indexes perform searches merge databases that have been developed
using different controlled vocabularies?
Interoperability (CONT.) Factors Affecting Interoperability Multilingual Controlled Vocabularies Searching Indexing Merging Databases Merging Controlled Vocabularies Achieving Interoperability Storage and Maintenance of Relationships
among Terms in Multiple Controlled Vocabularies
II. The British Standard
BS 8723: Structured Vocabularies for Information Retrieval – Guide Slides based on the presentation by Stella G Dextre Clarke, Alan Gilchrist ,Leonard WillIn ISKO 2004, London
Existing BSI/ISO thesaurus standards ISO 2788-1986 Guidelines for the
establishment and development of monolingual thesauri
= BS 5723:1987
ISO 5964-1985 Guidelines for the establishment and development of multilingual thesauri
= BS 6723:1985
What needs updating? Printed versus electronic application Guidance on management software Interoperability:
Mapping between thesauri and other types of vocabulary
Formats/protocols for data exchange with downstream applications
Applicability to end-user applications, not just those for information professionals
Outline of new standardBS 8723: Structured vocabularies for
information retrieval – Guide Part 1 - Definitions, symbols and abbreviations Part 2 – Thesauri Part 3 - Vocabularies other than thesauri; Part 4 - Interoperability between vocabularies Part 5 - Interoperation between vocabularies and
other components of information storage and retrieval systems
Part 3 chapters Classification schemes Subject heading lists Taxonomies Ontologies Semantic nets (?) Search thesauri
Issues for Part 3 How much guidance is needed on how to
build other sorts of vocabulary? Should we describe the idiosyncrasies of
existing schemes, even where we judge there is a ‘better’ way?
Pick out the characteristics of different vocabulary types that govern when and how you can map them.
But some of the observable characteristics might not be what we’d recommend.
Part 4: Interoperability between vocabularies
Huge demand for accessing information indexed with another language and/or vocabulary. ‘Mapping’. The Semantic Web is just one application.
Includes multilingual thesauri special case of mapping between vocabularies.
Applies where more than one language or vocabulary is in use, access to all resources is through one vocabulary
BS 8723 part 4 has a wider scope BS 6723, was only with multilingual thesauri.
BS 8723 extends the scope to: thesauri in different dialects of one language different thesauri in a single language situations where a thesaurus interoperates with one or
more different types of structured vocabulary, such as classification schemes
situations where not all the interoperating vocabularies have the same status and/or function.
Part 4: Interoperability between vocabularies (cont.)
Part 5: Interoperability with applications Vocabularies must work with
Search software Content Management Systems Web publishing software, etc.
Build on existing formats and protocols for data exchange Z39.50 and Zthes, XML schema DTD MARC SKOS Core Schema Topic Map ADL gazetteer protocol W3C crosswalks OMB _ Section 207 of e-gov act
Review and Comments Request a copy for Parts 1, 2, 3 and
4: Parts 1 and 2 numbered 04/30086620 DC
and 04/30094113 DC. The documents may be ordered from BSI
Customer Services tel +44(0)208-996-9001 or email [email protected]
Part 5 is out for comment
III. IFLA Guidelines for Multilingual Thesauri
IFLA Classification and Indexing Section April 2005 released for commentsPublished 2005
World-Wide Review of IFLA Guidelines for Multilingual Thesauri
URL: http://www.ifla.org/VII/s29/pubs/Draft-multilingualthesauri.pdf
Add to the ISO 5964 for multilingual Thesauri
IFLA Classification and Indexing Section WG on Guidelines for Multilingual Thesauri
Chair: Gerhard J.A. Riesthuis (Netherlands)
Members: Lois Mai Chan (USA), Patrice Landry (Switzerland), Pia Leth (Sweden), Ia McIlwaine (United Kingdom), Martin Kunz (Germany), Dorothy McGarry (USA), Max Naudi (France), Marcia Lei Zeng (USA)
Three approaches in the development of multilingual thesauri:
1. building a new thesaurus from the bottom up starting with one language and adding another language or
languages starting with more than one language simultaneously
2. combining existing thesauri merging two or more existing thesauri into one new
(multilingual) information retrieval language to be used in indexing and retrieval
linking existing thesauri and subject heading languages to each other; using the existing thesauri and/or subject heading languages both in indexing and retrieval
3. translating a thesaurus into one or more other languages
Semantic problems
Semantic problems pertain to equivalence relations between terms used as preferred and non-preferred terms in information retrieval languages.
Equivalence relations exist not only within each separate language involved, but also between the languages (intra-language equivalence and inter-language equivalence).
Intra-language homonymy and inter-language homonymy are also considered semantic questions.
Additional problems pertaining to semantics involve the scope, form and choice of thesaurus terms.
Structural problems Structural problems involve hierarchical and
associative relations between the terms. An important question in this respect is whether
the structure should be the same or different for each language.
In most if not all cases of linking, the structure will most probably not be the same in all the information retrieval languages involved.
In the other approaches mentioned it is possible in principle to apply the same structure to all languages.
Contents covered by the guidelines
Building multilingual thesauri starting from scratch
Structure Morphology and Semantics
Starting from existing thesauri Merging Linking
Glossary Appendix:
An example of a non-symmetrical thesaurus
Examples are in multiple languagesEnglish (British) English (USA) Dutch French
cranes (birds) cranes (birds) kraanvogels grue (oiseau) cranes (lifting
equipment) cranes (lifting
equipment) hijskranen SN voor andere
typen kranen, zie aldaar
grue (appareil de levage)
water taps water faucets waterkranen robinet à eau gas taps gas faucets gaskranen robinet à gaz taps NT water taps NT gas taps
faucets NT water faucets NT gas faucets
kranen SN voor kranen als
hijswerktuig gebruik hijskranen
NT waterkranen NT gaskranen
robinet NT robinet à eau NT robinet à gaz
Cranes is a homograph in English does not necessarily mean that equivalent terms in other languages are also homographs. The Dutch term kranen is a homograph too, but with the meanings cranes (lifting equipment) and taps.
What is a taxonomist to do? Watch the standards Participate in development Exceed the guidelines Comply with all standards –
internationally Promote standards participation And we do – so far!
Controlled vocabularies of all stripes need a place to call home Open contribution Thesaurus metadata contributions Comments on the contributions Examples of implementation A clearing house to keep track of
all the initiatives and suggested standards, a means to allow input from and to those initiatives, and publishing of best practices or lessons learned from
implementations perhaps a WikiKOS
The Solutions Registry? NKOS KOS of KOS SKOS participants KOS typology - Tudhope Tesauro.com – Spanish - Salama Kent.edu site – Marcia Zeng Taxonomy Warehouse – Factiva - Clarke UMLS - Unified Medical Language
System
More Solutions
Semantic Interoperability of Metadata and Information in unLike Environments (Open Source
UK HILT - Dennis Nicholson
Good starts Link to each other Include
Thesauri Taxonomies Semantic webs Classification systems Subject headings SKOS OWL and Ontologies Other KOS
What about? Authority Files Other pick lists Roget's and other synonym rings Dictionaries Gazetteers Glossaries Etc.
Discussion??
Thank you for your attention!
Marjorie M. K. Hlava
Access Innovations / Data Harmony