1 Enhancing search An update on taxonomies, metadata and thesauri Leonard Will Willpower Information
Jan 29, 2016
1
Enhancing searchAn update on taxonomies, metadata and thesauri
Leonard Will
Willpower Information
2
Summary
1 Metadata creation is cataloguing
2 Taxonomies are classifications
3 Thesauri and classifications are complementary ways of grouping concepts
4 Facet analysis is a useful technique for constructing schemes systematically
5 Most computer search interfaces are inadequate
3
Metadata = catalogue records
• Resources: any things that can be identified
– documents, web pages, images, sound files, teaching packages, books, museum objects, people, organisations
• Metadata: structured information about resources
– May be included with resources (e.g. “CIP”) or collected in separate “union catalogues” (e.g. OAI-PMH)
– Some from the resource itself (size, format), some from external sources (provenance, location, accessibility)
4
Metadata standards
• Anglo-American Cataloguing Rules (AACR)• Encoded Archival Description (EAD)• Learning Object Metadata (LOM)• Spectrum standard for museum information• Friend of a Friend (FOAF) and vCard• e-Government Metadata Standard (eGMS) • Dublin Core - lowest common denominator
5
Kinds of standards
• Content standards: which pieces of information are to be recorded (DC, AACR)
• Value standards: how is the information to be recorded (= DC encoding schemes)– formats (ISO date format, NCA name formats, AACR)– lists of valid values (thesauri, authority files)
• Structure standards: how the information is to be grouped and labelled for use by computers and humans (XML schemas, MARC)
• Application profiles: Choices from the above
6
Dublin Core metadata
• Title• Creator• Subject• Description• Publisher• Contributor• Date• Type
• Format• Identifier• Source• Language• Relation• Coverage• Rights• + element refinements
7
Subject
“Typically, Subject will be expressed as keywords, key phrases or classification codes that describe a topic of the resource.
Recommended best practice is to select a value from a controlled vocabulary or formal classification scheme.”
8
Taxonomies = controlled vocabularies
• “Taxonomy”: woolly meaning -> confusion– keep it for biological classification systems
• Knowledge organization systems (KOS)– a better expression for the general concept
• Main types are– thesauri– classification schemes– ontologies
9
Thesauri and classification schemes
• Thesauri and classification schemes are alternative ways of showing concepts and their relationships
• They are complementary and both approaches are needed
• They can both be built on the principles of facet analysis
10
Building blocks of all knowledge organisation schemes
• concepts
• relationships35 m cameras CC:H012 BT: film camerasaqualungs CC: D002 BT: diving equipmentcamera accessories CC: H002 BT: photographic equipment NT: flash guns
light meterstripods
RT: cameras
11
Relationships are between concepts, not words
BT
NT
vehiclesroad vehiclesconveyancesvoitures388.34629.2
carsautomobilesautosprivate cars388.342629.222
Choose one term as a descriptor to label the concept:
cars USE automobiles
12
Preferred term substitution
Anythingon farming?
I use the term agriculture for farming, so I’ll search for that
13
Relationships between concepts
• Paradigmatic, or a priori: apply generally, independently of any specific document– shoes BT footwear– shoes RT shoemakers
• Syntagmatic, or a posteriori: concepts that are related only in the context of a specific document– shoes : history– shoes : prices
A thesaurus can show these
A classification scheme can also show these
14
Searching hierarchies
I need informationon road vehicles
I know that buses,cars and lorries are all kinds of road
vehicles, so I’ll search for
these terms as well as for road
vehicles
15
Searching related terms
Please give me information
about agriculture
OK,I’ll look for that. Would you
also be interested in items dealing with forestry,
livestock or pet breeding?
16
Paradigmatic relationshipsin a thesaurus
• Many relationships are indicated as RT/RT, but their nature is not specified, so cannot be used for systematic grouping (ontologies overcome this)
• Hierarchical generic-specific relationship (BT/NT) allows (requires) grouping of concepts into facets - the terms have to be in the same facet
17
What is a facet?(Sometimes called a fundamental facet)
A high-level grouping of concepts of the same inherent category, e.g. activities, disciplines, people, materials, places, times. For example:
animals, mice, daffodils and bacteria could all be members of a living organisms facet;
digging, writing and cooking could all be members of an activities facet;
birthdays, wars and football matches could all be members of an events facet.
A concept cannot belong to more than one facet
19
A grouping of concepts within a facet by some stated characteristic of division.
vehicles
<vehicles by number of wheels> bicycles tricycles four-wheeled vehicles
automobiles
<vehicles by load carried> goods vehicles
lorries passenger vehicles
automobilesbuses
What is an array?(Sometimes called a subfacet)
Node labelsshowing
characteristicsof division
Array
Array
A concept may occur in more than one array
20
Parametric search
• Searching for resources that have one or more specified characteristics
• e.g. vehicles which– have three wheels AND– are used for carrying passengers
• This is an important and useful aspect of post-coordinate searching, but it is not faceted classification
21
Ways of displaying concepts and their paradigmatic relationships
1. Alphabetically, with their relationships35 mm cameras BT: film cameras
aqualungs BT: diving equipment
camera accessories BT: photographic equipment NT: flash guns
light meterstripods
RT: cameras
22
Ways of displaying concepts and their paradigmatic relationships
2. Hierarchically - one tree for each facet
(fields of work) . diving. photography. physics. . optics
(people)<people by age> . infants . children . adults<people by occupation> . divers . models (people) . photographers . physicists
(equipment). diving equipment. . aqualungs. . diving suits. . . dry suits. . . wet suits. . face masks. photo equipment. . cameras
23
Ways of displaying concepts and their paradigmatic relationships
3. In subject groups or categories (microthesauri)– one tree for each facet in each category
(fields of work) . diving. . scuba diving. . snorkel diving
(people). divers
(equipment). diving equipment. . aqualungs. . diving suits. . . dry suits
(fields of work) . photography. . colour photography
(people) . models (people) . photographers
(equipment). photo equipment. . cameras
797.23: DIVING
770: PHOTOGRAPHY
24
Combining concepts :syntagmatic relationships
(places)A1 ItalyA2 The NetherlandsA3 Russia
(people)B1 pottersB2 repairersB3 ceramicists
(activities)C1 mouldingC2 throwingC3 decoration
(objects)D1 earthenwareD2 porcelainD3 stoneware
Combine to express compound subjects - either post-coordinate, for searching:
porcelain AND decoration AND Russiaor pre-coordinate, for browsing:
porcelain decoration in Russia: D2C3A3
Node labelsshowing
facet names
25
Order of combining facets
thing - kind - part - property - material - process - operation - system operated on - product - by-product - agent - space - time - form
e.g.porcelain (thing) - decoration (process) - in Russia (space)
A facet may occur more than once in a string
26
Faceted classificationwith processes subordinated to objects
(processes)A ceramic production processes in generalAA forming in generalAAA coilingAAB mouldingAAC throwingAB decoration in generalABA glazingABB transfer printing
(objects)B ceramics in generalBB earthenware in general
(processes)BB.AA forming of earthenware BB.AAB moulding of earthenware BB.AB decoration of earthenware BB.ABA glazing of earthenware BB.ABB transfer printing of earthenware BC porcelain in general
(processes)BC.AA forming of porcelainBC.AAB moulding of porcelain
Words shown in blue may be omitted as they are implied by the hierarchical structure
27
Faceted classificationgeneration of subject strings
(objects)B ceramicsBB earthenware
(processes)BB.AA formingBB.AAB mouldingBB.AB decorationBB.ABA glazingBB.ABB transfer printingBC porcelain
(processes)BC.AA formingBC.AAB moulding
ceramics > earthenware > formingceramics > earthenware > forming > mouldingceramics > earthenware > decorationceramics > earthenware > decoration > glazingceramics > earthenware > decoration > transfer printingceramics > porcelainceramics > porcelain > formingceramics > porcelain > forming > moulding
28
Alphabetical index
ceramic production processes Aceramics Bcoiling : forming : ceramic production AAAdecoration : ceramic production ABdecoration : earthenware : ceramics BB.ABearthenware : ceramics BBforming : ceramic production AAforming : earthenware : ceramics BB.AAforming : porcelain : ceramics BC.AAglazing : decoration : ceramic production ABAglazing : decoration : earthenware : ceramics BB.ABAmoulding : earthenware : ceramics BB.AABmoulding : forming : ceramic production AABmoulding : porcelain : ceramics BC.AABporcelain : ceramics BCthrowing : forming : ceramic production AACtransfer printing : decoration : ceramic production ABBtransfer printing : decoration : earthenware : ceramics BB.ABB
29
The same concepts viewed in different ways
Thesaurus view Good for searching if you
know what you want Like a gazetteer Like a book’s index Gets quickly to individual
concepts Usually arranged by facet Shows paradigmatic
relationships Lets you combine concepts
when searching
Classification view Good for browsing or
surveying a topic Like a map Like a book’s contents page Shows related concepts
together Usually arranged by discipline Shows syntagmatic and
paradigmatic relationships Shows compound topics as
pre-combined subject strings
30
Some clarifications
• A classification can be both hierarchical and faceted• A classification built on faceted principles can be
enumerative• A symbolic notation is not essential, and should not
determine the structure• A classification can arrange compound topics in a
useful linear sequence - a thesaurus cannot• One-to-one mapping between a thesaurus and a
classification is not possible• A “guide to popular topics” may be used to
supplement a systematic classification
31
Use of a thesaurus
• A thesaurus as a search aid with unindexed material– Allows searching on terms linked to the term
asked for
• Software support for formulating questions– Browsing the thesaurus to choose terms– Combining terms with AND, OR, NOT and ( )
32
An ambiguous search interface
Does this mean: (lorries OR cars) AND diesel ?or does it mean: lorries OR (cars AND diesel) ?
33
Thesaurus creation and management
• Standards– BS/ISO standards give helpful guidance– Draft revised BS standard now out for comments
• Software– Many packages available– Best if integrated with database used for cataloguing
• Cooperative thesaurus development and use– DIY is a major and continuing task
34
Thesaurus development never ends
• It is an ongoing task
• It needs a knowledgeable thesaurus editor
• It needs cooperation and input from indexers and users
• User feedback
35
What we need
• Software for the combined development of thesaurus and classification– Thesaurofacet; Classaurus; ROOT; Bliss; Taxomita
• Software support for combining facets when searching, using a thesaurus. Often referred to as faceted classification, but not the same thing– Flamenco; View-based searching; No zero match (NZM)
• Software support for browsing in a classified catalogue with notation, captions and an alphabetical index
36
Links and further information
<http://www.willpowerinfo.co.uk/>