Top Banner
Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information Sciences University of Tennessee
40

Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Jan 14, 2016

Download

Documents

Augustine Glenn
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Subject Analysis and Vocabulary Control

Spring 2006, 6 March

Bharat MehraIS 520 (Organization and Representation of Information)

School of Information SciencesUniversity of Tennessee

Page 2: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Subject and its Representation

Subject reveals what a work is about: the content of the work

Representing subjects of an information object in the most precise and concise linguistic format is necessary for computerized searching: word, phrase, sentence, etc.

Page 3: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Questions

Why can’t a computer do a good job in identifying the “aboutness” of a work?

How can you identify “aboutness” for nontextual materials?

Page 4: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Subject Analysis Is part of creating metadata that deals with the

conceptual analysis of an information object to determine what it is about and

Translating “aboutness” of an info object to create controlled vocabulary terms for subject headings and classification notations

Page 5: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Purpose of Subject Analysis Provides meaningful subject access via retrieval

tool

Provides collocation of objects of a like nature (Cutter)

Provides a logical location for similar objects

Saves user time

Page 6: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Conceptual Analysis

What is it? Philosophy, history

What is it for? For a farmer…

What is it about?

D. W. Langridge, 1989

Page 7: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Methods in Conceptual Analysis

Purposive method: Figure out author’s purpose (statement of purpose)

Figure-ground method (what are the problems in this method?)

Objective method : Counting of references (what are the problems in this method?)

Appealing to unity or to rules of selection and rejection what has been said (selection) and not said (rejected)

P. Wilson, 1968

Page 8: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Identification of Concepts

Topics

Names (person, corporate bodies,

geographic areas, other named entities)

Time periods

Form

Page 9: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Subject Access Process

Textual and non-textual info objects

What will be helpful for identifying the

“aboutness” of the info object?

What did the user queries of the NLM’s

Prints and Photographs Collection reveal?

Page 10: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Dewey Decimal Classification Main classes=>divisions=>sections The system is made up of ten categories: 000 Computers, information and general

reference 100 Philosophy and psychology 200 Religion 300 Social sciences 400 Language 500 Science and mathematics 600 Technology 700 Arts and recreation 800 Literature 900 History and geography

330 for 330 for economy + 94 for Europe = 330.94 European economy; 973 for United States + 005 form division for periodicals = 973.005, periodicals concerning the United States generally

Page 11: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Dewey Decimal Classification• From the divine to the mundane (except 000)From the divine to the mundane (except 000)• Choosing Choosing decimals for its categories, allows purely for its categories, allows purely numerical and infinitely hierarchicalnumerical and infinitely hierarchical• Faceted classification: combines elements from Faceted classification: combines elements from different parts of the structure to construct a number different parts of the structure to construct a number representing the subject content representing the subject content

• Except for general works and Except for general works and fiction, works are , works are classified principally by subject, with extensions for classified principally by subject, with extensions for subject relationships, place, time or type of material, subject relationships, place, time or type of material, producing classification numbers of not less than three producing classification numbers of not less than three digits but otherwise of indeterminate length with a digits but otherwise of indeterminate length with a decimal point before the fourth digit, where present decimal point before the fourth digit, where present • Classmarks are to be read as numbers, in the order: Classmarks are to be read as numbers, in the order: 050, 220, 330.973, 331 etc. 050, 220, 330.973, 331 etc.

Page 12: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Subject Access--The Problems

diverse expressions linguistic phenomena cultural diversity human cognitive factors individual differences differences in methods, lack of consistency exhaustivity: summarization and depth indexing

Page 13: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Subject Access--Some Solutions

1. Vocabulary control in indexing

2. Classification systems arranging

concepts in hierarchical structure

3. Citations: citing and being cited

4. Hyperlinks

Page 14: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Why is controlled vocabulary needed?

Page 15: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

What can Vocabulary Control Do?

to promote the consistent representation of subject matter by indexer/cataloger and searchers;

to guide users on subject access by clarifying linguistic ambiguity and linking terms with related meanings;

to increase precision as well as recall.

Page 16: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Recall and Precision

Basic measures used in evaluating search strategies

Assumptions:• There is a set of records in the DB which is relevant to the search topic • Records are assumed to be either relevant or irrelevant (these measures do not allow for degrees of relevancy)• The actual retrieval set may not perfectly match the set of relevant records.

Page 17: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Recall and Precision

RECALL is the ratio of the number of relevant records retrieved to the total number of relevant records in the database. It is usually expressed as a percentage.

PRECISION is the ratio of the number of relevant records retrieved to the total number of irrelevant and relevant records retrieved. It is usually expressed as a percentage.

Page 18: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

PR Inverse Relationship

Why is there an inverse relationship? Issue of Language If search goal is comprehensive retrieval, then searcher must include synonyms, related terms, broad or general terms, for each concept

Precision suffers: • Searcher may decide to combine terms using Boolean rather than proximity operator: secondary concepts may get omitted•Because synonyms may not be exact synonyms the probability of retrieving irrelevant material increases

Recall suffers• Broader terms may result in the retrieval of material which does not discuss the narrower search topic• Using Boolean operators rather than proximity operators may increase the probability that the terms won't be in context

Page 19: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Other Problems with P and R

Records must be considered either relevant or irrelevant (what about records that are marginally relevant, somewhat irrelevant, very relevant, completely irrelevant)

Individual perception: what is relevant to one person may not be relevant to another

Measuring recall: difficult to know how many relevant records exist in DB

Measures for estimating recall

Usefulness of P and R

Page 20: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Challenges in Vocabulary Control

Specific vs. general Synonymous concepts Word form and one-word forms (e.g.,

online) Sequence and form for multiword terms

and phases; inverted order Abbreviations and acronyms Popular vs. technical names

Page 21: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

What is a Controlled Vocabulary?

A limited set of terms for indexing (subject cataloging) and for searching

authorized terms (representing concepts) scope notes related concepts lead-in terms (non-preferred synonym term,

not for indexing or searching; a pointer to authorized ones)

Page 22: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Types of Control

Terminology Synonyms (more terms for one concept) Homographs (more than one meaning):

qualifiers or preferred term synonym Homophones Conceptual relationships Hierarchical (narrower, broader) Associative (related) Cross References

Page 23: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Cro

ss r

efer

ence

Cro

ss r

efer

ence

Structure of Controlled Vocabulary

Term-A scope note: explains use of the term UF lead-in term-B “used for” BT term(s) SA term(s) “see also” NT term(s) -- subdivision Lead-in term-B USE Term-A

Page 24: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Examples

Subject Heading Lists developed in library community in favor of pre-coordination in card

cataloging environment

Thesauri developed as part of IR systems in favor of post-coordination and

somewhat pre-coordination

Page 25: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Pre-Coordination The combination of concepts at the time of

cataloging or indexing, e.g.: Library -- automation -- United States The above example is one heading in a structured

format: Topic -- subtopic -- geography

(LCSH is a highly pre-coordinated control vocabulary) Indexer constructs subject strings with main terms followed by

subdivisions

Page 26: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Post-Coordination The combination of concepts at the time of

searching for a compound concept, e.g.: library automation United States The above example indicates three

descriptors assigned to a work; no structure exists between them

Examples: ERIC

Page 27: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Pre-Coordinated SH

Document Number: 195Title: France importing crops from US and

exporting wine to US SH: crop--export--US SH: crop--import--France SH: wine--export--France SH: wine--import--US

Document Number: 44Document Number: 44Title: US importing wine from France Title: US importing wine from France • SH: wine--export—FranceSH: wine--export—France• SH: wine--import--USSH: wine--import--US

Page 28: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Pre-Coordinated Indexes

crop--export--US 195 crop--import--France 195 wine--export--France 44, 195 wine--import--US 44, 195

These facet headings are clear about the direction of the trade between two countries.

What happens if the concepts are not combined in the headings?

Page 29: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Post-Coordinated Indexes

crop 195 export 44, 195 import 44, 195 France 44, 195 US 44, 195 wine 44, 195

Let’s do a Boolean search: crop AND import AND USresults: Document 195 -- irrelevant

Page 30: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Subject Cataloging--Process

1. Conceptual analysis of a document to identify

what the document is about

The methods: purpose of the author (indicative statements) figure-ground objective analysis (statistics)

Page 31: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Subject Cataloging--Process (cont’d)

2. Translation of the conceptual analysis into a particular vocabulary

The methods look up subject headings weighted headings assign headings

Page 32: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Various Subjects in MARC

600 610 650 651

MARC tags 600 vs. 100 vs. 700 610 vs. 110 vs. 710

1XXfields (main entries) 4XXfields (series statements) 6XXfields (subject headings) 7XXfields (added entries other than subject or series) 8XXfields (series added entries)

X00Personal names X10Corporate names X11Meeting names X30Uniform titles X40Bibliographic titles X50Topical terms X51Geographic names

For example, 610: subject heading that is a corporate name

Page 33: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Subject Cataloging Quality

Consistency: works on the same subjects are given the same headings

Exhaustivity: whether the headings cover all aspects of the work -- number of headings

Specificity: whether the heading assigned is at the same hierarchical level of the concept

Page 34: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Controlled Vocabularies

1. Subject heading lists: include phrases, precoordinated terms LCSH Sears List of SH MeSH,

2. Thesauri: single and bound terms (e.g., Type A Personality) representing single concepts (descriptors); strictly hierarchical; narrower in scope; can be multilingual Art & Architecture Thesaurus (cultural heritage info) Thesaurus of ERIC Descriptors (educational resources) INSPEC Thesaurus (physics and engineering communities)

Page 35: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Controlled Vocabularies

1. and 2. provide subject access to info objects by providing terminology that can be consistent (controlled vocabulary)

Choose preferred terms and make references from non-used terms Provide hierarchies: BT, NT, RT

3. Ontologies: bring all variant ways of expressing a concept and showing relationships via BT, NT, RT; do not select preferred terms “systematic account of existence”

Page 36: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Solution to the “Subject Problem” for Images: Natural Language Analysis

Natural language that people use (linguistic constructs, grammar relationships, syntax, communication vocabulary) can be used for describing and searching in visual information retrieval systems

Content-based natural language processing is understood in terms of syntactic structure in the spoken natural language

Concept-based natural language processing attempts to capture the semantics of an image

Page 37: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Critical Reflection 7: The GAME

User-Based Natural Language Analysis for Creation and Evaluation of Visual information Retrieval Systems in Library and Museum Settings

Your response: On the Black Board space, respond to the questions provided on the handout

Page 38: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Exercise 3: Authority Control OBJECTIVES to observe name authority control to observe controlled vocabulary for subject access

Part I. Name authority Go to the authority record database in Library of Congress http://

authorities.loc.gov/. Search for the popular author, Samuel Clemens.

How many authorized headings are established for him? Attach the most complete MARC Authority record for each authorized heading.

For the MARC Authority format, explain the semantics (meanings) of the fields: 1xx, 4xx and 5xx. Make sure that you mention how authorized and unauthorized headings are cross-referenced.

For each authorized heading, how many bibliographic records are found in LC collection using the heading? If an authorized heading is not used, why so?

Can the user just click on the authorized heading to retrieve bibliographic records by the author?

Page 39: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Exercise 3: Authority Control Part II. Authorized subject headings Go to the authority record database in Library of Congress http://

authorities.loc.gov/.

Search for an authorized subject heading for each of the topics: Teapot Dome scandal

Watergate scandal

What are the broader heading (BT)? What are the narrower headings (NT)? What are the related headings (RT)?

Construct an alphabetical subject headings list of the headings (BT, NT, RT, the heading itself) and their related headings including both authorized headings and lead-in terms. Under each heading cross-reference the related terms: Used-for, Use, BT, NT, RT.

Page 40: Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Exercise 3: Authority Control

WHAT TO TURN IN?

The authority records for Samuel Clemens in MARC format and your answers to all the questions.

The authority records for the two subject headings in MARC format and the subject headings list.

A brief discussion on the roles of authority control in IR.