Top Banner
2007.02.08 - SLIDE 1 IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information In Collections Automatic Classification Some slides in this lecture were originally created by Prof. Marti Hearst
79

2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 1IS 240 – Spring 2007

Lecture 8: Clustering

University of California, BerkeleySchool of Information

IS 245: Organization of Information In Collections

Automatic Classification

Some slides in this lecture were originally created by Prof. Marti Hearst

Page 2: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 2IS 240 – Spring 2007

Overview

• Introduction to Automatic Classification and Clustering

• Classification of Classification Methods

• Classification Clusters and Information Retrieval in Cheshire II

• The 4W project revisited…

Page 3: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 3IS 240 – Spring 2007

Classification

• The grouping together of items (including documents or their representations) which are then treated as a unit. The groupings may be predefined or generated algorithmically. The process itself may be manual or automated.

• In document classification the items are grouped together because they are likely to be wanted together– For example, items about the same topic.

Page 4: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 4IS 240 – Spring 2007

Automatic Indexing and Classification

• Automatic indexing is typically the simple deriving of keywords from a document and providing access to all of those words.

• More complex Automatic Indexing Systems attempt to select controlled vocabulary terms based on terms in the document.

• Automatic classification attempts to automatically group similar documents using either:– A fully automatic clustering method.– An established classification scheme and set of

documents already indexed by that scheme.

Page 5: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 5IS 240 – Spring 2007

Background and Origins

• Early suggestion by Fairthorne – “The Mathematics of Classification”

• Early experiments by Maron (1961) and Borko and Bernick(1963)

• Work in Numerical Taxonomy and its application to Information retrieval Jardine, Sibson, van Rijsbergen, Salton (1970’s).

• Early IR clustering work more concerned with efficiency issues than semantic issues.

Page 6: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 6IS 240 – Spring 2007

Cluster Hypothesis

• The basic notion behind the use of classification and clustering methods:

• “Closely associated documents tend to be relevant to the same requests.”– C.J. van Rijsbergen

Page 7: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 7IS 240 – Spring 2007

Classification of Classification Methods

• Class Structure– Intellectually Formulated

• Manual assignment (e.g. Library classification)• Automatic assignment (e.g. Cheshire Classification

Mapping)

– Automatically derived from collection of items• Hierarchic Clustering Methods (e.g. Single Link)• Agglomerative Clustering Methods (e.g. Dattola)• Hybrid Methods (e.g. Query Clustering)

Page 8: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 8IS 240 – Spring 2007

Classification of Classification Methods

• Relationship between properties and classes– monothetic– polythetic

• Relation between objects and classes– exclusive– overlapping

• Relation between classes and classes– ordered– unordered

Adapted from Sparck Jones

Page 9: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 9IS 240 – Spring 2007

Properties and Classes

• Monothetic– Class defined by a set of properties that are both

necessary and sufficient for membership in the class

• Polythetic– Class defined by a set of properties such that to be a

member of the class some individual must have some number (usually large) of those properties, and that a large number of individuals in the class possess some of those properties, and no individual possesses all of the properties.

Page 10: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 10IS 240 – Spring 2007

A B C D E F G H 1 + + +2 + + +3 + + +4 + + +5 + + +6 + + +7 + + +8 + + +

Monothetic vs. Polythetic

Polythetic

Monothetic

Adapted from van Rijsbergen, ‘79

Page 11: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 11IS 240 – Spring 2007

Exclusive Vs. Overlapping

• Item can either belong exclusively to a single class

• Items can belong to many classes, sometimes with a “membership weight”

Page 12: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 12IS 240 – Spring 2007

Ordered Vs. Unordered

• Ordered classes have some sort of structure imposed on them– Hierarchies are typical of ordered classes

• Unordered classes have no imposed precedence or structure and each class is considered on the same “level”– Typical in agglomerative methods

Page 13: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 13IS 240 – Spring 2007

Clustering Methods

• Hierarchical

• Agglomerative

• Hybrid

• Automatic Class Assignment

Page 14: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 14IS 240 – Spring 2007

Coefficients of Association

• Simple

• Dice’s coefficient

• Jaccard’s coefficient

• Cosine coefficient

• Overlap coefficient

|||,min(|||

||||

||

||||

||||

||2

||

BABA

BA

BA

BABA

BA

BA

BA

Page 15: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 15IS 240 – Spring 2007

Hierarchical Methods

2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4

Single Link Dissimilarity Matrix

Hierarchical methods: Polythetic, Usually Exclusive, OrderedClusters are order-independent

||||

||1

BA

BAitydissimilar

Page 16: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 16IS 240 – Spring 2007

Threshold = .1

Single Link Dissimilarity Matrix

2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4

2 03 0 04 0 0 05 1 0 0 1 1 2 3 4

2

1

35

4

Page 17: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 17IS 240 – Spring 2007

Threshold = .2

2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4

2 03 0 14 0 0 05 1 0 0 1 1 2 3 4

2

1

35

4

Page 18: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 18IS 240 – Spring 2007

Threshold = .3

2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4

2 03 0 14 1 1 15 1 0 0 1 1 2 3 4

2

1

35

4

Page 19: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 19IS 240 – Spring 2007

Clustering

Agglomerative methods: Polythetic, Exclusive or Overlapping, Unorderedclusters are order-dependent.

DocDoc

DocDoc

DocDoc

DocDoc

1. Select initial centers (I.e. seed the space)2. Assign docs to highest matching centers and compute centroids3. Reassign all documents to centroid(s)

Rocchio’s method (similar to current K-means methods

Page 20: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 20IS 240 – Spring 2007

Automatic Class Assignment

DocDoc

DocDoc

DocDoc

Doc

SearchEngine

1. Create pseudo-documents representing intellectually derived classes.2. Search using document contents3. Obtain ranked list4. Assign document to N categories ranked over threshold. OR assign to top-ranked category

Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually orderedclusters are order-independent, usually based on an intellectually derived scheme

Page 21: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 21IS 240 – Spring 2007

Automatic Categorization in Cheshire II

• The Cheshire II system is intended to provide a bridge between the purely bibliographic realm of previous generations of online catalogs and the rapidly expanding realm of full-text and multimedia information resources. It is currently used in the UC Berkeley Digital Library Project and for a number of other sites and projects.

Page 22: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 22IS 240 – Spring 2007

Overview of Cheshire II

• It supports SGML as the primary database type.• It is a client/server application.• Uses the Z39.50 Information Retrieval Protocol.• Supports Boolean searching of all servers.• Supports probabilistic ranked retrieval in the

Cheshire search engine.• Supports ``nearest neighbor'' searches,

relevance feedback and Two-Stage Search.• GUI interface on X window displays (Tcl/Tk).• HTTP/CGI interface for the Web (Tcl scripting).

Page 23: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 23IS 240 – Spring 2007

Cheshire II - Cluster Generation• Define basis for clustering records.

– Select field to form the basis of the cluster.– Evidence Fields to use as contents of the pseudo-

documents.

• During indexing cluster keys are generated with basis and evidence from each record.

• Cluster keys are sorted and merged on basis and pseudo-documents created for each unique basis element containing all evidence fields.

• Pseudo-Documents (Class clusters) are indexed on combined evidence fields.

Page 24: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 24IS 240 – Spring 2007

Cheshire II - Two-Stage Retrieval• Using the LC Classification System

– Pseudo-Document created for each LC class containing terms derived from “content-rich” portions of documents in that class (subject headings, titles, etc.)

– Permits searching by any term in the class– Ranked Probabilistic retrieval techniques attempt to

present the “Best Matches” to a query first.– User selects classes to feed back for the “second

stage” search of documents.

• Can be used with any classified/Indexed collection.

Page 25: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 25IS 240 – Spring 2007

Probabilistic Retrieval: Logistic Regression

• Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables.

nnkji vcvcvcctdR|qO ...),,(log 22110

)),|(log(1

1),|(

ji dqROjie

dqRP

m

kkjiji ROtdqROdqRO

1, )](log),|([log),|(log

Log odds of relevance is a linear function of attributes:

Term contributions summed:

Probability of Relevance is inverse of log odds:

Page 26: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 26IS 240 – Spring 2007

Probabilistic Retrieval: Logistic Regression

6

10),|(

iii XccDQRP

In Cheshire II probability of relevance is based onLogistic regression from a sample set of TREC documents to determine values of the coefficients.At retrieval the probability estimate is obtained by:

For 6 attributes or “clues” about term usage in the documents and the query

Page 27: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 27IS 240 – Spring 2007

Probabilistic Retrieval: Logistic Regression attributes

MX

n

nNIDF

IDFM

X

DLX

DAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

log1

log1

6

15

4

13

2

11

Average Absolute Query Frequency

Query Length

Average Absolute Document Frequency

Document Length

Average Inverse Document Frequency

Inverse Document Frequency

Number of Terms in common between query and document (M) -- logged

Page 28: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 28IS 240 – Spring 2007

Cheshire II Demo

• Examples from the:– SciMentor(BioSearch) project

• Journal of Biological Chemistry and MEDLINE data

– CHESTER (EconLit)• Journal of Economic Literature subjects

– Unfamiliar Metadata & TIDES Projects• Basis for clusters is a normalized Library of Congress Class

Number• Evidence is provided by terms from record titles (and subject

headings for the “all languages”• Five different training sets (Russian, German, French,

Spanish, and All Languages• Testing cross-language retrieval and classification

– 4W Project Search

Page 29: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 29IS 240 – Spring 2007

References

• Christian Plaunt & Barbara Norgard, “An Association Based Method for Automatic Indexing with a Controlled Vocabulary”. JASIS 49(10), 1998.– Preprint available available on class web site

• Ray R. Larson, Jerome McDonough, Lucy Kuntz, Paul O’Leary & Ralph Moon, “Cheshire II: Designing a Next-Generation Online Catalog”. JASIS, 47(7) 555-567, 1996.

Page 30: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 30IS 240 – Spring 2007

Developing a Metadata Infrastructure for Information

Access:What, Where, When and

Who?Prof. Ray R. Larson

University of California, BerkeleySchool of Information

Page 31: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 31IS 240 – Spring 2007

Overview

• Metadata as Infrastructure– What, Where, When and Who?

• What are Entry Vocabulary Indexes?– Notion of an EVI– How are EVIs Built

• Time Period Directories– Mining Metadata for new metadata

Page 32: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 32IS 240 – Spring 2007

Metadata as Infrastructure

• The difference between memorization and understanding lies in knowing the context and relationships of whatever is of interest. When setting out to learn about a new topic, a well-tested practice is to follow the traditional “5Ws and the H”: Who?, What?, When?, Where?, Why?, and How?

Page 33: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 33IS 240 – Spring 2007

Metadata as Infrastructure

• The reference collections of paper-based libraries provide a structured environment for resources, with encyclopedias and subject catalogs, gazetteers, chronologies, and biographical dictionaries, offering direct support for at least What, Where, When, and Who.

• The digital environment does not yet provide an effective, and easily exploited, infrastructure comparable to the traditional reference library.

Page 34: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 34IS 240 – Spring 2007

What?

Searching texts by topic, e.g. Dewey, LCSH, any subject index, or category scheme applied to documents.

• Two kinds of mapping in every search:– Documents are assigned to topic categories, e.g.

Dewey– Queries have to map to topic categories, e.g. Dewey’s

Relativ Index from ordinary words/phrases to Decimal Classification numbers.

• Also mapping between topic systems, e.g. US Patent classification and International Patent Classification.

Page 35: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 35IS 240 – Spring 2007

Texts

‘What’ searches involve mapping to controlled vocabularies

Thesaurus/Ontology

Page 36: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 36IS 240 – Spring 2007

Start with a collection of documents.

Page 37: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 37IS 240 – Spring 2007

Classify and index with controlled

vocabulary

Or use a pre-indexed

collection.

Index

Page 38: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 38IS 240 – Spring 2007

Problem:Controlled

Vocabularies can be

difficult for people to

use.“pass mtr veh spark ign eng”

Index

Use: “Economic Policy”

In Library of Congress subj

For: “Wirtschaftspolitik”

Page 39: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 39IS 240 – Spring 2007

Solution:Entry Level Vocabulary

Indexes.Index

EVIpass mtr veh

spark ign eng”

= “Automobile”

Page 40: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 40IS 240 – Spring 2007

“What” and Entry Vocabulary Indexes

• EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents…

Page 41: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 41IS 240 – Spring 2007

Has an Entry Vocabulary

Module been built?

User selects a subject domain of

interest.

Download a set of training data.

Build associations between extracted terms & controlled

vocabularies.

Map user’s query to ranked list of

controlled vocabulary terms

Part of speech tagging

Use an existing EVI.

Extract terms (words and noun phrases) from

titles and abstracts.

User selects search terms from the ranked

list of terms returned by the EVI.

YES

Building an Entry Vocabulary Module (EVI)

Searching

For noun phrases

Internet DB indexed with a controlled

vocabulary.

Domains to select from: Engineering, Medicine, Biology, Social science, etc.

User has question but is unfamiliar with the domain

he wants to search.

NO

Building and Searching EVIs

Page 42: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 42IS 240 – Spring 2007

Technical Details

Download a set of

training data.

Build associations between extracted terms & controlled

vocabularies.

Part of speech tagging

Extract terms (words and noun

phrases) from titles and abstracts.

Building an Entry Vocabulary Module (EVI)

For noun phrases

Internet DB indexed with a

controlled vocabulary.

Page 43: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 43IS 240 – Spring 2007

Association Measure

C ¬Ct a b¬t c d

Where t is the occurrence of a term and C is the occurrence of a class in the training set

Page 44: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 44IS 240 – Spring 2007

Association Measure

• Maximum Likelihood ratio

W(C,t) = 2[logL(p1,a,a+b) + logL(p2,c,c+d) - logL(p,a,a+b) – logL(p,c,c+d)] where logL(p,n,k) = klog(p) + (n – k)log(1- p)

and p1= p2= p=

a a+b

c c+d

a+c a+b+c+d

Vis. Dunning

Page 45: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 45IS 240 – Spring 2007

Alternatively

• Because the “evidence” terms in EVIs can be considered a document, you can also use IR techniques and use the top-ranked classes for classification or query expansion

Page 46: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 46IS 240 – Spring 2007

FindPlutonium

In Arabic Chinese Greek Japanese Korean Russian Tamil

...),,2[logL(p t)W(c, 1 baaStatistical association

Digital library resources

Page 47: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 47IS 240 – Spring 2007

EVI example

EVI 1

Index term:“pass mtr veh spark ign eng”User

Query “Automobile

” EVI 2Index term:“automobiles”OR

“internal combustible engines”

Page 48: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 48IS 240 – Spring 2007

But why stop there?

Index

EVI

Page 49: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 49IS 240 – Spring 2007

“Which EVI do I use?”

Index

EVI

Index

Index EVI

IndexEVI

Page 50: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 50IS 240 – Spring 2007

EVI to EVIs

Index

EVI

Index

Index EVI

IndexEVI

EVI2

Page 51: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 51IS 240 – Spring 2007

FindPlutonium

In Arabic Chinese Greek Japanese Korean Russian Tamil

Why not treat language the same way?

Page 52: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 52IS 240 – Spring 2007

Texts

Numericdatasets

It is also difficult to move between different media forms

Thesaurus/Ontology

EVI

Page 53: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 53IS 240 – Spring 2007

Searching across data types

• Different media can be linked indirectly via metadata, but often (e.g. for socio-economic numeric data series) you also need to specify WHERE to get correct results

Page 54: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 54IS 240 – Spring 2007

Texts

Numericdatasets

But texts associated with numeric data can be mapped as well…

Thesaurus/Ontology

captions

EVI

EVI

Page 55: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 55IS 240 – Spring 2007

EVI to Numeric Data example

EVI LCSH

marcnew query

search resultscaptions

numeric table

numeric database

online catalog

search interface 1

search interface 2

1

8 7 6

5

432

11

10 9

Page 56: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 56IS 240 – Spring 2007

Texts

Numericdatasets

But there are also geographic dependencies…

Thesaurus/Ontology

captionsMaps/Geo Data

EVI

EVI

Page 57: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 57IS 240 – Spring 2007

WHERE: Place names are problematic…

• Variant forms: St. Petersburg, Санкт Петербург, Saint-Pétersbourg, . . .

• Multiple names: Cluj, in Romania / Roumania / Rumania, is also called Klausenburg and Kolozsvar.

• Names changes: Bombay Mumbai.• Homographs:Vienna, VA, and Vienna, Austria;

– 50 Springfields.• Anachronisms: No Germany before 1870• Vague, e.g. Midwest, Silicon Valley• Unstable boundaries: 19th century Poland;

Balkans; USSR• Use a gazetteer!

Page 58: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 58IS 240 – Spring 2007

WHERE. Geo-temporal search interface. Place names found in documents. Gazetteer provided lat. & long. Places displayed on map.

Timebar

Page 59: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 59IS 240 – Spring 2007

Zoom on map. Click on place for a list of records. Click on record to display text.

Page 60: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 60IS 240 – Spring 2007

Catalogs and gazetteers should talk to each other!

Geographic sort / display of catalog search result.

Catalog search

Gazetteer search

Page 61: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 61IS 240 – Spring 2007

Texts

Numericdatasets

So geographic search becomes part of the infrastructure

Thesaurus/Ontology

Gazetteers captionsMaps/Geo Data

EVI

Page 62: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 62IS 240 – Spring 2007

WHEN: Search by time is also weakly supported…

• Calendars are the standard for time• But people use the names of events to refer to

time periods• Named time periods resemble place names in

being:– Unstable: European War, Great War, First World War– Multiple: Second World War, Great Patriotic War– Ambiguous: “Civil war” in different centuries in

England, USA, Spain, etc.

• Places have temporal aspects & periods have geographical aspects: When the Stone Age was, varies by region

Page 63: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 63IS 240 – Spring 2007

• Suggests a similar solution: A gazetteer-like Time Period Directory.

• Gazetteer:– Place name – Type – Spatial markers (Lat & long) -- When

• Time Period Directory: – Period name – Type – Time markers (Calendar) – Where

• Note the symmetry in the connections between Where and When.

Similarity between place names and period names

Page 64: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 64IS 240 – Spring 2007

Solution - Time Period Directories

• Initial development involved mining the Library of Congress Subject Authority file for named time periods…

Page 65: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 65IS 240 – Spring 2007

LC MARC Authorities Records

<USMARC><Fld001>sh 00000613 </Fld001><Fld151><a>Magdeburg

(Germany)</a><x>History</x><y>Siege, 1550-1551</y></Fld151>

<Fld550><w>g</w><a>Sieges</a><z>Germany</z></Fld550><Fld670><a>Work cat.: 45053442: Besselmeier, S. Warhafftige

history vnd beschreibung des Magdeburgischen Kriegs, 1552.</a></Fld670>

<Fld670><a>Cath. encyc.</a><b>(Magdeburg: besieged (1550-51) by the Margrave Maurice of Saxony)</b></Fld670>

<Fld670><a>Ox. encyc. reformation</a><b>(Magdeburg: ... during the 1550-1551 siege of Magdeburg ...)</b></Fld670>

</USMARC>

Page 66: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 66IS 240 – Spring 2007

timePeriodEntry Time Period Directory InstanceContains components described below

- periodID Unique identifier

- periodName Period name, can be repeated for alternative namesInformation about language, script, transliteration schemeSource information and notes (where was the period name mentioned)

- descriptiveNotes Description of time period

- dates Calendar and date formatBegin & end date (exact, earliest, latest, most-likely, advocated-by-source, ongoing)Notes, sources

- periodClassification Period type, e.g. Period of Conflict, Art movementCan plug in different classification schemesCan be repeated for several classifications

- location Associated places with time periodContains both place name and entry to a gazetteer providing more specific place information like latitude / longitude coordinatesCan plug in different location indicators (e.g. ADL gazetteer, Getty Thesaurus of Geographic names)Recently added coordinates for direct use

- relatedPeriod Related time periodsperiodID of related periodsInformation about relationship type (part-of, successor etc.)Can plug in different relationship type schemes

- entryMetadata Notes about creator / creation of instanceEntry dateModification date

Page 67: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 67IS 240 – Spring 2007

Page 68: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 68IS 240 – Spring 2007

Time periods by named location

Page 69: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 69IS 240 – Spring 2007

Catalog Search Result

Page 70: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 70IS 240 – Spring 2007

Web Interface - Access by map

Page 71: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 71IS 240 – Spring 2007

Zoomable interface gives access to geographically focused info…

Page 72: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 72IS 240 – Spring 2007

Link initiates search of theLibrary of Congress catalogfor all records relating to thistime period.

Web Interface - Access by timeline

Page 73: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 73IS 240 – Spring 2007

WHEN and WHAT

• These named time periods are derived from Library of Congress catalog subject headings and so can be used for catalog searching which finds books on topics important for that time period

Page 74: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 74IS 240 – Spring 2007

Time period directories link via the place (or time)

Texts

Numericdatasets

Thesaurus/Ontology

Gazetteers captionsMaps/Geo Data

EVI

Time Period Directory Time lines, Chronologies

Page 75: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 75IS 240 – Spring 2007

WHEN, WHERE and WHO

• Catalog records found from a time period search commonly include names of persons important at that time. Their names can be forwarded to, e.g., biographies in the Wikipedia encyclopedia.

Page 76: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 76IS 240 – Spring 2007

Place and time are broadly important across numerous tools and genres including, e.g. Language atlases, Library catalogs,Biographical dictionaries, Bibliographies, Archival finding aids, Museum records, etc., etc.

Biographical dictionaries are heavy on place and time: Emanuel Goldberg, Born Moscow 1881. PhD under Wilhelm Ostwald, Univ. of Leipzig, 1906. Director, Zeiss Ikon, Dresden, 1926-33. Moved to Palestine 1937. Died Tel Aviv, 1970.

Life as a series of episodes involving Activity (WHAT), WHERE, WHEN, and WHO else.

Page 77: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 77IS 240 – Spring 2007

A new form of biographical dictionary would link to all

Texts

Numericdatasets

Thesaurus/Ontology

Gazetteers captionsMaps/Geo Data

EVI

Time Period Directory Time lines, Chronologies

Biographical Dictionary

Page 78: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 78IS 240 – Spring 2007

A Metadata Infrastructure

CATALOGS

AchivesHistorical Societies

LibrariesMuseums

Public TelevisionPublishersBooksellers

AudioImages

Numeric DataObjectsTexts

Virtual RealityWebpages

RESOURCES

INTERMEDIA INFRASTRUCTURE

Text and ImagesBiographical DictionaryWHO

TimelinesTime Period DirectoryWHEN

MapsGazetteerWHERE

Syndetic StructureThesaurusWHAT

Special Display ToolsAuthority ControlFacet

Learners

Dossiers

Page 79: 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

2007.02.08 - SLIDE 79IS 240 – Spring 2007

Acknowledgements

• Electronic Cultural Atlas Initiative project• This work was partially supported by the

Institute of Museum and Library Services through a National Leadership Grant for Libraries, award number LG-02-04-0041-04, Oct 2004 - Sept 2006 entitled “Supporting the Learner: What, Where, When and Who” – See: http://ecai.org/imls2004

• Michael Buckland, Fred Gey, Vivien Petras, Matt Meiske, Kim Carl, Anya Kartavenko, Minakshi Mukherjee

• Contact: [email protected]