Top Banner
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3
28

1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

Dec 24, 2015

Download

Documents

Brittney Parker
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

1

CS 430: Information Discovery

Lecture 15

Library Catalogs 3

Page 2: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

2

Course Administration

• Midterm examination results have been sent by email.

• Assignment 2 results will be mailed shortly.

• Assignment 3, due November 10, will be posted soon.

Page 3: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

3

Automatic extraction of catalog data

Example: Dublin Core records for web pages

Strategies

• Manual by trained cataloguers - high quality records, but expensive and time consuming

• Entirely automatic - fast, almost zero cost, but poor quality

• Automatic followed by human editing - cost and quality depend on the amount of editing

• Manual collection level record, automatic item level record - moderate quality, moderate cost

Page 4: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

4

DC-dot

DC-dot is a Dublin Core metadata editor for web pages, created by Andy Powell at UKOLN

http://www.ukoln.ac.uk/metadata/dcdot/

DC-dot has two parts:

(a) A skeleton Dublin Core record is created automatically from clues in the web page

(b) A user interface is provided for cataloguers to edit the record

Page 5: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

5

Page 6: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

6

Automatic record for CS 430 home page

DC-dot applied to http://www.cs.cornell.edu/courses/cs430/2001sp/

<link rel="schema.DC" href="http://purl.org/dc">

<meta name="DC.Title" content="CS 430: Information Discovery">

<meta name="DC.Subject" content="[email protected]; Course Structure; Readings and references; Slides; Basic Information; William Y. Arms; Information Retrieval Data Structures and Algorithms; [email protected]; Assignments; Syllabus; Text Book; Laptop computers; Assumed Background; Nomadic Computing Experiment; Notices; Course Description; Code of practice; Assignments and Grading; Last changed: February 6, 2001">

continued on next slide

Page 7: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

7

Automatic record for CS 430 home page (continued)

DC-dot applied to http://www.cs.cornell.edu/courses/cs430/2001sp/

<meta name="DC.Publisher" content="Cornell University">

<meta name="DC.Date" scheme="W3CDTF" content="2001-02-07">

<meta name="DC.Type" scheme="DCMIType" content="Text">

<meta name="DC.Format" content="text/html">

<meta name="DC.Format" content="5781 bytes">

<meta name="DC.Identifier" content="http://www.cs.cornell.edu/courses/cs430/2001sp/">

Page 8: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

8

Observations on DC-dot applied to CS430 home page

DC.Title is a copy of the html <title> field

DC.Publisher is the owner of the IP address where the page was stored

DC.Subject is a list of headings and noun phrases presented for editing

DC.Date is taken from the Last-Modified field in the http header

DC.Type and DC.Format are taken from the MIME type of the http response

DC.Identifier was supplied by the user as input

Page 9: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

9

Page 10: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

10

DC-dot applied to http://www.georgewbush.com/

<link rel="schema.DC" href="http://purl.org/dc">

<meta name="DC.Subject" content="George W. Bush; Bush; George Bush; President; republican; 2000 election; election; presidential election; George; B2K; Bush for President; Junior; Texas; Governor; taxes; technology; education; agriculture; health care; environment; society; social security; medicare; income tax; foreign policy; defense; government">

<meta name="DC.Description" content="George W. Bush is running for President of the United States to keep the country prosperous.">

continued on next slide

Automatic record for George W. Bush home page

Page 11: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

11

DC-dot applied to http://www.georgewbush.com/

<meta name="DC.Publisher" content="Concentric Network Corporation">

<meta name="DC.Date" scheme="W3CDTF" content="2001-01-12">

<meta name="DC.Type" scheme="DCMIType" content="Text">

<meta name="DC.Format" content="text/html">

<meta name="DC.Format" content="12223 bytes">

<meta name="DC.Identifier" content="http://www.georgewbush.com/">

Automatic record for George W. Bush home page (continued)

Page 12: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

12

Observations on DC-dot applied to George W. Bush home page

The home page has several meta tags:

<META NAME="TITLE" CONTENT="George W. Bush for President"> [The page has no html <title>]

<META NAME="CONTACT" CONTENT="George W Bush Campaign, P. O. Box 1902, Austin, TX 78767, Phone: (512) 637-2000">

<META NAME="DESCRIPTION" CONTENT="George W. Bush is running for President of the United States to keep the country prosperous.">

<META NAME="KEYWORDS" CONTENT="George W. Bush, Bush, George Bush, President, republican, 2000 election and more

Page 13: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

13

Collection-level metadata

Several of the most difficult fields to extract automatically are the same across all pages in a web site.

Therefore create a collection record manually and combine it with automatic extraction of other fields at item level.

For the CS 430 home page, collection-level metadata:

<meta name="DC.Publisher" content="Cornell University">

<meta name="DC.Creator" content="William Y. Arms">

<meta name="DC.Rights" content="William Y. Arms, 2001">

See: Jenkins and Inman

Page 14: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

14

Page 15: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

15

Metadata extracted automatically by DC-dot

D.C. Field Qualifier Content

title Digital Libraries and the Problem of Purpose

subject not included in this slide

publisher Corporation for National Research Initiatives

date W3CDTF 2000-05-11

type DCMIType Text

format text/html

format 27718 bytes

identifier http://www.dlib.org/dlib/january00/01levy.html

Page 16: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

16

Collection-level record

D.C. Field Qualifier Content

publisher Corporation for National Research Initiatives

type article

type resource work

relation rel-type InSerial

relation serial-name D-Lib Magazine

relation issn 1082-9873

language English

rights Permission is hereby given for the material in D-Lib Magazine to be used for ...

Page 17: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

17

Combined item-level record (DC-dot plus collection-level)

D.C. Field Qualifier Content

title Digital Libraries and the Problem of Purpose

publisher (*) Corporation for National Research Initiatives

date W3CDTF 2000-05-11

type (*) article

type resource (*) work

type DCMIType Text

format text/html

format 27718 bytes

(*) indicates collection-level metadata

continued on next slide

Page 18: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

18

Combined item-level record (DC-dot plus collection-level)

D.C. Field Qualifier Content

relation rel-type (*) InSerial

relation serial-name (*) D-Lib Magazine

relation issn (*) 1082-9873

language (*) English

rights (*) Permission is hereby given for the material in D-Lib Magazine to be used for ...

identifier http://www.dlib.org/dlib/january00/01levy.html

(*) indicates collection-level metadata

Page 19: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

19

Manually created record

D.C. Field Qualifier Content

title Digital Libraries and the Problem of Purpose

creator (+) David M. Levy

publisher Corporation for National Research Initiatives

date publication January 2000

type article

type resource work

(+) entry that is not in the automatically generated records

continued on next slide

Page 20: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

20

Manually created record

D.C. Field Qualifier Content

relation rel-type InSerial

relation serial-name D-Lib Magazine

relation issn 1082-9873

relation volume (+) 6

relation issue (+) 1

identifier DOI (+) 10.1045/january2000-levy

identifier URL http://www.dlib.org/dlib/january00/01levy.html

language English

rights (+) Copyright (c) David M. Levy

(+) entry that is not in the automatically generated records

Page 21: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

21

Collection-level metadata

Compare:

(a) Metadata extracted automatically by DC-dot

(b) Collection-level record

(c) Combined item-level record (DC-dot plus collection-level)

(d) Manual record

For web pages information retrieval works better by automatic indexing, rather than automatic extraction of metadata followed by indexing of metadata.

However, we will see later an effective example of automated extraction of metadata from video sequences (Informedia).

Page 22: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

22

Metatest

Metatest is a research project led by Liz Liddy at Syracuse with participation from the Human Computer Interaction group at Cornell.

The aim is to compare the effectiveness as perceived by the user of indexing based on:

(a) Manually created Dublin Core

(b) Automatically created Dublin Core (higher quality than DC-dot)

(c) Full text indexing

Preliminary results suggest remarkably little difference in effectiveness.

Page 23: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

23

Midterm Examination Q3

(a) The aggregate term weighting for term j in document i is sometimes written:

wij = tfij * idfj

Explain the purpose of tfij and idfj.

Term frequency assumes that the usefulness of a term for retrieval increase as the number of times that the term appears in the document increases.

Inverse document frequency assumes that terms that apear in few documents are better discriminators than those that appear in many.

Page 24: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

24

Midterm Examination Q3 (continued)

In class, we recommended, the following term weighting for free text documents:

tfij = fij / mi

idfj = log2 (n / nj) + 1

(i) Explain why this form is frequently changed for the weighting of terms in documents, such as catalog records, that are not free text.

The forms of tf and idf have been developed for free text. The distribution of terms are different in free text and catalog records.

(ii) Explain why this form might give difficulties if the documents vary greatly in length.

The scaling factors in tf and idf have been developed for collections of similar length records.

Page 25: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

25

Midterm Examination Q3 (continued)

(c) Consider the query:q: dog cat dog

and the following set of documents:d1: bee dog bee cat bee elk elkd2: elk dog ant ant dog ant

d2: cat cat cat cat dog

(i) With no term weighting, what is the similarity between this query and each of the documents?

Page 26: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

26

Midterm Examination Q3 (continued)

Term vector matrixant bee cat dog elk length

q 1 1 √2d1 1 1 1 1 2d2 1 1 1 √3d3 1 1 √2

Similarities q d1 d2 d3

q 1 1/√2 1/√6 1

Page 27: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

27

Midterm Examination Q3 (continued)

(ii) Weighting both the query and the documents for term frequency, but not weighting for inverse document frequency, what is the similarity between this query and each of the documents?

Page 28: 1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

28

Midterm Examination Q3 (continued)

Term vector matrixant bee cat dog elk length

q 1 2 √5d1 3 1 1 2 √15d2 3 2 1 √14d3 4 1 √17

Similarities q d1 d2 d3

q 1 3/√75 4/√70 6/√85