Top Banner
1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)
38

1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

Dec 25, 2015

Download

Documents

Agatha Garrett
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

1

Subject Metadata Enhancement using

Statistical Topic Models

DLF ForumApril 24, 2007

David Newman

(UC Irvine)

Kat Hagedorn

(U Michigan)

Page 2: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

2

Everyone could use better metadata!

?

Page 3: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

3

Outline

1. OAIster: Metadata Challenges

2. Clustering: Topic Model

3. Deployment of the Prototype

4. Lessons Learned

Page 4: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

4

• (as of Sept. 2006)• 700+ institutions • 9.6 million records• Academically-oriented

material, research literature, images, and more

• Problem: How to go beyond keyword search?

OAIster collection, e.g.,– CiteSeer– PubMed– Library of Congress– arXiv.org– PictureAustralia

plus…– Xiamen University Repository– National Library of Serbia– DSpace at Malmo University– Deep Blue

OAIster

Page 5: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

5

Our Challenge

• OAIster wants to provide users with the best possible search and discovery experience

• Can improve search and discovery with– Better metadata – Better access to the metadata

Page 6: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

6

How?

Enhancing access via…

• Search– limit search results by subject classification

• Browse– browse subject classification hierarchy

• Built prototype to showcase this

Page 7: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

7

Topic Model

• State-of-the-art statistical algorithm

• Learns a set of topics or subjects covered by a collection of text records

• Works by finding patterns of co-occurring words

• Determines the mix of topics associated with each record

Page 8: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

8

Process

Cluster

vocab-ulary

preprocesstopic

model(cluster/learn)

topicsOAI

records

Page 9: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

9

Processvocab-ulary

preprocesstopic

model(cluster/learn)

topicsCluster

OAIrecords

vocab-ulary

preprocesstopic

model(classify)

1. topics in records2. records in topicsoai

rec

Classify

OAIrecords

Page 10: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

10

Process

Cluster

Classify

clustering is learning the

topics

classification is using the

learned topics

vocab-ulary

preprocesstopic

model(cluster/learn)

topicsOAI

records

vocab-ulary

preprocesstopic

model(classify)

1. topics in records2. records in topicsoai

recOAI

records

Page 11: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

11

Building Vocabulary

• Preprocessed (sampled) repositories, excluded stopwords

• Only kept words that occurred in more than 10 records

• Result: a final vocabulary with ~ 90,000 words

Page 12: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

12

OAI Record (from USC repository)

Coast route from Los Angeles to San Diego. Part one: Los Angeles to Santa Ana, 1927

Strip map of the automobile route from Los Angeles to Santa Ana via the coast route. Includes Los Angeles (top %26 left), Tustin (bottom), La Habra (right). Due north is about 35 degrees to the right of vertical. Principal features: municipalities, railroads, roads, mileages, road names, hotels, garages, Auto Club offices. Prominent locations (cities; streets; geographical features; institutions): Los Angeles (Broadway, 7th Street, Whittier Boulevard), Pasadena, Alhambra, San Gabriel, Montebello, Downey, Whittier, La Habra (Spadra), Fullerton, Anaheim (Los Angeles Street), Orange, Santa Ana (Main Street, 1st Street, 4th Street), Tustin; Rio Hondo, San Gabriel River; State School, Orange County Hospital.

Maps, Tourist; roadways; Automobile Club of Southern California

Page 13: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

13

OAI Record

Harbor Freeway at 3rd Street overpass

Page 14: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

14

Preprocessing Example

<ID=oai:CiteSeerPSU:44072>

<title>Reinforcement Learning: A Survey

<description>This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." …

<subject>Leslie Pack Kaelbling, Michael Littman, Andrew Moore. Reinforcement Learning: A Survey

vocab-ulary

preprocess

<ID=oai:CiteSeerPSU:44072>

reinforcement learning survey

survey field reinforcement learning computer science perspective written accessible researcher familiar machine learning historical basis field broad selection current summarized reinforcement learning faced agent learn behavior trial error interaction dynamic environment resemblance psychology differ considerably detail word reinforcement …

leslie pack kaelbling littman andrew moore reinforcement learning survey

Page 15: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

15

Example Topics (1)

Words in Topic Topic Label

gene sequence genes sequences cdna region amino_acid clones encoding cloned coding dna genomic cloning clone

gene sequencing

social cultural political culture conflict identity society economic context gender contemporary politic world examines tradition sociology institution ethic discourse

cultural identity

general_relativity gravity gravitational solution black_hole tensor einstein horizon spacetime equation field metric vacuum scalar matter energy relativity

relativity

house garden houses dwelling housing homes terrace estate home building architecture residence homestead residences road cottage domestic fences lawn historic

domestic architecture

Page 16: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

16

Example Topics (2)

Words in Topic Usefulness

large small size larger smaller sizes scale sized largest

Reasonable but unusable

foi para pacientes por foram dos doen resultados grupo das tratamento entre

Topic about patient treatment, in Spanish

building street visible santa_ana view avenue public_library front orange corner

Not usable: mix of concept words and specific geographic location words

Page 17: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

17

Topics Assigned to a Record

Metadata Record Topic Labels

(% words assigned)

Aggregating sets of judgments: two impossibility results compared. (C. List and P. Pettit)

May's celebrated theorem (1952) shows that, if a group of individuals wants to make a choice between two alternatives (say x and y), then majority voting is the unique decision procedure satisfying a set of attractive minimal conditions ...

game theory (21%)

argument (12%)

criteria (7%)

Page 18: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

18

Top Records in “Game Theory”[GAME THEORY] game games equilibrium preferences player cooperative preference equilibria cooperation collective utility individual choice bargaining coalition nash strategy

Repositories

1. Fundamental Components of the Gameplay Experience: Analysing Immersion

2. The Ethics of Computer Game Design

3. Backward Induction and Common Knowledge

4. Designing Puzzles for Collaborative Gaming Experience

5. Aggregating sets of judgments: two impossibility results compared

6. Games for Modal and Temporal Logics

7. Configuring the player - subversive behaviour in Project Entropia

8. From Mass Audience to Massive Multiplayer: How Multiplayer Games Create New Media Politics

9. Bargaining with incomplete information; Handbook of Game Theory with Economic Applications

10. Testable Restrictions of General Equilibrium Theory in Exchange Economies with Externalities

RePEc

Dspace at ANU (Australian National University)

Edinburgh Research Archive

Almae Matris Studiorum (AMS)

eScholarship Repository

CiteSeer

Page 19: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

19

Improving Topics

• First topic model of OAIster resulted in 70% of topics being usable

• We improved topics by– Reduced Vocabulary: Remove topically

low-value words from vocabulary (manual)– Background Words Model: Automatically

detect and remove words specific to repositories (automatic)

Page 20: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

20

METADATA

<T>Country Landscapes</T>

<SU>Ireland; town; street; sidewalk; house; cart; Dorothea Lange Collection; Photographic Essays (1953-1959); Ireland; Ireland, Country Landscapes</SU>

Background Words Model

Page 21: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

21

METADATA

<T>Horseplay</T>

<SU>Gunlock, UT; child, female, face; horse, back; rein, rope; Dorothea Lange Collection; Photographic Essays (1953-1959); Three Mormon Towns; Gunlock; Gunlock, Horseplay</SU>

Page 22: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

22

METADATA

<T>Country Landscapes</T>

<SU>Ireland; town; street; sidewalk; house; cart; Dorothea Lange Collection; Photographic Essays (1953-1959); Ireland; Ireland, Country Landscapes</SU>

METADATA

<T>Country Landscapes</T>

<SU>Ireland; town; street; sidewalk; house; cart; Dorothea Lange Collection; Photographic Essays (1953-1959); Ireland; Ireland, Country Landscapes</SU>

METADATA

<T>Country Landscapes</T>

<SU>Ireland; town; street; sidewalk; house; cart; Dorothea Lange Collection; Photographic Essays (1953-1959); Ireland; Ireland, Country Landscapes</SU>

METADATA

<T>Horseplay</T>

<SU>Gunlock, UT; child, female, face; horse, back; rein, rope; Dorothea Lange Collection; Photographic Essays (1953-1959); Three Mormon Towns; Gunlock; Gunlock, Horseplay</SU>

California Digital Library (CDL) repository

Page 23: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

23

Dorothea Lange Collection …

Shared Topics

CDL

Repositories

Repository-specific words

Background Words Model

Page 24: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

24

Improvement of Music topic (Background Words)

Standard Topic Model

[MUSIC] music moa periodical_devoted ladies_repository literature_art song musical mus gov_nla nla_mus music_australian cover instrument piano musician south_america voice_piano drum peru song_piano composer marsh word singing violin playing orchestra vanity_fair

Shared Topic from BackgroundWords Topic Model

[MUSIC] music poster musical dance song theatre instrument actor concert entertainment theater piano sound festival theatrical musician drama performances opera art ballet popular performing performer folk singer composer dancer drum jazz dancing orchestra stage pieces singing recording

Page 25: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

25

Improvement of Family Photos topic (Reduced Vocab)

Standard Topic Model

family_photograph mss jpg george_edward anderson_photograph plate_negative women_portrait gelatin_dry photograph_portrait south_africa studio_portrait children_portrait hair standing sitting portrait underwood portrait_portrait front infant_portrait

Reduced Vocab Topic Model

family_photograph wearing woman hair dress clothing shoulder baby suit dressed chair clothing_dress wear hand tie shirt jacket costume boy ribbon collar dark lap bow white full_face beard young_woman leaning striped outdoor

Page 26: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

26

Improvement Measures

0

20

40

60

80

100

UsableTopics

RecordsEnhanced

Coverage

Original

Improved

Percent

Page 27: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

27

Deployment

• Added topic labels to 2.5 million record sub-set of OAIster (62 repositories)

• Built search and browse interface for testing purposes

Page 28: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

28

Creating labels• Community of “experts” created labels

for the 352 topics

Page 29: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

29

Adding classification

• At the same time, associated classification categories

• “High Level Browse” classification developed at UM, based on LC call numbers

Page 30: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

30

Assign to records

• Top-4 topics are assigned to records

• Plus, the chosen classification categories

• Performed for each repository

Page 31: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

31

Resulting record

Page 32: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

32

Performing a search

• Using interface, end-users can choose to limit their searches by subject categories mapped to the topic labels

• e.g., doing a search by “gender” and “diversity”, limited to subject categories

Page 33: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

33

Page 34: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

34

Search results• Can then choose to limit search by a different

subject category…on the search results page

Page 35: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

35

Resulting record in display

Page 36: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

36

Lessons Learned (1)

• Topic modeling allows…– Narrow/expand search results, without having to

re-do search– Clarification of search

• Labels and classification– Humanities records fared worse than scientific

records, e.g., lack of metadata, use of metaphors– Classification has some holes, e.g., history of war

Page 37: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

37

Lessons Learned (2)

• Quality– Should use experts to label topics

• Scalability– Found ways to improve topics– Some manual, some automatic, reasonably

scalable

• Testing– Real accuracy analysis needed– Real end-user usability needed

Page 38: 1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

38

Questions?

• David Newman, U. of California Irvine– [email protected]

• Kat Hagedorn, U. of Michigan– [email protected]