1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)
Dec 25, 2015
1
Subject Metadata Enhancement using
Statistical Topic Models
DLF ForumApril 24, 2007
David Newman
(UC Irvine)
Kat Hagedorn
(U Michigan)
3
Outline
1. OAIster: Metadata Challenges
2. Clustering: Topic Model
3. Deployment of the Prototype
4. Lessons Learned
4
• (as of Sept. 2006)• 700+ institutions • 9.6 million records• Academically-oriented
material, research literature, images, and more
• Problem: How to go beyond keyword search?
OAIster collection, e.g.,– CiteSeer– PubMed– Library of Congress– arXiv.org– PictureAustralia
plus…– Xiamen University Repository– National Library of Serbia– DSpace at Malmo University– Deep Blue
OAIster
5
Our Challenge
• OAIster wants to provide users with the best possible search and discovery experience
• Can improve search and discovery with– Better metadata – Better access to the metadata
6
How?
Enhancing access via…
• Search– limit search results by subject classification
• Browse– browse subject classification hierarchy
• Built prototype to showcase this
7
Topic Model
• State-of-the-art statistical algorithm
• Learns a set of topics or subjects covered by a collection of text records
• Works by finding patterns of co-occurring words
• Determines the mix of topics associated with each record
9
Processvocab-ulary
preprocesstopic
model(cluster/learn)
topicsCluster
OAIrecords
vocab-ulary
preprocesstopic
model(classify)
1. topics in records2. records in topicsoai
rec
Classify
OAIrecords
10
Process
Cluster
Classify
clustering is learning the
topics
classification is using the
learned topics
vocab-ulary
preprocesstopic
model(cluster/learn)
topicsOAI
records
vocab-ulary
preprocesstopic
model(classify)
1. topics in records2. records in topicsoai
recOAI
records
11
Building Vocabulary
• Preprocessed (sampled) repositories, excluded stopwords
• Only kept words that occurred in more than 10 records
• Result: a final vocabulary with ~ 90,000 words
12
OAI Record (from USC repository)
Coast route from Los Angeles to San Diego. Part one: Los Angeles to Santa Ana, 1927
Strip map of the automobile route from Los Angeles to Santa Ana via the coast route. Includes Los Angeles (top %26 left), Tustin (bottom), La Habra (right). Due north is about 35 degrees to the right of vertical. Principal features: municipalities, railroads, roads, mileages, road names, hotels, garages, Auto Club offices. Prominent locations (cities; streets; geographical features; institutions): Los Angeles (Broadway, 7th Street, Whittier Boulevard), Pasadena, Alhambra, San Gabriel, Montebello, Downey, Whittier, La Habra (Spadra), Fullerton, Anaheim (Los Angeles Street), Orange, Santa Ana (Main Street, 1st Street, 4th Street), Tustin; Rio Hondo, San Gabriel River; State School, Orange County Hospital.
Maps, Tourist; roadways; Automobile Club of Southern California
14
Preprocessing Example
<ID=oai:CiteSeerPSU:44072>
<title>Reinforcement Learning: A Survey
<description>This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." …
<subject>Leslie Pack Kaelbling, Michael Littman, Andrew Moore. Reinforcement Learning: A Survey
vocab-ulary
preprocess
<ID=oai:CiteSeerPSU:44072>
reinforcement learning survey
survey field reinforcement learning computer science perspective written accessible researcher familiar machine learning historical basis field broad selection current summarized reinforcement learning faced agent learn behavior trial error interaction dynamic environment resemblance psychology differ considerably detail word reinforcement …
leslie pack kaelbling littman andrew moore reinforcement learning survey
15
Example Topics (1)
Words in Topic Topic Label
gene sequence genes sequences cdna region amino_acid clones encoding cloned coding dna genomic cloning clone
gene sequencing
social cultural political culture conflict identity society economic context gender contemporary politic world examines tradition sociology institution ethic discourse
cultural identity
general_relativity gravity gravitational solution black_hole tensor einstein horizon spacetime equation field metric vacuum scalar matter energy relativity
relativity
house garden houses dwelling housing homes terrace estate home building architecture residence homestead residences road cottage domestic fences lawn historic
domestic architecture
16
Example Topics (2)
Words in Topic Usefulness
large small size larger smaller sizes scale sized largest
Reasonable but unusable
foi para pacientes por foram dos doen resultados grupo das tratamento entre
Topic about patient treatment, in Spanish
building street visible santa_ana view avenue public_library front orange corner
Not usable: mix of concept words and specific geographic location words
17
Topics Assigned to a Record
Metadata Record Topic Labels
(% words assigned)
Aggregating sets of judgments: two impossibility results compared. (C. List and P. Pettit)
May's celebrated theorem (1952) shows that, if a group of individuals wants to make a choice between two alternatives (say x and y), then majority voting is the unique decision procedure satisfying a set of attractive minimal conditions ...
game theory (21%)
argument (12%)
criteria (7%)
18
Top Records in “Game Theory”[GAME THEORY] game games equilibrium preferences player cooperative preference equilibria cooperation collective utility individual choice bargaining coalition nash strategy
Repositories
1. Fundamental Components of the Gameplay Experience: Analysing Immersion
2. The Ethics of Computer Game Design
3. Backward Induction and Common Knowledge
4. Designing Puzzles for Collaborative Gaming Experience
5. Aggregating sets of judgments: two impossibility results compared
6. Games for Modal and Temporal Logics
7. Configuring the player - subversive behaviour in Project Entropia
8. From Mass Audience to Massive Multiplayer: How Multiplayer Games Create New Media Politics
9. Bargaining with incomplete information; Handbook of Game Theory with Economic Applications
10. Testable Restrictions of General Equilibrium Theory in Exchange Economies with Externalities
RePEc
Dspace at ANU (Australian National University)
Edinburgh Research Archive
Almae Matris Studiorum (AMS)
eScholarship Repository
CiteSeer
19
Improving Topics
• First topic model of OAIster resulted in 70% of topics being usable
• We improved topics by– Reduced Vocabulary: Remove topically
low-value words from vocabulary (manual)– Background Words Model: Automatically
detect and remove words specific to repositories (automatic)
20
METADATA
<T>Country Landscapes</T>
<SU>Ireland; town; street; sidewalk; house; cart; Dorothea Lange Collection; Photographic Essays (1953-1959); Ireland; Ireland, Country Landscapes</SU>
Background Words Model
21
METADATA
<T>Horseplay</T>
<SU>Gunlock, UT; child, female, face; horse, back; rein, rope; Dorothea Lange Collection; Photographic Essays (1953-1959); Three Mormon Towns; Gunlock; Gunlock, Horseplay</SU>
22
METADATA
<T>Country Landscapes</T>
<SU>Ireland; town; street; sidewalk; house; cart; Dorothea Lange Collection; Photographic Essays (1953-1959); Ireland; Ireland, Country Landscapes</SU>
METADATA
<T>Country Landscapes</T>
<SU>Ireland; town; street; sidewalk; house; cart; Dorothea Lange Collection; Photographic Essays (1953-1959); Ireland; Ireland, Country Landscapes</SU>
METADATA
<T>Country Landscapes</T>
<SU>Ireland; town; street; sidewalk; house; cart; Dorothea Lange Collection; Photographic Essays (1953-1959); Ireland; Ireland, Country Landscapes</SU>
METADATA
<T>Horseplay</T>
<SU>Gunlock, UT; child, female, face; horse, back; rein, rope; Dorothea Lange Collection; Photographic Essays (1953-1959); Three Mormon Towns; Gunlock; Gunlock, Horseplay</SU>
California Digital Library (CDL) repository
23
Dorothea Lange Collection …
Shared Topics
CDL
Repositories
Repository-specific words
Background Words Model
24
Improvement of Music topic (Background Words)
Standard Topic Model
[MUSIC] music moa periodical_devoted ladies_repository literature_art song musical mus gov_nla nla_mus music_australian cover instrument piano musician south_america voice_piano drum peru song_piano composer marsh word singing violin playing orchestra vanity_fair
Shared Topic from BackgroundWords Topic Model
[MUSIC] music poster musical dance song theatre instrument actor concert entertainment theater piano sound festival theatrical musician drama performances opera art ballet popular performing performer folk singer composer dancer drum jazz dancing orchestra stage pieces singing recording
25
Improvement of Family Photos topic (Reduced Vocab)
Standard Topic Model
family_photograph mss jpg george_edward anderson_photograph plate_negative women_portrait gelatin_dry photograph_portrait south_africa studio_portrait children_portrait hair standing sitting portrait underwood portrait_portrait front infant_portrait
Reduced Vocab Topic Model
family_photograph wearing woman hair dress clothing shoulder baby suit dressed chair clothing_dress wear hand tie shirt jacket costume boy ribbon collar dark lap bow white full_face beard young_woman leaning striped outdoor
26
Improvement Measures
0
20
40
60
80
100
UsableTopics
RecordsEnhanced
Coverage
Original
Improved
Percent
27
Deployment
• Added topic labels to 2.5 million record sub-set of OAIster (62 repositories)
• Built search and browse interface for testing purposes
29
Adding classification
• At the same time, associated classification categories
• “High Level Browse” classification developed at UM, based on LC call numbers
30
Assign to records
• Top-4 topics are assigned to records
• Plus, the chosen classification categories
• Performed for each repository
32
Performing a search
• Using interface, end-users can choose to limit their searches by subject categories mapped to the topic labels
• e.g., doing a search by “gender” and “diversity”, limited to subject categories
34
Search results• Can then choose to limit search by a different
subject category…on the search results page
36
Lessons Learned (1)
• Topic modeling allows…– Narrow/expand search results, without having to
re-do search– Clarification of search
• Labels and classification– Humanities records fared worse than scientific
records, e.g., lack of metadata, use of metaphors– Classification has some holes, e.g., history of war
37
Lessons Learned (2)
• Quality– Should use experts to label topics
• Scalability– Found ways to improve topics– Some manual, some automatic, reasonably
scalable
• Testing– Real accuracy analysis needed– Real end-user usability needed
38
Questions?
• David Newman, U. of California Irvine– [email protected]
• Kat Hagedorn, U. of Michigan– [email protected]