Top Banner
JSTOR Sustainability Collection Sharon Garewal, JSTOR Senior Metadata Librarian Ron Snyder, ITHAKA Labs Director of Research and Development
20
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: JSTOR Sustainability Collection - DHUG 2015

JSTOR Sustainability CollectionSharon Garewal, JSTOR Senior Metadata Librarian

Ron Snyder, ITHAKA Labs Director of Research and Development

Page 2: JSTOR Sustainability Collection - DHUG 2015

Overview

Sustainability collection defined

Utilization of the thesaurus within the sustainability collection

Subject matter experts enlisted

Results

Live demo

Page 3: JSTOR Sustainability Collection - DHUG 2015

JSTOR- a quick primer

3,200+ journals & 30,000+ books

9.3 million full length articles

70 million pages

2.9 million book reviews

138 million content accesses in 2013

100 million searches per year

http://www.jstor.org/

Page 4: JSTOR Sustainability Collection - DHUG 2015

Sustainability Collection: what will it be?

Driver: Emerging interdisciplinary area that JSTOR wanted to support in both research and teaching needs.

Core topics of Cities and Urbanization, Food and Agriculture, Industrial Ecology, Resource Economics, Forestry and Land Use and Environmental Policy and Law

Composed of journals, books, grey literature (working reports, research reports, technical reports etc.)

Specialized functionality to support research by including semantic indexing to help researchers locate related terms and concepts. This is where the JSTOR Thesaurus (JTHES) comes into play!

Page 5: JSTOR Sustainability Collection - DHUG 2015
Page 6: JSTOR Sustainability Collection - DHUG 2015

JTHES19 Top terms, 57,470 Terms;

103,129 rules

Page 7: JSTOR Sustainability Collection - DHUG 2015

The challenge

To assemble a list of key terms in Sustainability

The terms will be used to organize and tag sustainability-related research articles on JSTOR starting in 2015.

These terms will also be used for an auto complete function in the search component.

Utilize the JTHES in a live prototype

This was the first project where we looked at how to use the thesaurus as an intelligence layer within a collection. How should it work? How do we do this?

Page 8: JSTOR Sustainability Collection - DHUG 2015

How do we get this done? The options…

Create a new thesaurus for sustainability:

Pros: Specific to sustainability

Cons: Remembering to make changes in more than one place. Cost associated with creating and maintaining a separate thesaurus

Create a sustainability branch within JTHES:

Pros: Could BT (Broader terms) all relevant branches and terms from elsewhere in the JTHES into 1 branch

Cons: Redundant; Multiple BT’s clutter up the JTHES

Create a facet to tag terms within JTHES as “Sustainability”:

Pros: Creates a flat list (in faceted view) of all of the terms in that facet; Easy to maintain

Cons: Does not show a hierarchy; Cannot have multiple facets

Page 9: JSTOR Sustainability Collection - DHUG 2015

The road to sustainability…

Research: examined existing glossaries and thesauri created by research libraries, discipline associations and individual scholars in each of the disciplines.

Existing terms (pulling lists)

Existing branches (clean up)

Adding new terms

Adding new branches: Food studies, Urban studies, etc.

Constructing new rules and refining existing rules

Testing content

Page 10: JSTOR Sustainability Collection - DHUG 2015

Enlisting Subject matter experts

Contacted faculty members in ten disciplines to go over the subset of terms assembled in their discipline and review those terms with an eye toward:

Is this how people in the field express this concept?

Is it correctly included in the sustainability facet?

Are there any important terms or concepts that we've missed? (including acronyms, synonyms, variant spellings, inverted phrases)

Page 11: JSTOR Sustainability Collection - DHUG 2015

SME spreadsheetsEach SME was slightly different in how they approached their subject areas with some SMEs being reluctant to give much feedback and others giving large amounts of feedback to sift through.

Example of terms pulled from Law, Public administration/policy and International/global studies

Page 12: JSTOR Sustainability Collection - DHUG 2015

View- Facet provides

alphabetical list of all

tagged terms

Page 13: JSTOR Sustainability Collection - DHUG 2015

The Results!! labs.jstor.org/sustainability

Page 14: JSTOR Sustainability Collection - DHUG 2015

Implementation of the Sustainability Prototype

The thesaurus and semantic index are used for content discovery and presentation

The identification of a “sustainability collection” from the JSTOR corpus was performed using topic modeling (specifically LDA – Latent Dirichlet Allocation)

A model of 100 topics was generated from the content

Staff assigned sustainability scores for each of the topics based on a review of the top words in each topic

Each document in the JSTOR corpus was then assigned a sustainability score of 0-9 based on the sustainability scores for the topics most closely associated with the document

Page 15: JSTOR Sustainability Collection - DHUG 2015

Weighting of document-level indexed terms

Document-level weights were computed for each sematic term using TF-IDF

TF-IDF is a measure of how important a word is to a document in a collection

The TF-IDF value increases proportionally to the number of times the word appears in a document (the ‘TF’ or term frequency), but is offset by how common the word is in a corpus (the ‘IDF’ or inverse document frequency)

The TF-IDF weighted terms are used to:

order the terms displayed for each document

boost document relevancy when index terms are used in discovery

Page 16: JSTOR Sustainability Collection - DHUG 2015

Auto-suggest and refining results

[Thesaurus slide: a new thing, metadata we create, screenshot(s) of Sustainability Portal]

Page 17: JSTOR Sustainability Collection - DHUG 2015

Refinements in our use of the thesaurus and semantic index in sustainability Auto calculation of sustainability score using LDA topics and thesaurus

sustainability facet

Calculate topics and term correlations

Compute sustainability score for each topic based on the most relevant terms and sustainability facet

Compute a sustainability score for each corpus document based on topic weights and topic sustainability score

Automated LDA topic labeling

Labeling topics generated by unsupervised topic modeling is an ongoing challenge

We’re investigating the feasibility of using the same topic/term correlations used to compute sustainability scores to assign labels

Attempts to find the thesaurus term that best characterizes the most highly correlated terms for each topic

Page 18: JSTOR Sustainability Collection - DHUG 2015

Other JSTOR Labs projects/tools using the thesaurus and semantic index

http://labs.jstor.org/jthes/

http://labs.jstor.org/snap/

http://labs.jstor.org/readings/

Thesaurus Visualization Tool

Page 19: JSTOR Sustainability Collection - DHUG 2015

And some other JSTOR Labs projects

http://labs.jstor.org/reflowit/

http://labs.jstor.org/shakespeare/