Top Banner
Introduction Method Results Summary Contextualization of Topics browsing through terms, authors, journals and cluster allocations Rob Koopman 1 Shenghui Wang 1 Andrea Scharnhorst 2 1 OCLC Research 2 DANS-KNAW ISSI 2015
27

Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Aug 12, 2015

Download

Science

Shenghui Wang
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

Contextualization of Topicsbrowsing through terms, authors, journals and cluster allocations

Rob Koopman1 Shenghui Wang1 Andrea Scharnhorst2

1OCLC Research 2DANS-KNAW

ISSI 2015

Page 2: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

Introduction

What are essence and boundary of a scientific field?

Different ways to find clusters in scientific literature based onconnectivity in terms of authorship, citations, languagesimilarity, etc.

Ambiguous nature in science

Page 3: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

Ariadne: interactive context explorer

Ariadne is an interactive interface which allows users toexplore the context of entities such as authors, journals,topical terms, etc.

It builds on semantic indexing statistically computed from alarge scale bibliographic corpus

It was originally implemented to explore 1M topical terms, 3Mauthors, 35K journals and 700+ Dewey decimal classesassociated with 65M articles.

Page 4: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

Research questions

Q1: How does the Ariadne algorithm work on a much smaller,field specific dataset?

Q2: Can we use Ariadne to label the clusters produce by thedifferent methods?

Q3: Can we use Ariadne to compare different clusteringsolutions?

Page 5: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

LittleAriadne

LittleAriadne: context explorer over Astrophysics data

Offline: generates a semantic representation for each entity

Online: finds the most related entities and usingmultidimensional scaling to display

Page 6: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

LittleAriadne

An example article

Article ID ISI:000276828000006

Title On the Mass Transfer Rate in SS Cyg

Abstract The mass transfer rate in SS Cyg at quiescence, estimatedfrom the observed luminosity of the hot spot, is log M-tr= 16.8 +/- 0.3. This is safely below the critical masstransfer rates of log M-crit = 18.1 (corresponding to logT-crit(0) = 3.88) or log M-crit = 17.2 (corresponding tothe “revised” value of log T-crit(0) = 3.65). The masstransfer rate during outbursts is strongly enhanced

Author [author:smak j]

ISSN [issn:0001-5237]

Subject [subject:accretion, accretion disks] [subject:cataclysmicvariables] [subject:disc instability model] [subject:dwarf novae][subject:novae, cataclysmic variables] [subject:outbursts][subject:parameters] [subject:stars] [subject:stars dwarf novae][subject:stars individual ss cyg] [subject:state] [subject:superoutbursts]

Page 7: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

LittleAriadne

An example article

Article ID ISI:000276828000006

Title On the Mass Transfer Rate in SS Cyg

Abstract The mass transfer rate in SS Cyg at quiescence, estimatedfrom the observed luminosity of the hot spot, is log M-tr= 16.8 +/- 0.3. This is safely below the critical masstransfer rates of log M-crit = 18.1 (corresponding to logT-crit(0) = 3.88) or log M-crit = 17.2 (corresponding tothe “revised” value of log T-crit(0) = 3.65). The masstransfer rate during outbursts is strongly enhanced

Author [author:smak j]

ISSN [issn:0001-5237]

Subject [subject:accretion, accretion disks] [subject:cataclysmicvariables] [subject:disc instability model] [subject:dwarf novae][subject:novae, cataclysmic variables] [subject:outbursts][subject:parameters] [subject:stars] [subject:stars dwarf novae][subject:stars individual ss cyg] [subject:state] [subject:superoutbursts]

Cluster label [cluster:a 19] [cluster:b 16] [cluster:c 15][cluster:d 51] [cluster:e 17] [cluster:f 1]

Page 8: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

LittleAriadne

Six different clustering solutions

x Source y=#Cluster #Cluster in LittleAriadnea cwts 1.8 23 23b UMSI 23 23c oclc 20 20 20d hu 139 48e sts 5664 229f ECOOM 15 15

Page 9: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

LittleAriadne

Entities in the Astrophysis dataset

There are in total 90,343 entities associated with 111,616astrophysics articles

59 journals

27,027 author names (no disambiguation applied)

39,577 topical terms

23,322 subjects (extracted from ”Author Keywords” and”Keywords Plus”)

358 cluster labels (source + cluster id)

Page 10: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

LittleAriadne

Build semantic representation

Basic assumptions

Entities can be represented by its contextEntities which share more context are more likely to be related

Context is the textual environment where an entity occurs

Page 11: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

LittleAriadne

An example article

Article ID ISI:000276828000006

Title On the Mass Transfer Rate in SS Cyg

Abstract The mass transfer rate in SS Cyg at quiescence, estimatedfrom the observed luminosity of the hot spot, is log M-tr= 16.8 +/- 0.3. This is safely below the critical masstransfer rates of log M-crit = 18.1 (corresponding to logT-crit(0) = 3.88) or log M-crit = 17.2 (corresponding tothe “revised” value of log T-crit(0) = 3.65). The masstransfer rate during outbursts is strongly enhanced

Author [author:smak j]

ISSN [issn:0001-5237]

Subject [subject:accretion, accretion disks] [subject:cataclysmicvariables] [subject:disc instability model] [subject:dwarf novae][subject:novae, cataclysmic variables] [subject:outbursts][subject:parameters] [subject:stars] [subject:stars dwarf novae][subject:stars individual ss cyg] [subject:state] [subject:superoutbursts]

Cluster label [cluster:a 19] [cluster:b 16] [cluster:c 15][cluster:d 51] [cluster:e 17] [cluster:f 1]

Page 12: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

LittleAriadne

Dimension reduction using Random Projection

masstransfer rate

[subject:outburst][subject:sstars][subject:parameters]

[author:smak j]

[cluster: a19][issn:0001-5237]

Page 13: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

LittleAriadne

Dimension reduction using Random Projection

masstransfer rate

[subject:outburst][subject:sstars][subject:parameters]

[author:smak j]

[cluster: a19][issn:0001-5237]

Page 14: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

LittleAriadne

From semantic representation to visualisation and more

Each entity has its semantic representation

Cosine similarity between entities can be computed very fast,based on which the 2D visualisation is implemented

For each article, we collected the semantic representation ofall the entities in which it involves, and take an average as itssemantic representation

We applied a standard K-means clustering method to clusterthese articles based on their semantic representations

Page 15: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

LittleAriadne

From semantic representation to visualisation and more

Each entity has its semantic representation

Cosine similarity between entities can be computed very fast,based on which the 2D visualisation is implemented

For each article, we collected the semantic representation ofall the entities in which it involves, and take an average as itssemantic representation

We applied a standard K-means clustering method to clusterthese articles based on their semantic representations

Page 16: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

Experiment 1: Exploring context

Experiment 1: Exploring context

Now we can explore

Let’s start with starsAn overview of all journals

Page 17: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

Experiment 1: Exploring context

Contextual view of stars

Page 18: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

Experiment 2: Labelling clusters

Experiment 2: Labelling clusters

What is cluster a 2?

Page 19: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

Experiment 2: Labelling clusters

Experiment 2: Labelling clusters

Page 20: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

Experiment 2: Labelling clusters

Experiment 2: Labelling clusters

Cluster ID Top 9 most related topical terms

a 2 ”cosmology” ”dark energy” ”density perturbations””cosmologies” ”planck” ”cosmological” ”spatialcurvature” ”inflationary” ”inflation”

b 2 ”cosmology” ”cosmological constant” ”cosmologies””cosmological” ”universes” ”dark energy” ”quadratic””tensor” ”planck”

c 17 ”power spectrum” ”cosmological parameters” ”cmb””last scattering” ”anisotropies” ”microwave background””power spectra” ”planck” ”cosmic microwave”

d 28 ”density perturbations” ”inflationary” ”inflation””dark energy” ”scale invariant” ”spatial curvature””cosmological perturbations” ”inflationary models””cosmologies”

Page 21: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

Experiment 3: Comparing clustering solutions

Experiment 3: Comparing clustering solutions

Cluster labels are treated as entities

Let’s compare

Page 22: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

Experiment 3: Comparing clustering solutions

Highly similar clustering solutions

Page 23: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

Experiment 3: Comparing clustering solutions

Partially agreeing clustering solutions

Page 24: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

Experiment 3: Comparing clustering solutions

An overview of all clustering solutions

Page 25: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

Summary

Summary

We present a method and an interface that allows visualexploration through the contexts of entities

We can provide the most related topical terms to clustersalthough expert knowledge is needed to transform them intoreal labels/topics

LittleAriadne provides a visual way of comparing differentclustering solutions

Our naıve way of clustering is worth exploring further

Page 26: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

Future extensions

Future extensions

Add more types of entities, such as citations, publishers,conferences, etc, to provide richer context

Add direct links to articles to answer information retrievalneeds

Study context sensitivity

compare ”young” and ”young”

Page 27: Contextualization of topics - browsing through terms, authors, journals and cluster allocations

Introduction Method Results Summary

Thank you

Thank you

http://thoth.pica.nl/astro/relate

Rob Koopman ([email protected])Shenghui Wang ([email protected])Andrea Scharnhorst ([email protected])