Top Banner
Ariadne’s Thread Exploring a world of networked information built from free-text metadata Rob Koopman, Shenghui Wang OCLC Research 12 March 2015 KNAW eHumanitites group
24

Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Jul 17, 2015

Download

Data & Analytics

Shenghui Wang
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Ariadne’s Thread

Exploring a world of networked information built from free-text metadata

Rob Koopman, Shenghui Wang

OCLC Research

12 March 2015

KNAW eHumanitites group

Page 2: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

What would you do if you are

interested in a topic?

Page 3: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata
Page 4: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata
Page 5: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

It is difficult to answer these questions:

• What are the different aspects of this topic?

• Are there related aspects missing in my search terms?

• Who are the most prominent authors about this topic?

• Which journals publish most about this topic?

• How have others — e.g. librarians — described and

classified this topic?

Page 6: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Demo examples

• http://thoth.pica.nl/demo/relate

Page 7: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

How do we do this?

● Offline: Build low-dimensional semantic

representation using Random Projection

● Online: Interactive exploration of networked

entities

Page 8: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

A MARC record

title

authors

issn

dewey

publisher

Page 9: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Step 1: Build semantic representation

• Direct co-occurrence based

Does prekindergarten improve school preparation and performance?

[author:loeb susanna]

[issn:0272-7757]

• Indirect co-occurrence based (context based)

The effects of state prekindergarten programs on young children’s

school readiness in five states

[author:jung kwanghee]

[subject:readiness for school]

Page 10: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Dataset

● WorldCat, 300+ million records

● Selected 13 million items (topical terms,

authors, ISSNs, Dewey decimal codes,

publishers, subject headings)

● Represented by 6 million topical terms

But a matrix of 13M x 6M is too big to process

Page 11: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Dimension reduction based on Random Projection

C: a co-occurrence matrix

R: a random matrix of +/-1

C’: approximation of C

after random projection

-- Semantic matrix

Koopman, R., Wang, S., Scharnhorst, A., Englebienne, G.: Ariadne’s thread: In- teractive navigation in a world of networked information. In: CHI’15 Extended Abstracts.

Page 12: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Scaling up

- We first select the items we want to build

vectors for

- Matrix C and R are never stored.

- We only store C’ a matrix of about 16GB

- We read metadata records sequentially and

update C’ for each “co-occurrence”.

- Cost is order Ntotalwords * Ncolumns

- Can be done in parallel

- Can use HADOOP

Page 13: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Step 2: Interactive exploration

- Let user input search term

- Calculate the top 500 most related

candidates

- Find mutually related items

- Convert distances to probabilities

- Project to 2D

Page 14: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Step 2: Interactive exploration

- Let user select term

- Calculate the top 500 most similar

candidates

- Find mutually related items

- Convert distances to probabilities

- Project to 2D

Page 15: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Raw

Page 16: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Find mutually related items

- Basic assumption: my friends are the ones

who consider me as a friend too.

- For each candidate calculate the average distance and

standard deviation to all other candidates in the top 500

list.

- Keep candidates with the highest z-scores for the

selected term.

Page 17: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Cooked

Page 18: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Visualise in 2D

- Multidimensional scaling

- Simple spring model

- Verlet integration

Page 19: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Convert distances to probabilities

Idea from SNE to get more structure:

- Calculate z-score of each item

- Convert these scores to probabilities

- Average A to B and B to A probabilities

L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of

Machine Learning Research 9(Nov):2579-2605, 2008.

Page 20: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

cosine similarity

Page 21: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Probability

Page 22: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Future work

● Compare the algorithm to other existing

algorithms - benchmarking

● Improve visualisation (more simple NLP)

● More functionality (timeline, history)

● More metadata fields (publisher, subject,

identifiers)

● Extend the implementation to other

databases

Page 23: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Future work

● Identify applications, e.g.o Author name disambiguation

o Matching chemical molecules

● Prepare user scenarios for usability testing

Page 24: Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Thank you

[email protected]

[email protected]

http://thoth.pica.nl/relate (ArticleFirst)

http://thoth.pica.nl/astro/relate (Astrophysics articles)

http://thoth.pica.nl/demo/relate (WorldCat)