Pascal Visualization Challenge Blaž Fortuna, IJS Marko Grobelnik, IJS Steve Gunn, US.

Pascal Visualization Challenge

Blaž Fortuna, IJS

Marko Grobelnik, IJS

Steve Gunn, US

Part I: Challenge details

ePrints

Database of around 1600 papers published by Pascal members

Papers are described with: Authors (unique Pascal Id) Title Abstract (most papers) Publish date (some papers

only have year)

Challenge Goal

Two main goals: to test and compare different text

visualization methods, ideas and algorithms on a common dataset,

to contribute to the Pascal dissemination and promotion activities by using data about scientific publications from Pascal’s EPrints server

Task

Visualize and present the Pascal ePrints data in a novel way which enables:

discovering main areas covered by the papers and people in Pascal,

discovering area and people developments trough time,

helping the researchers with recommendation on which papers to read,

helping at finding the right reviewers for new papers.

Data

Raw XML file from Pascal ePrints server Processed data for easier use:

Bag-of-words (TextGarden, Matlab) Graph (Matlab, Pajek)

Data processed for different possible scenarios.

Raw XML file

Cleaned data from Pascal ePrints server.

Data is given as a list of papers, each paper is described by: Title Abstract Year of publication List of authors

Each Author is described by unique Pascal Id and institution.

<paper id="2080" year="2006"><title>Synthesis of Maximum…</title><abstract>In this presentation…</abstract><subjects>

<subject id="CS">Computati…</subject><subject id="LO">Learning…</subject><subject id="TA">Theory …</subject>

</subjects><authors>

<author id="452" institution_id="1">Sandor Szedmak

</author><author id="1" institution_id="1">

John Shawe-Taylor</author>

</authors><institutions>

<institution id="1">Universit…</institution></institutions>

</paper>

Bag-of-words

Covered scenarios: Document == Paper Document == Author Document == Institution

Available formats: TextGarden

Text file where one line equals one document

Matlab Data available in form of

sparse Term-Document matrix

TextGarden (www.textmining.net): Format:

Document_name !Subject DocumentList Example:

Support_Vector_Machine_to_synthesise_kernels !Machine_Vision !Theory_and_Algorithms Support Vector Machine to synthesise kernels -- Suppose we are given two sets of …

Matlab: Sparse matrix saved in text file, it can be

simply read into Matlab by:X = spconvert(load(‘papers.dat’));

Documents are columns in the matrix Names of columns (document names)

and rows (words) are provided.

Graph

Covered scenarios: Vertex == Word,

Edge == Co-Appearance Vertex == Author,

Edge == Co-Authors Vertex == Institution,

Edge == Collaboration

Available formats: Matlab

Data available in form of sparse adjacency matrix

Pajek Software for network

analysis

Matlab: Sparse matrix saved in text file, it

can be simply read into Matlab by:X =

spconvert(load(‘words.dat’)); Names of vertices (words,

authors, institutions) are provided.

Pajek: Can be downloaded from:

vlado.fmf.uni-lj.si/pub/networks/pajek

Submissions

The results can be: images, movies, Web sites, VRML files, executables (windows, linux), etc.

For interactive tool also provide a video, showing the use of the tool on the Pascal ePrints data.

Evaluation

Usability of visualization – The goal is to assess usability of particular visualization in different practical contexts.

Innovativeness – The goal is to estimate how innovative are the ideas used for visualization.

Aesthetics of the image – Here we are aiming to identify the "nicest" images from the challenge.

General Pascal-researchers’ voting over the web about "who likes what".

Since all the criteria are subjective, we will hire experts for judging about the quality.

Each of the criteria will generate a separate ranking.

Part II: Examples

Visualization example 1/2: Document Atlas

Bag-of-words approach: Document == Author Author is described by

a sum of all the abstracts from the papers he co-authored.

We construct separate profile for papers from year 2004 and papers from year 2005.

Dimensionality reduction

Documents are mapped from bag-of-words space to two dimensions in two steps: Latent Semantic Indexing:

13.000 dim => 110 dim Multidimensional Scaling

110 dim => 2 dim

The background reflects the density of documents

document

Background words Each part of the map is

assigned a keyword which is most representative for the documents in the area.

We get a “map” of the topics covered within the documents.

In the case of Pascal ePrints data areas on the map correspond to the areas covered within the Pascal Network.

Time dynamics

For each author we have profile for years 2004 and 2005

By showing the difference we can see how authors’ research focus developed between 2004 and 2005.

gradient

Co-Authorships

Live Demo

Visualization example 2/2: IST World Web portal developed within

IST World EU project Uses search and

visualization methods to: discover the main research

areas and collaborations within the PASCAL organizations

produce recommendation on which papers to read (e.g. papers on image recognition, or kernel trick)

find the right reviewers for a new paper (e.g a paper on "brain computer interface") and assess their competence

Research areas

Institutions are placed on the map of research areas from Pascal Network

Example shows which are the areas closely related to JSI

Collaborations Collaboration of institutions

Collaboration of authors working on

“text mining”

Paper Recommendation

Competence Search

Live Demo

Thank you!