Pascal Visualization Challenge Blaž Fortuna, IJS Marko Grobelnik, IJS Steve Gunn, US
Pascal Visualization Challenge
Blaž Fortuna, IJS
Marko Grobelnik, IJS
Steve Gunn, US
Part I: Challenge details
ePrints
Database of around 1600 papers published by Pascal members
Papers are described with: Authors (unique Pascal Id) Title Abstract (most papers) Publish date (some papers
only have year)
Challenge Goal
Two main goals: to test and compare different text
visualization methods, ideas and algorithms on a common dataset,
to contribute to the Pascal dissemination and promotion activities by using data about scientific publications from Pascal’s EPrints server
Task
Visualize and present the Pascal ePrints data in a novel way which enables:
discovering main areas covered by the papers and people in Pascal,
discovering area and people developments trough time,
helping the researchers with recommendation on which papers to read,
helping at finding the right reviewers for new papers.
Data
Raw XML file from Pascal ePrints server Processed data for easier use:
Bag-of-words (TextGarden, Matlab) Graph (Matlab, Pajek)
Data processed for different possible scenarios.
Raw XML file
Cleaned data from Pascal ePrints server.
Data is given as a list of papers, each paper is described by: Title Abstract Year of publication List of authors
Each Author is described by unique Pascal Id and institution.
<paper id="2080" year="2006"><title>Synthesis of Maximum…</title><abstract>In this presentation…</abstract><subjects>
<subject id="CS">Computati…</subject><subject id="LO">Learning…</subject><subject id="TA">Theory …</subject>
</subjects><authors>
<author id="452" institution_id="1">Sandor Szedmak
</author><author id="1" institution_id="1">
John Shawe-Taylor</author>
</authors><institutions>
<institution id="1">Universit…</institution></institutions>
</paper>
Bag-of-words
Covered scenarios: Document == Paper Document == Author Document == Institution
Available formats: TextGarden
Text file where one line equals one document
Matlab Data available in form of
sparse Term-Document matrix
TextGarden (www.textmining.net): Format:
Document_name !Subject DocumentList Example:
Support_Vector_Machine_to_synthesise_kernels !Machine_Vision !Theory_and_Algorithms Support Vector Machine to synthesise kernels -- Suppose we are given two sets of …
Matlab: Sparse matrix saved in text file, it can be
simply read into Matlab by:X = spconvert(load(‘papers.dat’));
Documents are columns in the matrix Names of columns (document names)
and rows (words) are provided.
Graph
Covered scenarios: Vertex == Word,
Edge == Co-Appearance Vertex == Author,
Edge == Co-Authors Vertex == Institution,
Edge == Collaboration
Available formats: Matlab
Data available in form of sparse adjacency matrix
Pajek Software for network
analysis
Matlab: Sparse matrix saved in text file, it
can be simply read into Matlab by:X =
spconvert(load(‘words.dat’)); Names of vertices (words,
authors, institutions) are provided.
Pajek: Can be downloaded from:
vlado.fmf.uni-lj.si/pub/networks/pajek
Submissions
The results can be: images, movies, Web sites, VRML files, executables (windows, linux), etc.
For interactive tool also provide a video, showing the use of the tool on the Pascal ePrints data.
Evaluation
Usability of visualization – The goal is to assess usability of particular visualization in different practical contexts.
Innovativeness – The goal is to estimate how innovative are the ideas used for visualization.
Aesthetics of the image – Here we are aiming to identify the "nicest" images from the challenge.
General Pascal-researchers’ voting over the web about "who likes what".
Since all the criteria are subjective, we will hire experts for judging about the quality.
Each of the criteria will generate a separate ranking.
Part II: Examples
Visualization example 1/2: Document Atlas
Bag-of-words approach: Document == Author Author is described by
a sum of all the abstracts from the papers he co-authored.
We construct separate profile for papers from year 2004 and papers from year 2005.
Dimensionality reduction
Documents are mapped from bag-of-words space to two dimensions in two steps: Latent Semantic Indexing:
13.000 dim => 110 dim Multidimensional Scaling
110 dim => 2 dim
The background reflects the density of documents
document
Background words Each part of the map is
assigned a keyword which is most representative for the documents in the area.
We get a “map” of the topics covered within the documents.
In the case of Pascal ePrints data areas on the map correspond to the areas covered within the Pascal Network.
Time dynamics
For each author we have profile for years 2004 and 2005
By showing the difference we can see how authors’ research focus developed between 2004 and 2005.
gradient
Co-Authorships
Live Demo
Visualization example 2/2: IST World Web portal developed within
IST World EU project Uses search and
visualization methods to: discover the main research
areas and collaborations within the PASCAL organizations
produce recommendation on which papers to read (e.g. papers on image recognition, or kernel trick)
find the right reviewers for a new paper (e.g a paper on "brain computer interface") and assess their competence
Research areas
Institutions are placed on the map of research areas from Pascal Network
Example shows which are the areas closely related to JSI
Collaborations Collaboration of institutions
Collaboration of authors working on
“text mining”
Paper Recommendation
Competence Search
Live Demo
Thank you!