Brainiac: a Graph-based Literature Visualization · Abstract Nowadays, users face the problem of too much information available. A user trying to research into a new topic will face

Brainiac: a Graph-based Literature Visualization

Miguel Alexandre Lourenco dos Santos

Thesis to obtain the Master of Science Degree in

Information Systems and Computer Engineering

Supervisors: Profª Sandra Pereira GamaProf. Hugo Alexandre Ferreira

Examination Committee

Chairperson: Prof. Luıs Manuel Antunes VeigaSupervisor: Profª Sandra Pereira Gama

Members of the Committee: Profª Ana Paula Boler Claudio

November 2017

Acknowledgments

I would like to thank my parents for their friendship, encouragement and caring over all these years,

for always being there for me through thick and thin and without whom this project would not be possible.

I would also like to thank my grandparents, aunts, uncles and cousins for their understanding and support

throughout all these years.

I would also like to acknowledge my dissertation supervisors Profª Sandra Gama and Prof. Hugo

Ferreira for their insight, support and sharing of knowledge that has made this Thesis possible, by

keeping me motivated to carry out the development of this project.

I would like to thank my colleagues, Tomas Alves, Rodrigo Verıssimo, and Luıs Fonseca, and for their

time, knowledge, dedication and friendship. You’re simply the best.

Last but not least, to all my friends and colleagues that helped me grow as a person and were always

there for me during the good and bad times in my life. Thank you.

To each and every one of you – Thank you.

Abstract

Nowadays, users face the problem of too much information available. A user trying to research into a

new topic will face a collection of context-specific documents, and exploring this collection may require

knowledge on specific concepts that is only available with more experienced users. In this work, we

address this problem, in the neuroscience context, creating a visualization, in collaboration with Instituto

de Biofısica e Engenharia Biomedica (IBEB), that helps users analyzing a collection of documents,

indicating documents that may be similar. The developed visualization has potential to help users in

this context, by interacting with different views, it allows to combine a document search by similarity and

by different topics. We conducted an evaluation, to measure the usability of the developed application,

and its utility, to validate the data visualized. The results from the usability test were very good, with

no obvious interface problem. Validation of the processed data also show good results, with room for

improvement with some errors detected in text processing.

Keywords

Text Visualization; Information Visualization; Visual Analytics; Text Processing; Document Collection;

Neuroscience.

iii

Resumo

Hoje em dia, os utilizadores deparam-se com demasiada informacao disponıvel. Quando um utilizador

procura informacao acerca de um novo topico, encontrara uma grande quantidade de documentos,

que, muitas vezes, requerem experiencia na area para conseguir interpretar ou filtrar o conteudo mais

relevante. Com este trabalho pretende-se resolver este problema, na area da Neurociencia, criando

uma visualizacao, em colaboracao com o IBEB, que permita aos utilizadores analisar uma colecao

de documentos, indicando documentos que possam ser semelhantes, e do interesse do utilizador. A

visualizacao desenvolvida apresenta potencial para ajudar utilizadores neste contexto, explorando as

diferentes vistas que foram desenvolvidas para o efeito. Combinando estas vistas, e possıvel entao

procurar documentos por semelhanca ou por um determinado topico. Finalmente avaliou-se a solucao

final, tendo em conta a usabilidade e a utilidade da mesma, de modo a validar os resultados que podem

ser visualizados na aplicacao. Os resultados foram bons, nao havendo nenhum erro obvio na interface

de utilizador. A validacao dos dados processados tambem mostrou bons resultados, deixando uma

margem para possıvel melhoria nalguns erros detetados durante esta avaliacao.

Palavras Chave

Visualizacao de Informacao; Neurociencia; Visualizacao de Texto; Colecoes de Documentos;

v

Contents

1 Introduction 1

1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Related Work 5

2.1 Single Document Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Document Corpura Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Brainiac 25

3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Backend Document Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 Text Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.3 Content Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.4 Clustering and Top Word Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Brainiac: a Graph-Based Literature Visualization . . . . . . . . . . . . . . . . . . . . . . . 35

3.3.1 Gathering the requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3.2 Initial Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.2.A The Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.2.B The Cluster Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.2.C The Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.2.D UI Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.2.E Main Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3.3 Informal Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.3.A Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.3.B Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3.3.C Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3.4 Final version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

vii

3.3.4.A Topic Magnets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Evaluation 51

4.1 Usability Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.1.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.1.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Conclusion 61

viii

List of Figures

2.1 Tag cloud generated with TagCrowd1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Word cloud generated with Wordle2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Docuburst [1] visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 World tree [2] visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Layout of the PaperLens [3] visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 The Bohemian Bookshelf ’s visualization layout. . . . . . . . . . . . . . . . . . . . . . . . . 12

2.7 Landscape view on the Dissertation Browser [4] system. . . . . . . . . . . . . . . . . . . . 14

2.8 Department view on the Dissertation Browser [4] system. . . . . . . . . . . . . . . . . . . 15

2.9 Thesis view on the Dissertation Browser [4] system. . . . . . . . . . . . . . . . . . . . . . 15

2.10 Jigsaw ’s [5] List View. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.11 Jigsaw ’s [5] Document Viewer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.12 Jigsaw ’s [5] Document Grid Viewer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.13 Jigsaw ’s [5] Document Cluster View. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.14 Jigsaw ’s [5] Document Cluster View. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.15 The Papervis’s [6] visualization layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.16 The ThemeRiver [7] visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.17 FacetAtlas’ [8] graph-like visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.18 The Wivi [9] visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1 Architecture of the final solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Document collection processing pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Results of TF-IDF clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4 Results of k -means clustering applied after the dimmensionality reduction . . . . . . . . . 34

3.5 Brainiac’s initial main view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.6 Example of the Network centering feature. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.7 Example of semantic zoom applied to the Network and Cluster Layout . . . . . . . . . . . 38

3.8 Example of the search function interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

ix

3.9 Example of layout rearrangement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.10 Example of window resizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.11 Example of Hover Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.12 Example of the filter interaction in the Timeline . . . . . . . . . . . . . . . . . . . . . . . . 41

3.13 Brainiac’s main view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.14 Example of Network hovering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.15 Example of hovering a document in the Cluster Layout. . . . . . . . . . . . . . . . . . . . 45

3.16 Example of focusing a document in the Network. . . . . . . . . . . . . . . . . . . . . . . . 46

3.17 Example of zooming out on the Cluster Layout. . . . . . . . . . . . . . . . . . . . . . . . . 46

3.18 File uploader interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.19 File uploader interface with document details. . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.20 File uploader interface, for document upload. . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.21 Example of topic magnet attraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.1 Distribution of time taken in each task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2 SUS scores scale, based on school grading scales. . . . . . . . . . . . . . . . . . . . . . 57

x

List of Tables

2.1 Comparison between the reviewed visualizations. . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 Comparison between the reviewed visualizations and the developed solution. . . . . . . . 50

xi

Acronyms

API Application Program Interface

PDF Portable Document Format

NLTK Natural Language Toolkit

SVD Single Value Decomposition

LSA Latent Semantic Analysis

PCA Principal Component Analysis

t-SNE t-distributed Stochastic Neighbor Embedding

NMF Non-Negative Matrix Factorization

LDA Latent Dirichlet Allocation

JSON JavaScript Object Notation

SUS System Usability Scale

IBEB Instituto de Biofısica e Engenharia Biomedica

IST Instituto Superior Tecnico

xiii

1Introduction

Contents

1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1

Nowadays, users face large collections of data as they seek to understand a certain subject better.

This ranges from academic research to deciding what product to buy, from a variety of options, and

require users to explore a collection of context-specific text data that is often unfamiliar to them. This

is one of the consequences of the appearance of the Internet, as it allows considerable amounts of

information available to anyone anywhere.

While structured or numerical data is manageable through statistical analysis, text data usually con-

tains noise and its computational analysis is much slower. In order to navigate this rich data, users

often use search tools to find relevant information. However, the processing of searching a document

database is not adequate in data exploration cases, as the researcher may not have the necessary

base knowledge to recognize what to search for, and what keywords should be used. This exploratory

search [10] [11] goes beyond the simple retrieval of documents, as investigating a ranked list of search

results is insufficient to understand the overall collection and possible relations across multiple docu-

ments.

Several visualizations have been designed for this purpose, combining both information visualization

and text analysis tools. When used individually, either of these approaches yields insufficient results to

adequately understand the document collection, as text mining is not considered satisfactory to compre-

hend the collection [12] [13], while visualizations such as PaperLens [3] start to have problems as the

size of collection grows.

In the neuroscience context, a system that enables discovery could allow researchers from IBEB to

freely explore a collection of documents, focusing on their topics and relations, by visualizing a collection

as a whole, instead of manually browsing each one.

1.1 Objectives

The main objective of this work is to create an interactive visualization that enables users to

explore a document collection in the neuroscience context, allowing them to analyze the content

and similarity between documents in the whole collection.

In order to accomplish this, several intermediate objectives are described, regarding the design of the

visualization. These objectives will function as guidelines to follow the development of the visualization,

and can be described as follows:

• Build a database that contains the documents from the context’s domain, to be used in the visual-

ization;

• Design the layout of the visualization and develop the application that will serve as a backend ;

• Evaluate the solution, according to its usability and its utility.

3

1.2 Document Structure

This document is structured as follows. Chapter 2 reviews different visualizations that focus on the

visualization of text. It starts by providing a summary on single document visualizations, following onto

visualizations that focus on the whole document collection.

Next, Chapter 3 describes all the development work, introducing the work that was done in order

to process a document collection and describing the different visualization techniques that were uti-

lized in the application. It goes over the adopted iterative process, including the informal testing phase

used to collect feedback. This chapter also introduces the final solution, containing a description of the

application’s interface and the changes made from the initial version.

The results regarding the formal testing phase are introduced in Chapter 4. This testing phase, aimed

to measure the usability and the utility of the final version of the visualization, describing the procedure

and results of both the usability tests and the case studies that were performed.

Lastly, the document ends with a conclusion of the developed work and a discussion of future work,

in Chapter 5.

4

2Related Work

Contents

2.1 Single Document Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Document Corpura Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5

As it was introduced, nowadays, there is a large amount of documents available about almost any

topic. Usually, text needs to be manually interpreted, and due to these considerable amounts of in-

formation available, it quickly becomes infeasible to go through and interpret most of the information

accessible.

Information visualization can be seen as an additional tool to help with the process of interpreting

documents. This technique can present a new representation of the collection, while highlighting pos-

sible new patterns that were not expected. It may be a reasonable approach to the described problem,

by showing the user documents of potential interest, which would simplify the process of navigating the

collection of documents.

Recent work has proposed different approaches on text visualization, allowing users to go over a text

document and understand what it is about. They are mostly designed to visualize different aspects of a

document, specifically the document’s metadata, the source text directly, computed features like entities

or patterns, or even the general concepts of the document.

The work reviewed is mainly divided into visualizations that focus on analyzing single documents

and visualizations for large document collections, or corpora. Both these categories are reviewed in the

following subsections, as they provide proper contributions to text visualization.

2.1 Single Document Visualization

Browsing lengthy text documents is usually a very time-consuming task. This is specially true when

the user is not familiar with the document at hand. Trying to understand what the document is about, and

what its key points are, often requires someone to manually browse through the text. As discussed, this

process is rather slow, and when facing a collection of documents, will require considerable amounts of

time to fully understand the collection at hand.

In order to solve these issues, different visualizations have been developed to represent the content

in a simpler way, facilitating the user’s process of understanding the subject of the document. Usually,

these representations are based on methods that take into account the document’s metadata, source

text, or some computed features from text data available. One example of this type of single document

visualization is tag clouds, or word clouds, which utilizes the source of the document to depict the general

content of the document.

Word clouds differ from tag clouds in appearance, as they are able to change the orientation and

position of words in order to form a shape or figure, usually a cloud. Both these visualizations bring the

user a visual representation of the text given, by displaying relevant words. These relevant words, used

to depict the content of the text, are obtained through different methods, such as metadata in the case

of tag clouds, or word frequency, in the case of word clouds.

7

Figure 2.1: Tag cloud generated with TagCrowd1.

A tag clouds’ list of words is usually displayed in a simpler structure, presenting them in horizontal

lines, with word importance being represented through font size, as depicted in figure 2.1 On the other

hand, word clouds, particulary Wordle [14], have further possibilities, as it tries to diversify color, typog-

raphy and composition to the font size, as seen in figure 2.2 This results into more memorable visual

representations, which appeals users to these kinds of data visualization.

Figure 2.2: Word cloud generated with Wordle2.

Approaching Docuburst [1], it allows to visualize the document content, structured according to a

IS-A relation from the WordNet [15]. It uses a radial space-filling layout, where the hierarchy represents

the hyponymy relation, while the angular width is proportional to the number of leaves in the subtree

(see Figure 2.3).

The root node is usually either a word or a synset (A group of words that are synonyms). The user

is able to choose the root node by searching, which then populates the rest of the visualization with all

1www.tagcrowd.com2www.wordle.net

8

www.tagcrowd.com

www.wordle.net

Figure 2.3: Docuburst [1] visualization.

the hyponyms. A search function is provided, enabling the user to highlight nodes matching the query.

Additionally, it allows for a semantic zoom through a fisheye view that collapses the farthest from the

root node. The full text from the document is available at the bottom of the interface (Figure 2.3), which

can be used through the linked visualization.

By providing an overview of the document content, users are able to compare multiple documents,

by having the trees rooted on the same synset, showing the difference between documents’ content.

Enabling users to view this comparison allows different applications, such as plagiarism detection, doc-

ument categorization and authorship attribution.

Lastly, there is the Word Tree [2], introduced as a visualization focused on exploring repetitive text.

It takes form in a tree structure, with the words that follow a particular search term, arranging the words

spatially, as seen in figure 2.4. It was designed to exploit the interest behind visualizing unstructured text

in the Many Eyes website, where users could upload and visualize their own data.

The design is compared to an interactive form of the keyword-in-context technique, since the visual

design makes it easy to spot repetition in the contextual words that follow a phrase, as well as having a

natural tree structure, while having clear ways to interact with the visualization. Since it was intended to

allow users to test the visualization in the Many Eyes site, users had to rapidly comprehend the visual

design, else they could just ignore the visualization altogether.

The layout of the tree consists in the typical branching, to associate it with the tree structure, and,

similarly to word clouds, font size is used to indicate word or phrase frequency. Branches from the

9

Figure 2.4: World tree [2] visualization.

tree continue until the frequency is equal to one, instead of stopping at the first unique phrase. Once

users enter a search query and the tree is generated, they are able to explore the tree. While exploring,

hovering a particular word or phrase reveals additional information, while clicking on an individual word

allows the user to adjust the phrase or the root of the tree that is being shown.

Some scalability issues arise with the usage of font scale to show common words. As the text input

size increases, text readability becomes a concern, since having very common words as the source of

the tree reduces the text scaling to a point where it becomes barely readable. To handle this, the authors

chose not to display relevant branches, although this could turn into a problem, as there could be some

loss of information when removing certain extensions.

The Word Tree was made available on the Many Eyes site, where users could upload and visualize

their own text documents. Although the site is not available anymore, at the time, users started taking

advantage of the word cloud’s similarity to grasp a general understanding of the text, which varied from

a collection of Twitter posts to newsgroup discussions. Although this visualization was intended to

analyze unstructured text, the authors realized that the users started using structured data to exploit the

tree structure. Being accessible in the site also helped to get feedback, which was generally positive

and provided some suggestions to the design, such as the option to ignore punctuation marks and stop

words (words that do not contribute to content, providing unnecessary information, such as articles and

prepositions), the ability to drill down from the tree structure to the plain text, to see the uses of particular

words or phrases, and to show a net of the words’ connection two words or phrases.

Overall, theWord Tree was considered a flexible solution to visualize both unstructured and structured

data, with a good reaction from users, and future work included combining the word tree with some

10

other text visualization, since the user starts by staring at a blank page, waiting for a search query, and

improving the design to be able to handle larger datasets.

2.2 Document Corpura Visualization

When the scope expands from a single document to the complete collection, visualizations tend to

be extended to a more exploratory search, while not disregarding search methods.

The simpler features that can be used are derived from the metadata. The PaperLens [3] system

was devised to visualize trends and connections in conference papers, extracting the authors, topics

and citations of these papers.

Figure 2.5: Layout of the PaperLens [3] visualization.

Regarding the visualization layout, it provides distinct views of the dataset, as shown on Figure 2.5.

Users are able to find popular papers sorted by year and topic (Figure 2.5a), or retrieve a list of papers

from a specific research area, by selecting a topic, or by author(s) (Figure 2.5c) shown in the selected

authors region (Figure 2.5b), whose work is differentiated by the colors attributed. The design allows

the user to discern the most influential papers by topic, resorting to a list of the most referenced papers

(Figure 2.5f). Lastly, it is provided a co-authorship graph that enables users to explore relations between

authors in the collection (Figure 2.5d). The implementation allows interaction between the visualizations,

11

as selecting a topic, paper or author will load related items in the remaining visual representations.

User case studies were overall positive, despite showing a few design issues, specifically in the au-

thor search, which would find substring matches while not allowing the search for first or last names.

There are also some issues noted about behaving consistently throughout the layout, or users not un-

derstanding the purpose of the initial layout, considering some segments as “recreational”, as mentioned

by the authors.

In conclusion, most of these issues were easily solved by adopting a simpler design, yet the scaling

concern was founded, as the representation used in this visualization was not able to depict the dataset

as it expanded.

The Bohemian Bookshelf [16] is an additional example of a visual representation that resorts to the

usage of metadata. This system is laid out as a digital book collection, designed to tackle accidental

discoveries – serendipity. This is accomplished by allowing a “shelf browsing” like experience, which

have been shown to inspire serendipitous discoveries. The “shelf like” browsing is attained with the

different visualizations acting as a whole, offering multiple access points due to different perspectives

from the views, drawing attention with the visually distinct visualizations and providing distinct, yet playful,

approaches to information exploration.

Figure 2.6: The Bohemian Bookshelf ’s [16] visualization layout. On the left side, there is the Keyword Chains ontop and the Timelines on the bottom. On the right, there is the Cover Circle view on top, with the AuthorSpiral on the bottom. Lastly, the Book Pile, in the middle of the layout.

Different factors are usually associated with these discoveries, such as observational skills, open-

mindedness (receptiveness to unexpected information), knowledge and perseverance, as well as exter-

nal factors, for instance coincidence and influence of other people or systems [17–20]. Libraries and

physical bookshelves improve serendipity, due to exploratory sense present in these systems. From

here, the authors derived a few design considerations to promote serendipity, namely, multiple access

points, which correlated to open-mindedness and the researcher’s eagerness to analyze data from di-

12

verse perspectives, juxtaposition or adjacency of information, multiple pathways, and curiosity and play.

The layout of the visualization consists mainly of five different perspectives on the dataset, the Cover

Color Circle, Keyword Chains, Timelines, Book Pile and the Author Spiral, as depicted in Figure 2.6.

The first visualization, the Cover Color Circle provides a first look at the book, the cover color, by

showing an average of the cover’s image. Books are displayed as circles, grouped by the respective

calculated colors, in a circular layout, and hovering a specific book will provide the user with a preview

of the book’s cover.

The second, Keyword Chains, exploits keyword usage to represent content, simplifying the catego-

rization and search. It displays the selected book in the center, and distinct keywords branching out.

Each keyword is followed by a book title, which will be followed by another keyword and so on, forming

the keyword chain, which can be focused on a different book title by clicking on it, or on a keyword,

restructuring the tree around the corresponding book.

Thirdly, the Timelines visualization displays the association between the book’s publication year and

the time period depicted. The layout consists of two timelines, with the upper one representing the

publication year, and the lower indicating the focus of the book’s content. The books are illustrated as

circles with respective color in each of the timelines, with a line connecting both of them that shows the

relation between publication year and the initial time period the book covers.

The Book Pile looks to provide further insight on the physical aspects of the books. With each one

being represented as a square, with page count expressed on its edge length and color borrowed from

the book. Books with fewer pages are represented at the bottom, while thicker ones are shown on top.

Lastly, the Author Spiral displays the books by authors’ names in a list that rolls up in a spiral in both

ends, due to space issues. As the names start spiraling, they are replaced by circles that represent

the books in the library, again expressing the book’s color. Clicking on a text label or circle will show a

preview of the book, similar to what was mentioned earlier.

All of these visualizations are interlinked and, combined, bring different perspectives to the user, as

actions taken on a specific one will cause the rest to adapt, for example, when selecting a book, it will be

highlighted in all views. However, there are some issues regarding scalability. For example, increasing

the collection’s size will be costly performance wise. Overall, this visualization takes a more playful

approach to data exploration while taking into account the serendipity concept, which could be further

developed and evaluated with case studies.

The source text of the documents in the collection can be used to visualize the dataset, however, due

to the large nature of the collection, this rapidly becomes unfeasible. Computed features such as word

similarity or topic similarity are used to compare a likeness metric that is used to compare documents

and project the differences onto a visualization, which by itself may lead to trust problems [4]. While the

first similarity metric utilizes directly the source text, the second takes into account related terms used,

13

which is convenient when the documents do not use the same exact words.

An example of this is the Dissertation Browser [4], a visual analysis tool developed to investigate

collaboration between different academic departments. The adopted approach resided in detecting

shared language or terms across publications of various areas, seeing that the authors mention the

different vocabulary across distinct areas.

Figure 2.7: Landscape view on the Dissertation Browser [4] system.

This visualization consisted of three different “views”: Landscape view, Department view and Thesis

view. The first one, Landscape view(Figure 2.7), encoded each department as a circle, with its area

representing the number of published thesis, and the distance between each circle defining the similarity

among them. This view was intended to reveal patterns in the university’s research areas, however,

being subject to a projection, it led to trust issues.

In response, the second visualization, Department View(Figure 2.8) was conceived, which enables

focusing on a single area and evaluate the similarity between that department and the remaining ones.

Identical to the first design, distance also encodes the similarity, in a radial layout.

The last visualization, Thesis View(Figure 2.9), aimed to validate the results from the similarity mea-

sure observed previously. Identically, it is focused on a single department, displaying thesis from the

area and the most resembling ones from other departments. This visualization is shown by clicking on

the focused area in the Department View and provides insight on similarity between two areas and what

are the thesis that enable this proximity.

14

Figure 2.8: Department view on the Dissertation Browser [4] system.

Figure 2.9: Thesis view on the Dissertation Browser [4] system.

User tests revealed the first view was not suitable, as the trust issues generated by the projection

artifacts led to users wrongly identifying unusual trends. In addition, the Department view allowed

users to identify similarities they did not expect, with either word similarity and topic similarity, which

demonstrated both these features could have decreased accuracy. Utilizing the Thesis view revealed

the cause of this reduced accuracy. Topic similarity would position Biology close to Computer Science

due to the existence of computational biology, while word similarity would assign two departments as

15

resembling in situations they used the same rare words.

Jigsaw [5] is a separate system that provides different visual representations of computed features.

It produces a summary of the collection, or a single document, a measure of similarity between doc-

uments, clustering, it identifies entities and connections among them, and possible related entities for

further investigation. Additionally, it allows for a document sentiment analysis, which provides insight on

sentiment, subjectivity, polarity and other attributes.

The authors take into account the two factors introduced by the Dissertation Browser – interpretation

and trust. Since the visualization are based on computed features, it is important to understand how

accurate the results are, assuring that users make trustworthy inferences from their interpretation.

Figure 2.10: Jigsaw ’s [5] List View. Shows the conference, year, author, concept and keywords associated. In thebottom figure, the concept graph is selected, showing connected years, concepts and authors.

The List View(Figure 2.10) provides a data cleaning phase, where users are able to select a list

of documents of interest, as well as present the user the most important relations in the dataset. Fig-

ure 2.10 shows an example of how the documents can be clustered (conference, year...), although this

can be personalized according to the context of the documents.

Selecting documents in the List View will allow for interaction in the remaining perspectives. In the

Document Grid Viewer (Figure 2.11), users have access to the source text of the document as well as

related information of the selected documents, including a summary of all the chosen documents and a

summary of the single document selected from the list.

The Document Clustering View provides another approach to a selection of documents, allowing the

16

Figure 2.11: Jigsaw ’s [5] Document Viewer, shows a summary of the loaded documents(left panel) at the top, andsummarizes the selected document on the right in the Summary panel.

user to differentiate the main topics by identifying the clustering results, with some advanced options to

personalize the clustering of the documents.

In terms of document similarity, the Document Grid Viewer (Figure 2.12) provides users with the

ability to compare a set of documents to another one, ordering them by the selected measure, this case

the similarity.

Lastly, the World Tree Viewer (Figure 2.14) shows the occurrences so of a specific word, as well as

the common phrases it is associated to. This visual representation has been reviewed previously, as the

Word Tree [2].

Figure 2.12: Jigsaw ’s [5] Document Grid Viewer, displays the documents in a grid, ordered by similarity accordingthe selected document.

17

Figure 2.13: Jigsaw ’s [5] Document Cluster View, displays the different clusters of similar documents.

This work demonstrated different combinations of text analysis and interactive visualization to aid the

user exploring a specific document collection. Although pre-processing the whole dataset could be a

potential scaling issue, the visualization itself in general is fluid, supporting a variety of different areas,

for instance, aviation documents, source code files, fraud investigations, and, as discussed in some use

cases, academic research and consumer reviews. One important aspect is the lack of user evaluations,

which would have identified more apparent issues with the visualization.

A follow-on system is the Papervis [6] visualization, proposed as a solution to the abundance of

information when investigating a certain research field and obtaining sizeable amounts of related pa-

pers It represents relevant papers as a graph, with the modified radial space filling and bullseye view

techniques, and provides several visual cues like node colours, sizes and boundaries to represent the

paper’s relevance (Figure 2.15E)

There are some features provided by the visualization to enhance it as an exploration tool, specifi-

cally an efficient screen usage by adopting ideas of radial spatial filling and bullseye view layout, visual

indications to distinguish results, having a specific paper or keyword at its centre with the other papers

organised relatively to the centre, the interface is user-friendly, allowing the user to explore different

views and analyze results at will, and a history mechanism, to prevent users from feeling lost in the

visualization.

The interface provides a clear way to change between different modes (Figure 2.15A) and change

other configuration options, an area to review exploration history (Figure 2.15B), a filter and selection

control (Figure 2.15C), and details of the currently selected paper (Figure 2.15D).

In the first mode, citation-reference mode, the main visualization produces a radial view, with the

paper of interest localized in the center and the rest of the papers distributed within ten bin circles

18

Figure 2.14: Jigsaw ’s [5] World Tree Viewer, shows the occurrences of a specific word, followed by the most com-mon phrases where the word appeared.

around the selected one, where the distance to the center is defined by the relevance of the document,

characterized by the citations and references it has. Citations are revealed for a single document by

clicking on it, while a double-click will re-center the whole graph to the new paper.

In the second mode, keyword mode, where users are able to find documents that share keywords,

or use keywords as a cluster category, thus being able to discover pertinent contributions in a certain

research field. By selecting a keyword, the system will load all related documents, and display the

keyword at the root, with appropriate papers surrounding the root node, with importance also being

calculated according to citations.

In the last mode, mixed mode, papers are loaded just like in the first mode, except the layout is

arranged similarly to the keyword mode, as well as the process used to link other papers.

There were some issues identified which were related to the time used to load the dataset into

memory, taking around six seconds according to the authors. This is explained by the complexity behind

the ordering in Citation-Reference mode, and the clustering algorithm in the Keyword mode.

The proposed visualization made literature review an easier task, and the three modes provided

implement different ways to explore the papers in the dataset. The authors mention a possible improve-

ment to the design, by arranging multiple focus points in the center.

Approaching techniques that focus on the subject of the documents, there is the ThemeRiver [7]

visualization, which allows the user to identify a document collection’s thematic content and its variations

over time, as well as the relative strength of the themes. These are shown in the context of a timeline

with corresponding external events, allowing the user to recognize patterns, relationships or trends in

19

Figure 2.15: The Papervis’s [6] visualization layout.

the visualization, as shown in figure 2.16.

Figure 2.16: The ThemeRiver [7] visualization.

One of the goals behind the design was to enable the users to quickly find patterns, and, by using

visual, more familiar metaphors, this discovery is facilitated to the user. The river metaphor was chosen

as a means to display time progression, while also representing the theme’s relative strength, utilizing

the flow, composition and width. Here, the separate “currents” in the flow illustrate each of the themes

depicted in the collection, the thickness describes the variations in strength and horizontal distance

symbolizes time change. Smooth limits and color are necessary to ease the tracking and the comparison

20

of specific currents in the flow.

The separate river currents’ strength are calculated by reviewing, for example, the number of docu-

ments containing the theme word, in each period. Alternately, the number of occurrences of the theme

words as a substitute to document frequency could also stand for theme strength. The authors mention

the difficulties behind picking the colors for each theme, since there were several factors needed to take

into account. On the one hand, colors for distinct themes needed to have some contrast, to be able to

distinguish the currents, while also considering the possibility of having a high number of topics in the

collection. The solution was to sort the colors by groups of related themes and showing the colors’ family

attributed to each group.

User tests were overall positive, confirming that the chosen metaphor was easy to understand, and

it was useful to identify macro trends, but not as appropriate when dealing with minor patterns. Users

proposed features from the histogram used to test the main visualization against, mostly to be able to

see the actual numeric values behind the abstraction, referring to the trust aspect mentioned before.

Along with these results, the visualization could be improved by increasing the performance in order to

support more interactions and improve the control users have on the system.

Following the ThemeRiver system, there is the FacetAtlas [8], a visual representation of both local

and global patterns in the set of documents, displayed with a graph and a density map (Figure 2.17), in

order to provide context. It also allows users a more interactive experience, with the ability to search for

specific terms, which will render a new graph. To properly understand the design choices, it is necessary

to explain that facets are considered to be a class of entities, which are instances of a particular concept

from the data, and relations are simply connections between these entities.

Figure 2.17: FacetAtlas’ [8] graph-like visualization.

21

In order to encode both the global context and the relations, the authors combined a density map

with a multifaceted graph. The first is used to display clusters, while the latter utilizes circles to represent

the extracted entities. Relations are encoded using links between the corresponding facet nodes, with

the thickness defining how related the two entities are within the specific facet.

There are several visual patterns to make easier the user exploration – clusters, co-occurrences and

outliers. Clusters are groups of similar entities being represented in the density map. Co-occurrences

patterns occur when two or more entities have strong internal relations across different facets. Finally,

outliers represent entities with internal relations crossing cluster boundaries.

Several possible interactions are provided to users, namely a text query, showing facets, entities and

clusters possibly relevant for the given query, a semantic zoom, to zoom in on the details of the nodes and

relationships, context switching to change the primary facet being focused, highlighting, power buttons

and links to documents.

User tests were positive, with users being able to complete the tasks given without taking a long time

to do it, compared against a baseline system that was developed solely for this purpose. Alongside, the

authors conducted expert interviews, which revealed user satisfaction with the usefulness of the system,

suggesting it could be used as an alternative to other known techniques.

One last example of a visualization that focuses on the exploratory sense of investigating a collection

of documents is Wivi [9]. Here, the documents are Wikipedia 3 articles, however, the same concepts are

applied.

Figure 2.18: The Wivi [9] visualization.

Articles and references are represented as the nodes and the links in a graph, which aims to help

3www.wikipediaorg

22

www.wikipediaorg

users navigate the Wikipedia environment, while proposing new exploration options based on user his-

tory. While initially the graph contains simply the first article with all of its links, it is expanded by simply

exploring new nodes, as any links in the first sections are added as options in the graph, while trying to

predict the degree of interest in each one, based on the user’s history.

The visualization (Figure 2.18) consists of the navigation area and the article text. The first will display

visited articles as circles, with the root at the center, while unvisited ones are shown as rectangles. These

are laid out in rings around the center, closer to the center depending on how relevant they are. The

second simply displays the text from the selected article. Users are able to choose a different page by

either clicking on the desired link on the text, or on the corresponding node in the navigation panel.

This approach combines the visualization of already visited articles as well as possibly relevant ones,

taking into account the user history since the beginning of the search. User studies showed that Wivi

was positively perceived by the majority of the testers, and that it could be used again in future research,

which shows that this visualization could be a viable interface to browse Wikipedia.

2.3 Discussion

The reviewed systems usually follow on a list of features from the documents, as mentioned, from

using the metadata accessible, to computed features, such as sentiment analysis. So that one may

review these, a table is presented (Table 2.1) with the common approaches taken, to compare the

visualizations and how they differ from one another, including how they allow users to interact with.

Table 2.1: Comparison between the reviewed visualizations.

read patterns overview compare features search zoomTag Clouds [21] 7 7 3 3 7 3 7

Wordle [14] 7 7 3 3 7 3 7

Docuburst [1] 3 7 3 3 7 3 3

Word Tree [2] 3 7 7 7 7 3 3

PaperLens [3] 7 7 3 7 7 3 7

Bohemian Bookshelf [16] 7 7 3 7 7 3 7

Dissertation Browser [4] 3 7 3 3 3 7 7

Jigsaw [5] 3 3 3 3 3 3 3

PaperVis [6] 3 7 3 7 3 3 7

ThemeRiver [7] 3 7 3 3 3 7 7

FacetAtlas [8] 3 7 7 7 3 3 3

Wivi [9] 3 7 3 7 3 3 3

The table entries are divided into two categories, where the first four entries refer to single document,

while the remainder are attributed to visualizing collections. The concepts that are adopted to compare

between the different systems are explained in the following list:

23

read The ability to drill-down and read the original document.

patterns Capacity to reveal patterns in the documents or collection.

overview Present a brief overview of the text.

compare Ability to compare multiple documents.

features Reveal computed or extracted features from the text, such as entities.

search Support a search query, for specific words or phrases.

zoom Providing some kind of zoom of the visualization, semantic or graphic details.

It is important to note that there are some other useful features, although it would not make sense

to include in the comparison table, as they are not as common in the visualizations reviewed. Some

examples of these methods would be the ability to provide a semantic analysis, indicating word meaning,

capacity to cluster documents into groups of related articles, providing the user with a summary of the

document and suggesting possible new navigation options, according to past history.

These functions usually provide rich interactions to the visualization, allowing the user to explore and

understand the dataset, and should be taken into account with the design of the proposed system, to

reach the set goals for the visualization.

In this first group of visualizations, that focus on a single document, usually do not support a capacity

to reveal patterns. This functionally may prove to be more valuable when working with a collection of

documents, although only Jigsaw incorporated a feature that allowed the user to see these patterns.

Most visualizations that focus on a collection of documents allow the user to work with extracted

features from the text, while the first group does not. This could be attributed to these features being

used across documents in the collection, which only the second group works with.

24

3Brainiac

Contents

3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Backend Document Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Brainiac: a Graph-Based Literature Visualization . . . . . . . . . . . . . . . . . . . . . 35

25

Brainiac is an application focused on visualizing a collection of documents. It is a tool developed in

collaboration with IBEB, to help users explore the content of a group of documents, allowing the user to

potentially identify documents of interest, arranging documents based on their similarity and topics.

The development of this visualization followed an iterative and incremental development, focusing on

user feedback to improve its usability and main features. As such, there were two main testing phases

in this process. An informal testing phase, where the focus was gathering feedback from the users, and

a formal one, aiming to measure the usability of the final version of the application.

This chapter describes the final solution and how it was constructed. The first section describes the

system’s architecture, defining each of the components needed, and naming the existent interactions

between each one. Then, the backend text processing used on the collection of documents. Finally, the

last section discusses how the processed data is used in each of the application’s visualizations, and

how they interact with each other.

3.1 Architecture

The system was designed with two main components, the backend, which is responsible with the text

processing, and the frontend, which serves the user with the web app that contains the visualization.

The depicted architecture is a generic one, commonly used in these kinds of web applications. Although

there are usually additional elements in these kinds of systems, they are usually related to security,

management or communication, and are not included as they are not relevant in the scope of this

project.

A normal interaction with the system is depicted in figure 3.1. The user starts by making a request to

the frontend server, through the browser, getting the necessary files to render the web app. Then, the

browser makes the necessary requests to the backend server, either fetching the preprocessed data,

documents, or to upload new files.

Figure 3.1: Architecture of the final solution. Documents are stored in the backend, where they are processed.The backend also serves these documents and the initiation file to the frontend. This will contain theapplication code, ran on the client’s browser, and make the necessary requests to the backend.

27

The backend server was developed with Node.js1, and serves as an Application Program Inter-

face (API) for the application. We opted to use this platform due to Javascript already being used exten-

sively in the frontend, which facilitated the development seeing that we could use the same language to

develop both elements of the system. This server is used for the main processing in the visualization.

Its main functions are to hold the document corpus, allow the upload of files by the user, serve the

main initialization file, and allow the querying of specific words to measure their distance to each of the

documents.

The frontend was developed utilizing React2 and D33. React is a library used to build the interface

of the application, while D3 is a library that helps us create the visualizations. It allows some abstraction

from HTML, and add or change components around without worrying too much about breaking the

interface. As such, it allows adding new UI elements as needed, facilitating prototype iteration.

3.2 Backend Document Processing

This section describes how the backend handles the processing of the collection of documents. Each

stage of this process is defined in figure 3.2. The following subsections describe each of these stages

in detail, following the order depicted in the figure.

The documents were gathered with the help of professor Hugo Ferreira, from IBEB, that listed a

few topics of interest to guide the construction of an initial collection of documents. These subjects

were intended to help the search in search engines such as Google Scholar or Pubmed, and create

a small document database with articles from these topics. This database was intended to help with

the development, including the informal testing phase, and the usability tests, and as such, not much

time was focused on creating a big collection. These studies are usually stored in a Portable Document

Format (PDF) format, to facilitate access, although prompts an initial stage that converts them to plain

text.

3.2.1 Text Extraction

As mentioned, in the database created, documents are in a PDF format. Since these files may

contain not only text, but images, hyperlinks, videos, embedded fonts, and executable scripts, they are

stored in binary. In addition, the elements included are usually accompanied by a set of formatting and

other describing components needed, so that the result from rendering is the same across platforms. To

deal with this file format, the text needs to be extracted, while ignoring other elements such as images

and formatting elements that can not be used to classify the document.1https://nodejs.org2https://reactjs.org/3https://d3js.org/

28

https://nodejs.org

https://reactjs.org/

https://d3js.org/

Figure 3.2: Stages in the document collection processing pipeline. The first two stages in the process – Text Extrac-tion and Tokenization – are applied to single documents, in order to extract the terms of each documentand form a bag of words representation. The last two stages are applied to this representation, in orderto try and extract content.

This extraction is done through a Python script, that takes each document and converts it into a plain

text file with all the text from the original file. Initially, we opted to use C++, as this processing involved

performance-intensive computation. However, it was decided to migrate the development to Python, as

this language allows for faster iterations on the code, meaning the development time was focused on the

processing and not on the language specifics.

The script uses textract4, a Python library that enables us to extract text from any document, PDF

files in this case. Since these files can have different encoding, the result from the extraction may

contain some unwanted characters when converting to UTF-8. One specific example of this problem

happened with some documents that contained math symbols from study comparison, such as “>”, or

“6”, that when converted, produced incorrect output in the resulting text file, corresponding to numeric

characters.

In order to remove some of these incorrect characters, a set of rules were placed, before the resulting

text was saved, that allowed us to remove characters that did not contribute to the actual content of the

file. These rules are used to discard digits, punctuation and symbols, hyperlinks and some words that do

not contribute to the content itself. These removed characters are not only products of wrong conversion,

but also parts of the document, like references or citations that sometimes are merged into the words.

3.2.2 Tokenization

After all the files are converted into plain text, the second stage involves reading all the documents

and fitting them in the model, to obtain the necessary results for the visualization. This is done with a

second Python script that reads each of the converted files into memory, in a bag-of-words representa-

tion. This model is a simplistic approach utilized in a natural language processing in which the text is

represented as a set (bag) of words. The text is stripped of any punctuation and newline characters, and

4https://pypi.python.org/pypi/textract

29

https://pypi.python.org/pypi/textract

segmented into tokens, in order to fit the bag-of-words model. This tokenization is done by splitting the

text, taking into account word delimiters, and results in an array of all the words in the document.

In order to better this process of splitting the text, two intermediate stages were added. First, after

the text segmentation, each word is matched with an English dictionary, so that wrongly converted

words and stop words are discarded from the process. In the context of natural language processing,

stop words usually refer to the most common words. These words do not contribute to the document’s

content, and as such, they can safely be removed to improve the final results. With the text divided into

tokens, remaining words are reduced to a common base form. Then, the remaining words are reduced

to a common base form.

This second stage is needed because, due to grammatical reasons, documents use various different

forms of the same words, which usually have very similar meaning. An easy example to understand

this problem, is a set of words such as the conjugation of the verb to be, that results in different words,

although with identical meaning. Additionally, there are words such as democracy, democratic or democ-

ratization that share the same origin and similar meaning, which would appear as different instances.

To solve this problem, we opted to use a lemmatizer, rather than a stemmer. On one hand, stemming

refers to the process of reducing words to a base form, or word stem. Since this method uses a simple

heuristic to reach the stem, it usually does not match the morphological root of the word. On the other

hand, lemmatization involves the use of a vocabulary and morphological analysis of the word, with the

objective to remove the inflectional endings of the word, returning to the base form of a word, named a

lemma.

In our case, we opted to use a Lemmatizer from the Natural Language Toolkit (NLTK)5 library, which

takes advantage of the WordNet6 to attain the base, or dictionary, form of a given word, instead of

the word stem, or root, that is returned when using a Stemmer. As a comparison, we can take the

conjugation forms of the verb to be: am and was. Using both a Stemmer and a Lemmatizer from the

NLTK, we obtain am and wa, respectively, with the first process, and be with the second.

3.2.3 Content Extraction

Having the text segmented into the bag-of-words representation, we work with a second Python

library, scikit-learn7. This is a machine learning library that allows us to apply feature extraction on the

document collection. We opted to apply the tf-idf, short for term frequency-inverse document frequency,

a statistical measure that we can use to evaluate how important a word is to a document, in a collection,

allowing us to understand which words are useful for determining the topic of each document [22].

The weights associated with each word grow as the frequency of that term increases in a given

5http://www.nltk.org/6https://wordnet.princeton.edu/7http://scikit-learn.org

30

http://www.nltk.org/

https://wordnet.princeton.edu/

http://scikit-learn.org

document, but it is offset by the frequency of the word in the whole corpus. This offset helps us measure

how important a term is in the collection, as it allows scaling down the weight of frequent terms across

the collection, while simultaneously scaling up the uncommon ones.

The recurrence of each term t is calculated by simply measuring term frequency in each document

d from our collection D, and it is normalized to take into account the total length of the document.

Therefore, larger documents will not get higher scores due to their larger term frequencies. Then, the

inverse document frequency idf is calculated by taking the logarithm of the total number of documents

N , divided by the number of documents that have the term being weighted, nt. The final result is

calculated by multiplying tf by idf , as seen in 3.1.

tf(t, d) = ft,d

idf(t,D) = logN

nt

tf-idf(t, d,D) = tf(t, d) · idf(t,D)

(3.1)

By fitting the whole tokenized document collection into the tf-idf model from the scikit-learn library,

we obtain a term-document matrix that reflects the weights for all the terms in all documents. Using the

cosine similarity, we can measure the similarity between two vectors on the matrix, which consequently

allows us to measure the similarity between two documents in the collection. Using this method, we

obtain a new matrix with the similarity values between each document in the collection, allowing the

creating of links between strongly related documents.

One additional stage was added that, for each document, separated each other document into differ-

ent levels of similarity. This was initially done on the client-side, but it was changed so that all processing

is done on the same side.

3.2.4 Clustering and Top Word Extraction

To complement these links, a cluster analysis is performed on the tf-idf resulting matrix. This task

involves grouping a set of objects so that each group includes objects that are more similar between

each other than they are to other groups. In the context of the document collection, it allows grouping

together documents that are similar to each other, creating clusters of documents on different topics,

helping users identify relationships in the collection.

To create these groups, the k -means clustering is used. K -means is a general-purpose clustering

algorithm, that tries to separate the samples into groups of equal variance. This method, however,

requires the number of clusters to be specified beforehand, and it is not guaranteed to each a global

optimum. This implies that the final result will depend not only on the number of clusters specified, but

on the placement of the centers of each cluster, which may result in different results as the model is run

31

several times.

In order to evaluate the results from this clustering, a projection onto a 2D space is needed, since

the vectors representing each document have a very high dimension count. The process of reducing the

number of dimensions of a vector, while preserving information is designated dimensionality reduction,

and it usually consists of either selecting a subset of all the features, or computing new features from

the existing ones. Although this helps visualize results from the tf-idf model, reducing a high number

of dimensions to only two, can lead to the loss of information relating the document vector. This loss

of information, in turn, will cause a distortion on the resulting graph visualization of the collection of

documents. Certain patterns may appear, as a result of these artifacts, which influences the analysis of

the results.

Taking this into account, different dimensionality reduction methods were used, so that it is possible to

compare results without being too liable on the distortion. Document vectors were reduced using Latent

Semantic Analysis (LSA), Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor

Embedding (t-SNE), down to two dimensions. Both LSA and PCA perform a linear reduction on the

data, using Single Value Decomposition (SVD) of the data, while t-SNE performs a nonlinear reduction.

The comparison between these algorithms can be seen in figures 3.3, 3.3(a) to 3.3(c).

As clustering a high amount of dimensions can have some problems [23], the results of each dimen-

sionality reduction are again clustered, for further analysis. The results from this clustering can be seen

in figures 3.4, 3.4(a) to 3.4(c).

An additional stage that was added in the process involved the computing of top terms by topic, or

cluster. Taking the results from clustering the tf-idf results, it is possible to obtain the most relevant

features to each. Since in this case, features are equivalent to terms, it stores the most pertinent words

in each cluster identified. To compute the top words by topic, topics were first extracted by fitting the tf-idf

features into the Non-Negative Matrix Factorization (NMF) model. In order to get a more complete set

of results, Latent Dirichlet Allocation (LDA) topic extraction is also performed, with term count features

instead of tf-idf features, as the scaling idf property would disproportionately change the weights of

words with this model.

As these top words were extracted using different methods, a common measure between the words

and the documents had to be created, so that words could be sorted according to their relevance to each

document. With the trained tf-idf model, it is possible to measure the cosine similarity between each

word and each document. This method returns, for each word, an array with the similarities to each

document. This process was later altered to function if the input of the script specified a single word,

which would allow measuring the similarity between the specified word and each document, allowing

users to add new words to the list.

As mentioned in the architecture subsection, this processing is done mainly on the backend server,

32

(a) Results of TF-IDF clustering, using LSA dimen-sionality reduction to display results in the graph

(b) Results of TF-IDF clustering, using PCA dimen-sionality reduction to display results in the graph

(c) Results of TF-IDF clustering, using tSNE di-mensionality reduction to display results in thegraph

Figure 3.3: Results of TF-IDF clustering, reduced by LSA, PCA and tSNE for comparison.

33

(a) k -Means applied after dimensionality reduction,in this case with LSA

(b) k -Means applied after dimensionality reduction,in this case with PCA

(c) k -Means applied after dimensionality reduction,in this case with tSNE

Figure 3.4: Results of k -means clustering applied after the dimensionality reduction, for comparison.

34

that served as an API. In order to pass the processed data to the visualization, the script stores ev-

erything into a JavaScript Object Notation (JSON) file, which facilitates the interpretation when reading

the file in Javascript. This file will include the array of documents in a “nodes” property, while having

the calculated links in a “links” property. Information relative to each document is aggregated into each

document object, such as the cluster which it belongs to, its title or abstract and each other document’s

similarity levels, in relation to itself. Here, the cluster information used was derived from the original

k -means, performed on the tf-idf matrix. Since dimensionality reduction can lead to loss of information,

it was decided not to utilize this method as projection into a 2D space could approximate documents that

are not similar at all. In figure 3.3, any of the dimensionality reduction algorithms show instances of this,

as documents from different clusters are occasionally placed close together.

3.3 Brainiac: a Graph-Based Literature Visualization

Following the text processing described in the previous section, all the data is available and being

served in the backend. The frontend can request the main file, and use the processed data in the

visualization. This section describes the frontend component of the application. Additionally, it presents

the tasks that were derived from the meetings with professor Hugo Ferreira, from IBEB, as well as the

feedback from these meetings and from the first and informal testing phase.

As mentioned in the beginning of this chapter, the development of this visualization followed an

iterative model, focusing on user’s feedback, specifically meetings with professor Hugo Ferreira and an

informal testing phase, to guide the design of the application. This testing session did not focus on

validating or evaluating the usability of the application, but simply on gathering feedback from target

users.

This section will describe in detail the main phases in the development of this visualization, namely

the gathering of requirements from IBEB, through professor Hugo, the initial version, the testing phase

and the feedback collected, and, at the end, the final version of the application.

3.3.1 Gathering the requirements

As mentioned in the beginning of this chapter, this application was developed in cooperation with

IBEB, specifically with professor Hugo Ferreira. There were initial meetings aiming to obtain a list of

requirements for the visualization, and further sessions aimed at gathering feedback, being new possible

features or a change of approach in already implemented features.

From these requirements, a list of tasks were derived, to focus the development of the visualization:

• Search for a specific document;

35

• Filter documents by date of publication;

• Method to identify similar documents;

• Identifying topics in documents, and separating documents based on these identified topics;

• Ability to provide a brief overview or summary of a specific document

• Integration with a search engine, such as Pubmed or Google Scholar ;

• Differentiate different types of studies in the area: Clinical Trials, Guidelines, Meta-analysis or

systematic revisions, for example;

• Differentiate between the different levels of evidence level that are usually attributed to these stud-

ies, specifically, clinical trials.

This list of requirements was later utilized to create the list of tasks used in both testing phases, and

as guideline for design of the visualization.

3.3.2 Initial Version

Initially, the application consisted mainly on the sidebar, and the three visualizations: the Network

(Figure 3.5.A), the Cluster Layout (Figure 3.5.B) and the Timeline (Figure 3.5.C). The sidebar (Fig-

ure 3.5.D) contained a list of documents, a search feature, and a Words per Topic feature that was

disabled, due to not being ready for testing. This version was the version utilized in the informal testing

phase, although there was previous feedback from professor Hugo Ferreira, regarding some UI elements

such as the coloring utilized in the interface.

3.3.2.A The Network

The Network visualization focuses on showing the user the documents in the collection, as nodes,

and their similarity between each other as links between nodes, with these being computed as described

in the previous subsection.

By double clicking on a node in the Network, users were able to center a specific node, arranging the

remaining documents in different rings around the centered node, as seen in figure 3.6 This rearrange-

ment places documents taking into account their similarity with the center node, and displayed a simple

moving animation on each node, so that the user understood that was happening with the state of the

visualization. There are four different orbits around the center, with documents being placed closer to

the center as their similarity measure with the centered increases, with their placement being evenly

distributed inside each ring.

36

Figure 3.5: Brainiac’s initial main view. There are three main visualizations: The Network(A), the Cluster Layout(B),and the Timeline(C). The sidebar(D) lists all documents in the present database, and allows the user tosearch for specific keywords to add new documents to the visualization.

Figure 3.6: Example of the Network centering feature. By double clicking a node, it is centered in the visualization,arranging the remaining documents in an orbit like disposition, with the most similar placed in a closerorbit, and the less similar in a more distant one.

There was an initial idea of having the documents actually orbit around the document, when in this

mode, instead of being in fixed positions. This was later discarded, as it would eventually become too

confusing for users to deal with all the moving nodes, with no actual value being added by this kind of

feature.

3.3.2.B The Cluster Layout

The Cluster Layout displays the documents color coded by the cluster they belong to. It is a simple

visualization that was designed to represent documents by cluster, and as such, their positioning does

not reflect any computed measure. Nodes are simply placed randomly on the display, and a force was

created to keep each node close to corresponding cluster’s elements.

Both these visualizations (the Network and the Cluster Layout) allow the user to scroll and pan the

37

view of the nodes. The zoom is implemented as a semantic zoom, instead of the standard, graphical

zoom, as it allows the user to view the detail, without distortion on the elements, instead of simply scaling

up or down the view, as seen in figure 3.7.

Figure 3.7: Example of semantic zoom applied to the Network and Cluster Layout. Node sizes are not scaled up,only the distances between each one, displaying a higher level of detail to the user.

3.3.2.C The Timeline

Finally, the Timeline places each document according to their publication year. Only the year is con-

sidered in this case, as there was somewhat a lack of consistency with the dates on some documents.

Some had information on their year, but not month or day, so only the year was taken into account

in the collection, in order to keep the results consistent. Contrary to the Network and Cluster Layout,

this visualization does not implement zooming or any kind of axis scaling. This kind of feature was not

implemented as a similar one, that allowed users to filter documents by year, which will be described

in subsection 3.3.2.E, alongside other designed interactions. Even though documents have a full date

available in the details, some documents were changed to include month and day, despite not being

specified in their metadata, so that details are consistent across the collection.

3.3.2.D UI Components

Regarding the rest of the interface, the sidebar contained, as it was mentioned at the beginning of

this subsection, a simple document list, and a search feature. This document list listed all the document

that were currently in the database, and allowed the user to change the way they were sorted: by title,

or by date.

The search feature allowed an integration with a search engine, in this case Pubmed, to search for

new documents using the input query. This would use search Pubmed and list the top ten results that

were returned. The interface, seen in figure 3.8, allowed the user to select which documents he was

interested in adding to the visualization, including their abstracts. After choosing the desired studies,

and choosing to update the visualization, the application takes a few seconds to process and update the

visualization.

38

Figure 3.8: Interface displayed to the user after using the search function, on the sidebar. It displays the top resultsreturned and allowed the user to select any number of these to add to the visualization.

Figure 3.9: Example of layout rearrangement. Moving a window in the grid indicates with on the background whereit will be placed. If the user tries to move over an existing window, like in the example, the old window isplaced in a new free location, normally beneath the window being moved.

The three visualizations are placed into a responsive layout, that allows users to manipulate the

arrangement of the visualizations. The layout works as a simple grid, where users can grab one of the

visualization’s title bar and move it, as seen in figure 3.9, resize it, as seen in figure 3.10. The layout

arranges itself, so if the user tries to move or resize a window over an existent one, the new one takes

priority and occupies the position from that window, which is placed below the moved window. Since it

works as a grid, it snaps into the nearest possible position, which can be seen from both figures, and

gives feedback where the window will be placed in the grid, and what will be its size.

3.3.2.E Main Interactions

To facilitate the user’s navigation, several interactions between each visualization were designed. By

hovering a specific document, all the nodes representing the document in the remaining visualizations

become highlighted, as seen in figure 3.11. This hover works similarly in the Timeline, although users

39

Figure 3.10: Example of window resizing. Resizing a window in the grid indicates on the background the size thatit will assume. If the user tries to resize over an existing window, like in the example, the old windowrearranges itself, normally beneath the window being resized.

are able to always hover the closest item while the mouse is within the visualization’s bounds.

This ability to hover the closest node was implemented due to the initial size of the displayed nodes

in this visualization. As a result of their smaller size, it was hard to hover a specific document, and with

this detail, the difficulty was reduced since it did not require the user to hover the node with precision.

Figure 3.11: Example of hover interaction: Nodes were highlighted by increasing their radius by a few pixels, chang-ing their color to red. Links were highlighted by changing their color to red as well.

This interaction also highlights the document on the list (Figure 3.11.A), but this only occurs if the

document’s entry is already visible in the list. The automatic scrolling to the hovered node was disabled

by default, due to its confusing nature, as every time a user tried to hover a new document, intentionally

or not, the document list would scroll as needed to display the document. Instead, a subtle modification

was added, so that by hovering a node while holding the control key on the keyboard, the document list

would allow the scrolling to the focused document. A tip was displayed whenever the user tried to hover

a document, without holding the key, explaining this behavior.

As mentioned before, the Timeline allowed the user to filter documents based on their year of pub-

40

lication. By dragging a box on the visualization, it would filter out documents that did not belong to the

selected time interval. Documents that were not present in the applied filter did not appear in the doc-

ument list, until the filter is changed or removed. However, nodes corresponding to filtered documents

had their opacity changed, so they appeared “grayed out” in the visualization, although still influencing

the force layout responsible for each visualization, as seen in figure 3.12.

Figure 3.12: Example of the filter interaction in the Timeline. Nodes that are not included in the selected period aregrayed out of the visualization.

3.3.3 Informal Testing

This informal testing phase was formative evaluation, aiming to assess the usability of the initial

version of the visualization with representative users. The goal behind this specific phase was to identify

usability problems with the visualization, in order to better the user experience in the final solution. As

such, as list of tasks was derived from the requirements, that required the user to interact with different

components of the visualization, without focusing on quantitative data such as the task execution time

or number of errors detected in the completion of each task.

This subsection will go over the participants, the procedure and discuss the feedback obtained

through the different users that participated in this test.

3.3.3.A Participants

Subjects were recruited by professor Hugo Ferreira, in order to gather users with context knowledge.

There was a total of 5 users, with ages ranging between 23 and 33 years old. From these users, only

one did not have context knowledge.

41

3.3.3.B Procedure

The tests were performed in a laboratory in IBEB. Each participant was explained the purpose of

the test, and what they would be doing. They were given a brief explanation regarding the visualization,

describing the overall layout of the visualization and the meaning behind each particular visualization,

as well as the main interactions between each one. Users were motivated to ´´think out loud”, mani-

festing their opinions on the interface, and giving any feedback they could think of. Following this short

description of the application, they were given 5 minutes to explore the visualization, and encouraged to

try different interactions in order to familiarize themselves with the interface.

After this exploratory period, subjects were asked to perform a series of tasks. They were given a

task a time, given the next one when the current was completed. The list of predefined tasks are as

follows:

1. Identify the year with most publications;

2. Identify one of the documents that has the most relations;

3. Identify one of the biggest clusters of documents;

(a) Give example of two documents belonging to that cluster;

4. Identify two of the documents published between 2000 and 2010;

5. Identify the year of publication of the document named “Distinct Brain Networks underlie cognitive

dysfunction in Parkinson and Alzheimer diseases”;

6. Center the network visualization on the document named “Regional volumetric change in Parkin-

son’s disease with cognitive decline”;

7. Give two examples of documents belonging to the same cluster as document named “Structural

Brain Changes in Parkinson Disease With Dementia”

8. Give two examples of documents that are related to “Temporal lobe atrophy on MRI in Parkinson

disease with dementia”;

9. Create a new visualization with a query for documents relevant to “Alzheimer”;

3.3.3.C Discussion

There were many problems regarding the interface, both reported by the user and detected by fol-

lowing task execution.

First, there some users displayed some confusion regarding the meaning of the Cluster Layout,

understanding the meaning behind the clusters, as it did not have any particular interactions with the

42

rest of the visualization. They indicated they would like to see how the clusters were positioned in the

Timeline or the Network.

There were also some problems when trying to identify document a certain node represented, as

hovering was disabled, by mistake. This forced users to hover while pressing the control key, which

revealed problems with participants that had less experience with computers. These users did not

understand how to automatically scroll the list with this hover property, and as such there was some

difficulty identifying document titles from the visualization.

When asked to look for a specific title, most users tried to wrongly search for their title using the

search function. It was not obvious to users, even with a description in the input box, that the search

function was meant to search for new documents. Without a way to filter documents, subjects were

forced to manually scroll the list in search for the required document.

After identifying the required document in the sidebar list, some tasks required users to identify some

property about the new document, from one of the visualizations. A specific example, is task 6, that

asked users to search for a document and center the network on that same document. After manually

locating the document in the sidebar list, users needed to memorize its location in the Network and then,

center it, since the highlight would disappear as soon as the mouse left the node.

In conclusion, this testing phase identified a few critical problems that slowed down the user’s exe-

cution. The solutions for these problems are described in the next subsection, that describes the final

version of the application.

3.3.4 Final version

The final version of the application did not include any major changes to the main visualizations,

besides the new topic magnets, and the file uploader interface, accessible by the sidebar. The Net-

work, Cluster Layout and Timeline, as seen in figure 3.13, did not have a major problem regarding their

usability, as such appear similar to the initial version (Figure 3.5).

This version changed the coloring of the UI and the nodes, to facilitate the identification of the different

states, and the different clusters in the Cluster Layout. It also implemented back the popup, allowing

users to simply hover a node to determine the document’s title, as seen in figure 3.14. This hover

function was slightly changed, since previously it overwritten the color of the node, specially in the

Cluster Layout, which prevented users from identifying the cluster the hovered node belonged to. This

new hover function does not prompt the user to use the control key to scroll the document list anymore,

since it was moved to a new state. Regarding the problems pointed out in the Cluster Layout, hovering

a node in this visualization also highlights documents that belong to the same cluster, on the Network

and Timeline, as seen in figure 3.15.

43

Figure 3.13: Brainiac’s main view

Figure 3.14: Example of hovering a document in the Network visualization. It works similarly in the Timeline view.

The second problem that was associated with hovering a node, was the requirement to memorize a

node’s location. In order to fix this problem, a new state was introduced: focusing a node. By clicking

on a node, it was possible to change its state to appear as “focused”, as seen in figure 3.16. Focused

nodes are very similar to hover nodes, but since the interaction to hover nodes required the user to not

move the mouse away from that node, the user may opt to click on the node and focus on that node. In

this new state, nodes are highlighted with a different color from the hovering state. It also changes the

node’s border, as in the Cluster Layout visualization, the cluster to which a node belongs is identified

by its color. Contrarily to hovering, focusing a node allows the user to scroll the document list on the

44

Figure 3.15: Example of hovering a document in the Cluster Layout, highlighting documents that belong to thesame cluster in the remaining visualizations.

sidebar, so that the corresponding entry is selected, although it does not display the popup, unless the

user actively tries to hover the node.

In the Cluster Layout, zooming out also allows the user to collapse the nodes into their corresponding

clusters, so that the user can work with the clusters directly (see Figure 3.17). Some users mentioned

that the animation behind this feature could be a little frustrating at times, if the user zoomed out a bit

too much by mistake. In order to solve this, the limits that triggered the node collapse were tweaked a

bit, as well as the animation duration for both the zoom out and zoom in, so the user does not feel like

wasting time by accident too often.

The search function that allowed users to add new documents by querying Pubmed was removed in

this last version. This function presented a few problems that limited its usefulness, mainly the inability

to fetch the full document from the search engine. In the initial version, only the abstracts were used

to compare with existing documents, however, the abstract does not contain enough content to make

accurate assumptions about similarity in the collection. Another problem that was present with this

method, was the fact that Google Scholar blocked requests from the Python script dealing with the

fetching, due to its policy regarding bots. These lead to the removal of the feature, changing it to a new

file uploading interface that lets users manually add their own documents (Figure 3.18).

Initially, the only way to add new document files to the visualization would be to manually place them

in the server folder, adding the details to the document.json file, or to use the integration with Pubmed

to search for new documents. These approaches were not optimal, as mentioned in the last paragraph.

The new file uploader provides a menu where the user is able to add new files to the visualization, but is

45

Figure 3.16: Example of focusing a document in the Network visualization.

Figure 3.17: Example of zooming out on the Cluster Layout, which collapses nodes into their corresponding cluster.

required to provide additional details: title, date, authors and the abstract (Figure 3.20). This concerns

the metadata processing, as this method does not have consistent results when extracting the required

fields, the user is trusted to enter the correct data to better the information in the visualization.

In the sidebar, a new document filter was added, that allows users to search for a specific title.

Originally, this feature allowed users to filter the whole visualization, graying out documents not matching

the input query. This was removed in later stages, due to its main function being search for documents.

Filtering the whole visualization did not make sense and forced users to clear the query before trying to

interact with the visualization.

3.3.4.A Topic Magnets

The topic magnets submenu lists the top relevant words in the collection. These words are gathered

in the final stages of the backend processing, which is described in Subsection 3.2. Words in this list

act like objects that can be dragged to the Cluster Layout visualization, where an item with that word is

created.

46

Figure 3.18: File uploader interface. There is also a list of documents in the collection.

These new items work as magnets, although disabled by default. By double clicking on this new

object, it will active it, attracting the documents to itself. As mentioned, the words were measured against

each document in the visualization. The resulting list of similarities is used in order to vary the attraction

observed between each document and that magnet. This can be used to analyze which documents are

related to a specific term, as seen in figure 3.21.

Users are able to add new words to the list, allowing them to check what documents are closer to a

specific topic. The visualization takes a bit to update the list with the new word, due to the processing

that is required, but after this delay, users are able to freely use the added topic.

These word objects do not interact with the Network or the Timeline due to the nature behind these

visualizations. The Network was the first option to hold the topic magnets visualization. However, it was

already used for the center visualization technique, which lead it to be applied in the Cluster Layout. The

Timeline, on the other hand, was designed to take advantage of the positioning of each node, to reveal

the publication year of each document. Since the technique described in this subsection requires the

positioning of each node to illustrate the similarity between documents and magnet, combining these

two techniques would defeat their purpose.

3.3.5 Discussion

The Brainiac visualization allows users to explore a collection of documents in the neuroscience

context. It presents different views on the collection, namely the Network, which connects similar doc-

uments, the Timeline, which shows documents based on their publication year, and, finally, the Cluster

Layout, which differentiates documents based on their clusters. This clustering is based on their con-

47

Figure 3.19: File uploader interface showing the details provided by selecting one of the document entries.

tent, so users are able to interpret groups of documents as different topics. Additionally, it allows users

to create magnet objects based on a specific term, making it possible to explore the closest documents

to a specific topic.

By providing a range of different interactions between these views, Brainiac focuses on helping the

users finding similar or related documents. Users can start with a specific document in mind, or try

to create a topic and explore the closest documents to that topic. The cluster arrangement will show

documents in the same group, which can help users find out about new documents that they might not

have known before, within that topic.

To put in context, we compare Brainiac with the existing systems, reviewed in Section 2, using a

similar table, as seen in table 3.1. We use the same concepts in order to categorize our visualization,

namely, the ability to read the original document, the ability to present patterns from the collection, to

present an overview of the document, to compare between different documents, to present extracted

features, to search for specific terms or phrases and the ability to zoom on details.

Specifically with Brainiac, the user is easily able to read the original document, by double clicking on

a specific document in the sidebar, or by following the “Open Document” button that is presented in the

small overview. Since this overview can become intrusive in the visualization, it is only presented to the

user when the document is hovered in the sidebar, instead of one of the views available.

We consider that the visualization fails to allow the user to explicitly compare documents, as currently

there is no way to select two or more documents and compare their properties or topics. Since we do

not extract features such as entities from the collection’s content, we also consider that the “features”

48

Figure 3.20: File uploader interface. The title, date, author list and abstract are fields the user needs to fill, asmetadata extraction is not consistent.

Figure 3.21: Example of a topic magnet attracting documents based on their relation with the topic.

concept is not present in the visualization. Since the search function currently present in the application

allows only the filtering of documents by their name, we consider this feature not to be present, as it is

not possible for users to search for a specific term or topic. Finally, the zoom component is considered to

be present. It was implemented in both the Network view and the Cluster Layout. It could be interesting

to add this zooming feature to the Timeline, if the scalability of our solution creates problems with the

number of documents present in this view.

49

Table 3.1: Comparison between the reviewed visualizations and the developed solution.

read patterns overview compare features search zoomTag Clouds [21] 7 7 3 3 7 3 7

Wordle [14] 7 7 3 3 7 3 7

Docuburst [1] 3 7 3 3 7 3 3

Word Tree [2] 3 7 7 7 7 3 3

PaperLens [3] 7 7 3 7 7 3 7

Bohemian Bookshelf [16] 7 7 3 7 7 3 7

Dissertation Browser [4] 3 7 3 3 3 7 7

Jigsaw [5] 3 3 3 3 3 3 3

PaperVis [6] 3 7 3 7 3 3 7

ThemeRiver [7] 3 7 3 3 3 7 7

FacetAtlas [8] 3 7 7 7 3 3 3

Wivi [9] 3 7 3 7 3 3 3

Brainiac 3 3 3 7 7 7 3

50

4Evaluation

Contents

4.1 Usability Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

51

With the visualization and interface’s main points accomplished, we decided to start a formal testing

phase. This testing phase aimed at evaluating the usability of the visualization and its utility, as a tool

to aid interpreting a collection of documents. This chapter describes this formal testing phase, the

participants, for both the usability tests and the case studies, the evaluation and the procedure.

4.1 Usability Tests

As mentioned, usability testing is used as a tool to evaluate a product by testing it on representative

users. This can be seen as an irreplaceable usability practice, since it gives direct input on how real

users use the system [24]. The goal behind this process is to identify usability problems, while gathering

qualitative and quantitative data to determine the participant’s level of comfort with the product and

satisfaction. Thus, this section goes over the participants, the procedure for these tests, the final results

of this testing phase and the discussion of these results.

4.1.1 Participants

Subjects were recruited through standard procedures including direct contact and through word of

mouth. Subjects included anyone interested in participating if they were at least 18 years old. Each

participant was asked to sign a consent form.

During this testing phase, 16 tests were performed. In the first test, technical problems prevented the

first user from performing the last task. All the tests were conducted between 08h30 and 20h00. None

of the subjects that completed the test had professional experience in neuroscience. However, since the

test focused on the interface aspect of the visualization, this did not impact the tests.

4.1.2 Procedure

The tests were performed in a laboratory inside campus Alameda, in Instituto Superior Tecnico (IST).

Each participant was explained the purpose of the study, and what they would be doing. Subjects were

asked to fill a consent form, to allow the recording of their actions in the visualization during the test.

After filling the form, users were explained the meaning behind each of the visualizations presented

in the application, namely the Network, the Cluster Layout and the Timeline, as well as the interactions

between each one. After this brief summary, participants were given 5 minutes to explore the applica-

tion’s interface, experimenting the described functionalities. At the beginning on this phase, a simple

script was run to start recording the user’s actions on screen, for later reference.

Following this exploratory phase, subjects were asked to perform a series of predefined tasks, which

the assistant would measure, in terms of time taken to perform the task, and the number of errors

53

showed during the execution. Participants were given a single task at a time, given the next one when

the current was completed. The list of predefined tasks are as follows:

1. Identify the year with the most publications;

2. Identify one of the documents that has the most relations in terms of similarity;

3. Identify one of the biggest clusters of documents;

(a) Give example of two documents belonging to that cluster;

4. Filter documents between 2000 and 2010 and identify two documents belonging to that time span;

5. Identify the year of publication of the document named “Distinct Brain Networks underlie cognitive

dysfunction in Parkinson and Alzheimer diseases”;

6. Center the network visualization on the document named “Regional volumetric change in Parkin-

son’s disease with cognitive decline”;

7. Give two examples of documents belonging to the same cluster as document named “Structural

Brain Changes in Parkinson Disease With Dementia”;

8. Give two examples of documents that are related to “Temporal lobe atrophy on MRI in Parkinson

disease with dementia”;

9. Zoom out the cluster view;

(a) Identify the most recent cluster

(b) Identify a cluster disperse along the timeline;

10. Create a new Topic Magnet with “Alzheimer”;

(a) Identify two documents related to the topic;

(b) Identify the closest cluster to the topic;

(c) Create a new topic magnet with “Parkinson” and place it on the opposite end of the previously

created magnet;

(d) Identify the closest document to the new topic;

11. Upload the given document and update the visualization with the new added document;

(a) Center the network on the new document, and identify two documents related;

(b) Identify the cluster the document belong to;

54

With the completion of this set of tasks, users were then asked to fill the System Usability Scale (SUS)

questionnaire. The SUS was utilized to measure the application’s usability, and consists of a ten item

questionnaire, using a Likert scale to give an overview on how the user felt about the system [25]. Then,

testers were given a compensation based in candies and thanked for the time taken.

4.1.3 Results

The distribution of the time taken in each task can be seen in 4.1. The first tasks (Tasks 1 to 3,

including 3a) given were very easy, as they did not require the user to make changes to the initial state

of the visualization. Each task required the user to look at the initial state of each of the visualizations

and identify something.

The first task was simple, and, apart from two users that did not understand right away the objective,

no users had problems getting the correct answer in a few seconds, with the mean being four. In the

second task, some users tried looking first at the cluster layout, before understanding that the correct

response required them to find a document in the Network. This can be seen in the box plot for this task

(Fig. 4.1(a)), as the distribution is more disperse, which is also the case with task 3.

In the first task, there were no detected errors. However, tasks 2 through 3a reported at least one

user with errors in the execution.

The fourth task required users to apply a filter in the timeline and identify documents. The box plot

for this task shows a more compressed distribution of the time taken, with only two users with an error

reported in the completion of this task.

Tasks 5 through 8 required users to search for a specific document and make the same observations

as the first group of tasks. As such, the first task of this group, task 5 had a very disperse distribution,

with almost all users getting at least one error in this task. Contrary to the first task, the rest displayed a

more compact distribution, with fewer users displaying errors.

Tasks 9, 9a and 9b required the user to combine the Cluster Layout and the Timeline. The first

simply required the user to zoom out on the Cluster Layout, and as such, it does not present a sparse

distribution, although there were still a few errors. Then, 9a presents a more disperse distribution, with 9b

being denser in the box plot.

The next group of tasks, 10a through 10d, including task 10, required users to work with the Topic

Magnets, in the sidebar. These tasks presented a denser execution time distribution, although there

were outliers that did not understand the task at the beginning. Only a few users showed errors in the

execution of these tasks, and overall, the distribution of the task time is compact.

Lastly, tasks 11 to 11b also displayed less variation in the spreads, with errors detected only on the

first two tasks.

From the SUS questionnaires, the usability was measured with a mean score of 82.5 (see figure 4.2),

55

(a) Distribution of task times (Tasks 1 to 9)

(b) Distribution of task times (Tasks 9a to 11b)

Figure 4.1: Distribution of time taken in each task.

56

across all users, with a standard deviation of 9.287, indicating that results do not vary too much from

the mean. Research indicates Web-based SUS scores to be, on average, 68 [26]. Since the score in

this testing phase reached an above average score of 82.5, with a low standard deviation, it can be

concluded that users were satisfied with the usability of the application, apart from the identified errors.

Figure 4.2: A comparison of the adjective ratings, acceptability scores, and school grading scales, in relation to theaverage SUS score [27]. The questionnaires place this visualization at 82.5, marked A in the figure.

4.1.4 Discussion

In general, the results were very good. The execution times presented were low and, in general, did

not present a disperse distribution, with only simple errors being made when users did not understand

a specific interaction right away. The results did not point to any obvious usability problem, although the

analysis pointed to UI elements that required subtle changes to improve the user interface.

The first group of tasks, tasks 1 and 3a were completed without problems, however, task 2 and 3

have a wider spread, that could be attributed to the wording in the requested task.

In the second task, users were asked to identify one of the documents that displayed “the most

relations. Some users tried to determine the one with the most, and were unsure what to pick, which

caused the larger spread on that task. This occurred in the second task similarly, which lead to some

users completing in just a few seconds, while other users tried to compare each cluster’s number of

elements.

As mentioned, the fourth task required users to filter the Timeline. While this task did not present a

significant problem, it caused the large spread on task 5. This happened due to this task requiring the

user to filter the timeline. Since the filter is applied not only the visualization, but to the document list on

the sidebar, many users did not remember to, at first, remove the filter from the timeline before searching

for the required document. Other users also searched the required document by trying to scroll through

the list, without using the document filter. This could be attributed to some users not noticing that they

could search the document list by typing the name of the document, as the input box may not be obvious

on a first look.

57

The rest of the tasks that required the user to identify a property about a specific document did not

have such a large spread, as users had already removed the filter. However, some users noticed there

was a bug on the document filter, that did not match any documents if the query started with a Lowercase

letter. This flaw was not obvious at first, and lead to some users to search the list by manually scroll the

list looking for the needed title.

Tasks 9a and 9b required the user to hover each cluster in the Cluster Layout and follow each one’s

spread on the timeline. However, some participants did not understand they could exploit the hover

interaction to quickly identify the solution. Some tried to manually scan the timeline and identify which

cluster that document belonged to, and tried to estimate the answers, which lead to high execution times

in those cases.

The group of tasks that involved creating new topic magnets did not present any significant problem.

The times measured in tasks 10 and 10c include the time needed for the preprocessing required in

the backend, which normally added around 20 seconds to completion time. Due to technical problems

that two users faced with the preprocessing, they repeated the task, although there was no significant

improvement that could skew the results.

The last group of tasks involved the upload of a new document to the visualization. One of the users

also experienced technical problems in this task, and due to time constraints, did not perform any of the

tasks in this group. The first task, 11 also included the time required to upload and process the new

document, which leads to displayed higher execution times. Most users identified a problem with the

interface with the file uploader, as after filling the required details to upload the document, they did not

understand where to proceed with the upload. As such, the positioning of the button was changed to

improve clarity in the process of adding new documents to the collection.

The rest of the tasks in this group, tasks 11a and 11b did not present any significant findings, as

users simply had to repeat tasks on the new document.

In conclusion, following participants completing the list of tasks given lead to finding some subtle

problems with the UI, such as some elements not being as highlighted, like the document filter on the

sidebar. There were some other problems that were promptly identified by users. One example was

already mentioned, the positioning of the upload button in the file uploader interface, but two users

mentioned that there could be some trouble identifying cluster colors on the timeline, when comparing a

darker green with the default node color.

4.2 Case studies

These studies aimed at testing the utility of the visualization. Since this visualization was designed

as a tool to aid the exploring of a collection of documents, the case studies were performed to evaluate if

58

the results from visualization were correct. The goal behind this process is to verify and consolidate the

user’s context knowledge, or even to possibly unexpected data patterns. Thus, this section will go over

the participants, the procedure and the results of this testing phase, as well as a discussion, regarding

these results.

4.2.1 Participants

Contrarily to the usability testing phase described in section 4.1, the participants were required to

have some context knowledge, in order to validate the information being displayed. With this in mind,

subjects were recruited with the help of professor Hugo Ferreira, from IBEB.

During this testing, two case studies were performed. The tests were conducted between 11h30 and

12h30. The first tester was a Ph.D. student, and the second was a MSc student, both from Biomedical

and Biophysics Engineering. Although they did not have direct experience with the topics included in the

collection of documents, both subjects still had enough experience in the area to understand the main

topics of each document.

4.2.2 Procedure

The tests were performed in a laboratory in IBEB. Each subject was explained the purpose of the

study, and what they would be doing. Participants were asked to fill a consent form, to allow the recording

of their actions in the visualization during the test.

After filling the form, users were given an explanation regarding the meaning of each visualization,

namely the Network, the Cluster Layout and the Timeline, as well as the interactions between each one.

After this brief summary, participants were given 5 minutes to freely explore the application’s interface,

experimenting the described functionalities. At the beginning of this exploratory period, a simple script

was run to start recording the user’s actions, for later reference.

Following this phase, subjects were given 15 minutes to freely explore the visualization, aiming to

validate or disprove the visualization’s results.

4.2.3 Results

Both participants focused their analysis mainly on the Network ’s relations, trying to understand if the

existing links could be validated. The second user also tried to verify the clustering displayed on the

Cluster Layout, including the topic magnets that work with this visualization.

In general, the first user thought the results were good, as similar documents were linked together

correctly. However, he was also able to find documents that should not have been linked at all. A

particular example of this was a link existing between two studies that mentioned Alzheimer, although

59

their focus was different, with one’s focus being MRI scans, and the second being the influence of

Microglial activation with Alzheimer’s disease.

The second user’s feedback was in line with the first subject, as he also pointed documents whose

focus point did not match, although a secondary topic allowed the similarity connection to exist.

4.2.4 Discussion

In general, the results displayed were in accordance with what was expected, but the existence of

wrongly linked nodes is cause for concern. Even though the wrongly linked nodes talked about the same

topics (In the specified example, the same topic was Alzheimer’s disease), the focal point of the study

needs to be taken into account.

In conclusion, the visualization can be an asset, with the potential to help users guide their research

in this area. By providing certain interesting documents as focus points, users can better direct their

efforts at what they are looking for, without disregarding the exploratory sense that would be present

without this tool. However, in order improve the trust users can place in this tool, further tweaking of the

text processing pipeline may be needed. Additional testing is also required, in order to further validate

the utility of this visualization, focusing on the clustering and the topic magnets, as they may present

interesting results.

60

5Conclusion

61

Nowadays users face the problem of too much information available. A user trying to research into a

new topic will face a collection of context-specific documents, and exploring this collection may require

knowledge on specific concepts that is only available with more experienced users. With that, different

visualizations were reviewed. These tried to help users understanding the content of a single document

or making sense of the whole collection of documents, usually helping the user what kind of topics are

available in the visualization. Combining the comparison between the reviewed visualizations with the

gathered requirements from professor Hugo Ferreira, from IBEB, a list of tasks were derived that helped

guide the development of the application. The development followed an iterative model, that relied on

the feedback collected from the users to improve the visualization’s usability. An informal testing phase

took place, in order to gather feedback and detect possible usability problems before the final usability

tests. Finally, a formal testing phase took place, which consisted on two phases: usability tests and

case studies. The former focused on measuring the usability of the application, while the former aimed

to validate the utility of the developed solution.

From this formal testing phase, we can conclude that all the defined objectives were completed,

with good final results. The main objective was to build the visualization that allowed user to analyze the

content and similarity between documents in the collection. Several intermediate objectives were defined

to guide the development, which included building a database of documents to be used, designing and

developing the layout of the application, and evaluating the final solution.

The database was completed with the help of professor Hugo, by giving some guidelines on what

topics to search for. This database aimed to help the development of the application, and as such, it did

not contain a diverse collection of documents. Then, the designing and developing was also completed

with success, although not all the requirements that were collected from IBEB were not implemented.

Lastly, the evaluation of the final solution, through the formal testing phase, ended with good results,

regarding the usability of the application and its utility, although there is room for improvement.

Although the scope of this project was mainly aimed at working with documents from the neuro-

science context, it can easily be applied to other subjects as well. The development followed a more

generic approach, so that it was possible to apply our work to a different area of expertise without much

effort. Admittedly, some requirements that were not implemented would have contributed to narrowing

the focus of this work, but due to time constraints, the requirements were prioritized in such a way that

this generic approach was possible.

Additionally, there are some concerns regarding the scalability of this project. As mentioned, the

development was focused mainly on a smaller collection to aid our work with the document processing.

Lacking a bigger document collection, the scalability aspect of our solution was not taken into account

during the design and development. As such, it is important to take into account that it may have

performance issues as the collection increases, and the visualizations can present a bigger amount of

63

visual clutter, complicating the interpretation of the dataset. As future work, it would be interesting to

improve this aspect of our solution. A possible course would be to change the visualization so that it is

possible to hide or collapse unrelated documents, in order to avoid numerous nodes at the same time.

Furthermore, there is additional work that involves improving the backend text processing that was

described in Section 3.2. Specifically, improving the system to use bigrams and trigrams. By using

these contiguous sequences, text analysis will be able to take context into account, when measuring

similarity between documents. This could be sued to solve the wrongly linked nodes in the Network,

and improve the existing connections. There could also be some further work to improve the method of

adding new documents to the visualization. This could follow the professor Hugo’s idea of integrating the

visualization with a search engine such as Pubmed or Google Scholar, with a procedure to automatically

fetch the full document. On the other hand, another method would be to allow the drag and drop of files

into the visualization, with automatic fetching of metadata, from the file or from an online database,

removing this concern from the user.

64

Bibliography

[1] C. Collins, S. Carpendale, and G. Penn, “Docuburst: Visualizing document content using language

structure,” in Computer graphics forum, vol. 28, no. 3. Wiley Online Library, 2009, pp. 1039–1046.

[2] M. Wattenberg and F. B. Viegas, “The word tree, an interactive visual concordance,” IEEE transac-

tions on visualization and computer graphics, vol. 14, no. 6, pp. 1221–1228, 2008.

[3] M. Spindler and R. Dachselt, “Paperlens: advanced magic lens interaction above the tabletop,” in

Proceedings of the ACM International Conference on Interactive Tabletops and Surfaces. ACM,

2009, p. 7.

[4] J. Chuang, D. Ramage, C. Manning, and J. Heer, “Interpretation and trust: Designing model-driven

visualizations for text analysis,” in Proceedings of the SIGCHI Conference on Human Factors in

Computing Systems. ACM, 2012, pp. 443–452.

[5] C. Gorg, Z. Liu, J. Kihm, J. Choo, H. Park, and J. Stasko, “Combining computational analyses and

interactive visualization for document exploration and sensemaking in jigsaw,” IEEE Transactions

on Visualization and Computer Graphics, vol. 19, no. 10, pp. 1646–1663, 2013.

[6] J.-K. Chou and C.-K. Yang, “Papervis: Literature review made easy,” in Computer Graphics Forum,

vol. 30, no. 3. Wiley Online Library, 2011, pp. 721–730.

[7] S. Havre, E. Hetzler, P. Whitney, and L. Nowell, “Themeriver: Visualizing thematic changes in large

document collections,” IEEE transactions on visualization and computer graphics, vol. 8, no. 1, pp.

9–20, 2002.

[8] N. Cao, J. Sun, Y.-R. Lin, D. Gotz, S. Liu, and H. Qu, “Facetatlas: Multifaceted visualization for

rich text corpora,” IEEE transactions on visualization and computer graphics, vol. 16, no. 6, pp.

1172–1181, 2010.

[9] S. Lehmann, U. Schwanecke, and R. Dorner, “Interactive visualization for opportunistic exploration

of large document collections,” Information Systems, vol. 35, no. 2, pp. 260–269, 2010.

65

[10] G. Marchionini, “Exploratory search: from finding to understanding,” Communications of the ACM,

vol. 49, no. 4, pp. 41–46, 2006.

[11] R. W. White, B. Kules, S. M. Drucker et al., “Supporting exploratory search, introduction, special

issue, communications of the acm,” Communications of the ACM, vol. 49, no. 4, pp. 36–39, 2006.

[12] D. A. Keim, J. Kohlhammer, G. Ellis, and F. Mansmann, Mastering the information age-solving

problems with visual analytics. Florian Mansmann, 2010.

[13] K. A. Cook and J. J. Thomas, “Illuminating the path: The research and development agenda for

visual analytics,” Pacific Northwest National Laboratory (PNNL), Richland, WA (US), Tech. Rep.,

2005.

[14] F. B. Viegas, M. Wattenberg, and J. Feinberg, “Participatory visualization with wordle,” IEEE trans-

actions on visualization and computer graphics, vol. 15, no. 6, pp. 1137–1144, 2009.

[15] G. A. Miller, “Wordnet: a lexical database for english,” Communications of the ACM, vol. 38, no. 11,

pp. 39–41, 1995.

[16] A. Thudt, U. Hinrichs, and S. Carpendale, “The bohemian bookshelf: supporting serendipitous book

discoveries through information visualization,” in Proceedings of the SIGCHI Conference on Human

Factors in Computing Systems. ACM, 2012, pp. 1461–1470.

[17] P. Andre, J. Teevan, S. T. Dumais et al., “Discovery is never by chance: designing for (un) serendip-

ity,” in Proceedings of the seventh ACM conference on Creativity and cognition. ACM, 2009, pp.

305–314.

[18] A. Foster and N. Ford, “Serendipity and information seeking: an empirical study,” Journal of Docu-

mentation, vol. 59, no. 3, pp. 321–340, 2003.

[19] T. Gup, “Technology and the end of serendipity,” The Chronicle of Higher Education, vol. 44, no. 21,

p. A52, 1997.

[20] E. G. Toms, “Serendipitous information retrieval.” in DELOS Workshop: Information Seeking,

Searching and Querying in Digital Libraries. Zurich, 2000.

[21] Y. Hassan-Montero and V. Herrero-Solana, “Improving tag-clouds as visual information retrieval

interfaces,” in International conference on multidisciplinary information sciences and technologies.

Citeseer, 2006, pp. 25–28.

[22] J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of massive datasets. Cambridge university

press, 2014.

66

[23] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the surprising behavior of distance metrics in

high dimensional spaces,” in ICDT, vol. 1. Springer, 2001, pp. 420–434.

[24] J. Nielsen, Usability engineering. Elsevier, 1994.

[25] J. Brooke et al., “Sus-a quick and dirty usability scale,” Usability evaluation in industry, vol. 189, no.

194, pp. 4–7, 1996.

[26] A. Bangor, P. T. Kortum, and J. T. Miller, “An empirical evaluation of the system usability scale,” Intl.

Journal of Human–Computer Interaction, vol. 24, no. 6, pp. 574–594, 2008.

[27] A. Bangor, P. Kortum, and J. Miller, “Determining what individual sus scores mean: Adding an

adjective rating scale,” Journal of usability studies, vol. 4, no. 3, pp. 114–123, 2009.

67

69