Brainiac: a Graph-based Literature Visualization Miguel Alexandre Lourenc ¸ o dos Santos Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering Supervisors: Profª Sandra Pereira Gama Prof. Hugo Alexandre Ferreira Examination Committee Chairperson: Prof. Lu´ ıs Manuel Antunes Veiga Supervisor: Profª Sandra Pereira Gama Members of the Committee: Profª Ana Paula Boler Cl´ audio November 2017
85
Embed
Brainiac: a Graph-based Literature Visualization · Abstract Nowadays, users face the problem of too much information available. A user trying to research into a new topic will face
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Brainiac: a Graph-based Literature Visualization
Miguel Alexandre Lourenco dos Santos
Thesis to obtain the Master of Science Degree in
Information Systems and Computer Engineering
Supervisors: Profª Sandra Pereira GamaProf. Hugo Alexandre Ferreira
Examination Committee
Chairperson: Prof. Luıs Manuel Antunes VeigaSupervisor: Profª Sandra Pereira Gama
Members of the Committee: Profª Ana Paula Boler Claudio
November 2017
Acknowledgments
I would like to thank my parents for their friendship, encouragement and caring over all these years,
for always being there for me through thick and thin and without whom this project would not be possible.
I would also like to thank my grandparents, aunts, uncles and cousins for their understanding and support
throughout all these years.
I would also like to acknowledge my dissertation supervisors Profª Sandra Gama and Prof. Hugo
Ferreira for their insight, support and sharing of knowledge that has made this Thesis possible, by
keeping me motivated to carry out the development of this project.
I would like to thank my colleagues, Tomas Alves, Rodrigo Verıssimo, and Luıs Fonseca, and for their
time, knowledge, dedication and friendship. You’re simply the best.
Last but not least, to all my friends and colleagues that helped me grow as a person and were always
there for me during the good and bad times in my life. Thank you.
To each and every one of you – Thank you.
Abstract
Nowadays, users face the problem of too much information available. A user trying to research into a
new topic will face a collection of context-specific documents, and exploring this collection may require
knowledge on specific concepts that is only available with more experienced users. In this work, we
address this problem, in the neuroscience context, creating a visualization, in collaboration with Instituto
de Biofısica e Engenharia Biomedica (IBEB), that helps users analyzing a collection of documents,
indicating documents that may be similar. The developed visualization has potential to help users in
this context, by interacting with different views, it allows to combine a document search by similarity and
by different topics. We conducted an evaluation, to measure the usability of the developed application,
and its utility, to validate the data visualized. The results from the usability test were very good, with
no obvious interface problem. Validation of the processed data also show good results, with room for
improvement with some errors detected in text processing.
Keywords
Text Visualization; Information Visualization; Visual Analytics; Text Processing; Document Collection;
Neuroscience.
iii
Resumo
Hoje em dia, os utilizadores deparam-se com demasiada informacao disponıvel. Quando um utilizador
procura informacao acerca de um novo topico, encontrara uma grande quantidade de documentos,
que, muitas vezes, requerem experiencia na area para conseguir interpretar ou filtrar o conteudo mais
relevante. Com este trabalho pretende-se resolver este problema, na area da Neurociencia, criando
uma visualizacao, em colaboracao com o IBEB, que permita aos utilizadores analisar uma colecao
de documentos, indicando documentos que possam ser semelhantes, e do interesse do utilizador. A
visualizacao desenvolvida apresenta potencial para ajudar utilizadores neste contexto, explorando as
diferentes vistas que foram desenvolvidas para o efeito. Combinando estas vistas, e possıvel entao
procurar documentos por semelhanca ou por um determinado topico. Finalmente avaliou-se a solucao
final, tendo em conta a usabilidade e a utilidade da mesma, de modo a validar os resultados que podem
ser visualizados na aplicacao. Os resultados foram bons, nao havendo nenhum erro obvio na interface
de utilizador. A validacao dos dados processados tambem mostrou bons resultados, deixando uma
margem para possıvel melhoria nalguns erros detetados durante esta avaliacao.
Palavras Chave
Visualizacao de Informacao; Neurociencia; Visualizacao de Texto; Colecoes de Documentos;
the hyponyms. A search function is provided, enabling the user to highlight nodes matching the query.
Additionally, it allows for a semantic zoom through a fisheye view that collapses the farthest from the
root node. The full text from the document is available at the bottom of the interface (Figure 2.3), which
can be used through the linked visualization.
By providing an overview of the document content, users are able to compare multiple documents,
by having the trees rooted on the same synset, showing the difference between documents’ content.
Enabling users to view this comparison allows different applications, such as plagiarism detection, doc-
ument categorization and authorship attribution.
Lastly, there is the Word Tree [2], introduced as a visualization focused on exploring repetitive text.
It takes form in a tree structure, with the words that follow a particular search term, arranging the words
spatially, as seen in figure 2.4. It was designed to exploit the interest behind visualizing unstructured text
in the Many Eyes website, where users could upload and visualize their own data.
The design is compared to an interactive form of the keyword-in-context technique, since the visual
design makes it easy to spot repetition in the contextual words that follow a phrase, as well as having a
natural tree structure, while having clear ways to interact with the visualization. Since it was intended to
allow users to test the visualization in the Many Eyes site, users had to rapidly comprehend the visual
design, else they could just ignore the visualization altogether.
The layout of the tree consists in the typical branching, to associate it with the tree structure, and,
similarly to word clouds, font size is used to indicate word or phrase frequency. Branches from the
9
Figure 2.4: World tree [2] visualization.
tree continue until the frequency is equal to one, instead of stopping at the first unique phrase. Once
users enter a search query and the tree is generated, they are able to explore the tree. While exploring,
hovering a particular word or phrase reveals additional information, while clicking on an individual word
allows the user to adjust the phrase or the root of the tree that is being shown.
Some scalability issues arise with the usage of font scale to show common words. As the text input
size increases, text readability becomes a concern, since having very common words as the source of
the tree reduces the text scaling to a point where it becomes barely readable. To handle this, the authors
chose not to display relevant branches, although this could turn into a problem, as there could be some
loss of information when removing certain extensions.
The Word Tree was made available on the Many Eyes site, where users could upload and visualize
their own text documents. Although the site is not available anymore, at the time, users started taking
advantage of the word cloud’s similarity to grasp a general understanding of the text, which varied from
a collection of Twitter posts to newsgroup discussions. Although this visualization was intended to
analyze unstructured text, the authors realized that the users started using structured data to exploit the
tree structure. Being accessible in the site also helped to get feedback, which was generally positive
and provided some suggestions to the design, such as the option to ignore punctuation marks and stop
words (words that do not contribute to content, providing unnecessary information, such as articles and
prepositions), the ability to drill down from the tree structure to the plain text, to see the uses of particular
words or phrases, and to show a net of the words’ connection two words or phrases.
Overall, theWord Tree was considered a flexible solution to visualize both unstructured and structured
data, with a good reaction from users, and future work included combining the word tree with some
10
other text visualization, since the user starts by staring at a blank page, waiting for a search query, and
improving the design to be able to handle larger datasets.
2.2 Document Corpura Visualization
When the scope expands from a single document to the complete collection, visualizations tend to
be extended to a more exploratory search, while not disregarding search methods.
The simpler features that can be used are derived from the metadata. The PaperLens [3] system
was devised to visualize trends and connections in conference papers, extracting the authors, topics
and citations of these papers.
Figure 2.5: Layout of the PaperLens [3] visualization.
Regarding the visualization layout, it provides distinct views of the dataset, as shown on Figure 2.5.
Users are able to find popular papers sorted by year and topic (Figure 2.5a), or retrieve a list of papers
from a specific research area, by selecting a topic, or by author(s) (Figure 2.5c) shown in the selected
authors region (Figure 2.5b), whose work is differentiated by the colors attributed. The design allows
the user to discern the most influential papers by topic, resorting to a list of the most referenced papers
(Figure 2.5f). Lastly, it is provided a co-authorship graph that enables users to explore relations between
authors in the collection (Figure 2.5d). The implementation allows interaction between the visualizations,
11
as selecting a topic, paper or author will load related items in the remaining visual representations.
User case studies were overall positive, despite showing a few design issues, specifically in the au-
thor search, which would find substring matches while not allowing the search for first or last names.
There are also some issues noted about behaving consistently throughout the layout, or users not un-
derstanding the purpose of the initial layout, considering some segments as “recreational”, as mentioned
by the authors.
In conclusion, most of these issues were easily solved by adopting a simpler design, yet the scaling
concern was founded, as the representation used in this visualization was not able to depict the dataset
as it expanded.
The Bohemian Bookshelf [16] is an additional example of a visual representation that resorts to the
usage of metadata. This system is laid out as a digital book collection, designed to tackle accidental
discoveries – serendipity. This is accomplished by allowing a “shelf browsing” like experience, which
have been shown to inspire serendipitous discoveries. The “shelf like” browsing is attained with the
different visualizations acting as a whole, offering multiple access points due to different perspectives
from the views, drawing attention with the visually distinct visualizations and providing distinct, yet playful,
approaches to information exploration.
Figure 2.6: The Bohemian Bookshelf ’s [16] visualization layout. On the left side, there is the Keyword Chains ontop and the Timelines on the bottom. On the right, there is the Cover Circle view on top, with the AuthorSpiral on the bottom. Lastly, the Book Pile, in the middle of the layout.
Different factors are usually associated with these discoveries, such as observational skills, open-
mindedness (receptiveness to unexpected information), knowledge and perseverance, as well as exter-
nal factors, for instance coincidence and influence of other people or systems [17–20]. Libraries and
physical bookshelves improve serendipity, due to exploratory sense present in these systems. From
here, the authors derived a few design considerations to promote serendipity, namely, multiple access
points, which correlated to open-mindedness and the researcher’s eagerness to analyze data from di-
12
verse perspectives, juxtaposition or adjacency of information, multiple pathways, and curiosity and play.
The layout of the visualization consists mainly of five different perspectives on the dataset, the Cover
Color Circle, Keyword Chains, Timelines, Book Pile and the Author Spiral, as depicted in Figure 2.6.
The first visualization, the Cover Color Circle provides a first look at the book, the cover color, by
showing an average of the cover’s image. Books are displayed as circles, grouped by the respective
calculated colors, in a circular layout, and hovering a specific book will provide the user with a preview
of the book’s cover.
The second, Keyword Chains, exploits keyword usage to represent content, simplifying the catego-
rization and search. It displays the selected book in the center, and distinct keywords branching out.
Each keyword is followed by a book title, which will be followed by another keyword and so on, forming
the keyword chain, which can be focused on a different book title by clicking on it, or on a keyword,
restructuring the tree around the corresponding book.
Thirdly, the Timelines visualization displays the association between the book’s publication year and
the time period depicted. The layout consists of two timelines, with the upper one representing the
publication year, and the lower indicating the focus of the book’s content. The books are illustrated as
circles with respective color in each of the timelines, with a line connecting both of them that shows the
relation between publication year and the initial time period the book covers.
The Book Pile looks to provide further insight on the physical aspects of the books. With each one
being represented as a square, with page count expressed on its edge length and color borrowed from
the book. Books with fewer pages are represented at the bottom, while thicker ones are shown on top.
Lastly, the Author Spiral displays the books by authors’ names in a list that rolls up in a spiral in both
ends, due to space issues. As the names start spiraling, they are replaced by circles that represent
the books in the library, again expressing the book’s color. Clicking on a text label or circle will show a
preview of the book, similar to what was mentioned earlier.
All of these visualizations are interlinked and, combined, bring different perspectives to the user, as
actions taken on a specific one will cause the rest to adapt, for example, when selecting a book, it will be
highlighted in all views. However, there are some issues regarding scalability. For example, increasing
the collection’s size will be costly performance wise. Overall, this visualization takes a more playful
approach to data exploration while taking into account the serendipity concept, which could be further
developed and evaluated with case studies.
The source text of the documents in the collection can be used to visualize the dataset, however, due
to the large nature of the collection, this rapidly becomes unfeasible. Computed features such as word
similarity or topic similarity are used to compare a likeness metric that is used to compare documents
and project the differences onto a visualization, which by itself may lead to trust problems [4]. While the
first similarity metric utilizes directly the source text, the second takes into account related terms used,
13
which is convenient when the documents do not use the same exact words.
An example of this is the Dissertation Browser [4], a visual analysis tool developed to investigate
collaboration between different academic departments. The adopted approach resided in detecting
shared language or terms across publications of various areas, seeing that the authors mention the
different vocabulary across distinct areas.
Figure 2.7: Landscape view on the Dissertation Browser [4] system.
This visualization consisted of three different “views”: Landscape view, Department view and Thesis
view. The first one, Landscape view(Figure 2.7), encoded each department as a circle, with its area
representing the number of published thesis, and the distance between each circle defining the similarity
among them. This view was intended to reveal patterns in the university’s research areas, however,
being subject to a projection, it led to trust issues.
In response, the second visualization, Department View(Figure 2.8) was conceived, which enables
focusing on a single area and evaluate the similarity between that department and the remaining ones.
Identical to the first design, distance also encodes the similarity, in a radial layout.
The last visualization, Thesis View(Figure 2.9), aimed to validate the results from the similarity mea-
sure observed previously. Identically, it is focused on a single department, displaying thesis from the
area and the most resembling ones from other departments. This visualization is shown by clicking on
the focused area in the Department View and provides insight on similarity between two areas and what
are the thesis that enable this proximity.
14
Figure 2.8: Department view on the Dissertation Browser [4] system.
Figure 2.9: Thesis view on the Dissertation Browser [4] system.
User tests revealed the first view was not suitable, as the trust issues generated by the projection
artifacts led to users wrongly identifying unusual trends. In addition, the Department view allowed
users to identify similarities they did not expect, with either word similarity and topic similarity, which
demonstrated both these features could have decreased accuracy. Utilizing the Thesis view revealed
the cause of this reduced accuracy. Topic similarity would position Biology close to Computer Science
due to the existence of computational biology, while word similarity would assign two departments as
15
resembling in situations they used the same rare words.
Jigsaw [5] is a separate system that provides different visual representations of computed features.
It produces a summary of the collection, or a single document, a measure of similarity between doc-
uments, clustering, it identifies entities and connections among them, and possible related entities for
further investigation. Additionally, it allows for a document sentiment analysis, which provides insight on
sentiment, subjectivity, polarity and other attributes.
The authors take into account the two factors introduced by the Dissertation Browser – interpretation
and trust. Since the visualization are based on computed features, it is important to understand how
accurate the results are, assuring that users make trustworthy inferences from their interpretation.
Figure 2.10: Jigsaw ’s [5] List View. Shows the conference, year, author, concept and keywords associated. In thebottom figure, the concept graph is selected, showing connected years, concepts and authors.
The List View(Figure 2.10) provides a data cleaning phase, where users are able to select a list
of documents of interest, as well as present the user the most important relations in the dataset. Fig-
ure 2.10 shows an example of how the documents can be clustered (conference, year...), although this
can be personalized according to the context of the documents.
Selecting documents in the List View will allow for interaction in the remaining perspectives. In the
Document Grid Viewer (Figure 2.11), users have access to the source text of the document as well as
related information of the selected documents, including a summary of all the chosen documents and a
summary of the single document selected from the list.
The Document Clustering View provides another approach to a selection of documents, allowing the
16
Figure 2.11: Jigsaw ’s [5] Document Viewer, shows a summary of the loaded documents(left panel) at the top, andsummarizes the selected document on the right in the Summary panel.
user to differentiate the main topics by identifying the clustering results, with some advanced options to
personalize the clustering of the documents.
In terms of document similarity, the Document Grid Viewer (Figure 2.12) provides users with the
ability to compare a set of documents to another one, ordering them by the selected measure, this case
the similarity.
Lastly, the World Tree Viewer (Figure 2.14) shows the occurrences so of a specific word, as well as
the common phrases it is associated to. This visual representation has been reviewed previously, as the
Word Tree [2].
Figure 2.12: Jigsaw ’s [5] Document Grid Viewer, displays the documents in a grid, ordered by similarity accordingthe selected document.
17
Figure 2.13: Jigsaw ’s [5] Document Cluster View, displays the different clusters of similar documents.
This work demonstrated different combinations of text analysis and interactive visualization to aid the
user exploring a specific document collection. Although pre-processing the whole dataset could be a
potential scaling issue, the visualization itself in general is fluid, supporting a variety of different areas,
for instance, aviation documents, source code files, fraud investigations, and, as discussed in some use
cases, academic research and consumer reviews. One important aspect is the lack of user evaluations,
which would have identified more apparent issues with the visualization.
A follow-on system is the Papervis [6] visualization, proposed as a solution to the abundance of
information when investigating a certain research field and obtaining sizeable amounts of related pa-
pers It represents relevant papers as a graph, with the modified radial space filling and bullseye view
techniques, and provides several visual cues like node colours, sizes and boundaries to represent the
paper’s relevance (Figure 2.15E)
There are some features provided by the visualization to enhance it as an exploration tool, specifi-
cally an efficient screen usage by adopting ideas of radial spatial filling and bullseye view layout, visual
indications to distinguish results, having a specific paper or keyword at its centre with the other papers
organised relatively to the centre, the interface is user-friendly, allowing the user to explore different
views and analyze results at will, and a history mechanism, to prevent users from feeling lost in the
visualization.
The interface provides a clear way to change between different modes (Figure 2.15A) and change
other configuration options, an area to review exploration history (Figure 2.15B), a filter and selection
control (Figure 2.15C), and details of the currently selected paper (Figure 2.15D).
In the first mode, citation-reference mode, the main visualization produces a radial view, with the
paper of interest localized in the center and the rest of the papers distributed within ten bin circles
18
Figure 2.14: Jigsaw ’s [5] World Tree Viewer, shows the occurrences of a specific word, followed by the most com-mon phrases where the word appeared.
around the selected one, where the distance to the center is defined by the relevance of the document,
characterized by the citations and references it has. Citations are revealed for a single document by
clicking on it, while a double-click will re-center the whole graph to the new paper.
In the second mode, keyword mode, where users are able to find documents that share keywords,
or use keywords as a cluster category, thus being able to discover pertinent contributions in a certain
research field. By selecting a keyword, the system will load all related documents, and display the
keyword at the root, with appropriate papers surrounding the root node, with importance also being
calculated according to citations.
In the last mode, mixed mode, papers are loaded just like in the first mode, except the layout is
arranged similarly to the keyword mode, as well as the process used to link other papers.
There were some issues identified which were related to the time used to load the dataset into
memory, taking around six seconds according to the authors. This is explained by the complexity behind
the ordering in Citation-Reference mode, and the clustering algorithm in the Keyword mode.
The proposed visualization made literature review an easier task, and the three modes provided
implement different ways to explore the papers in the dataset. The authors mention a possible improve-
ment to the design, by arranging multiple focus points in the center.
Approaching techniques that focus on the subject of the documents, there is the ThemeRiver [7]
visualization, which allows the user to identify a document collection’s thematic content and its variations
over time, as well as the relative strength of the themes. These are shown in the context of a timeline
with corresponding external events, allowing the user to recognize patterns, relationships or trends in
19
Figure 2.15: The Papervis’s [6] visualization layout.
the visualization, as shown in figure 2.16.
Figure 2.16: The ThemeRiver [7] visualization.
One of the goals behind the design was to enable the users to quickly find patterns, and, by using
visual, more familiar metaphors, this discovery is facilitated to the user. The river metaphor was chosen
as a means to display time progression, while also representing the theme’s relative strength, utilizing
the flow, composition and width. Here, the separate “currents” in the flow illustrate each of the themes
depicted in the collection, the thickness describes the variations in strength and horizontal distance
symbolizes time change. Smooth limits and color are necessary to ease the tracking and the comparison
20
of specific currents in the flow.
The separate river currents’ strength are calculated by reviewing, for example, the number of docu-
ments containing the theme word, in each period. Alternately, the number of occurrences of the theme
words as a substitute to document frequency could also stand for theme strength. The authors mention
the difficulties behind picking the colors for each theme, since there were several factors needed to take
into account. On the one hand, colors for distinct themes needed to have some contrast, to be able to
distinguish the currents, while also considering the possibility of having a high number of topics in the
collection. The solution was to sort the colors by groups of related themes and showing the colors’ family
attributed to each group.
User tests were overall positive, confirming that the chosen metaphor was easy to understand, and
it was useful to identify macro trends, but not as appropriate when dealing with minor patterns. Users
proposed features from the histogram used to test the main visualization against, mostly to be able to
see the actual numeric values behind the abstraction, referring to the trust aspect mentioned before.
Along with these results, the visualization could be improved by increasing the performance in order to
support more interactions and improve the control users have on the system.
Following the ThemeRiver system, there is the FacetAtlas [8], a visual representation of both local
and global patterns in the set of documents, displayed with a graph and a density map (Figure 2.17), in
order to provide context. It also allows users a more interactive experience, with the ability to search for
specific terms, which will render a new graph. To properly understand the design choices, it is necessary
to explain that facets are considered to be a class of entities, which are instances of a particular concept
from the data, and relations are simply connections between these entities.
Brainiac is an application focused on visualizing a collection of documents. It is a tool developed in
collaboration with IBEB, to help users explore the content of a group of documents, allowing the user to
potentially identify documents of interest, arranging documents based on their similarity and topics.
The development of this visualization followed an iterative and incremental development, focusing on
user feedback to improve its usability and main features. As such, there were two main testing phases
in this process. An informal testing phase, where the focus was gathering feedback from the users, and
a formal one, aiming to measure the usability of the final version of the application.
This chapter describes the final solution and how it was constructed. The first section describes the
system’s architecture, defining each of the components needed, and naming the existent interactions
between each one. Then, the backend text processing used on the collection of documents. Finally, the
last section discusses how the processed data is used in each of the application’s visualizations, and
how they interact with each other.
3.1 Architecture
The system was designed with two main components, the backend, which is responsible with the text
processing, and the frontend, which serves the user with the web app that contains the visualization.
The depicted architecture is a generic one, commonly used in these kinds of web applications. Although
there are usually additional elements in these kinds of systems, they are usually related to security,
management or communication, and are not included as they are not relevant in the scope of this
project.
A normal interaction with the system is depicted in figure 3.1. The user starts by making a request to
the frontend server, through the browser, getting the necessary files to render the web app. Then, the
browser makes the necessary requests to the backend server, either fetching the preprocessed data,
documents, or to upload new files.
Figure 3.1: Architecture of the final solution. Documents are stored in the backend, where they are processed.The backend also serves these documents and the initiation file to the frontend. This will contain theapplication code, ran on the client’s browser, and make the necessary requests to the backend.
27
The backend server was developed with Node.js1, and serves as an Application Program Inter-
face (API) for the application. We opted to use this platform due to Javascript already being used exten-
sively in the frontend, which facilitated the development seeing that we could use the same language to
develop both elements of the system. This server is used for the main processing in the visualization.
Its main functions are to hold the document corpus, allow the upload of files by the user, serve the
main initialization file, and allow the querying of specific words to measure their distance to each of the
documents.
The frontend was developed utilizing React2 and D33. React is a library used to build the interface
of the application, while D3 is a library that helps us create the visualizations. It allows some abstraction
from HTML, and add or change components around without worrying too much about breaking the
interface. As such, it allows adding new UI elements as needed, facilitating prototype iteration.
3.2 Backend Document Processing
This section describes how the backend handles the processing of the collection of documents. Each
stage of this process is defined in figure 3.2. The following subsections describe each of these stages
in detail, following the order depicted in the figure.
The documents were gathered with the help of professor Hugo Ferreira, from IBEB, that listed a
few topics of interest to guide the construction of an initial collection of documents. These subjects
were intended to help the search in search engines such as Google Scholar or Pubmed, and create
a small document database with articles from these topics. This database was intended to help with
the development, including the informal testing phase, and the usability tests, and as such, not much
time was focused on creating a big collection. These studies are usually stored in a Portable Document
Format (PDF) format, to facilitate access, although prompts an initial stage that converts them to plain
text.
3.2.1 Text Extraction
As mentioned, in the database created, documents are in a PDF format. Since these files may
contain not only text, but images, hyperlinks, videos, embedded fonts, and executable scripts, they are
stored in binary. In addition, the elements included are usually accompanied by a set of formatting and
other describing components needed, so that the result from rendering is the same across platforms. To
deal with this file format, the text needs to be extracted, while ignoring other elements such as images
and formatting elements that can not be used to classify the document.1https://nodejs.org2https://reactjs.org/3https://d3js.org/
Figure 3.2: Stages in the document collection processing pipeline. The first two stages in the process – Text Extrac-tion and Tokenization – are applied to single documents, in order to extract the terms of each documentand form a bag of words representation. The last two stages are applied to this representation, in orderto try and extract content.
This extraction is done through a Python script, that takes each document and converts it into a plain
text file with all the text from the original file. Initially, we opted to use C++, as this processing involved
performance-intensive computation. However, it was decided to migrate the development to Python, as
this language allows for faster iterations on the code, meaning the development time was focused on the
processing and not on the language specifics.
The script uses textract4, a Python library that enables us to extract text from any document, PDF
files in this case. Since these files can have different encoding, the result from the extraction may
contain some unwanted characters when converting to UTF-8. One specific example of this problem
happened with some documents that contained math symbols from study comparison, such as “>”, or
“6”, that when converted, produced incorrect output in the resulting text file, corresponding to numeric
characters.
In order to remove some of these incorrect characters, a set of rules were placed, before the resulting
text was saved, that allowed us to remove characters that did not contribute to the actual content of the
file. These rules are used to discard digits, punctuation and symbols, hyperlinks and some words that do
not contribute to the content itself. These removed characters are not only products of wrong conversion,
but also parts of the document, like references or citations that sometimes are merged into the words.
3.2.2 Tokenization
After all the files are converted into plain text, the second stage involves reading all the documents
and fitting them in the model, to obtain the necessary results for the visualization. This is done with a
second Python script that reads each of the converted files into memory, in a bag-of-words representa-
tion. This model is a simplistic approach utilized in a natural language processing in which the text is
represented as a set (bag) of words. The text is stripped of any punctuation and newline characters, and
document, but it is offset by the frequency of the word in the whole corpus. This offset helps us measure
how important a term is in the collection, as it allows scaling down the weight of frequent terms across
the collection, while simultaneously scaling up the uncommon ones.
The recurrence of each term t is calculated by simply measuring term frequency in each document
d from our collection D, and it is normalized to take into account the total length of the document.
Therefore, larger documents will not get higher scores due to their larger term frequencies. Then, the
inverse document frequency idf is calculated by taking the logarithm of the total number of documents
N , divided by the number of documents that have the term being weighted, nt. The final result is
calculated by multiplying tf by idf , as seen in 3.1.
tf(t, d) = ft,d
idf(t,D) = logN
nt
tf-idf(t, d,D) = tf(t, d) · idf(t,D)
(3.1)
By fitting the whole tokenized document collection into the tf-idf model from the scikit-learn library,
we obtain a term-document matrix that reflects the weights for all the terms in all documents. Using the
cosine similarity, we can measure the similarity between two vectors on the matrix, which consequently
allows us to measure the similarity between two documents in the collection. Using this method, we
obtain a new matrix with the similarity values between each document in the collection, allowing the
creating of links between strongly related documents.
One additional stage was added that, for each document, separated each other document into differ-
ent levels of similarity. This was initially done on the client-side, but it was changed so that all processing
is done on the same side.
3.2.4 Clustering and Top Word Extraction
To complement these links, a cluster analysis is performed on the tf-idf resulting matrix. This task
involves grouping a set of objects so that each group includes objects that are more similar between
each other than they are to other groups. In the context of the document collection, it allows grouping
together documents that are similar to each other, creating clusters of documents on different topics,
helping users identify relationships in the collection.
To create these groups, the k -means clustering is used. K -means is a general-purpose clustering
algorithm, that tries to separate the samples into groups of equal variance. This method, however,
requires the number of clusters to be specified beforehand, and it is not guaranteed to each a global
optimum. This implies that the final result will depend not only on the number of clusters specified, but
on the placement of the centers of each cluster, which may result in different results as the model is run
31
several times.
In order to evaluate the results from this clustering, a projection onto a 2D space is needed, since
the vectors representing each document have a very high dimension count. The process of reducing the
number of dimensions of a vector, while preserving information is designated dimensionality reduction,
and it usually consists of either selecting a subset of all the features, or computing new features from
the existing ones. Although this helps visualize results from the tf-idf model, reducing a high number
of dimensions to only two, can lead to the loss of information relating the document vector. This loss
of information, in turn, will cause a distortion on the resulting graph visualization of the collection of
documents. Certain patterns may appear, as a result of these artifacts, which influences the analysis of
the results.
Taking this into account, different dimensionality reduction methods were used, so that it is possible to
compare results without being too liable on the distortion. Document vectors were reduced using Latent
Semantic Analysis (LSA), Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor
Embedding (t-SNE), down to two dimensions. Both LSA and PCA perform a linear reduction on the
data, using Single Value Decomposition (SVD) of the data, while t-SNE performs a nonlinear reduction.
The comparison between these algorithms can be seen in figures 3.3, 3.3(a) to 3.3(c).
As clustering a high amount of dimensions can have some problems [23], the results of each dimen-
sionality reduction are again clustered, for further analysis. The results from this clustering can be seen
in figures 3.4, 3.4(a) to 3.4(c).
An additional stage that was added in the process involved the computing of top terms by topic, or
cluster. Taking the results from clustering the tf-idf results, it is possible to obtain the most relevant
features to each. Since in this case, features are equivalent to terms, it stores the most pertinent words
in each cluster identified. To compute the top words by topic, topics were first extracted by fitting the tf-idf
features into the Non-Negative Matrix Factorization (NMF) model. In order to get a more complete set
of results, Latent Dirichlet Allocation (LDA) topic extraction is also performed, with term count features
instead of tf-idf features, as the scaling idf property would disproportionately change the weights of
words with this model.
As these top words were extracted using different methods, a common measure between the words
and the documents had to be created, so that words could be sorted according to their relevance to each
document. With the trained tf-idf model, it is possible to measure the cosine similarity between each
word and each document. This method returns, for each word, an array with the similarities to each
document. This process was later altered to function if the input of the script specified a single word,
which would allow measuring the similarity between the specified word and each document, allowing
users to add new words to the list.
As mentioned in the architecture subsection, this processing is done mainly on the backend server,
32
(a) Results of TF-IDF clustering, using LSA dimen-sionality reduction to display results in the graph
(b) Results of TF-IDF clustering, using PCA dimen-sionality reduction to display results in the graph
(c) Results of TF-IDF clustering, using tSNE di-mensionality reduction to display results in thegraph
Figure 3.3: Results of TF-IDF clustering, reduced by LSA, PCA and tSNE for comparison.
33
(a) k -Means applied after dimensionality reduction,in this case with LSA
(b) k -Means applied after dimensionality reduction,in this case with PCA
(c) k -Means applied after dimensionality reduction,in this case with tSNE
Figure 3.4: Results of k -means clustering applied after the dimensionality reduction, for comparison.
34
that served as an API. In order to pass the processed data to the visualization, the script stores ev-
erything into a JavaScript Object Notation (JSON) file, which facilitates the interpretation when reading
the file in Javascript. This file will include the array of documents in a “nodes” property, while having
the calculated links in a “links” property. Information relative to each document is aggregated into each
document object, such as the cluster which it belongs to, its title or abstract and each other document’s
similarity levels, in relation to itself. Here, the cluster information used was derived from the original
k -means, performed on the tf-idf matrix. Since dimensionality reduction can lead to loss of information,
it was decided not to utilize this method as projection into a 2D space could approximate documents that
are not similar at all. In figure 3.3, any of the dimensionality reduction algorithms show instances of this,
as documents from different clusters are occasionally placed close together.
3.3 Brainiac: a Graph-Based Literature Visualization
Following the text processing described in the previous section, all the data is available and being
served in the backend. The frontend can request the main file, and use the processed data in the
visualization. This section describes the frontend component of the application. Additionally, it presents
the tasks that were derived from the meetings with professor Hugo Ferreira, from IBEB, as well as the
feedback from these meetings and from the first and informal testing phase.
As mentioned in the beginning of this chapter, the development of this visualization followed an
iterative model, focusing on user’s feedback, specifically meetings with professor Hugo Ferreira and an
informal testing phase, to guide the design of the application. This testing session did not focus on
validating or evaluating the usability of the application, but simply on gathering feedback from target
users.
This section will describe in detail the main phases in the development of this visualization, namely
the gathering of requirements from IBEB, through professor Hugo, the initial version, the testing phase
and the feedback collected, and, at the end, the final version of the application.
3.3.1 Gathering the requirements
As mentioned in the beginning of this chapter, this application was developed in cooperation with
IBEB, specifically with professor Hugo Ferreira. There were initial meetings aiming to obtain a list of
requirements for the visualization, and further sessions aimed at gathering feedback, being new possible
features or a change of approach in already implemented features.
From these requirements, a list of tasks were derived, to focus the development of the visualization:
• Search for a specific document;
35
• Filter documents by date of publication;
• Method to identify similar documents;
• Identifying topics in documents, and separating documents based on these identified topics;
• Ability to provide a brief overview or summary of a specific document
• Integration with a search engine, such as Pubmed or Google Scholar ;
• Differentiate different types of studies in the area: Clinical Trials, Guidelines, Meta-analysis or
systematic revisions, for example;
• Differentiate between the different levels of evidence level that are usually attributed to these stud-
ies, specifically, clinical trials.
This list of requirements was later utilized to create the list of tasks used in both testing phases, and
as guideline for design of the visualization.
3.3.2 Initial Version
Initially, the application consisted mainly on the sidebar, and the three visualizations: the Network
(Figure 3.5.A), the Cluster Layout (Figure 3.5.B) and the Timeline (Figure 3.5.C). The sidebar (Fig-
ure 3.5.D) contained a list of documents, a search feature, and a Words per Topic feature that was
disabled, due to not being ready for testing. This version was the version utilized in the informal testing
phase, although there was previous feedback from professor Hugo Ferreira, regarding some UI elements
such as the coloring utilized in the interface.
3.3.2.A The Network
The Network visualization focuses on showing the user the documents in the collection, as nodes,
and their similarity between each other as links between nodes, with these being computed as described
in the previous subsection.
By double clicking on a node in the Network, users were able to center a specific node, arranging the
remaining documents in different rings around the centered node, as seen in figure 3.6 This rearrange-
ment places documents taking into account their similarity with the center node, and displayed a simple
moving animation on each node, so that the user understood that was happening with the state of the
visualization. There are four different orbits around the center, with documents being placed closer to
the center as their similarity measure with the centered increases, with their placement being evenly
distributed inside each ring.
36
Figure 3.5: Brainiac’s initial main view. There are three main visualizations: The Network(A), the Cluster Layout(B),and the Timeline(C). The sidebar(D) lists all documents in the present database, and allows the user tosearch for specific keywords to add new documents to the visualization.
Figure 3.6: Example of the Network centering feature. By double clicking a node, it is centered in the visualization,arranging the remaining documents in an orbit like disposition, with the most similar placed in a closerorbit, and the less similar in a more distant one.
There was an initial idea of having the documents actually orbit around the document, when in this
mode, instead of being in fixed positions. This was later discarded, as it would eventually become too
confusing for users to deal with all the moving nodes, with no actual value being added by this kind of
feature.
3.3.2.B The Cluster Layout
The Cluster Layout displays the documents color coded by the cluster they belong to. It is a simple
visualization that was designed to represent documents by cluster, and as such, their positioning does
not reflect any computed measure. Nodes are simply placed randomly on the display, and a force was
created to keep each node close to corresponding cluster’s elements.
Both these visualizations (the Network and the Cluster Layout) allow the user to scroll and pan the
37
view of the nodes. The zoom is implemented as a semantic zoom, instead of the standard, graphical
zoom, as it allows the user to view the detail, without distortion on the elements, instead of simply scaling
up or down the view, as seen in figure 3.7.
Figure 3.7: Example of semantic zoom applied to the Network and Cluster Layout. Node sizes are not scaled up,only the distances between each one, displaying a higher level of detail to the user.
3.3.2.C The Timeline
Finally, the Timeline places each document according to their publication year. Only the year is con-
sidered in this case, as there was somewhat a lack of consistency with the dates on some documents.
Some had information on their year, but not month or day, so only the year was taken into account
in the collection, in order to keep the results consistent. Contrary to the Network and Cluster Layout,
this visualization does not implement zooming or any kind of axis scaling. This kind of feature was not
implemented as a similar one, that allowed users to filter documents by year, which will be described
in subsection 3.3.2.E, alongside other designed interactions. Even though documents have a full date
available in the details, some documents were changed to include month and day, despite not being
specified in their metadata, so that details are consistent across the collection.
3.3.2.D UI Components
Regarding the rest of the interface, the sidebar contained, as it was mentioned at the beginning of
this subsection, a simple document list, and a search feature. This document list listed all the document
that were currently in the database, and allowed the user to change the way they were sorted: by title,
or by date.
The search feature allowed an integration with a search engine, in this case Pubmed, to search for
new documents using the input query. This would use search Pubmed and list the top ten results that
were returned. The interface, seen in figure 3.8, allowed the user to select which documents he was
interested in adding to the visualization, including their abstracts. After choosing the desired studies,
and choosing to update the visualization, the application takes a few seconds to process and update the
visualization.
38
Figure 3.8: Interface displayed to the user after using the search function, on the sidebar. It displays the top resultsreturned and allowed the user to select any number of these to add to the visualization.
Figure 3.9: Example of layout rearrangement. Moving a window in the grid indicates with on the background whereit will be placed. If the user tries to move over an existing window, like in the example, the old window isplaced in a new free location, normally beneath the window being moved.
The three visualizations are placed into a responsive layout, that allows users to manipulate the
arrangement of the visualizations. The layout works as a simple grid, where users can grab one of the
visualization’s title bar and move it, as seen in figure 3.9, resize it, as seen in figure 3.10. The layout
arranges itself, so if the user tries to move or resize a window over an existent one, the new one takes
priority and occupies the position from that window, which is placed below the moved window. Since it
works as a grid, it snaps into the nearest possible position, which can be seen from both figures, and
gives feedback where the window will be placed in the grid, and what will be its size.
3.3.2.E Main Interactions
To facilitate the user’s navigation, several interactions between each visualization were designed. By
hovering a specific document, all the nodes representing the document in the remaining visualizations
become highlighted, as seen in figure 3.11. This hover works similarly in the Timeline, although users
39
Figure 3.10: Example of window resizing. Resizing a window in the grid indicates on the background the size thatit will assume. If the user tries to resize over an existing window, like in the example, the old windowrearranges itself, normally beneath the window being resized.
are able to always hover the closest item while the mouse is within the visualization’s bounds.
This ability to hover the closest node was implemented due to the initial size of the displayed nodes
in this visualization. As a result of their smaller size, it was hard to hover a specific document, and with
this detail, the difficulty was reduced since it did not require the user to hover the node with precision.
Figure 3.11: Example of hover interaction: Nodes were highlighted by increasing their radius by a few pixels, chang-ing their color to red. Links were highlighted by changing their color to red as well.
This interaction also highlights the document on the list (Figure 3.11.A), but this only occurs if the
document’s entry is already visible in the list. The automatic scrolling to the hovered node was disabled
by default, due to its confusing nature, as every time a user tried to hover a new document, intentionally
or not, the document list would scroll as needed to display the document. Instead, a subtle modification
was added, so that by hovering a node while holding the control key on the keyboard, the document list
would allow the scrolling to the focused document. A tip was displayed whenever the user tried to hover
a document, without holding the key, explaining this behavior.
As mentioned before, the Timeline allowed the user to filter documents based on their year of pub-
40
lication. By dragging a box on the visualization, it would filter out documents that did not belong to the
selected time interval. Documents that were not present in the applied filter did not appear in the doc-
ument list, until the filter is changed or removed. However, nodes corresponding to filtered documents
had their opacity changed, so they appeared “grayed out” in the visualization, although still influencing
the force layout responsible for each visualization, as seen in figure 3.12.
Figure 3.12: Example of the filter interaction in the Timeline. Nodes that are not included in the selected period aregrayed out of the visualization.
3.3.3 Informal Testing
This informal testing phase was formative evaluation, aiming to assess the usability of the initial
version of the visualization with representative users. The goal behind this specific phase was to identify
usability problems with the visualization, in order to better the user experience in the final solution. As
such, as list of tasks was derived from the requirements, that required the user to interact with different
components of the visualization, without focusing on quantitative data such as the task execution time
or number of errors detected in the completion of each task.
This subsection will go over the participants, the procedure and discuss the feedback obtained
through the different users that participated in this test.
3.3.3.A Participants
Subjects were recruited by professor Hugo Ferreira, in order to gather users with context knowledge.
There was a total of 5 users, with ages ranging between 23 and 33 years old. From these users, only
one did not have context knowledge.
41
3.3.3.B Procedure
The tests were performed in a laboratory in IBEB. Each participant was explained the purpose of
the test, and what they would be doing. They were given a brief explanation regarding the visualization,
describing the overall layout of the visualization and the meaning behind each particular visualization,
as well as the main interactions between each one. Users were motivated to ´´think out loud”, mani-
festing their opinions on the interface, and giving any feedback they could think of. Following this short
description of the application, they were given 5 minutes to explore the visualization, and encouraged to
try different interactions in order to familiarize themselves with the interface.
After this exploratory period, subjects were asked to perform a series of tasks. They were given a
task a time, given the next one when the current was completed. The list of predefined tasks are as
follows:
1. Identify the year with most publications;
2. Identify one of the documents that has the most relations;
3. Identify one of the biggest clusters of documents;
(a) Give example of two documents belonging to that cluster;
4. Identify two of the documents published between 2000 and 2010;
5. Identify the year of publication of the document named “Distinct Brain Networks underlie cognitive
dysfunction in Parkinson and Alzheimer diseases”;
6. Center the network visualization on the document named “Regional volumetric change in Parkin-
son’s disease with cognitive decline”;
7. Give two examples of documents belonging to the same cluster as document named “Structural
Brain Changes in Parkinson Disease With Dementia”
8. Give two examples of documents that are related to “Temporal lobe atrophy on MRI in Parkinson
disease with dementia”;
9. Create a new visualization with a query for documents relevant to “Alzheimer”;
3.3.3.C Discussion
There were many problems regarding the interface, both reported by the user and detected by fol-
lowing task execution.
First, there some users displayed some confusion regarding the meaning of the Cluster Layout,
understanding the meaning behind the clusters, as it did not have any particular interactions with the
42
rest of the visualization. They indicated they would like to see how the clusters were positioned in the
Timeline or the Network.
There were also some problems when trying to identify document a certain node represented, as
hovering was disabled, by mistake. This forced users to hover while pressing the control key, which
revealed problems with participants that had less experience with computers. These users did not
understand how to automatically scroll the list with this hover property, and as such there was some
difficulty identifying document titles from the visualization.
When asked to look for a specific title, most users tried to wrongly search for their title using the
search function. It was not obvious to users, even with a description in the input box, that the search
function was meant to search for new documents. Without a way to filter documents, subjects were
forced to manually scroll the list in search for the required document.
After identifying the required document in the sidebar list, some tasks required users to identify some
property about the new document, from one of the visualizations. A specific example, is task 6, that
asked users to search for a document and center the network on that same document. After manually
locating the document in the sidebar list, users needed to memorize its location in the Network and then,
center it, since the highlight would disappear as soon as the mouse left the node.
In conclusion, this testing phase identified a few critical problems that slowed down the user’s exe-
cution. The solutions for these problems are described in the next subsection, that describes the final
version of the application.
3.3.4 Final version
The final version of the application did not include any major changes to the main visualizations,
besides the new topic magnets, and the file uploader interface, accessible by the sidebar. The Net-
work, Cluster Layout and Timeline, as seen in figure 3.13, did not have a major problem regarding their
usability, as such appear similar to the initial version (Figure 3.5).
This version changed the coloring of the UI and the nodes, to facilitate the identification of the different
states, and the different clusters in the Cluster Layout. It also implemented back the popup, allowing
users to simply hover a node to determine the document’s title, as seen in figure 3.14. This hover
function was slightly changed, since previously it overwritten the color of the node, specially in the
Cluster Layout, which prevented users from identifying the cluster the hovered node belonged to. This
new hover function does not prompt the user to use the control key to scroll the document list anymore,
since it was moved to a new state. Regarding the problems pointed out in the Cluster Layout, hovering
a node in this visualization also highlights documents that belong to the same cluster, on the Network
and Timeline, as seen in figure 3.15.
43
Figure 3.13: Brainiac’s main view
Figure 3.14: Example of hovering a document in the Network visualization. It works similarly in the Timeline view.
The second problem that was associated with hovering a node, was the requirement to memorize a
node’s location. In order to fix this problem, a new state was introduced: focusing a node. By clicking
on a node, it was possible to change its state to appear as “focused”, as seen in figure 3.16. Focused
nodes are very similar to hover nodes, but since the interaction to hover nodes required the user to not
move the mouse away from that node, the user may opt to click on the node and focus on that node. In
this new state, nodes are highlighted with a different color from the hovering state. It also changes the
node’s border, as in the Cluster Layout visualization, the cluster to which a node belongs is identified
by its color. Contrarily to hovering, focusing a node allows the user to scroll the document list on the
44
Figure 3.15: Example of hovering a document in the Cluster Layout, highlighting documents that belong to thesame cluster in the remaining visualizations.
sidebar, so that the corresponding entry is selected, although it does not display the popup, unless the
user actively tries to hover the node.
In the Cluster Layout, zooming out also allows the user to collapse the nodes into their corresponding
clusters, so that the user can work with the clusters directly (see Figure 3.17). Some users mentioned
that the animation behind this feature could be a little frustrating at times, if the user zoomed out a bit
too much by mistake. In order to solve this, the limits that triggered the node collapse were tweaked a
bit, as well as the animation duration for both the zoom out and zoom in, so the user does not feel like
wasting time by accident too often.
The search function that allowed users to add new documents by querying Pubmed was removed in
this last version. This function presented a few problems that limited its usefulness, mainly the inability
to fetch the full document from the search engine. In the initial version, only the abstracts were used
to compare with existing documents, however, the abstract does not contain enough content to make
accurate assumptions about similarity in the collection. Another problem that was present with this
method, was the fact that Google Scholar blocked requests from the Python script dealing with the
fetching, due to its policy regarding bots. These lead to the removal of the feature, changing it to a new
file uploading interface that lets users manually add their own documents (Figure 3.18).
Initially, the only way to add new document files to the visualization would be to manually place them
in the server folder, adding the details to the document.json file, or to use the integration with Pubmed
to search for new documents. These approaches were not optimal, as mentioned in the last paragraph.
The new file uploader provides a menu where the user is able to add new files to the visualization, but is
45
Figure 3.16: Example of focusing a document in the Network visualization.
Figure 3.17: Example of zooming out on the Cluster Layout, which collapses nodes into their corresponding cluster.
required to provide additional details: title, date, authors and the abstract (Figure 3.20). This concerns
the metadata processing, as this method does not have consistent results when extracting the required
fields, the user is trusted to enter the correct data to better the information in the visualization.
In the sidebar, a new document filter was added, that allows users to search for a specific title.
Originally, this feature allowed users to filter the whole visualization, graying out documents not matching
the input query. This was removed in later stages, due to its main function being search for documents.
Filtering the whole visualization did not make sense and forced users to clear the query before trying to
interact with the visualization.
3.3.4.A Topic Magnets
The topic magnets submenu lists the top relevant words in the collection. These words are gathered
in the final stages of the backend processing, which is described in Subsection 3.2. Words in this list
act like objects that can be dragged to the Cluster Layout visualization, where an item with that word is
created.
46
Figure 3.18: File uploader interface. There is also a list of documents in the collection.
These new items work as magnets, although disabled by default. By double clicking on this new
object, it will active it, attracting the documents to itself. As mentioned, the words were measured against
each document in the visualization. The resulting list of similarities is used in order to vary the attraction
observed between each document and that magnet. This can be used to analyze which documents are
related to a specific term, as seen in figure 3.21.
Users are able to add new words to the list, allowing them to check what documents are closer to a
specific topic. The visualization takes a bit to update the list with the new word, due to the processing
that is required, but after this delay, users are able to freely use the added topic.
These word objects do not interact with the Network or the Timeline due to the nature behind these
visualizations. The Network was the first option to hold the topic magnets visualization. However, it was
already used for the center visualization technique, which lead it to be applied in the Cluster Layout. The
Timeline, on the other hand, was designed to take advantage of the positioning of each node, to reveal
the publication year of each document. Since the technique described in this subsection requires the
positioning of each node to illustrate the similarity between documents and magnet, combining these
two techniques would defeat their purpose.
3.3.5 Discussion
The Brainiac visualization allows users to explore a collection of documents in the neuroscience
context. It presents different views on the collection, namely the Network, which connects similar doc-
uments, the Timeline, which shows documents based on their publication year, and, finally, the Cluster
Layout, which differentiates documents based on their clusters. This clustering is based on their con-
47
Figure 3.19: File uploader interface showing the details provided by selecting one of the document entries.
tent, so users are able to interpret groups of documents as different topics. Additionally, it allows users
to create magnet objects based on a specific term, making it possible to explore the closest documents
to a specific topic.
By providing a range of different interactions between these views, Brainiac focuses on helping the
users finding similar or related documents. Users can start with a specific document in mind, or try
to create a topic and explore the closest documents to that topic. The cluster arrangement will show
documents in the same group, which can help users find out about new documents that they might not
have known before, within that topic.
To put in context, we compare Brainiac with the existing systems, reviewed in Section 2, using a
similar table, as seen in table 3.1. We use the same concepts in order to categorize our visualization,
namely, the ability to read the original document, the ability to present patterns from the collection, to
present an overview of the document, to compare between different documents, to present extracted
features, to search for specific terms or phrases and the ability to zoom on details.
Specifically with Brainiac, the user is easily able to read the original document, by double clicking on
a specific document in the sidebar, or by following the “Open Document” button that is presented in the
small overview. Since this overview can become intrusive in the visualization, it is only presented to the
user when the document is hovered in the sidebar, instead of one of the views available.
We consider that the visualization fails to allow the user to explicitly compare documents, as currently
there is no way to select two or more documents and compare their properties or topics. Since we do
not extract features such as entities from the collection’s content, we also consider that the “features”
48
Figure 3.20: File uploader interface. The title, date, author list and abstract are fields the user needs to fill, asmetadata extraction is not consistent.
Figure 3.21: Example of a topic magnet attracting documents based on their relation with the topic.
concept is not present in the visualization. Since the search function currently present in the application
allows only the filtering of documents by their name, we consider this feature not to be present, as it is
not possible for users to search for a specific term or topic. Finally, the zoom component is considered to
be present. It was implemented in both the Network view and the Cluster Layout. It could be interesting
to add this zooming feature to the Timeline, if the scalability of our solution creates problems with the
number of documents present in this view.
49
Table 3.1: Comparison between the reviewed visualizations and the developed solution.
With the visualization and interface’s main points accomplished, we decided to start a formal testing
phase. This testing phase aimed at evaluating the usability of the visualization and its utility, as a tool
to aid interpreting a collection of documents. This chapter describes this formal testing phase, the
participants, for both the usability tests and the case studies, the evaluation and the procedure.
4.1 Usability Tests
As mentioned, usability testing is used as a tool to evaluate a product by testing it on representative
users. This can be seen as an irreplaceable usability practice, since it gives direct input on how real
users use the system [24]. The goal behind this process is to identify usability problems, while gathering
qualitative and quantitative data to determine the participant’s level of comfort with the product and
satisfaction. Thus, this section goes over the participants, the procedure for these tests, the final results
of this testing phase and the discussion of these results.
4.1.1 Participants
Subjects were recruited through standard procedures including direct contact and through word of
mouth. Subjects included anyone interested in participating if they were at least 18 years old. Each
participant was asked to sign a consent form.
During this testing phase, 16 tests were performed. In the first test, technical problems prevented the
first user from performing the last task. All the tests were conducted between 08h30 and 20h00. None
of the subjects that completed the test had professional experience in neuroscience. However, since the
test focused on the interface aspect of the visualization, this did not impact the tests.
4.1.2 Procedure
The tests were performed in a laboratory inside campus Alameda, in Instituto Superior Tecnico (IST).
Each participant was explained the purpose of the study, and what they would be doing. Subjects were
asked to fill a consent form, to allow the recording of their actions in the visualization during the test.
After filling the form, users were explained the meaning behind each of the visualizations presented
in the application, namely the Network, the Cluster Layout and the Timeline, as well as the interactions
between each one. After this brief summary, participants were given 5 minutes to explore the applica-
tion’s interface, experimenting the described functionalities. At the beginning on this phase, a simple
script was run to start recording the user’s actions on screen, for later reference.
Following this exploratory phase, subjects were asked to perform a series of predefined tasks, which
the assistant would measure, in terms of time taken to perform the task, and the number of errors
53
showed during the execution. Participants were given a single task at a time, given the next one when
the current was completed. The list of predefined tasks are as follows:
1. Identify the year with the most publications;
2. Identify one of the documents that has the most relations in terms of similarity;
3. Identify one of the biggest clusters of documents;
(a) Give example of two documents belonging to that cluster;
4. Filter documents between 2000 and 2010 and identify two documents belonging to that time span;
5. Identify the year of publication of the document named “Distinct Brain Networks underlie cognitive
dysfunction in Parkinson and Alzheimer diseases”;
6. Center the network visualization on the document named “Regional volumetric change in Parkin-
son’s disease with cognitive decline”;
7. Give two examples of documents belonging to the same cluster as document named “Structural
Brain Changes in Parkinson Disease With Dementia”;
8. Give two examples of documents that are related to “Temporal lobe atrophy on MRI in Parkinson
disease with dementia”;
9. Zoom out the cluster view;
(a) Identify the most recent cluster
(b) Identify a cluster disperse along the timeline;
10. Create a new Topic Magnet with “Alzheimer”;
(a) Identify two documents related to the topic;
(b) Identify the closest cluster to the topic;
(c) Create a new topic magnet with “Parkinson” and place it on the opposite end of the previously
created magnet;
(d) Identify the closest document to the new topic;
11. Upload the given document and update the visualization with the new added document;
(a) Center the network on the new document, and identify two documents related;
(b) Identify the cluster the document belong to;
54
With the completion of this set of tasks, users were then asked to fill the System Usability Scale (SUS)
questionnaire. The SUS was utilized to measure the application’s usability, and consists of a ten item
questionnaire, using a Likert scale to give an overview on how the user felt about the system [25]. Then,
testers were given a compensation based in candies and thanked for the time taken.
4.1.3 Results
The distribution of the time taken in each task can be seen in 4.1. The first tasks (Tasks 1 to 3,
including 3a) given were very easy, as they did not require the user to make changes to the initial state
of the visualization. Each task required the user to look at the initial state of each of the visualizations
and identify something.
The first task was simple, and, apart from two users that did not understand right away the objective,
no users had problems getting the correct answer in a few seconds, with the mean being four. In the
second task, some users tried looking first at the cluster layout, before understanding that the correct
response required them to find a document in the Network. This can be seen in the box plot for this task
(Fig. 4.1(a)), as the distribution is more disperse, which is also the case with task 3.
In the first task, there were no detected errors. However, tasks 2 through 3a reported at least one
user with errors in the execution.
The fourth task required users to apply a filter in the timeline and identify documents. The box plot
for this task shows a more compressed distribution of the time taken, with only two users with an error
reported in the completion of this task.
Tasks 5 through 8 required users to search for a specific document and make the same observations
as the first group of tasks. As such, the first task of this group, task 5 had a very disperse distribution,
with almost all users getting at least one error in this task. Contrary to the first task, the rest displayed a
more compact distribution, with fewer users displaying errors.
Tasks 9, 9a and 9b required the user to combine the Cluster Layout and the Timeline. The first
simply required the user to zoom out on the Cluster Layout, and as such, it does not present a sparse
distribution, although there were still a few errors. Then, 9a presents a more disperse distribution, with 9b
being denser in the box plot.
The next group of tasks, 10a through 10d, including task 10, required users to work with the Topic
Magnets, in the sidebar. These tasks presented a denser execution time distribution, although there
were outliers that did not understand the task at the beginning. Only a few users showed errors in the
execution of these tasks, and overall, the distribution of the task time is compact.
Lastly, tasks 11 to 11b also displayed less variation in the spreads, with errors detected only on the
first two tasks.
From the SUS questionnaires, the usability was measured with a mean score of 82.5 (see figure 4.2),
55
(a) Distribution of task times (Tasks 1 to 9)
(b) Distribution of task times (Tasks 9a to 11b)
Figure 4.1: Distribution of time taken in each task.
56
across all users, with a standard deviation of 9.287, indicating that results do not vary too much from
the mean. Research indicates Web-based SUS scores to be, on average, 68 [26]. Since the score in
this testing phase reached an above average score of 82.5, with a low standard deviation, it can be
concluded that users were satisfied with the usability of the application, apart from the identified errors.
Figure 4.2: A comparison of the adjective ratings, acceptability scores, and school grading scales, in relation to theaverage SUS score [27]. The questionnaires place this visualization at 82.5, marked A in the figure.
4.1.4 Discussion
In general, the results were very good. The execution times presented were low and, in general, did
not present a disperse distribution, with only simple errors being made when users did not understand
a specific interaction right away. The results did not point to any obvious usability problem, although the
analysis pointed to UI elements that required subtle changes to improve the user interface.
The first group of tasks, tasks 1 and 3a were completed without problems, however, task 2 and 3
have a wider spread, that could be attributed to the wording in the requested task.
In the second task, users were asked to identify one of the documents that displayed “the most
relations. Some users tried to determine the one with the most, and were unsure what to pick, which
caused the larger spread on that task. This occurred in the second task similarly, which lead to some
users completing in just a few seconds, while other users tried to compare each cluster’s number of
elements.
As mentioned, the fourth task required users to filter the Timeline. While this task did not present a
significant problem, it caused the large spread on task 5. This happened due to this task requiring the
user to filter the timeline. Since the filter is applied not only the visualization, but to the document list on
the sidebar, many users did not remember to, at first, remove the filter from the timeline before searching
for the required document. Other users also searched the required document by trying to scroll through
the list, without using the document filter. This could be attributed to some users not noticing that they
could search the document list by typing the name of the document, as the input box may not be obvious
on a first look.
57
The rest of the tasks that required the user to identify a property about a specific document did not
have such a large spread, as users had already removed the filter. However, some users noticed there
was a bug on the document filter, that did not match any documents if the query started with a Lowercase
letter. This flaw was not obvious at first, and lead to some users to search the list by manually scroll the
list looking for the needed title.
Tasks 9a and 9b required the user to hover each cluster in the Cluster Layout and follow each one’s
spread on the timeline. However, some participants did not understand they could exploit the hover
interaction to quickly identify the solution. Some tried to manually scan the timeline and identify which
cluster that document belonged to, and tried to estimate the answers, which lead to high execution times
in those cases.
The group of tasks that involved creating new topic magnets did not present any significant problem.
The times measured in tasks 10 and 10c include the time needed for the preprocessing required in
the backend, which normally added around 20 seconds to completion time. Due to technical problems
that two users faced with the preprocessing, they repeated the task, although there was no significant
improvement that could skew the results.
The last group of tasks involved the upload of a new document to the visualization. One of the users
also experienced technical problems in this task, and due to time constraints, did not perform any of the
tasks in this group. The first task, 11 also included the time required to upload and process the new
document, which leads to displayed higher execution times. Most users identified a problem with the
interface with the file uploader, as after filling the required details to upload the document, they did not
understand where to proceed with the upload. As such, the positioning of the button was changed to
improve clarity in the process of adding new documents to the collection.
The rest of the tasks in this group, tasks 11a and 11b did not present any significant findings, as
users simply had to repeat tasks on the new document.
In conclusion, following participants completing the list of tasks given lead to finding some subtle
problems with the UI, such as some elements not being as highlighted, like the document filter on the
sidebar. There were some other problems that were promptly identified by users. One example was
already mentioned, the positioning of the upload button in the file uploader interface, but two users
mentioned that there could be some trouble identifying cluster colors on the timeline, when comparing a
darker green with the default node color.
4.2 Case studies
These studies aimed at testing the utility of the visualization. Since this visualization was designed
as a tool to aid the exploring of a collection of documents, the case studies were performed to evaluate if
58
the results from visualization were correct. The goal behind this process is to verify and consolidate the
user’s context knowledge, or even to possibly unexpected data patterns. Thus, this section will go over
the participants, the procedure and the results of this testing phase, as well as a discussion, regarding
these results.
4.2.1 Participants
Contrarily to the usability testing phase described in section 4.1, the participants were required to
have some context knowledge, in order to validate the information being displayed. With this in mind,
subjects were recruited with the help of professor Hugo Ferreira, from IBEB.
During this testing, two case studies were performed. The tests were conducted between 11h30 and
12h30. The first tester was a Ph.D. student, and the second was a MSc student, both from Biomedical
and Biophysics Engineering. Although they did not have direct experience with the topics included in the
collection of documents, both subjects still had enough experience in the area to understand the main
topics of each document.
4.2.2 Procedure
The tests were performed in a laboratory in IBEB. Each subject was explained the purpose of the
study, and what they would be doing. Participants were asked to fill a consent form, to allow the recording
of their actions in the visualization during the test.
After filling the form, users were given an explanation regarding the meaning of each visualization,
namely the Network, the Cluster Layout and the Timeline, as well as the interactions between each one.
After this brief summary, participants were given 5 minutes to freely explore the application’s interface,
experimenting the described functionalities. At the beginning of this exploratory period, a simple script
was run to start recording the user’s actions, for later reference.
Following this phase, subjects were given 15 minutes to freely explore the visualization, aiming to
validate or disprove the visualization’s results.
4.2.3 Results
Both participants focused their analysis mainly on the Network ’s relations, trying to understand if the
existing links could be validated. The second user also tried to verify the clustering displayed on the
Cluster Layout, including the topic magnets that work with this visualization.
In general, the first user thought the results were good, as similar documents were linked together
correctly. However, he was also able to find documents that should not have been linked at all. A
particular example of this was a link existing between two studies that mentioned Alzheimer, although
59
their focus was different, with one’s focus being MRI scans, and the second being the influence of
Microglial activation with Alzheimer’s disease.
The second user’s feedback was in line with the first subject, as he also pointed documents whose
focus point did not match, although a secondary topic allowed the similarity connection to exist.
4.2.4 Discussion
In general, the results displayed were in accordance with what was expected, but the existence of
wrongly linked nodes is cause for concern. Even though the wrongly linked nodes talked about the same
topics (In the specified example, the same topic was Alzheimer’s disease), the focal point of the study
needs to be taken into account.
In conclusion, the visualization can be an asset, with the potential to help users guide their research
in this area. By providing certain interesting documents as focus points, users can better direct their
efforts at what they are looking for, without disregarding the exploratory sense that would be present
without this tool. However, in order improve the trust users can place in this tool, further tweaking of the
text processing pipeline may be needed. Additional testing is also required, in order to further validate
the utility of this visualization, focusing on the clustering and the topic magnets, as they may present
interesting results.
60
5Conclusion
61
Nowadays users face the problem of too much information available. A user trying to research into a
new topic will face a collection of context-specific documents, and exploring this collection may require
knowledge on specific concepts that is only available with more experienced users. With that, different
visualizations were reviewed. These tried to help users understanding the content of a single document
or making sense of the whole collection of documents, usually helping the user what kind of topics are
available in the visualization. Combining the comparison between the reviewed visualizations with the
gathered requirements from professor Hugo Ferreira, from IBEB, a list of tasks were derived that helped
guide the development of the application. The development followed an iterative model, that relied on
the feedback collected from the users to improve the visualization’s usability. An informal testing phase
took place, in order to gather feedback and detect possible usability problems before the final usability
tests. Finally, a formal testing phase took place, which consisted on two phases: usability tests and
case studies. The former focused on measuring the usability of the application, while the former aimed
to validate the utility of the developed solution.
From this formal testing phase, we can conclude that all the defined objectives were completed,
with good final results. The main objective was to build the visualization that allowed user to analyze the
content and similarity between documents in the collection. Several intermediate objectives were defined
to guide the development, which included building a database of documents to be used, designing and
developing the layout of the application, and evaluating the final solution.
The database was completed with the help of professor Hugo, by giving some guidelines on what
topics to search for. This database aimed to help the development of the application, and as such, it did
not contain a diverse collection of documents. Then, the designing and developing was also completed
with success, although not all the requirements that were collected from IBEB were not implemented.
Lastly, the evaluation of the final solution, through the formal testing phase, ended with good results,
regarding the usability of the application and its utility, although there is room for improvement.
Although the scope of this project was mainly aimed at working with documents from the neuro-
science context, it can easily be applied to other subjects as well. The development followed a more
generic approach, so that it was possible to apply our work to a different area of expertise without much
effort. Admittedly, some requirements that were not implemented would have contributed to narrowing
the focus of this work, but due to time constraints, the requirements were prioritized in such a way that
this generic approach was possible.
Additionally, there are some concerns regarding the scalability of this project. As mentioned, the
development was focused mainly on a smaller collection to aid our work with the document processing.
Lacking a bigger document collection, the scalability aspect of our solution was not taken into account
during the design and development. As such, it is important to take into account that it may have
performance issues as the collection increases, and the visualizations can present a bigger amount of
63
visual clutter, complicating the interpretation of the dataset. As future work, it would be interesting to
improve this aspect of our solution. A possible course would be to change the visualization so that it is
possible to hide or collapse unrelated documents, in order to avoid numerous nodes at the same time.
Furthermore, there is additional work that involves improving the backend text processing that was
described in Section 3.2. Specifically, improving the system to use bigrams and trigrams. By using
these contiguous sequences, text analysis will be able to take context into account, when measuring
similarity between documents. This could be sued to solve the wrongly linked nodes in the Network,
and improve the existing connections. There could also be some further work to improve the method of
adding new documents to the visualization. This could follow the professor Hugo’s idea of integrating the
visualization with a search engine such as Pubmed or Google Scholar, with a procedure to automatically
fetch the full document. On the other hand, another method would be to allow the drag and drop of files
into the visualization, with automatic fetching of metadata, from the file or from an online database,
removing this concern from the user.
64
Bibliography
[1] C. Collins, S. Carpendale, and G. Penn, “Docuburst: Visualizing document content using language
structure,” in Computer graphics forum, vol. 28, no. 3. Wiley Online Library, 2009, pp. 1039–1046.
[2] M. Wattenberg and F. B. Viegas, “The word tree, an interactive visual concordance,” IEEE transac-
tions on visualization and computer graphics, vol. 14, no. 6, pp. 1221–1228, 2008.
[3] M. Spindler and R. Dachselt, “Paperlens: advanced magic lens interaction above the tabletop,” in
Proceedings of the ACM International Conference on Interactive Tabletops and Surfaces. ACM,
2009, p. 7.
[4] J. Chuang, D. Ramage, C. Manning, and J. Heer, “Interpretation and trust: Designing model-driven
visualizations for text analysis,” in Proceedings of the SIGCHI Conference on Human Factors in
Computing Systems. ACM, 2012, pp. 443–452.
[5] C. Gorg, Z. Liu, J. Kihm, J. Choo, H. Park, and J. Stasko, “Combining computational analyses and
interactive visualization for document exploration and sensemaking in jigsaw,” IEEE Transactions
on Visualization and Computer Graphics, vol. 19, no. 10, pp. 1646–1663, 2013.
[6] J.-K. Chou and C.-K. Yang, “Papervis: Literature review made easy,” in Computer Graphics Forum,
vol. 30, no. 3. Wiley Online Library, 2011, pp. 721–730.
[7] S. Havre, E. Hetzler, P. Whitney, and L. Nowell, “Themeriver: Visualizing thematic changes in large
document collections,” IEEE transactions on visualization and computer graphics, vol. 8, no. 1, pp.
9–20, 2002.
[8] N. Cao, J. Sun, Y.-R. Lin, D. Gotz, S. Liu, and H. Qu, “Facetatlas: Multifaceted visualization for
rich text corpora,” IEEE transactions on visualization and computer graphics, vol. 16, no. 6, pp.
1172–1181, 2010.
[9] S. Lehmann, U. Schwanecke, and R. Dorner, “Interactive visualization for opportunistic exploration
of large document collections,” Information Systems, vol. 35, no. 2, pp. 260–269, 2010.
65
[10] G. Marchionini, “Exploratory search: from finding to understanding,” Communications of the ACM,
vol. 49, no. 4, pp. 41–46, 2006.
[11] R. W. White, B. Kules, S. M. Drucker et al., “Supporting exploratory search, introduction, special
issue, communications of the acm,” Communications of the ACM, vol. 49, no. 4, pp. 36–39, 2006.
[12] D. A. Keim, J. Kohlhammer, G. Ellis, and F. Mansmann, Mastering the information age-solving
problems with visual analytics. Florian Mansmann, 2010.
[13] K. A. Cook and J. J. Thomas, “Illuminating the path: The research and development agenda for
visual analytics,” Pacific Northwest National Laboratory (PNNL), Richland, WA (US), Tech. Rep.,
2005.
[14] F. B. Viegas, M. Wattenberg, and J. Feinberg, “Participatory visualization with wordle,” IEEE trans-
actions on visualization and computer graphics, vol. 15, no. 6, pp. 1137–1144, 2009.
[15] G. A. Miller, “Wordnet: a lexical database for english,” Communications of the ACM, vol. 38, no. 11,
pp. 39–41, 1995.
[16] A. Thudt, U. Hinrichs, and S. Carpendale, “The bohemian bookshelf: supporting serendipitous book
discoveries through information visualization,” in Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems. ACM, 2012, pp. 1461–1470.
[17] P. Andre, J. Teevan, S. T. Dumais et al., “Discovery is never by chance: designing for (un) serendip-
ity,” in Proceedings of the seventh ACM conference on Creativity and cognition. ACM, 2009, pp.
305–314.
[18] A. Foster and N. Ford, “Serendipity and information seeking: an empirical study,” Journal of Docu-
mentation, vol. 59, no. 3, pp. 321–340, 2003.
[19] T. Gup, “Technology and the end of serendipity,” The Chronicle of Higher Education, vol. 44, no. 21,
p. A52, 1997.
[20] E. G. Toms, “Serendipitous information retrieval.” in DELOS Workshop: Information Seeking,
Searching and Querying in Digital Libraries. Zurich, 2000.
[21] Y. Hassan-Montero and V. Herrero-Solana, “Improving tag-clouds as visual information retrieval
interfaces,” in International conference on multidisciplinary information sciences and technologies.
Citeseer, 2006, pp. 25–28.
[22] J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of massive datasets. Cambridge university
press, 2014.
66
[23] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the surprising behavior of distance metrics in
high dimensional spaces,” in ICDT, vol. 1. Springer, 2001, pp. 420–434.
[24] J. Nielsen, Usability engineering. Elsevier, 1994.
[25] J. Brooke et al., “Sus-a quick and dirty usability scale,” Usability evaluation in industry, vol. 189, no.
194, pp. 4–7, 1996.
[26] A. Bangor, P. T. Kortum, and J. T. Miller, “An empirical evaluation of the system usability scale,” Intl.
Journal of Human–Computer Interaction, vol. 24, no. 6, pp. 574–594, 2008.
[27] A. Bangor, P. Kortum, and J. Miller, “Determining what individual sus scores mean: Adding an
adjective rating scale,” Journal of usability studies, vol. 4, no. 3, pp. 114–123, 2009.