-
Helping Intelligence Analysts make Connections
M. Shahriar Hossain, Christopher Andrews, Naren Ramakrishnan,
and Chris NorthDepartment of Computer Science, Virginia Tech,
Blacksburg, VA 24061
Email: {msh, cpa, naren, north}@vt.edu
Abstract
Discovering latent connections between seemingly un-connected
documents and constructing “stories” fromscattered pieces of
evidence are staple tasks in intelli-gence analysis. We have worked
with government in-telligence analysts to understand the strategies
they useto make connections. Beyond techniques like cluster-ing
that aim to provide an initial broad summary oflarge document
collections, an important goal of an-alysts in this domain is to
assimilate and synthesizefine grained information from a smaller
set of forageddocuments. Further, analysts’ domain expertise is
cru-cial because it provides rich contextual background formaking
connections and thus the goal of KDD is toaugment human discovery
capabilities, not supplant it.We describe a visual analytics system
we have built—Analyst’s Workspace (AW)—that integrates
browsingtools with a storytelling algorithm in a large
screendisplay environment. AW helps analysts
systematicallyconstruct stories of desired fidelity from document
col-lections and helps marshall evidence as longer storiesare
constructed.
IntroductionWhat do the April’07 shootings at Virginia Tech,
BernardMadoff’s Ponzi scheme uncovered in Dec’08, and theMarch’09
recall of Zencore plus have in common? Theyare all extreme
happenings that lead us to question: ‘Whydidn’t somebody connect
the dots?’ Our ongoing failures todo so have led to these and many
other, arguably avoidable,catastrophes. Yet, piecing together a
story between seem-ingly disconnected information remains an
elusive skill andan understudied task.
Storytelling is an accepted metaphor in analytical rea-soning
and in visual analytics (Thomas and Cook (eds.)2005). Many software
tools exist to support story buildingactivities (Eccles et al.
2008; Hsieh and Shipman 2002;Wright et al. 2006). Analysts are able
to lay out evidenceaccording to spatial cues and incrementally
build connec-tions between them. Such connections can then be
chainedtogether to create stories which either serve as end
hypothe-ses or as templates of reasoning that can then be
prototyped.
Copyright c© 2011, Association for the Advancement of
ArtificialIntelligence (www.aaai.org). All rights reserved.
However, there are severe limitations to human sensemak-ing
capabilities, even on gigapixel-sized displays, when con-fronted
with massive haystacks of data. Algorithmic sup-port to help sift
through the myriad of possibilities is crucialhere. At the same
time, storytelling is not entirely automat-able since it is an
exploratory activity and the analyst bringsin valuable intuition
and contextual cues to direct the storybuilding process. Hence it
is imperative that we view story-telling as a collaborative
enterprise between algorithmic andhuman capabilities.
The focus of this paper is on exploring document collec-tions
and we present a visual analytics system called Ana-lyst’s
Workspace (AW) that aids intelligence analysts in ex-ploring
connections and building stories between possiblydisparate end
points. Our key contributions are:
1. Design considerations that have emerged from a detaileduser
study with five analysts working on intelligence anal-ysis
tasks.
2. New algorithms that find stories through document
col-lections and also help marshall evidence to support dis-covered
stories.
3. Implementation of both interactive visualization and
algo-rithmic storytelling support in AW; and a case study overa
public domain dataset.
How Analysts make ConnectionsWe recently had the opportunity to
interview and perform astudy with five intelligence analysts
currently employed at agovernment organization. The detailed
results are presentedand discussed in [Andrews et al. 2010]. We
begin by de-scribing qualitative lessons from the interviews
followed bya study of their strategies in solving analysis
tasks.
Interviews with AnalystsFor the purpose of this paper, it
suffices to note that thegoal of the interviews was to attempt to
typify how analystsapproached the large quantities of data they
were requiredto sift through, and to learn what tools they used and
howthey used them. From these interviews, the most interestingfact
that emerged was that the analysts largely used softwaretools only
at the beginning and at the end of their analysis.
Basic search tools were used to filter down a dataset atthe
start of their analysis. At the end of the analysis, presen-
-
Figure 1: How intelligence analysts make connections(from
(Pirolli and Card 2005).)
tation tools (such as PowerPoint) would be used to
createreports. For the middle of the analytic process, where the
ac-tual sensemaking occurs, the analysts in our study reportedthat
they tended to print out reports and other source mate-rials. This
allowed them to easily read them, annotate themwith notes and
highlights, sort them into physical folders,stack them in
meaningful ways on the desk, and even layall the documents out on a
large table where they could beorganized and rapidly skimmed.
A formal way to characterize the above observations iswith
reference to the schematic of Pirolli and Card (Pirolliand Card
2005). As Fig. 1 shows, the process by whichintelligence analysts
make connections is frequently tenta-tive and evolutionary, with
structures developing as under-standing of the data increases.
There are two ‘subloops’ inFig. 1: information foraging and
sense-making. Most ana-lytic systems, such as IN-SPIRE (PNNL ),
Jigsaw (HCII ),ThemeRiver (Havre et al. 2002), NetLens (Kang et al.
2007)focus on support for the information foraging loop, leavingthe
sensemaking to the analyst. Other tools, such as Ana-lyst’s
Notebook (i2group ), Sentinel Visualizer (FMS, Inc. ),Entity
Workspace (Bier et al. 2006), and Palantir (Khuranaet al. 2009)
focus more on the the sensemaking loop, andwhile many of them
ostensibly support foraging, the analystsreported using these tools
primarily for late stage sensemak-ing and presentation.
The key problem with this separation of the two halves ofthe
sensemaking process is that the schematic is not meantto be a state
diagram – it is a representation of some ofthe thought processes
and structures that are identifiableduring sensemaking and a
description of how they relate.There is an overall trend from a
collection of raw data toa final report, but inbetween, the analyst
should be rangingwidely across the entire process, building up an
understand-ing through progressive foraging and structuring.
User StudyThe tendency of analysts to resort to non-software
methodsfor information organization suggested to us the
potential
Figure 2: A user works with Analyst’s Workspace on a 32megapixel
display.
for exploring the use of large screen displays and how theycan
be integrated into the sensemaking process. If the sense-making can
be drawn back into the computational realm, itprovides the
opportunity to better support the analysts.
We conducted a detailed user study with a large 32megapixel
(10,240×3,200) display, which consists of a 4×2grid of 30′′ LCD
panels, each with a maximum resolutionof 2560×1600. All of the
panels in the display are driven bya single computer, allowing us
to run conventional desktopapplications on the display without
modification. The dis-play is configured for single-user use and is
slightly curvedaround the user, who sits in the center, with the
freedom torotate around to access all parts of the display (Fig.
2).
For the study, we employed the VAST (Symposium onVisual
Analytics Science and Technology) 2006 Challengedataset. This
dataset contains approximately 240 documents,which are primarily
synthetic news stories from a fictitiouscity newspaper. Although
this is a relatively small dataset,most of it is actually noise,
with only about ten of the docu-ments being relevant to uncovering
the plot. Another featureof this dataset is that even if the
analyst uncovers all ten doc-uments, some analysis is still
required to actually determinethe nature of the synthetic
threat.
Five analysts were presented with the dataset as a direc-tory of
files, with only the search facilities of Windows XP’sFile
Explorer, WordPad for reading and annotating docu-ments, and a
simple image viewer for the couple of imagesincluded in the
dataset. We asked them to uncover the buriedplot using any approach
that they desired, using the space af-forded by the display in any
way that they found useful.
A key conclusion from this study was that the large dis-play was
treated in a fundamentally different way fromconventional displays.
Conventional displays typically con-strain the user to working with
one or two applications ordocuments at a time. Interaction in this
environment is pri-marily application oriented. The large display,
on the otherhand, permits the user to work with a large number of
ap-plications and documents simultaneously. In our study, wefound
that this simple change encouraged users to adopta more
document-centric approach, working with the doc-uments in a fashion
more akin to the way one would in-
-
Figure 3: An active session in Analyst’s Workspace. Full text
documents and entities share the space, with a mixture of
spatialmetaphors, such as clusters, graphs, and timelines all in
evidence. The yellow lines are the links of the derived social
network.
teract with physical pieces of paper laid out on a physicaldesk.
We found that our subjects freely moved documentsaround the space,
creating a form of “semantic layer” overthe document collection, in
which position on the displayhelped to convey additional semantics,
such as relationshipsbetween the documents. Using space to encode
extra infor-mation about the relationship between objects has a
rich his-tory, rooted in human perceptual abilities (Kirsh 1995).
Aprimary advantage of the use of space for this purpose isthat it
is very flexible and allows the user to express transi-tory or
questionable relationships in a visually salient struc-ture without
committing to a strict and potentially confiningstructure (Shipman
and Marshall 1999).
For example, most of the analysts used the space to clus-ter the
documents that they found important. The interestingfeature of
these clusters is that they were frequently vagueand grouped
documents on an assortment of different lev-els. For instance,
documents in the same workspace could beclustered because they
related to a particular person or place,because they had a related
theme such as weapons, or evenbecause of how the analyst regarded
the documents (e.g.,many of the analysts created a pile of
documents that theythought were probably junk but seemed related
enough thatthey did not want to close them and lose them).
Sometimes,clusters would form without the analyst having any
clearthought about why the documents in the collection mightfit
together.
While the study demonstrated the appeal of working spa-tially
for sensemaking, it is worth noting that most analystsdid not solve
the analysis task. At the end of most sessions,the analysts had all
identified the major themes and createdrepresentative structures,
but they did not connect the dotsto put the entire story together.
Here, we can point to theimpoverished foraging support, which could
not help themto identify the critical linchpins that would draw the
wholestory together.
The above observations motivated us to develop a vi-sual
analytics environment—Analyst’s Workspace (AW),and open the door to
our algorithmic assistance for foraging
connections of exploration within AW. AW i) closely mim-ics
information organization layouts employed by analysts,ii) relates
multiple representations to accommodate differentstrategies of
exploration, and iii) provide automated algo-rithmic assistance for
foraging connections and hypothesisgeneration.
Analyst’s WorkspaceAW provides the user with a plethora of
interaction tools foruse with large screen displays (e.g., familiar
click-and-drag,selection rectangles, multi-click selections) as
well as infor-mation organization facilities (e.g., graph layout,
temporalordering). Because these operations are local, they only
af-fect the local area or the currently selected documents andhence
enable the analyst to freely mix spatial metaphors (seeFig. 3).
While the primary visual elements in AW are full textdocuments,
we also provide support at the entity level. Doc-uments are marked
up based on extracted entities, and theanalyst can use context
menus to quickly identify new en-tities and create aliases between
entities. Double clickingan entity of interest in a document opens
an entity object,which is initially displayed as a list of
documents in whichthat entity appears. Entities can also be
collapsed down to arepresentational icon, and AW automatically
draws links be-tween entities when they co-occur in a document.
These twofeatures allow the analyst to rapidly construct and
exploresocial networks, which are commonly used tools in
intelli-gence analysis.
AW also provides basic facilities for text-based search.Search
results are displayed as lists of matching documentsin the space,
like the entities. The documents are color codedto tell the analyst
the state of a document: open, previouslyviewed, or never
viewed.
Visual links play a strong role in AW. These allow a num-ber of
relationships to be expressed, freeing spatial proxim-ity to be
used to express more complex relationships moredirectly related to
making sense of the dataset.
While Analyst’s Workspace is designed to support a flex-
-
Figure 4: AW’s entity browser, here showing the peopleidentified
in the dataset, sorted by the number of documentsin which each
appears.
ible approach to sensemaking, it does encourage a
particularanalytic approach that we observed being used by the
ana-lysts. This is a strategy that Kang et al. (2009) referred to
as“Find a Clue, Follow the Trail”. In this strategy, the
analystidentifies some starting place and then branches out the
in-vestigation from that point, following keywords and
entities.
In AW, a starting point can be provided by the entitybrowser
(Fig. 4), which allows the analyst to order enti-ties by the number
of occurrences in the dataset. The ana-lyst opens this entity and
gets a list of documents in whichthis entity appears. The analyst
then works through thesedocuments, opening new entities or
performing searches asnew clues are found. Since all of the search
results are inde-pendent objects in the space and there is a visual
record ofwhich documents have been visited, AW can support both
abreadth-first and a depth-first search through the informa-tion.
As the investigation progresses, the analyst uses thespace to
arrange the information as it is uncovered, buildingand rebuilding
structures to reflect his or her current under-standing of the
underlying narrative.
While this approach has been shown to be fairly effective(Kang
et al. 2009), it does not permit greater characteriza-tion of the
dataset and does not support more complex ques-tions that the
analyst might ask. For example, this approachrelies entirely on the
analyst to pick the right keywords andentities to “chase,” and can
miss less direct lines of investi-gation. It is common for
terrorists to use multiple aliases orcode words that can easily
thwart this approach. However, itis possible that common patterns
of behavior or other docu-ment similarities might help the analyst
to uncover some ofthese connections.
The analyst may also need the discovery of paths throughthe
dataset to be more efficient. For example, the analystmay have
uncovered that a revolutionary in South Americashares the same last
name as a farmer in the Pacific North-west who has been implicated
in some nefarious affairs and
wishes to ask if there is any link between them or if their
lastname is a coincidence. An exhaustive background check ofthe two
men is possible through AW if the dataset is rela-tively small, but
it is an indirect and time consuming pro-cess.
Algorithmic Support for StorytellingWe attempted to formalize
and support the ways by which ananalyst conducts unstructured
discovery, chases leads, andmarshalls evidence to support or refute
potentially promis-ing chains. Our story generation framework is
exploratoryin nature so that, given starting and ending documents
of in-terest, it explores candidate documents for path
following,and heuristics to admissibly estimate the potential for
pathsto lead to a desired destination. The generated paths are
thenpresented to the AW analyst who can choose to revise themor
adapt them for his/her purposes.
A story between documents d1 and dn is a sequence of
in-termediate documents d2, d3, ..., dn−1 such that every
neigh-boring pair of documents satisfies some user defined
criteria.Given a story connecting a start and an end document
(seeFig. 7 (a)), analysts perform one of two tasks: they eitheraim
to strengthen the individual connections, possibly lead-ing to a
longer chain (see Fig. 7 (b)), or alternatively theyseek to
organize evidence around the given connection (seeFig. 7 (c)). We
use the notions of distance threshold andclique size to mimic these
behaviors. We designed our sto-rytelling algorithm to work with
these two criteria that areunder the AW analyst’s control and
experimentation. (Theseare not magic parameters whose values have
to be tuned butare rather controls that mimic the natural process
by whichanalysts tighten or strengthen their hypotheses.)
The distance threshold refers to the maximum accept-able
distance between two neighboring documents in a story.Lower
distance thresholds impose stricter requirements andlead to longer
paths. The clique size threshold refers to theminimum size of the
clique that every pair of neighboringdocuments must participate in.
Thus, greater clique sizesimpose greater neighborhood constraints
and lead to longerpaths. See Fig 7 (d) for a new path with both
stricter cliquesize and stricter distance thresholds. These two
parametershence essentially map the story finding problem to one
ofuncovering clique paths in the underlying induced
similaritynetwork between documents.
We use the term “clique chain” to refer to a story alongwith its
surroundings connections of evidence. In contrast,a story only
constitutes the junction points between con-secutive cliques.
Another way to characterize them is thata clique chain constitues
many stories.
Fig. 5 describes the steps involved in generating stories
forinteraction by the AW analyst. For document modeling, weuse a
bag-of-words (vector) representation where the termsare weighted by
tf-idf with cosine normalization. Our searchframework has three key
computational stages:
1. construction of a concept lattice,2. generating promising
candidates for path following, and3. evaluating candidates for
potential to lead to destination.
-
---------------
---------------
---------------
-------------
Input
documents
Stop-word
removal and
stemming
Analyst’s
input
Heuristic
search
Document
modeling
---------------
---------------
-------------
---------------
---------------
-------------
Concept
lattice
generation
Figure 5: Pipeline of the storytelling framework in AW.
C2terms: ADEFG
docs: 3, 7
C10terms: ADEF
docs: 5, 3, 7
C7terms: ADE
docs:1, 5, 3, 7
C11terms: BDE
docs: 6, 7
C6terms: DE
docs: 8, 1, 6, 5, 3, 7
C3terms: DFG
docs: 2, 3, 7
C4terms: DF
docs: 2, 5, 3, 7
C5terms: D
docs: 8, 1, 6, 2, 5, 3, 7
C8terms: A
docs:4, 1, 5, 3, 7
3
2
5
16
8 4
The value in a cell (dj, tx) indicates
the frequency of term x in doc j.
C1terms: ABDEFG
docs: 7
7
C8terms: AC
docs:4
d1d2d3d4d5d6d7
tA3
1
2
1
5
tB
3
4
tC
5
tD2
4
4
2
3
2
tE5
5
4
1
1
tF
1
1
5
2
tG
3
3
3
d8 1 2
Do
cu
me
nts
Terms
Figure 6: A dataset and its concept lattice.
Of these, the first stage can be viewed as a startup cost
thatcan be amortized over multiple path finding tasks. The sec-ond
and third stages are organized as part of an A* searchalgorithm
that begins with the starting document, uses theconcept lattice to
identify candidates satisfying the distanceand clique size
requirements, and evaluates them heuristi-cally for their promise
in leading to the end document.
Concept Lattice ConstructionThe concept lattice is a data
structure that models conceptualclusters of document and term
overlaps and is used here as aquick lookup of potential neighbors
that will satisfy the dis-tance threshold and clique constraints.
Given a (weighted)term-document matrix, we use the CHARM-L (Zaki
and Ra-makrishnan 2005) closed set mining algorithm on a
booleanversion of this matrix to generate a concept lattice. Each
con-cept is a pair: (document set, term set) as shown in Fig.
6.Further, we order the document list for each concept by thenumber
of terms. Note that we can find an approximate setof nearest
neighbors for a document d from the documentlist of the concept
containing d and the longest term set.
Successor GenerationSuccessor generation is the task of, given a
document, usingthe distance threshold and clique size requirements
to iden-tify a set of possible successors for path following. Note
thatthis does not use the end document in its computation.
The basic idea of our successor generation approach is, in
addition to finding a good set of successor nodes for a
givendocument d, to be able to have sufficient number of themso
that, combinatorially, they contribute a desired numberof cliques.
With a clique size constraint of k, it is not suffi-cient to merely
pick the top k neighbors of the given docu-ment, since the
successor generation function expects mul-tiple clique candidates.
(Note that, even if we picked the topk neighbors, we will still
need to subject them to a check toverify that every pair satisfies
the distance threshold.) Giventhat this function expects b clique
candidates (where b is thebranching factor), a minimum m documents
must be identi-fied where m is given by the solution to the
inequalities:(
m− 1k
)< b and
(mk
)≥ b
For a given document, we pick the top m candidate doc-uments
from the concept lattice and form combinations ofsize k. Our
successor generator thus forms combinations ofsize k from these m
documents to obtain a total of b k-cliques. Since m is calculated
using the two inequalities,the total number of such combinations is
equal to or slightlygreater than b (but never less than b). Each
clique is given anaverage distance score calculated from the
distances of thedocuments of the clique and the current document d.
Thisaids in returning a priority queue of exactly b candidate
k-cliques.
We evaluated our successor generation mechanism bycomparing it
to the brute force nearest neighbor search andthe cover tree based
(Beygelzimer et al. 2006) nearest neigh-bor search mechanisms. We
found that our concept latticebased successor generation mechanism
works faster thanthese other approaches (not described due to space
limita-tions). Therefore we adopt the concept lattice in our
succes-sor generation procedure.
Evaluating CandidatesWe now have a basket of candidates that are
close to thecurrent document and we must determine which of
thesehas the potential to lead to the destination document.
Theprimary criteria of optimality for the A* search procedureof our
framework is the cumulative Soergel distance of thepath. The
Soergel distance between two documents d1andd2 is given by:
D(d1, d2) =
∑t
|wt,d1 − wt,d2 |∑t
max (wt,d1 , wt,d2)
where wt,di indicates the weight for term t of document di.We
use the straight line Soergel distance for the heuristic
-
CIA_05 FBI_28CIA_17
Boris Bugarov (a Russian bio-weapon scientist) was hired by
Pyotr Safrygin.
vector, moscow, pyotr, institute, live, safrygin, russia
Algorithm connects Pieter Dopple with Safrygin, but the link is
weak.
request, live, name, september
(a) Clique size=2, distance threshold=0.99: Example of a story
with weak connections.
(b) Clique size=2, distance threshold=0.96 : Example of a story
with stricter links.
CIA_05 FBI_28NSA_22NSA_22
Algorithm connects Boris Bugarov (a Russian bio-weapon
scientist) and PyotrSafrygin via an intercepted phone call.
pyotr, central, airline, moscow, russia
Algorithm finds Pieter Dopple involved in the stone business and
money laundering.
pakistan, ramundo, stone, ortiz, diamond, africa,
tanzanite, panama, precious
CIA_14NSA_22central, havana, cuba,
middle, east
Algorithm connects these two documents based on place names but
the connection is vague.
CIA_05 FBI_28
CIA_14
CIA_17
Boris Bugarov (a Russian bio-weapon scientist) was hired by
Pyotr Safrygin.
CIA_34
vector, moscow, pyotr safrygin, institute, live, russia
pyotrpyotr safrygin
Algorithm links Pieter Dopple with diamond transactions. Pieter
Dopple has relationships with militant Islamic groups.
request, live, name, september
(c) Clique size=3, distance threshold=0.99 : Example of a better
story with small amount of surroundingevidence.
CIA_05 FBI_28NSA_14 NSA_07NSA_14 NSA_07
CIA_17CIA_17CIA_17 CIA_14CIA_14CIA_14
Boris Bugarov (a Russian bio-weapon scientist) was hired by
Pyotr Safrygin.
CIA_34 NSA_20 CIA_09
pakistan, sell, receive, tanzanite, panama
pakistan, sell, receive, , sell, receive, , sell, receive, ,
sell, receive, , sell, receive, , sell, receive, tanzanite,
panamatanzanite, panamatanzanite, panamatanzanite, panama
, sell, receive, , sell, receive, , sell, receive, , sell,
receive, , sell, receive, tanzanite, panamatanzanite, panama
, sell, receive, , sell, receive, phone, pakistan, caller,
intercept, let, know
airline, moscow, central, russia
moscowmoscowmoscowmoscowrussiarussiarussiarussia
moscowmoscowrussiarussia
This clique explains some intercepted phone calls involving
Pyotr Safrygin, the director of security for Central Russia
Airlines. The phone conversations were transaction related and
mention tanzanite.
Algorithm connects Pieter Dopplewith diamond transactions using
neighboring evidence. Pieter Dopplehas relationships with militant
Islamic groups.
NSA_18
(d) Clique size=4, distance threshold=0.95 : Example of a better
story with more surrounding evidence.
Figure 7: A sample story illustrating the impact of change of
clique size and distance threshold. The goal is to connect
bio-weapon scientist Boris Bugarov with money launderer Pieter
Dopple. As the distance and clique size thresholds are
experi-mented with, we observe surrounding evidence connecting
Pieter Dopple with militant Islamic groups.
and, because it obeys the triangle inequality, it can be
shownthat this will never estimate the cost of a path from any
doc-ument d to the goal. Therefore our heuristic is admissibleand
our A* search will yield the optimal path.
It is important to note that our algorithm never
explicitlycomputes or materializes the underlying network of
similar-ities at any time. As a result, it is very easy for the AW
ana-lyst to vary the clique size and distance thresholds to
analyze
different stories for the same start and end pairs.
Experimental ResultsWe conduct both quantitative and qualitative
evaluation ofAW’s visual and algorithmic support for storytelling.
Thequestions we seek to assess are:
1. What is the interplay between distance threshold andclique
size constraints in story construction? How does
-
our heuristic fare in reference to an uninformed searchand as a
function of the constraints?
2. What is the quality of stories discovered by our
algo-rithm?
3. How do the algorithmically discovered stories compare tothose
found by analysts?
4. How can analysts mix-and-match algorithmic capabilitieswith
their intuitive expertise in story construction?
For our experiments, we used an analysis exercise(Hughes 2005)
developed at the Joint Military IntelligenceCollege. The exercise
dataset is sometimes referred to as theAtlantic Storm dataset.
Evaluating Story ConstructionTo study the relationship between
distance threshold andclique size constraints, we generated
thousands of storieswith different distance and clique size
requirements from theAtlantic Storm dataset, and computed the
maximum cliquesize for which at least one story was found. As
expected, wesee an anti-monotonic relationship and that it is more
diffi-cult to marshall evidence as distance thresholds get
stricter(Fig. 8).
To study the performance of AW’s heuristic over a non-heuristic
based search, we picked 1000 random start-enddocument pairs from
our document collection and generatedstories with different
distance threshold and clique size re-quirement. The non-heuristic
search is simply a breadth-firstsearch version of our A* search
framework (in other words,the heuristic returns zero for all
inputs). Fig. 9 compares av-erage runtimes of AW’s heuristic based
search against thenon-heuristic search. From top to bottom, three
consecutiveplots of Fig. 9 depict the average runtimes respectively
asfunctions of story length, distance threshold, and clique
size.Astute readers might expect a monotonic increase of
averageruntime with longer stories in Fig. 9 (top). Stories tend
to
(Stricter distance threshold )
Larger clique size implies stricter clique requirement
Smaller θθθθ implies stricter distance requirement
Distance Threshold, θθθθ
0.7
60
.77
0.7
80
.79
0.8
00
.81
0.8
20
.83
0.8
40
.85
0.8
60
.87
0.8
80
.89
0.9
00
.91
0.9
20
.93
0.9
40
.95
0.9
60
.97
0.9
80
.99
Ma
xim
um
cliq
ue
siz
e f
or
wh
ich
at
lea
st
on
e c
liq
ue
-ch
ain
wa
s f
ou
nd
0
5
10
15
20
25
Figure 8: Atlantic storm dataset: interplay between
distancethreshold and clique size constraints.
become longer with stringent distance threshold and cliquesize.
Further stringency, however, results in broken stories(the length
of the story theoretically becomes infinite). Asa result, we found
a smaller number of longer stories than
The impact of the heuristc
Story length, l
2 4 6 8 10 12 14 16 18 20 22
Avera
ge t
ime (
sec)
to d
isco
ver
sto
ries o
f le
ng
th l
0
5
10
15
20
25
30
With heuristic
Without heuristic
The impact of the heuristc
Distance threshold, θθθθ
0.82
0.83
0.84
0.85
0.86
0.87
0.88
0.89
0.90
0.91
0.92
0.93
0.94
Avera
ge t
ime (
sec)
to d
isco
ver
sto
ries w
ith
th
resh
old
, θθ θθ
0
5
10
15
20
With heuristic
Without heuristic
The impact of the heuristc
Clique size, k
2 4 6 8 10
Avera
ge t
ime (
sec)
to d
isco
ver
sto
ries w
ith
cliq
ue s
ize k
0
5
10
15
20
With heuristic
Without heuristic
Figure 9: We used 1000 random start-end pairs to com-pare the
performance of AW’s heuristic search against un-informed
search.
-
0
1
2
3
4
5
6
7
8
123456789 Sequence, j
DiagonalCell weights
1/81/81/81/81/81/81/81/8
1/82
1/(8*7)
1/(8*6)
1/(8*5)
1/(8*4)
1/(8*3)
1/(8*2)
1/8
Seq
ue
nce
, i
Total weight ofa diagonal line
0
1
2
3
4
5
6
7
8
123456789 Sequence, j1/81/81/81/81/81/81/81/8
1/82
1/(8*7)
1/(8*6)
1/(8*5)
1/(8*4)
1/(8*3)
1/(8*2)
1/8
Se
qu
en
ce
, i
Total weight ofa diagonal line
DiagonalCell weights
Figure 10: (left) A dispersion plot of an ideal story. The
dispersion coefficient ϑ = 1.0. (right) A dispersion plot of a
non-idealstory of same length. The dispersion coefficient ϑ = 1−
38×8 −
18×7 = 0.94.
the shorter ones. In all the plots of Fig. 9, we calculated
theaverage time over only the discovered stories. Since mostof the
long stories were found quickly by our algorithms,the curves of
Fig. 9 (top) increase first and then descreaseinstead of being
monotonically increasing. All the plots ofFig. 9 depict that the
heuristic yields significant gains overthe uninformed search.
Evaluating Story QualityIt is difficult to objectively evaluate
the quality of sto-ries. Here, we adopt Swanson’s complimentary but
disjoint(CBD) hypothesis (Swanson 1991) and assess the
pairwiseSoergel distance between documents in a story,
betweenconsecutive as well as non-consecutive documents. An
idealstory is one that meets the Soergel distance threshold θ
onlybetween consecutive pairs whereas a non-ideal story
“over-satisfies” the distance threshold and meets it even
betweennon-consecutive pairs. As shown in Fig. 10 (left), an
idealstory has only diagonal entries in its dispersion plot
(con-trast with Fig. 10 (right)). If n documents of a story are
d0,d1, ..., dn−1, then our formula for dispersion coefficient
isgiven by:
ϑ = 1− 1n− 2
n−3∑i=0
n−1∑j=i+2
disp (di, dj)
where
disp (di, dj) =
{1
n+i−j , if D (di, dj) > θ0, otherwise
We also compute p-values for each generated story. Re-call that
at each step of the A* search we build a queueof candidate
documents by investigating the corresponding
Table 1: Sample story fragments from an analyst. How didour
algorithm fare in discovering them?
StoryFound by
algorithm
Found in the
clique path
Found by
merging
stories
FBI_30FBI_35FBI_41CIA_43
CIA_41CIA_34CIA_39NSA_09
NSA_16
CIA_01CIA_05CIA_34CIA_41CIA_17
CIA_39NSA_22
NSA_11NSA_18NSA_16
CIA_06CIA_22CIA_21
CIA_24FBI_24
NSA_06CIA_32CIA_42
NSA_16CIA_38CIA_42
concepts of the concept lattice. To calculate the p-value ofa
clique of size k, we randomly select k − 1 documentsfrom the entire
candidate pool and check if all the edgesof the formed k-clique
satisfy the distance threshold θ, it-erating the test 50,000 times.
This allows us to find p-valuesdown to 2×10−5. We repeat this
process for every junction-document of a discovered clique chain.
The overall p-valueof a clique chain is calculated by multiplying
all the p-valuesof every clique of the chain.
Story ValidationWe have depicted stories with different distance
and cliquesize requirements in Fig. 7. The story connects a
Rus-sian bio-weapon scientist (Boris Bugarov) with a moneylaunderer
(Pieter Dopple) who has ties to militant Islamicgroups. In Table 1
we compared some discovered storieswith fragments put together by
analysts. The inputs fromthe analysts are not complete stories but
rather scattered,
-
The analyst requests a story connecting a pair of
interesting
documents.
Unsatisfied with the strength of the connection, the analyst
requests
information about documents in the surrounding neighborhood
(i.e.,
within the local clique).
Having explored the local neighborhood, the analyst has
identified
two additional documents that form a more meaningful
connection
and extends the original story.
The generated story between the two endpoints. The system has
identified two linking documents, and connected them together into
a linked story.
A list of the neighbors of the third document. The lines provide
visual links to open documents.
New connections have been manually added to extend the story
Figure 11: Illustration of AW usage.
piecewise connections. The table illustrates that all the
sto-ries were discovered by our algorithm with two exceptions:the
stories were not in the directly discovered path but werepresent in
the clique chain (i.e., the story did not exhibit thesame junction
points), or the fragment can be discovered bymerging multiple
stories together. This depicts the potentialof our heuristic in
helping AW analysts discover stories al-gorithmically.
Illustration of AW UsageFig. 11 shows an example of the usage of
AW and our al-gorithms. In this scenario, the analyst requests a
story con-necting a pair of interesting documents. The algorithm
re-turns a story but the analyst is not satisfied with parts ofthe
story. The analyst then requests information about doc-uments in
the surrounding neighborhood (i.e., within the lo-cal clique) of an
intermediate document. Having exploredthe local neighborhood, the
analyst identified two additionaldocuments that form a more
meaningful connection and ex-tends the original story. The two
story fragments of Table 1that were not directly found by the
algoritm could be modi-fied by the analyst to obtain more
meaningful stories.
Related LiteratureWe organize related work in this space under
various cate-gories.
Relationships via associations: Jayadevaprakash et al.(2005)
advocate a transitive method to generate an associ-
ation graph to find relationships between non-cooccurringtext
objects. The authors advocate the use of transitive meth-ods
because transitive methods do not require expensivetraining by
human experts. Similarly, our approach doesnot require expensive
training, but we situate our meth-ods in a visual analytics setting
with intelligence expertsproviding active feedback in the discovery
process. Vakaand Mukhopadhyay (2009) describe a method to
extracttransitive associations among diseases and herbs related
toAyurveda. The method is based on a text-mining techniquedesigned
for discovering transitive associations among bio-logical objects.
It uses a vocabulary discovery method froma subset of PubMed
corpora to associate herbs and diseases.Thaicharoen (2009) aims to
discover relational knowledge inthe form of frequent relational
patterns and relational associ-ation rules from disjoint sets of
literature. Although the aimof the research of Vaka and
Mukhopadhyay and Thaicharoenis somewhat similar to our objective,
we focus on findingconnecting chains in an induced similarity
network of docu-ments rather than finding a chain of associations
via externalknowledge.
Topic based hypotheses generation: Jin et al. (2007)present a
tool based on link analysis, and text miningmethodologies to detect
links between two topics across twoindividual documents. Srinivasan
(2004) presents text min-ing algorithms that are built within the
framework estab-lished by Swanson (1991). The algorithms generate
rankedterm lists where the key terms represent novel
relationshipsbetween topics. Although we do not conduct explicit
topicmodeling in our work, the requirement to impose
cliqueconstraints in story construction essentially helps
transduceslowly between topics.
Classification and clustering for hypotheses genera-tion: Glance
et al. (2005) describe a system that gathers spe-cific types of
online content and delivers analytics based onclassification,
natural language processing, and other miningtechnologies in a
marketing intelligence application. Faro etal. (2009) propose a
clustering method aimed at discoveringhidden relationships for
hypothesis generation and suitablefor semi-interactive querying.
Our method does not dependon classification/clustering for
information organization butharnesses CBD structures in finding
chains between docu-ments of different clusters.
Connecting the dots: The “connecting the dots” problemhas
appeared in the literature in different guises and for dif-ferent
applications: cellular networks (Brassard et al. 1980),social
networks (Faloutsos et al. 2004), image collections(Heath et al.
2010), and document collections (Das-Neves etal. 2005; Kumar et al.
2006; Shahaf and Guestrin 2010). Ourwork explicitly harnesses CBD
structures whereas many ofthese works focused on contexts with
weaker dispersion re-quirements. For instance, the model proposed
by Shahaf andGuestrin (2010) explicitly requires a connecting
thread ofcommonality through all documents in a story.
DiscussionWe have described a visual analytics system (AW)
thatprovides both exploratory and algorithmic support for an-alysts
in making connections. Privacy considerations pro-
-
hibit us from describing the new applications that AW isbeing
used for but the experimental results demonstrate itsrange of
capabilities. Future work is geared toward moremixed-initiative
facilities for story generation and proba-bilistic methods to
accommodate richer forms of analyst’sfeedback. We are also working
toward techniques to do au-tomatic story summarization and concept
map generation.
AcknowledgmentsThis work is supported in part by the Institute
for CriticalTechnology and Applied Science, Virginia Tech, and the
USNational Science Foundation through grant CCF-0937133.
ReferencesAndrews, C.; Endert, A.; and North, C. 2010. Space
toThink: Large High-resolution Displays for Sensemaking. InCHI ’10,
55–64.Beygelzimer, A.; Kakade, S.; and Langford, J. 2006.
CoverTrees for Nearest Neighbor. In ICML ’06, 97–104.Bier, E.;
Ishak, E.; and Chi, E. 2006. Entity Workspace: AnEvidence File That
Aids Memory, Inference, and Reading.In ISI ’06, 466–472.Brassard,
J.-P., and Gecsei, J. 1980. Path Building in Cellu-lar Partitioning
Networks. In ISCA ’80, 44–50.Das-Neves, F.; Fox, E. A.; and Yu, X.
2005. ConnectingTopics in Document Collections with Stepping Stones
andPathways. In CIKM ’05, 91–98.Eccles, R.; Kapler, T.; Harper, R.;
and Wright, W. 2008.Stories in GeoTime. Info. Vis.
7(1):3–17.Faloutsos, C.; McCurley, K. S.; and Tomkins, A. 2004.
FastDiscovery of Connection Subgraphs. In KDD ’04, 118–127.Faro,
A.; Giordano, D.; Maiorana, F.; and Spampinato, C.2009. Discovering
Genes-diseases Associations from Spe-cialized Literature using the
Grid. Trans. Info. Tech. Biomed.13:554–560.FMS, Inc. FMS Advanced
Systems Group, Sentinel Visual-izer. Last accessed: May 26, 2011,
http://www.fmsasg.com/.Glance, N.; Hurst, M.; Nigam, K.; Siegler,
M.; Stockton, R.;and Tomokiyo, T. 2005. Deriving Marketing
Intelligencefrom Online Discussion. In KDD ’05, 419–428.Havre, S.;
Hetzler, E.; Whitney, P.; and Nowell, L. 2002.ThemeRiver:
Visualizing Thematic Changes in Large Doc-ument Collections. IEEE
TVCG 8(1):9–20.HCII. Human Computer Interaction Institute,
CarnegieMellon University, Jigsaw. Last accessed: May 26,
2011,http://www.hcii.cmu.edu/mhci/projects/jigsaw.Heath, K.;
Gelfand, N.; Ovsjanikov, M.; Aanjaneya, M.; andGuibas, L. 2010.
Image Webs: Computing and ExploitingConnectivity in Image
Collections. In CVPR, 3432 –3439.Hsieh, H., and Shipman, F. M.
2002. Manipulating Struc-tured Information in a Visual Workspace.
In UIST’02, 217–226.Hughes, F. J. 2005. Discovery, Proof, Choice:
The Art andScience of the Process of Intelligence Analysis, Case
Study6, “All Fall Down”, Unpublished report.
i2group. The Analyst’s Notebook. Last accessed: May 26,2011,
http://www.i2group.com/us.Jayadevaprakash, N.; Mukhopadhyay, S.;
and Palakal, M.2005. Generating Association Graphs of
Non-cooccurringText Objects using Transitive Methods. In SAC ’05,
141–145.Jin, W.; Srihari, R. K.; and Ho, H. H. 2007. A Text
MiningModel for Hypothesis Generation. In ICTAI ’07, 156–162.Kang,
H.; Plaisant, C.; Lee, B.; and Bederson, B. B. 2007.NetLens:
Iterative Exploration of Content-actor NetworkData. Info. Vis.
6(1):18–31.Kang, Y.; Görg, C.; and Stasko, J. 2009. The Evaluation
ofVisual Analytics Systems for Investigative Analysis: Deriv-ing
Design Principles from a Case Study. In VAST, 139–146.Khurana, H.;
Basney, J.; Bakht, M.; Freemon, M.; Welch, V.;and Butler, R. 2009.
Palantir: a Framework for CollaborativeIncident Response and
Investigation. In IDtrust ’09, 38–51.Kirsh, D. 1995. The
Intelligent Use of Space. Artif. Intell.73(1-2):31–68.Kumar, D.;
Ramakrishnan, N.; Helm, R. F.; and Potts, M.2006. Algorithms for
Storytelling. In KDD ’06, 604–610.Pirolli, P., and Card, S. 2005.
The Sensemaking Processand Leverage Points for Analyst Technology
as Identifiedthrough Cognitive Task Analysis. In ICIA ’05.PNNL.
Pacific Northwest National Laboratory, INSPIREvisual document
analysis. Last accessed: May 26,
2011,http://in-spire.pnl.gov.Shahaf, D., and Guestrin, C. 2010.
Connecting the Dotsbetween News Articles. In KDD ’10,
623–632.Shipman, F. M., and Marshall, C. C. 1999. Formality
Con-sidered Harmful: Experiences, Emerging Themes, and Di-rections
on the Use of Formal Representations in InteractiveSystems. CSCW
8:333–352.Srinivasan, P. 2004. Text Mining: Generating
Hypothesesfrom MEDLINE. J. Am. Soc. Inf. Sci. Technol.
55:396–413.Swanson, D. R. 1991. Complementary Structures in
DisjointScience Literatures. In SIGIR ’91, 280–289.Thaicharoen, S.
2009. Text Association Mining with Cross-sentence Inference,
Structure-based Document Model andMulti-relational Text Mining.
Ph.D. Dissertation, Univ. ofColorado at Denver.Thomas, J. J., and
Cook (eds.), K. A. 2005. Illuminatingthe Path: The Research and
Development Agenda for VisualAnalytics. IEEE Computer Society
Press.Vaka, H. G. G., and Mukhopadhyay, S. 2009.
HypothesesGeneration Pertaining to Ayurveda Using Automated
Vo-cabulary Generation and Transitive Text Mining. In NBIS’09,
200–205.Wright, W.; Schroh, D.; Proulx, P.; Skaburskis, A.; and
Cort,B. 2006. The Sandbox for Analysis: Concepts and Methods.In CHI
’06, 801–810.Zaki, M. J., and Ramakrishnan, N. 2005. Reasoning
AboutSets Using Redescription Mining. In KDD ’05, 364–373.