Computer Forum Poster - Stanford Computer Forum

Text Visualization Techniques for Assessing Topic Model QualityJason Chuang, Christopher D. Manning, Jeffrey Heer

Statistical Topic ModelsLATENT DIRICHLET ALLOCATION

α z wθ

β

MN

Figure 1: Graphical model representation of LDA. The boxes are “plates” representing replicates.The outer plate represents documents, while the inner plate represents the repeated choiceof topics and words within a document.

where p(zn |θ) is simply θi for the unique i such that zin = 1. Integrating over θ and summing overz, we obtain the marginal distribution of a document:

p(w |α,β) =Z

p(θ |α)

N

∏n=1∑znp(zn |θ)p(wn |zn,β)

!

dθ. (3)

Finally, taking the product of the marginal probabilities of single documents, we obtain the proba-bility of a corpus:

p(D |α,β) =M

∏d=1

Z

p(θd |α)

Nd

∏n=1∑zdnp(zdn |θd)p(wdn |zdn,β)

!

dθd .

The LDA model is represented as a probabilistic graphical model in Figure 1. As the figuremakes clear, there are three levels to the LDA representation. The parameters α and β are corpus-level parameters, assumed to be sampled once in the process of generating a corpus. The variablesθd are document-level variables, sampled once per document. Finally, the variables zdn and wdn areword-level variables and are sampled once for each word in each document.

It is important to distinguish LDA from a simple Dirichlet-multinomial clustering model. Aclassical clustering model would involve a two-level model in which a Dirichlet is sampled oncefor a corpus, a multinomial clustering variable is selected once for each document in the corpus,and a set of words are selected for the document conditional on the cluster variable. As with manyclustering models, such a model restricts a document to being associated with a single topic. LDA,on the other hand, involves three levels, and notably the topic node is sampled repeatedly within thedocument. Under this model, documents can be associated with multiple topics.

Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling,where they are referred to as hierarchical models (Gelman et al., 1995), or more precisely as con-ditionally independent hierarchical models (Kass and Steffey, 1989). Such models are also oftenreferred to as parametric empirical Bayes models, a term that refers not only to a particular modelstructure, but also to the methods used for estimating parameters in the model (Morris, 1983). In-deed, as we discuss in Section 5, we adopt the empirical Bayes approach to estimating parameterssuch as α and β in simple implementations of LDA, but we also consider fuller Bayesian approachesas well.

997

Blei et al. Latent Dirichlet allocation.Journal of Machine Learning Research. 2003.

Large-Scale Analysis of Text Corpora

NATURE METHODS | VOL.8 NO.6 | JUNE 2011 | 443

CORRESPONDENCE

framework that is based on scientific research rather than NIH administrative and categorical designations.

We found that topic-based categories are not strictly associ-ated with the missions of individual Institutes but instead cut across the NIH, albeit in varying proportions consistent with each Institute’s distinct mission (Supplementary Table 1). The graphical map layout (Fig. 1) shows a global research structure that is logically coherent but only loosely related to Institute organization (Supplementary Table 1).

We describe four example use cases (Supplementary Data). First, we show a query using an algorithm-derived category relevant to angiogenesis (Supplementary Fig. 1). Unlike standard keyword-based searches, this type of query allows retrieval of grants that are truly focused on a particular research area. In addition, the resulting graphical clusters reveal clear patterns in the relation-ships between the retrieved grants and the multiple Institutes fund-ing this research. Second, we examine an NIH peer review study section. The database categories and clusters clarify the complex relationship between the NIH Institutes and the centralized NIH peer review system, which is distinct and independent from the Institutes. Third, we show an analysis of the NIH RCDC category ‘sleep research’ in conjunction with the database topics, the latter

Database of NIH grants using machine-learned categories and graphical clustering

To the Editor: Information on research funding is important to various groups, including investigators, policy analysts, advo-cacy organizations and, of course, the funding agencies them-selves. But informatics resources devoted to research funding are currently limited. In particular, there is a need for informa-tion on grants from the US National Institutes of Health (NIH), the world’s largest single source of biomedical research funding, because of its large number of awards (~80,000 each year) and its complex organizational structure. NIH’s 25 grant-awarding Institutes and Centers have distinct but overlapping missions, and the relationship between these missions and the research they fund is multifaceted. Because there is no comprehensive scheme that characterizes NIH research, navigating the NIH funding landscape can be challenging.

At present, NIH offers information on awarded grants via the RePORTER website (http://projectreporter.nih.gov/). For each award, RePORTER provides keyword tags, plus ~215 categorical designations assigned to grants via a partially auto-mated system known as the NIH research, condition and disease categorization (RCDC) process (http://report.nih.gov/rcdc/categories/). But keyword searches are not optimal for various information needs and analyses, and the RCDC cat-egories are only intended to meet specific NIH reporting requirements, rather than to comprehensively characterize the entire NIH research portfolio.

To facilitate navigation and discovery of NIH-funded research, we created a data-base (https://app.nihmaps.org/) in which we use text mining to extract latent cat-egories and clusters from NIH grant titles and abstracts. This categorical informa-tion is discovered using two unsupervised machine-learning techniques. The first is topic modeling, a Bayesian statistical method that discerns meaningful catego-ries from unstructured text. The second is a graph-based clustering method that produces a two-dimensional visualized output, in which grants are grouped based on their overall topic- and word-based similarity to one another. The database allows specific queries within a contextual

Figure 1 | Graphically clustered NIH grants, as rendered from a screenshot of the NIHMaps user interface. NIH awards (here showing grants from 2010; ~80,000 documents) were scored for their overall topic and word similarity, and the resulting document distance calculations were used to seed a graphing algorithm. Grants are represented as dots, color-coded by NIH Institute and are clustered based on shared thematic content. For acronyms and separate views with each Institute highlighted, see the legend for Supplementary Table 1. Labels in black were automatically derived from review assignments of the underlying documents. Labels in red indicate a global structure that was reproducible using multiple different algorithm settings.

Talley et al. Database of NIH grants using machine-learned categories and graphical clustering. Nature Methods. 2011

Modify ModelParameter search

Sensitivity analysisUpdate assumptions

Machine LearningUnsupervised

Semi-supervised

Verify ModelExpert feedback

Validate known facts

Termite | Topic Model VisualizationTopic models aid analysis of text corpora by identifying latent topics based on co-occurring words. Real-world deployments of topic models, however, often require intensive expert verification and model refinement. In this paper we present Termite, a visual analysis tool for assessing topic model quality. Termite uses a tabular layout to promote comparison of terms both within and across latent topics. We contribute a novel saliency measure for selecting relevant terms and a seriation algorithm that both reveals clustering structure and promotes the legibility of related terms.

Saliency MeasureDisplaying informative terms

Frequent (left) vs. salient (right) terms.Our saliency measure ranks tree, context, tasks,

focus, networks above the more frequent but less informative words based, paper, approach,

technique, method. Discriminative terms enable speedier identification: Topic 6 concerns focus+context techniques; this topical composition is

ambiguous when examining the frequent terms.

Seriation Method for Text DataPreservation of reading order & early terminationTerms ordered by frequency (left) vs. our seriation technique (right).Seriation reveals clusters of terms and aids identification of coherent concepts such as Topic 2 (parallel coordinates), Topic 17 (network visualization), Topic 25 (treemaps), and Topic 41 (graph layout). Our term similarity measure embeds word ordering and favors reading order (online communities, social networks, aspect ratio, etc).

Word Similarity MeasureCo-occurrence & collocation likelihood

We define an asymmetric similarity measure to account forco-occurrence and collocation likelihood between all pairs of

words. Collocation defines the probability that a phrase (sequence of words) occurs more often in a corpus than would

be expected by chance, and is an asymmetric measure. For example, “social networks” is a likely phrase; “networks

social” is not. Incorporating collocation favors adjacent words that form meaningful phrases, in the correct reading order.

The Termite SystemMatrix view & drill-down by topicWhen a topic is selected in the term-topic matrix (left), the systems visualizes the word frequency distribution relative to the full corpus (middle) and shows the most representa-tive documents (right). The term-topic matrix shows term distributions for all latent topics. Unlike lists of per-topic words (the current standard practice), matrices support comparison across both topics and terms. We use circular area to encode term probabilities. Texts typically exhibit long tails of low probability words. Area has a higher dynamic range than length encodings (quadratic vs. linear scaling) and curvature enables perception of area even when circles overlap.

Based on Termite: Visualization Techniques for Assessing Textual Topic Models by Jason Chuang, Christopher D. Manning, Jeffrey Heer. In the proceedings of AVI 2012: International Working Conference on Advanced Visual Interfaces. Capri Islands, Italy. May, 2012.

Computer Forum Poster - Stanford Computer Forum

Documents

Computer Forum Poster - Stanford Computer Forum