A Latent Variable Model for Geographic Lexical Variation

A Latent Variable Model for Geographic Lexical Variation

Jacob Eisenstein Brendan O’Connor Noah A. Smith Eric P. XingSchool of Computer ScienceCarnegie Mellon UniversityPittsburgh, PA 15213, USA

{jacobeis,brendano,nasmith,epxing}@cs.cmu.edu

Abstract

The rapid growth of geotagged social mediaraises new computational possibilities for in-vestigating geographic linguistic variation. Inthis paper, we present a multi-level generativemodel that reasons jointly about latent topicsand geographical regions. High-level topicssuch as “sports” or “entertainment” are ren-dered differently in each geographic region,revealing topic-specific regional distinctions.Applied to a new dataset of geotagged mi-croblogs, our model recovers coherent top-ics and their regional variants, while identi-fying geographic areas of linguistic consis-tency. The model also enables prediction ofan author’s geographic location from raw text,outperforming both text regression and super-vised topic models.

1 Introduction

Sociolinguistics and dialectology study how lan-guage varies across social and regional contexts.Quantitative research in these fields generally pro-ceeds by counting the frequency of a handful ofpreviously-identified linguistic variables: pairs ofphonological, lexical, or morphosyntactic featuresthat are semantically equivalent, but whose fre-quency depends on social, geographical, or otherfactors (Paolillo, 2002; Chambers, 2009). It is left tothe experimenter to determine which variables willbe considered, and there is no obvious procedure fordrawing inferences from the distribution of multiplevariables. In this paper, we present a method foridentifying geographically-aligned lexical variationdirectly from raw text. Our approach takes the formof a probabilistic graphical model capable of iden-tifying both geographically-salient terms and coher-ent linguistic communities.

One challenge in the study of lexical variation isthat term frequencies are influenced by a variety offactors, such as the topic of discourse. We addressthis issue by adding latent variables that allow us tomodel topical variation explicitly. We hypothesizethat geography and topic interact, as “pure” topi-cal lexical distributions are corrupted by geographi-cal factors; for example, a sports-related topic willbe rendered differently in New York and Califor-nia. Each author is imbued with a latent “region”indicator, which both selects the regional variant ofeach topic, and generates the author’s observed ge-ographical location. The regional corruption of top-ics is modeled through a cascade of logistic normalpriors—a general modeling approach which we callcascading topic models. The resulting system hasmultiple capabilities, including: (i) analyzing lexi-cal variation by both topic and geography; (ii) seg-menting geographical space into coherent linguisticcommunities; (iii) predicting author location basedon text alone.

This research is only possible due to the rapidgrowth of social media. Our dataset is derived fromthe microblogging website Twitter,1 which permitsusers to post short messages to the public. Manyusers of Twitter also supply exact geographical co-ordinates from GPS-enabled devices (e.g., mobilephones),2 yielding geotagged text data. Text incomputer-mediated communication is often morevernacular (Tagliamonte and Denis, 2008), and assuch it is more likely to reveal the influence of ge-ographic factors than text written in a more formalgenre, such as news text (Labov, 1966).

We evaluate our approach both qualitatively andquantitatively. We investigate the topics and regions

1http://www.twitter.com2User profiles also contain self-reported location names, but

we do not use that information in this work.

that the model obtains, showing both common-senseresults (place names and sports teams are groupedappropriately), as well as less-obvious insights aboutslang. Quantitatively, we apply our model to predictthe location of unlabeled authors, using text alone.On this task, our model outperforms several alterna-tives, including both discriminative text regressionand related latent-variable approaches.

2 Data

The main dataset in this research is gathered fromthe microblog website Twitter, via its official API.We use an archive of messages collected over thefirst week of March 2010 from the “Gardenhose”sample stream,3 which then consisted of 15% ofall public messages, totaling millions per day. Weaggressively filter this stream, using only messagesthat are tagged with physical (latitude, longitude)coordinate pairs from a mobile client, and whose au-thors wrote at least 20 messages over this period. Wealso filter to include only authors who follow fewerthan 1,000 other people, and have fewer than 1,000followers. Kwak et al. (2010) find dramatic shiftsin behavior among users with social graph connec-tivity outside of that range; such users may be mar-keters, celebrities with professional publicists, newsmedia sources, etc. We also remove messages con-taining URLs to eliminate bots posting informationsuch as advertising or weather conditions. For inter-pretability, we restrict our attention to authors insidea bounding box around the contiguous U.S. states,yielding a final sample of about 9,500 users and380,000 messages, totaling 4.7 million word tokens.We have made this dataset available online.4

Informal text from mobile phones is challeng-ing to tokenize; we adapt a publicly available tok-enizer5 originally developed for Twitter (O’Connoret al., 2010), which preserves emoticons and blocksof punctuation and other symbols as tokens. Foreach user’s Twitter feed, we combine all messagesinto a single “document.” We remove word typesthat appear in fewer than 40 feeds, yielding a vocab-ulary of 5,216 words. Of these, 1,332 do not appearin the English, French, or Spanish dictionaries of the

3http://dev.twitter.com/pages/streaming_api4http://www.ark.cs.cmu.edu/GeoTwitter5http://tweetmotif.com

spell-checking program aspell.Every message is tagged with a location, but most

messages from a single individual tend to come fromnearby locations (as they go about their day); formodeling purposes we use only a single geographiclocation for each author, simply taking the locationof the first message in the sample.

The authors in our dataset are fairly heavy Twit-ter users, posting an average of 40 messages per day(although we see only 15% of this total). We havelittle information about their demographics, thoughfrom the text it seems likely that this user set skewstowards teens and young adults. The dataset cov-ers each of the 48 contiguous United States and theDistrict of Columbia.

3 Model

We develop a model that incorporates two sourcesof lexical variation: topic and geographical region.We treat the text and geographic locations as out-puts from a generative process that incorporates bothtopics and regions as latent variables.6 During infer-ence, we seek to recover the topics and regions thatbest explain the observed data.

At the base level of model are “pure” topics (suchas “sports”, “weather”, or “slang”); these topics arerendered differently in each region. We call this gen-eral modeling approach a cascading topic model; wedescribe it first in general terms before moving to thespecific application to geographical variation.

3.1 Cascading Topic Models

Cascading topic models generate text from a chainof random variables. Each element in the chain de-fines a distribution over words, and acts as the meanof the distribution over the subsequent element inthe chain. Thus, each element in the chain can bethought of as introducing some additional corrup-tion. All words are drawn from the final distributionin the chain.

At the beginning of the chain are the priors, fol-lowed by unadulerated base topics, which may thenbe corrupted by other factors (such as geography ortime). For example, consider a base “food” topic

6The region could be observed by using a predefined geo-graphical decomposition, e.g., political boundaries. However,such regions may not correspond well to linguistic variation.

that emphasizes words like dinner and delicious;the corrupted “food-California” topic would placeweight on these words, but might place extra em-phasis on other words like sprouts.

The path through the cascade is determined by aset of indexing variables, which may be hidden orobserved. As in standard latent Dirichlet allocation(Blei et al., 2003), the base topics are selected bya per-token hidden variable z. In the geographicaltopic model, the next level corresponds to regions,which are selected by a per-author latent variable r.

Formally, we draw each level of the cascade froma normal distribution centered on the previous level;the final multinomial distribution over words is ob-tained by exponentiating and normalizing. To ensuretractable inference, we assume that all covariancematrices are uniform diagonal, i.e., aI with a > 0;this means we do not model interactions betweenwords.

3.2 The Geographic Topic ModelThe application of cascading topic models to ge-ographical variation is straightforward. Each doc-ument corresponds to the entire Twitter feed of agiven author during the time period covered by ourcorpus. For each author, the latent variable r cor-responds to the geographical region of the author,which is not observed. As described above, r se-lects a corrupted version of each topic: the kth basictopic has mean µk, with uniform diagonal covari-ance σ2

k; for region j, we can draw the regionally-corrupted topic from the normal distribution, ηjk ∼N(µk, σ2

kI).Because η is normally-distributed, it lies not in

the simplex but in RW . We deterministically com-pute multinomial parameters β by exponentiatingand normalizing: βjk = exp(ηjk)/

∑i exp(η(i)

jk ).This normalization could introduce identifiabilityproblems, as there are multiple settings for η thatmaximize P (w|η) (Blei and Lafferty, 2006a). How-ever, this difficulty is obviated by the priors: givenµ and σ2, there is only a single η that maximizesP (w|η)P (η|µ, σ2); similarly, only a single µmax-imizes P (η|µ)P (µ|a, b2).

The observed latitude and longitude, denoted y,are normally distributed and conditioned on the re-gion, with mean νr and precision matrix Λr indexedby the region r. The region index r is itself drawn

from a single shared multinomial ϑ. The model isshown as a plate diagram in Figure 1.

Given a vocabulary size W , the generative storyis as follows:

• Generate base topics: for each topic k < K

– Draw the base topic from a normal distribu-tion with uniform diagonal covariance: µk ∼N(a, b2I),

– Draw the regional variance from a Gammadistribution: σ2

k ∼ G(c, d).– Generate regional variants: for each regionj < J ,∗ Draw the region-topic ηjk from a normal

distribution with uniform diagonal covari-ance: ηjk ∼ N(µk, σ

2kI).

∗ Convert ηjk into a multinomialdistribution over words by ex-ponentiating and normalizing:βjk = exp

(ηjk

)/∑W

i exp(η(i)jk ),

where the denominator sums over thevocabulary.

• Generate regions: for each region j < J ,

– Draw the spatial mean νj from a normal dis-tribution.

– Draw the precision matrix Λj from a Wishartdistribution.

• Draw the distribution over regions ϑ from a sym-metric Dirichlet prior, ϑ ∼ Dir(αϑ1).

• Generate text and locations: for each document d,

– Draw topic proportions from a symmetricDirichlet prior, θ ∼ Dir(α1).

– Draw the region r from the multinomial dis-tribution ϑ.

– Draw the location y from the bivariate Gaus-sian, y ∼ N(νr,Λr).

– For each word token,∗ Draw the topic indicator z ∼ θ.∗ Draw the word token w ∼ βrz .

4 Inference

We apply mean-field variational inference: a fully-factored variational distribution Q is chosen to min-imize the Kullback-Leibler divergence from thetrue distribution. Mean-field variational inferencewith conjugate priors is described in detail else-where (Bishop, 2006; Wainwright and Jordan,2008); we restrict our focus to the issues that areunique to the geographic topic model.

w

z

ϴ

D

Nd

y

r

K

η

J

Λ

α

ν

μσ2

ϑ

µk log of base topic k’s distribution over word typesσ2

k variance parameter for regional variants of topic kηjk region j’s variant of base topic µk

θd author d’s topic proportionsrd author d’s latent regionyd author d’s observed GPS locationνj region j’s spatial centerΛj region j’s spatial precisionzn token n’s topic assignmentwn token n’s observed word typeα global prior over author-topic proportionsϑ global prior over region classes

Figure 1: Plate diagram for the geographic topic model, with a table of all random variables. Priors (besides α) areomitted for clarity, and the document indices on z and w are implicit.

We place variational distributions over all latentvariables of interest: θ, z, r,ϑ,η,µ, σ2,ν, and Λ,updating each of these distributions in turn, untilconvergence. The variational distributions over θand ϑ are Dirichlet, and have closed form updates:each can be set to the sum of the expected counts,plus a term from the prior (Blei et al., 2003). Thevariational distributions q(z) and q(r) are categor-ical, and can be set proportional to the expectedjoint likelihood—to set q(z) we marginalize over r,and vice versa.7 The updates for the multivariateGaussian spatial parameters ν and Λ are describedby Penny (2001).

4.1 Regional Word Distributions

The variational region-topic distribution ηjk is nor-mal, with uniform diagonal covariance for tractabil-ity. Throughout we will write 〈x〉 to indicate the ex-pectation of x under the variational distribution Q.Thus, the vector mean of the distribution q(ηjk) iswritten 〈ηjk〉, while the variance (uniform across i)of q(η) is written V(ηjk).

To update the mean parameter 〈ηjk〉, we max-imize the contribution to the variational bound Lfrom the relevant terms:

L[〈η(i)

jk 〉]= 〈log p(w|β, z, r)〉+〈log p(η(i)

jk |µ(i)k , σ

2k)〉,(1)

7Thanks to the naıve mean field assumption, we canmarginalize over z by first decomposing across all Nd wordsand then summing over q(z).

with the first term representing the likelihood of theobserved words (recall that β is computed determin-istically from η) and the second term correspondingto the prior. The likelihood term requires the expec-tation 〈logβ〉, but this is somewhat complicated bythe normalizer

∑Wi exp(η(i)), which sums over all

terms in the vocabulary. As in previous work on lo-gistic normal topic models, we use a Taylor approx-imation for this term (Blei and Lafferty, 2006a).

The prior on η is normal, so the contribution fromthe second term of the objective (Equation 1) is− 1

2〈σ2k〉〈(η(i)

jk − µ(i)k )2〉. We introduce the following

notation for expected counts: N(i, j, k) indicates theexpected count of term i in region j and topic k, andN(j, k) =

∑iN(i, j, k). After some calculus, we

can write the gradient ∂L/∂〈η((i))jk 〉 as

N(i, j, k)−N(j, k)〈β(i)jk 〉 − 〈σ

−2k 〉(〈η

(i)jk 〉 − 〈µ

(i)k 〉),

(2)which has an intuitive interpretation. The first twoterms represent the difference in expected counts forterm i under the variational distributions q(z, r) andq(z, r, β): this difference goes to zero when β(i)

jk per-fectly matches N(i, j, k)/N(j, k). The third termpenalizes η(i)

jk for deviating from its prior µ(i)k , but

this penalty is proportional to the expected inversevariance 〈σ−2

k 〉. We apply gradient ascent to maxi-mize the objective L. A similar set of calculationsgives the gradient for the variance of η; these aredescribed in an forthcoming appendix.

4.2 Base Topics

The base topic parameters areµk and σ2k; in the vari-

ational distribution, q(µk) is normally distributedand q(σ2

k) is Gamma distributed. Note that µk andσ2k affect only the regional word distributions ηjk.

An advantage of the logistic normal is that the vari-ational parameters over µk are available in closedform,

〈µ(i)k 〉 =

b2∑J

j 〈η(i)jk 〉+ 〈σ2

k〉a(i)

b2J + 〈σ2k〉

V(µk) = (b−2 + J〈σ−2k 〉)

−1,

where J indicates the number of regions. The ex-pectation of the base topic µ incorporates the priorand the average of the generated region-topics—these two components are weighted respectively bythe expected variance of the region-topics 〈σ2

k〉 andthe prior topical variance b2. The posterior varianceV(µ) is a harmonic combination of the prior vari-ance b2 and the expected variance of the region top-ics.

The variational distribution over the region-topicvariance σ2

k has Gamma parameters. These param-eters cannot be updated in closed form, so gradi-ent optimization is again required. The derivationof these updates is more involved, and is left for aforthcoming appendix.

5 Implementation

Variational scheduling and initialization are impor-tant aspects of any hierarchical generative model,and are often under-discussed. In our implementa-tion, the variational updates are scheduled as fol-lows: given expected counts, we iteratively updatethe variational parameters on the region-topics η andthe base topicsµ, until convergence. We then updatethe geographical parameters ν and Λ, as well as thedistribution over regions ϑ. Finally, for each doc-ument we iteratively update the variational param-eters over θ, z, and r until convergence, obtainingexpected counts that are used in the next iterationof updates for the topics and their regional variants.We iterate an outer loop over the entire set of updatesuntil convergence.

We initialize the model in a piecewise fashion.First we train a Dirichlet process mixture model on

the locations y, using variational inference on thetruncated stick-breaking approximation (Blei andJordan, 2006). This automatically selects the num-ber of regions J , and gives a distribution over eachregion indicator rd from geographical informationalone. We then run standard latent Dirichlet alloca-tion to obtain estimates of z for each token (ignoringthe locations). From this initialization we can com-pute the first set of expected counts, which are usedto obtain initial estimates of all parameters neededto begin variational inference in the full model.

The prior a is the expected mean of each topicµ; for each term i, we set a(i) = logN(i) − logN ,where N(i) is the total count of i in the corpus andN =

∑iN(i). The variance prior b2 is set to 1, and

the prior on σ2 is the Gamma distribution G(2, 200),encouraging minimal deviation from the base topics.The symmetric Dirichlet prior on θ is set to 1

2 , andthe symmetric Dirichlet parameter on ϑ is updatedfrom weak hyperpriors (Minka, 2003). Finally, thegeographical model takes priors that are linked to thedata: for each region, the mean is very weakly en-couraged to be near the overall mean, and the covari-ance prior is set by the average covariance of clustersobtained by running K-means.

6 Evaluation

For a quantitative evaluation of the estimated rela-tionship between text and geography, we assess ourmodel’s ability to predict the geographic location ofunlabeled authors based on their text alone.8 Thistask may also be practically relevant as a step towardapplications for recommending local businesses orsocial connections. A randomly-chosen 60% of au-thors are used for training, 20% for development,and the remaining 20% for final evaluation.

6.1 Systems

We compare several approaches for predicting au-thor location; we divide these into latent variablegenerative models and discriminative approaches.

8Alternatively, one might evaluate the attributed regionalmemberships of the words themselves. While the Dictionary ofAmerican Regional English (Cassidy and Hall, 1985) attemptsa comprehensive list of all regionally-affiliated terms, it is basedon interviews conducted from 1965-1970, and the final volume(covering Si–Z) is not yet complete.

6.1.1 Latent Variable ModelsGeographic Topic Model This is the full versionof our system, as described in this paper. To pre-dict the unseen location yd, we iterate until con-vergence on the variational updates for the hiddentopics zd, the topic proportions θd, and the regionrd. From rd, the location can be estimated as yd =arg maxy

∑Jj p(y|νj ,Λj)q(rd = j). The develop-

ment set is used to tune the number of topics and toselect the best of multiple random initializations.

Mixture of Unigrams A core premise of our ap-proach is that modeling topical variation will im-prove our ability to understand geographical varia-tion. We test this idea by fixing K = 1, running oursystem with only a single topic. This is equivalentto a Bayesian mixture of unigrams in which each au-thor is assigned a single, regional unigram languagemodel that generates all of his or her text. The de-velopment set is used to select the best of multiplerandom initializations.

Supervised Latent Dirichlet Allocation In amore subtle version of the mixture-of-unigramsmodel, we model each author as an admixture of re-gions. Thus, the latent variable attached to each au-thor is no longer an index, but rather a vector on thesimplex. This model is equivalent to supervised la-tent Dirichlet allocation (Blei and McAuliffe, 2007):each topic is associated with equivariant Gaussiandistributions over the latitude and longitude, andthese topics must explain both the text and the ob-served geographical locations. For unlabeled au-thors, we estimate latitude and longitude by esti-mating the topic proportions and then applying thelearned geographical distributions. This is a linearprediction

f(zd;a) = (zTda

lat, zTda

lon)

for an author’s topic proportions zd and topic-geography weights a ∈ R2K .

6.1.2 Baseline ApproachesText Regression We perform linear regressionto discriminatively learn the relationship betweenwords and locations. Using term frequency featuresxd for each author, we predict locations with word-geography weights a ∈ R2W :

f(xd;a) = (xTda

lat, xTda

lon)

Weights are trained to minimize the sum of squaredEuclidean distances, subject to L1 regularization:∑

d

(xTda

lat − ylatd )2 + (xT

dalon − ylon

d )2

+ λlat||alat||1 + λlon||alon||1

The minimization problem decouples into two sep-arate latitude and longitude models, which we fitusing the glmnet elastic net regularized regres-sion package (Friedman et al., 2010), which ob-tained good results on other text-based predictiontasks (Joshi et al., 2010). Regularization parameterswere tuned on the development set. The L1 penaltyoutperformed L2 and mixtures of L1 and L2.

Note that for both word-level linear regressionhere, and the topic-level linear regression in SLDA,the choice of squared Euclidean distance dovetailswith our use of spatial Gaussian likelihoods in thegeographic topic models, since optimizing a isequivalent to maximum likelihood estimation un-der the assumption that locations are drawn fromequivariant circular Gaussians centered around eachf(xd;a) linear prediction. We experimented withdecorrelating the location dimensions by projectingyd into the principal component space, but this didnot help text regression.

K-Nearest Neighbors Linear regression is a poormodel for the multimodal density of human popula-tions. As an alternative baseline, we applied super-vised K-nearest neighbors to predict the location ydas the average of the positions of the K most sim-ilar authors in the training set. We computed term-frequency inverse-document frequency features andapplied cosine similarity over their first 30 principalcomponents to find the neighbors. The choices ofprincipal components, IDF weighting, and neighbor-hood size K = 20 were tuned on the developmentset.

6.2 MetricsOur principle error metrics are the mean and mediandistance between the predicted and true location inkilometers.9 Because the distance error may be dif-ficult to interpret, we also report accuracy of classi-

9For convenience, model training and prediction use latitudeand longitude as an unprojected 2D Euclidean space. However,properly measuring the physical distance between points on the

Regression Classification accuracy (%)System Mean Dist. (km) Median Dist. (km) Region (4-way) State (49-way)Geographic topic model 900 494 58 24Mixture of unigrams 947 644 53 19Supervised LDA 1055 728 39 4Text regression 948 712 41 4K-nearest neighbors 1077 853 37 2Mean location 1148 1018Most common class 37 27

Table 1: Location prediction results; lower scores are better on the regression task, higher scores are better on theclassification task. Distances are in kilometers. Mean location and most common class are computed from the test set.Both the geographic topic model and supervised LDA use the best number of topics from the development set (10 and5, respectively).

fication by state and by region of the United States.Our data includes the 48 contiguous states plus theDistrict of Columbia; the U.S. Census Bureau di-vides these states into four regions: West, Midwest,Northeast, and South.10 Note that while major pop-ulation centers straddle several state lines, most re-gion boundaries are far from the largest cities, re-sulting in a clearer analysis.

6.3 Results

As shown in Table 1, the geographic topic modelachieves the strongest performance on all metrics.All differences in performance between systemsare statistically significant (p < .01) using theWilcoxon-Mann-Whitney test for regression errorand the χ2 test for classification accuracy. Figure 2shows how performance changes as the number oftopics varies.

Note that the geographic topic model and the mix-ture of unigrams use identical code and parametriza-tion – the only difference is that the geographic topicmodel accounts for topical variation, while the mix-ture of unigrams sets K = 1. These results validateour basic premise that it is important to model theinteraction between topical and geographical varia-tion.

Text regression and supervised LDA perform es-pecially poorly on the classification metric. Bothmethods make predictions that are averaged across

Earth’s surface requires computing or approximating the greatcircle distance – we use the Haversine formula (Sinnott, 1984).For the continental U.S., the relationship between degrees andkilometers is nearly linear, but extending the model to a conti-nental scale would require a more sophisticated approach.

10http://www.census.gov/geo/www/us_regdiv.pdf

0 5 10 15 20400

500

600

700

800

900

1000

1100

Number of topics

Med

ian

erro

r (k

m)

Geographic Topic Model

Supervised LDA

Mean location

Figure 2: The effect of varying the number of topics onthe median regression error (lower is better).

each word in the document: in text regression, eachword is directly multiplied by a feature weight; insupervised LDA the word is associated with a la-tent topic first, and then multiplied by a weight. Forthese models, all words exert an influence on the pre-dicted location, so uninformative words will drawthe prediction towards the center of the map. Thisyields reasonable distance errors but poor classifica-tion accuracy. We had hoped that K-nearest neigh-bors would be a better fit for this metric, but its per-formance is poor at all values of K. Of course it isalways possible to optimize classification accuracydirectly, but such an approach would be incapableof predicting the exact geographical location, whichis the focus of our evaluation (given that the desiredgeographical partition is unknown). Note that thegeographic topic model is also not trained to opti-mize classification accuracy.

“basketball”“popularmusic”

“daily life” “emoticons” “chit chat”

PISTONS KOBELAKERS game

DUKE NBACAVS STUCKEY

JETS KNICKS

album musicbeats artist video

#LAKERSITUNES tourproduced vol

tonight shopweekend gettinggoing chillingready discount

waiting iam

:) haha :d :( ;) :pxd :/ hahaha

hahah

lol smh jk yeawyd coo ima

wassupsomethin jp

Boston+ CELTICS victory

BOSTONCHARLOTTE

playing daughterPEARL alive war

compBOSTON ;p gna loveee

ese exam suttinsippin

N. California+ THUNDERKINGS GIANTSpimp trees clap

SIMON dlmountain seee 6am OAKLAND

pues hella kooSAN fckn

hella flirt hutiono OAKLAND

New York + NETS KNICKS BRONX iam cab oww wasssup nm

Los Angeles+ #KOBE#LAKERSAUSTIN

#LAKERS loadHOLLYWOODimm MICKEY

TUPAC

omw tacos hrHOLLYWOOD

af papi rainingth bomb coo

HOLLYWOOD

wyd coo af nadatacos messinfasho bomb

Lake Erie+ CAVS

CLEVELANDOHIO BUCKS od

COLUMBUS

premiere prodjoint TORONTOonto designer

CANADA villageburr

stink CHIPOTLEtipsy

;d blvd BIEBERhve OHIO

foul WIZ saltyexcuses lames

officer lastnight

Table 2: Example base topics (top line) and regional variants. For the base topics, terms are ranked by log-oddscompared to the background distribution. The regional variants show words that are strong compared to both the basetopic and the background. Foreign-language words are shown in italics, while terms that are usually in proper nounsare shown in SMALL CAPS. See Table 3 for definitions of slang terms; see Section 7 for more explanation and detailson the methodology.

Figure 3: Regional clustering of the training set obtained by one randomly-initialized run of the geographical topicmodel. Each point represents one author, and each shape/color combination represents the most likely cluster as-signment. Ellipses represent the regions’ spatial means and covariances. The same model and coloring are shown inTable 2.

7 Analysis

Our model permits analysis of geographical vari-ation in the context of topics that help to clarifythe significance of geographically-salient terms. Ta-ble 2 shows a subset of the results of one randomly-initialized run, including five hand-chosen topics (of50 total) and five regions (of 13, as chosen automat-ically during initialization). Terms were selected bylog-odds comparison. For the base topics we showthe ten strongest terms in each topic as compared tothe background word distribution. For the regionalvariants, we show terms that are strong both region-ally and topically: specifically, we select terms thatare in the top 100 compared to both the backgrounddistribution and to the base topic. The names for thetopics and regions were chosen by the authors.

Nearly all of the terms in column 1 (“basketball”)refer to sports teams, athletes, and place names—encouragingly, terms tend to appear in the regionswhere their referents reside. Column 2 contains sev-eral proper nouns, mostly referring to popular mu-sic figures (including PEARL from the band PearlJam).11 Columns 3–5 are more conversational.Spanish-language terms (papi, pues, nada, ese) tendto appear in regions with large Spanish-speakingpopulations—it is also telling that these terms ap-pear in topics with emoticons and slang abbrevia-tions, which may transcend linguistic barriers. Otherterms refer to people or subjects that may be espe-cially relevant in certain regions: tacos appears inthe southern California region and cab in the NewYork region; TUPAC refers to a rap musician fromLos Angeles, and WIZ refers to a rap musician fromPittsburgh, not far from the center of the “Lake Erie”region.

A large number of slang terms are found to havestrong regional biases, suggesting that slang maydepend on geography more than standard Englishdoes. The terms af and hella display especiallystrong regional affinities, appearing in the regionalvariants of multiple topics (see Table 3 for defini-tions). Northern and Southern California use variantspellings koo and coo to express the same meaning.

11This analysis is from an earlier version of our dataset thatcontained some Twitterbots, including one from a Boston-arearadio station. The bots were purged for the evaluation in Sec-tion 6, though the numerical results are nearly identical.

term definitionaf as fuck (very)coo cooldl downloadfasho for suregna going tohella veryhr houriam I amima I’m going toimm I’miono I don’t knowlames lame (not cool)

people

term definitionjk just kiddingjp just playing (kid-

ding)koo coollol laugh out loudnm nothing muchod overdone (very)omw on my waysmh shake my headsuttin somethingwassup what’s upwyd what are you do-

ing?

Table 3: A glossary of non-standard terms from Ta-ble 2. Definitions are obtained by manually inspectingthe context in which the terms appear, and by consultingwww.urbandictionary.com.

While research in perceptual dialectology does con-firm the link of hella to Northern California (Bu-choltz et al., 2007), we caution that our findingsare merely suggestive, and a more rigorous analysismust be undertaken before making definitive state-ments about the regional membership of individualterms. We view the geographic topic model as anexploratory tool that may be used to facilitate suchinvestigations.

Figure 3 shows the regional clustering on thetraining set obtained by one run of the model. Eachpoint represents an author, and the ellipses representthe bivariate Gaussians for each region. There arenine compact regions for major metropolitan areas,two slightly larger regions that encompass Floridaand the area around Lake Erie, and two large re-gions that partition the country roughly into northand south.

8 Related Work

The relationship between language and geographyhas been a topic of interest to linguists since thenineteenth century (Johnstone, 2010). An earlywork of particular relevance is Kurath’s (1949) WordGeography of the Eastern United States, in whichhe conducted interviews and then mapped the oc-currence of equivalent word pairs such as stoop andporch. The essence of this approach—identifyingvariable pairs and measuring their frequencies—remains a dominant methodology in both dialec-

tology (Labov et al., 2006) and sociolinguis-tics (Tagliamonte, 2006). Within this paradigm,computational techniques are often applied to posthoc analysis: logistic regression (Sankoff et al.,2005) and mixed-effects models (Johnson, 2009) areused to measure the contribution of individual vari-ables, while hierarchical clustering and multidimen-sional scaling enable aggregated inference acrossmultiple variables (Nerbonne, 2009). However, inall such work it is assumed that the relevant linguis-tic variables have already been identified—a time-consuming process involving considerable linguisticexpertise. We view our work as complementary tothis tradition: we work directly from raw text, iden-tifying both the relevant features and coherent lin-guistic communities.

An active recent literature concerns geotagged in-formation on the web, such as search queries (Back-strom et al., 2008) and tagged images (Crandall etal., 2009). This research identifies the geographicdistribution of individual queries and tags, but doesnot attempt to induce any structural organization ofeither the text or geographical space, which is thefocus of our research. More relevant is the workof Mei et al. (2006), in which the distribution overlatent topics in blog posts is conditioned on the ge-ographical location of the author. This is somewhatsimilar to the supervised LDA model that we con-sider, but their approach assumes that a partitioningof geographical space into regions is already given.

Methodologically, our cascading topic model isdesigned to capture multiple dimensions of variabil-ity: topics and geography. Mei et al. (2007) includesentiment as a second dimension in a topic model,using a switching variable so that individual wordtokens may be selected from either the topic or thesentiment. However, our hypothesis is that individ-ual word tokens reflect both the topic and the ge-ographical aspect. Sharing this intuition, Paul andGirju (2010) build topic-aspect models for the crossproduct of topics and aspects. They do not imposeany regularity across multiple aspects of the sametopic, so this approach may not scale when the num-ber of aspects is large (they consider only two as-pects). We address this issue using cascading distri-butions; when the observed data for a given region-topic pair is low, the model falls back to the basetopic. The use of cascading logistic normal distri-

butions in topic models follows earlier work on dy-namic topic models (Blei and Lafferty, 2006b; Xing,2005).

9 Conclusion

This paper presents a model that jointly identifieswords with high regional affinity, geographically-coherent linguistic regions, and the relationship be-tween regional and topic variation. The key model-ing assumption is that regions and topics interact toshape observed lexical frequencies. We validate thisassumption on a prediction task in which our modeloutperforms strong alternatives that do not distin-guish regional and topical variation.

We see this work as a first step towards a unsuper-vised methodology for modeling linguistic variationusing raw text. Indeed, in a study of morphosyn-tactic variation, Szmrecsanyi (2010) finds that bythe most generous measure, geographical factors ac-count for only 33% of the observed variation. Ouranalysis might well improve if non-geographicalfactors were considered, including age, race, gen-der, income and whether a location is urban or ru-ral. In some regions, estimates of many of these fac-tors may be obtained by cross-referencing geogra-phy with demographic data. We hope to explore thispossibility in future work.

Acknowledgments

We would like to thank Amr Ahmed, Jonathan Chang,Shay Cohen, William Cohen, Ross Curtis, Miro Dudık,Scott Kiesling, Seyoung Kim, and the anonymous re-viewers. This research was enabled by Google’s sup-port of the Worldly Knowledge project at CMU, AFOSRFA9550010247, ONR N0001140910758, NSF CAREERDBI-0546594, NSF IIS-0713379, and an Alfred P. SloanFellowship.

References

L. Backstrom, J. Kleinberg, R. Kumar, and J. Novak.2008. Spatial variation in search engine queries. InProceedings of WWW.

C. M. Bishop. 2006. Pattern Recognition and MachineLearning. Springer.

D. M. Blei and M. I. Jordan. 2006. Variational infer-ence for Dirichlet process mixtures. Bayesian Analy-sis, 1:121–144.

D. M. Blei and J. Lafferty. 2006a. Correlated topic mod-els. In NIPS.

D. M. Blei and J. Lafferty. 2006b. Dynamic topic mod-els. In Proceedings of ICML.

D. M. Blei and J. D. McAuliffe. 2007. Supervised topicmodels. In NIPS.

D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. LatentDirichlet allocation. JMLR, 3:993–1022.

M. Bucholtz, N. Bermudez, V. Fung, L. Edwards, andR. Vargas. 2007. Hella Nor Cal or totally So Cal?the perceptual dialectology of California. Journal ofEnglish Linguistics, 35(4):325–352.

F. G. Cassidy and J. H. Hall. 1985. Dictionary of Amer-ican Regional English, volume 1. Harvard UniversityPress.

J. Chambers. 2009. Sociolinguistic Theory: LinguisticVariation and its Social Significance. Blackwell.

D. J Crandall, L. Backstrom, D. Huttenlocher, andJ. Kleinberg. 2009. Mapping the world’s photos. InProceedings of WWW, page 761770.

J. Friedman, T. Hastie, and R. Tibshirani. 2010. Regular-ization paths for generalized linear models via coordi-nate descent. Journal of Statistical Software, 33(1).

D. E. Johnson. 2009. Getting off the GoldVarb standard:Introducing Rbrul for mixed-effects variable rule anal-ysis. Language and Linguistics Compass, 3(1):359–383.

B. Johnstone. 2010. Language and place. In R. Mesthrieand W. Wolfram, editors, Cambridge Handbook of So-ciolinguistics. Cambridge University Press.

M. Joshi, D. Das, K. Gimpel, and N. A. Smith. 2010.Movie reviews and revenues: An experiment in textregression. In Proceedings of NAACL-HLT.

H. Kurath. 1949. A Word Geography of the EasternUnited States. University of Michigan Press.

H. Kwak, C. Lee, H. Park, and S. Moon. 2010. Whatis Twitter, a social network or a news media? In Pro-ceedings of WWW.

W. Labov, S. Ash, and C. Boberg. 2006. The Atlas ofNorth American English: Phonetics, Phonology, andSound Change. Walter de Gruyter.

W. Labov. 1966. The Social Stratification of English inNew York City. Center for Applied Linguistics.

Q. Mei, C. Liu, H. Su, and C. X Zhai. 2006. A proba-bilistic approach to spatiotemporal theme pattern min-ing on weblogs. In Proceedings of WWW, page 542.

Q. Mei, X. Ling, M. Wondra, H. Su, and C. X. Zhai.2007. Topic sentiment mixture: modeling facets andopinions in weblogs. In Proceedings of WWW.

T. P. Minka. 2003. Estimating a Dirichlet distribution.Technical report, Massachusetts Institute of Technol-ogy.

J. Nerbonne. 2009. Data-driven dialectology. Languageand Linguistics Compass, 3(1).

B. O’Connor, M. Krieger, and D. Ahn. 2010. TweetMo-tif: Exploratory search and topic summarization fortwitter. In Proceedings of ICWSM.

J. C. Paolillo. 2002. Analyzing Linguistic Variation: Sta-tistical Models and Methods. CSLI Publications.

M. Paul and R. Girju. 2010. A two-dimensional topic-aspect model for discovering multi-faceted topics. InProceedings of AAAI.

W. D. Penny. 2001. Variational Bayes for d-dimensionalGaussian mixture models. Technical report, Univer-sity College London.

D. Sankoff, S. A. Tagliamonte, and E. Smith. 2005.Goldvarb X: A variable rule application for Macintoshand Windows. Technical report, Department of Lin-guistics, University of Toronto.

R. W. Sinnott. 1984. Virtues of the Haversine. Sky andTelescope, 68(2).

B. Szmrecsanyi. 2010. Geography is overrated. InS. Hansen, C. Schwarz, P. Stoeckle, and T. Streck, ed-itors, Dialectological and Folk Dialectological Con-cepts of Space. Walter de Gruyter.

S. A. Tagliamonte and D. Denis. 2008. Linguistic ruin?LOL! Instant messanging and teen language. Ameri-can Speech, 83.

S. A. Tagliamonte. 2006. Analysing Sociolinguistic Vari-ation. Cambridge University Press.

M. J. Wainwright and M. I. Jordan. 2008. GraphicalModels, Exponential Families, and Variational Infer-ence. Now Publishers.

E. P. Xing. 2005. On topic evolution. Technical Report05-115, Center for Automated Learning and Discov-ery, Carnegie Mellon University.

A Latent Variable Model for Geographic Lexical Variation

Documents