Top Banner
Density Estimation for Geolocation via Convolutional Mixture Density Network Hayate Iso, Shoko Wakamiya, Eiji Aramaki Nara Institute of Science and Technology {iso.hayate.id3,wakamiya,aramaki}@is.naist.jp Abstract Nowadays, geographic information re- lated to Twitter is crucially important for fine-grained applications. However, the amount of geographic information avail- able on Twitter is low, which makes the pursuit of many applications challenging. Under such circumstances, estimating the location of a tweet is an important goal of the study. Unlike most previous studies that estimate the pre-defined district as the classification task, this study employs a probability distribution to represent richer information of the tweet, not only the loca- tion but also its ambiguity. To realize this modeling, we propose the convolutional mixture density network (CMDN), which uses text data to estimate the mixture model parameters. Experimentally ob- tained results reveal that CMDN achieved the highest prediction performance among the method for predicting the exact coor- dinates. It also provides a quantitative rep- resentation of the location ambiguity for each tweet that properly works for extract- ing the reliable location estimations. 1 Introduction Geographic information related to Twitter en- riches the availability of data resources. Such information is indispensable for various practical applications such as early earthquake detection (Sakaki et al., 2010), infectious disease disper- sion assessment (Broniatowski et al., 2013), and regional user behavior assessment during an elec- tion period (Caldarelli et al., 2014). However the application performance depends strongly on the number of geo-tagged tweets, which account for fewer than 0.5 % of all tweets (Cheng et al., 2010). To extend the possibilities of the geographic information, a great deal of effort has been de- voted to specifying the geolocation automatically (Han et al., 2014). These studies are classified roughly into two aspects, the User level (Cheng et al., 2010; Han et al., 2014; Jurgens et al., 2015; Rahimi et al., 2015) and the Message level (Dredze et al., 2016; Liu and Huang, 2016; Priedhorsky et al., 2014) prediction. The former predicts the residential area. The latter one predicts the place that the user mentioned. This study targeted the latter problem, with message level prediction, in- volving the following three levels of difficulty. First, most tweets lack information to identify the true geolocation. In general, many tweets do not include geolocation identifiable words. There- fore, it is difficult even for humans to identify a geolocation (Figure 1b). Next, some location names involve ambiguity because a word refers to multiple locations. For example, places called “Portland” exist in sev- eral locations worldwide. Similarly, this ambigu- ity also arises within a single country, as shown in Figure 2. Although additional context words are necessary to identify the exact location, many tweets do not include such clue words for identify- ing the location. As for such tweet, the real-valued point estimation is expected to be degraded by re- gression towards the mean (Stigler, 1997). Finally, if a user states the word that represents the exact location, the user is not necessarily there. In case the user describes several places in the tweet, contextual comprehension of the tweet is needed to identify the true location (Figure 1c). In contrast to most studies, this study was con- ducted to resolve these issues based on the den- sity estimation approach. A salient benefit of density estimation is to enable comprehension of the uncertainty related to the tweet user location because it propagates from the estimated distri- bution and handles tweets distributed to multiple arXiv:1705.02750v1 [cs.CL] 8 May 2017
8

arXiv:1705.02750v1 [cs.CL] 8 May 2017

May 02, 2023

Download

Documents

Khang Minh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: arXiv:1705.02750v1 [cs.CL] 8 May 2017

Density Estimation for Geolocation via Convolutional Mixture DensityNetwork

Hayate Iso, Shoko Wakamiya, Eiji AramakiNara Institute of Science and Technology

{iso.hayate.id3,wakamiya,aramaki}@is.naist.jp

Abstract

Nowadays, geographic information re-lated to Twitter is crucially important forfine-grained applications. However, theamount of geographic information avail-able on Twitter is low, which makes thepursuit of many applications challenging.Under such circumstances, estimating thelocation of a tweet is an important goal ofthe study. Unlike most previous studiesthat estimate the pre-defined district as theclassification task, this study employs aprobability distribution to represent richerinformation of the tweet, not only the loca-tion but also its ambiguity. To realize thismodeling, we propose the convolutionalmixture density network (CMDN), whichuses text data to estimate the mixturemodel parameters. Experimentally ob-tained results reveal that CMDN achievedthe highest prediction performance amongthe method for predicting the exact coor-dinates. It also provides a quantitative rep-resentation of the location ambiguity foreach tweet that properly works for extract-ing the reliable location estimations.

1 Introduction

Geographic information related to Twitter en-riches the availability of data resources. Suchinformation is indispensable for various practicalapplications such as early earthquake detection(Sakaki et al., 2010), infectious disease disper-sion assessment (Broniatowski et al., 2013), andregional user behavior assessment during an elec-tion period (Caldarelli et al., 2014). However theapplication performance depends strongly on thenumber of geo-tagged tweets, which account forfewer than 0.5 % of all tweets (Cheng et al., 2010).

To extend the possibilities of the geographicinformation, a great deal of effort has been de-voted to specifying the geolocation automatically(Han et al., 2014). These studies are classifiedroughly into two aspects, the User level (Chenget al., 2010; Han et al., 2014; Jurgens et al., 2015;Rahimi et al., 2015) and the Message level (Dredzeet al., 2016; Liu and Huang, 2016; Priedhorskyet al., 2014) prediction. The former predicts theresidential area. The latter one predicts the placethat the user mentioned. This study targeted thelatter problem, with message level prediction, in-volving the following three levels of difficulty.

First, most tweets lack information to identifythe true geolocation. In general, many tweets donot include geolocation identifiable words. There-fore, it is difficult even for humans to identify ageolocation (Figure 1b).

Next, some location names involve ambiguitybecause a word refers to multiple locations. Forexample, places called “Portland” exist in sev-eral locations worldwide. Similarly, this ambigu-ity also arises within a single country, as shownin Figure 2. Although additional context wordsare necessary to identify the exact location, manytweets do not include such clue words for identify-ing the location. As for such tweet, the real-valuedpoint estimation is expected to be degraded by re-gression towards the mean (Stigler, 1997).

Finally, if a user states the word that representsthe exact location, the user is not necessarily there.In case the user describes several places in thetweet, contextual comprehension of the tweet isneeded to identify the true location (Figure 1c).

In contrast to most studies, this study was con-ducted to resolve these issues based on the den-sity estimation approach. A salient benefit ofdensity estimation is to enable comprehension ofthe uncertainty related to the tweet user locationbecause it propagates from the estimated distri-bution and handles tweets distributed to multiple

arX

iv:1

705.

0275

0v1

[cs

.CL

] 8

May

201

7

Page 2: arXiv:1705.02750v1 [cs.CL] 8 May 2017

CMDNCNN-l2CNN-l1

10-11

10-9

10-7

10-5

10-3

10-1

101

103

(a) I got lost at Umeda station.. Here isalmost like a jungle...likelihood:195.5, distance: 2.15 (km)

CMDNCNN-l2CNN-l1

10-10

10-8

10-6

10-4

10-2

100

102

(b) I’ve already slept a lot, but I still feelsleepy...likelihood: 7.0, distance: 174.4 (km)

CMDNCNN-l2CNN-l1

10-10

10-8

10-6

10-4

10-2

100

102

(c) I will not return to Fukuoka for thefirst time in Osakalikelihood: 75.6, distance: 13.5 (km)

Figure 1: Original tweet location with the estimated distribution by our proposed method, CMDN. Theintersection of dotted line represents the true tweet location. The red diamond represents the estimatedlocation by the CMDN. The blue and yellow triangle represents the CNN for regression with `2, `1-loss.

points properly. Figure 1 shows each estimateddensity as a heatmap. The estimated distributionis concentrated near the true location (Figure 1a)and vice versa (Figure 1b) if the tweet includesplenty of clues. Furthermore, the density-basedapproach can accommodate the representation ofmultiple output data, whereas the regression-basedapproach cannot (Figure 1c).

The density-based approach provides additionalbenefits for practical application. The estimateddensity appends the estimation reliability for eachtweet as the likelihood value. For reliable estima-tion, the estimated density provides the high like-lihood (Figure 1a and 1c) and vice versa (Figure1b).

To realize this modeling, we propose a Con-volutional Mixture Density Network (CMDN),a method for estimating the geolocation densityestimation from text data. Actually, CMDN ex-tracts valuable features using a convolutional neu-ral network architecture and converts these fea-tures to mixture density parameters. Our exper-imentally obtained results reveal that not merelythe high prediction performance, but also the re-liability measure works properly for filtering outuncertain estimations.

2 Related work

Social media geolocation has been undertaken onvarious platforms such as Facebook (Backstromet al., 2010), Flickr (Serdyukov et al., 2009), andWikipedia (Lieberman and Lin, 2009). Espe-cially, Twitter geolocation is the predominant fieldamong all of them because of its availability (Hanet al., 2014).

Twitter geolocation methods are not confined

to text information. Diverse information can fa-cilitate geolocation performance such as meta in-formation (time (Dredze et al., 2016) and esti-mated age and gender (Pavalanathan and Eisen-stein, 2015)), social network structures (Jurgenset al., 2015; Rahimi et al., 2015), and user move-ment (Liu and Huang, 2016). Many studies, how-ever, have attempted user level geolocation, notthe message level. Although the user level geolo-cation is certainly effective for some applications,message level geolocation supports fine-grainedanalyses.

However, Priedhorsky et al. (2014) has at-tempted density estimation for message level ge-olocation by estimating a word-independent Gaus-sian Mixture Model and then combining them toderive each tweet density. Although the paperproposed many weight estimation methods, manyof them depend strongly on locally distributedwords. Our proposed method, CMDN, enables es-timate of the geolocation density from the text se-quence in the End-to-End manner and considersthe marginal context of the tweet via CNN.

3 Convolutional Neural Network forRegression (Unimodal output)

We start by introducing the convolutional neuralnetwork (Fukushima, 1980; LeCun et al., 1998)(CNN) for a regression problem, which directlyestimates the real-valued output as our baselinemethod. Our regression formulation is almostidentical to that of CNN for document classifica-tion based on Kim (2014). We merely remove thesoftmax layer and replace the loss function.

We assume that a tweet has length L (paddedwhere necessary) and that wi represents the word

Page 3: arXiv:1705.02750v1 [cs.CL] 8 May 2017

Figure 2: The places where the word “Sakurajima’was mentioned in training data are projected onthe map. The word “Sakurajima” is mentioned inmultiple places because this word represents mul-tiple locations.

i-th index. We project the words wi to vectors xithrough an embedding matrix We, xi = Wewi ∈Rd, where d represents the size of the embeddings.We compose the sentence vector x1:L by concate-nating the word vectors xi, as

x1:L = x1 ⊕ x2 ⊕ · · · ⊕ xL ∈ RLd.

In that equation, ⊕ represents vector concatena-tion.

To extract valuable features from sentences, weapply filter matrix Wf for every part of the sen-tence vector with window size ` as

Wf = [wf,1,wf,2, . . . ,wf,m]>

ci,j = φ(w>f,jxi:i+` + bf,j)

where m stands for the number of feature maps,and φ represents the activation function. As de-scribed herein, we employ ReLU (Nair and Hin-ton, 2010) as the activation function.

For each filter’s output, we apply 1-max poolingto extract the most probable window feature cj .Then we compose the abstracted feature vectors hby concatenating each pooled feature cj as shownbelow.

cj = maxi∈{1,...,L−`+1}

ci,j

h = [c1, c2, . . . , cm]>

Finally, we estimate the real value output y us-ing the abstracted feature vector h as

y = Wrh + br ∈ Rq

where q signifies the output dimension, Wr ∈Rq×m, denotes the regression weight matrix, andbr ∈ Rq represent the bias vectors for regression.

To optimize the regression model, we use twoloss functions between the true value y and esti-mated value y. One is `2-loss as

N∑n=1

‖yn − yn‖22

and another is robust loss function for outlier, `1-loss,

N∑n=1

‖yn − yn‖1 =N∑n=1

q∑q′=1

|yn,q′ − yn,q′ |

where N represents the sample size. The `1-loss shrinks the outlier effects for estimation ratherthan the `2 one.

4 Convolutional Mixture DensityNetwork (Multimodal output)

In the previous section, we introduced the CNNmethod for regression problem. This regressionformulation is good for addressing well-definedproblems. Both the input and output have one-to-one correspondence. Our tweet corpus is funda-mentally unfulfilled with this assumption. There-fore, a more flexible model must be used to repre-sent the richer information.

In this section, we propose a novel architec-ture for text to density estimation, ConvolutionalMixture Density Network (CMDN), which is theextension of Mixture Density Network by Bishop(1994). In contrast to the regression approach thatdirectly represents the output values y, CMDNcan accommodate more complex information asthe probability distribution. The CMDN esti-mates the parameters of Gaussian mixture modelπk,µk,Σk, where k is the k-th mixture compo-nent using the same abstracted features h in theCNN regression (Sec 3).

4.1 Parameter estimation by neural network

Presuming that density consists of K componentsof multivariate normal distribution N (y|µk,Σk),then the number of each q-dimensional normaldistribution parameters are p = q(q+3)

2 (q parame-ters for each mean µk and diagonal values of thecovariance matrix Σk and q(q−1)

2 parameters forcorrelation parameters ρ) between dimension out-puts.

Page 4: arXiv:1705.02750v1 [cs.CL] 8 May 2017

For q = 2, each component of the parameters isrepresented as

µk =

(µk,1µk,2

),Σk =

(σ2k,1 ρkσk,1σk,2

ρkσk,1σk,2 σ2k,2

).

To estimate these parameters, we first projectthe hidden layer h into the required number of pa-rameters space θ as

θ = Wph + bp ∈ RKp,

where Wp ∈ RKp×m and bp ∈ RKp respectivelydenote the weight matrix and bias vector.

Although the parameter θ has a sufficient num-ber of parameters to represent the K mixturemodel, these parameters are not optimal for inser-tion to the parameters of the multivariate normaldistribution N (y|µk,Σk), and its mixture weightπk.

The mixture weights πk must be positive val-ues and must sum to 1. The variance parameters σmust be a positive real value. The correlation pa-rameter ρk must be (−1, 1). For this purpose, wetransform real-valued outputs θ into the optimalrange for each parameter of mixture density.

4.2 Parameter conversion

For simplicity, we first decompose θ to each mix-ture parameter θk as

θ = θ1 ⊕ θ2 ⊕ · · · ⊕ θK

θk = (θπk , θµk,1 , θµk,2 , θσk,1 , θσk,2 , θρk).

To restrict each parameter range, we convert eachvanilla parameter θk as

πk = softmax(θπk)

=exp(θπk)∑Kk′=1 exp(θπ′

k)∈ (0, 1)

µk,j = θµk,j ∈ R,σk,j = softplus(θσk,j )

= ln(1 + exp(θσk,j )

)∈ (0,∞),

ρk = softsign(θρk)

=θρk

1 + |θρk |∈ (−1, 1).

The original MDN paper (Bishop, 1994) andits well known application of MDN for handwrit-ing generation (Graves, 2013) used an exponen-tial function for transforming variance parameter

σ and a hyperbolic tangent function for the cor-relation parameter ρ. However, we use softplus(Glorot et al., 2011) for variance and softsign (Glo-rot and Bengio, 2010) for correlation.

Replacing the activation function in the outputlayer prevents these gradient problems. Actually,these gradient values are often exploded or nonex-istent. Our proposed transformation is effective toachieve rapid convergence and stable learning.

4.3 Loss function for parameter estimation

To optimize the mixture density model, we usenegative log likelihood as the training loss:

−N∑n=1

ln

(K∑k=1

πkN (yn|µk,Σk)

).

5 Experiments

In this section, we formalize our problem settingand clarify our proposed model effectiveness.

5.1 Problem setting

This study explores CMDN performance from twoperspectives.

Our first experiment is to predict the geographiccoordinates by which each user stated using theonly single tweet content. We evaluate the meanand median value of distances measured usingVincenty’s formula (Vincenty, 1975) between theestimated and true geographic coordinates for theoverall dataset.

The second experiment is used to filter outthe unreliable estimation quantitatively using thelikelihood-based threshold. Each estimated den-sity assigns the likelihood value for every pointof the location. We designate the likelihood val-ues as reliability indicators for the respective es-timated locations. Then, we remove the estima-tion from the lowest likelihood value and calculateboth the mean and median values. This indicatoris filtered correctly out the unreliable estimationsif the statistics decrease monotonically.

Location estimation by estimated densityIn contrast to the regression approach, it is neces-sary to specify the estimation point from the esti-mated density of each tweet. In accordance withBishop (1994), we employ the mode value of es-timated density as the estimated location y. Themode value of the probability distribution can befound by numerical optimization, but it requires

Page 5: arXiv:1705.02750v1 [cs.CL] 8 May 2017

Dataset

# of tweets 24,633,478# of users 276,248Average # of word 16.0# of vocabulary 351,752

Table 1: Dataset statistics.

overly high costs for scalable estimation. For asimple and scalable approximation for seeking themode value of the estimated density, we restrictthe search space to each mean value of the mix-ture components as shown below:

y = argmaxy∈{µ1,...,µK}

K∑k=1

πkN (y|µk,Σk).

5.2 Dataset

Our tweet corpus consists of 24,633,478 Japanesetweets posted from July 14, 2011 to July 31, 2012.Our corpus statistics are presented in Table 1. Wesplit our corpus randomly into training for 20Mtweets, development for 2M, and test for 2M.

5.3 Comparative models

We compare our proposed model effectiveness bycontrolling experiment procedures, which replacethe model components one-by-one. We also pro-vide simple baseline performance. The followingmodel configurations are presented in Table 2.

Mean: Mean value of the training data loca-tions.

Median: Median value of the training data lo-cations.

Enet: Elastic Net regression (Zou and Hastie,2005), which consists of ordinary least squares re-gression with `2, `1-regularization.

MLP-l2: Multi Layer Perceptron with `2-loss(Minsky and Papert, 1969)

MLP-l1: Multi Layer Perceptron with `1-loss.CNN-l2: Convolutional Neural Network for re-

gression with `2-loss based on Kim (2014)CNN-l1: Convolutional Neural Network for re-

gression with `1-loss.MDN: Mode value of Mixture Density Network

(Bishop, 1994)CMDN: Mode value of Convolutional Mixture

Density Network (Proposed)

Parameters

# of mixture 50Embedding dimension 300Window sizes 3, 4, 5Each filter size 128Dropout rate (2014) 0.2Batch size 500Learning rate 0.0001Optimization Adam (2014)

Table 2: Model configurations.

Method Mean (km) Median (km)

CMDN 159.4 10.7CNN-l1 147.5 28.2CNN-l2 166.5 80.5MDN 251.4 92.9MLP-l1 224.0 75.0MLP-l2 226.8 140.3Enet 197.2 144.6Mean 279.4 173.9Median 253.7 96.1

Table 3: The prediction performance. The lowervalue is the better estimation.

5.4 Results5.4.1 Geolocation performanceThe experimental geolocation performance resultsare presented in Table 3.

Overall results show that our proposed modelCMDN provides the lowest median error distance:CNN-l1 is the lowest mean error distance. Also,CMDN gives similar mean error distances to thoseof CNN-l1. Both CMDN and CNN-l1 outperformall others in comparative models.

In addition to the results obtained for the fea-ture extraction part, the CNN-based model consis-tently achieved better prediction performance thanthe vanilla MLP-based model measured by boththe mean and median.

5.4.2 Likelihood based thresholdWe show how the likelihood-based threshold af-fects both mean and median statistics in Figure3. The likelihood-based threshold consistently de-creases both statistics in proportion to the like-lihood lower bound increases. Especially, thesestatistics dramatically decrease when the likeli-hood is between 101 and 102. Several likelihoodbounds results are presented in Figure 4. Although

Page 6: arXiv:1705.02750v1 [cs.CL] 8 May 2017

Figure 3: Likelihood-based threshold results with95% confidence interval for our proposed model:CMDN. To extract reliable (lower value of dis-tance measures) geolocated tweets, we can in-crease the lower bound of likelihood. The stan-dard deviations of both the mean and median arecalculated using Bootstrap methods (Efron, 1979).

8 6 4 2 0 2 4 6 8 10

log-distance

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

1.65 (km)

Overalllikelihood > 80likelihood > 160

Figure 4: Output distribution with several like-lihood thresholds. The small log-distance repre-sents the better prediction.

the model mistakenly estimates many incorrectpredictions for the overall dataset (blue), the like-lihood base threshold correctly prunes the outlierestimations.

6 Discussion

Our proposed model provides high accuracy forour experimental data. Moreover, likelihood-based thresholds reveal a consistent indicator ofestimation certainty. In this section, we further ex-plore the properties of the CMDN from the view-point of difference of the loss function.

Loss function difference

We compare the CNN-based model with differentloss functions, `2, `1-loss for regression CNN-l2,CNN-l1 and negative log-likelihood for mixturedensity model CMDN. The output distance dis-tributions are shown in Figure 5.

Although `2-loss denotes the worst perfor-mance measured by mean and median amongCNN-based models (Table 3), it represents thelowest outlier ratio. Consequently, the `2-loss isthe most conservative estimate for our skewed cor-pus.

We can infer that the difference between thecompetitive model CNN-l1 and our proposedmodel CMDN is their estimation aggressiveness.The median of CMDN is remarkably lower than ofCNN-l1, but the mean of CMDN is slightly largerthan that of CNN-l1. The reason for these phe-nomena is the difference of the loss functions be-havior for multiple candidate data. Even thoughthe `1-loss function is robust to outliers, the es-timation deteriorates when the candidates appearwith similar possibilities. In contrast, CMDNcan accommodate multiple candidates as multiplemixture components. Therefore, CMDN imposesthe appropriate probabilities for several candidatesand picks up the most probable point as the pre-diction. In short, CMDN’s predictions becomemore aggressive than CNN-l1’s. Consequently,CMDN’s median value becomes lower than CNN-l1’s.

An important shortcoming of aggressive predic-tion is that the estimation deteriorates when theestimation fails. The mean value tends to be af-fected strongly by the outlier estimation. There-fore, CMDN’s mean value becomes higher thanthat of CNN-l1’s.

However, CMDN can overcome this shortcom-ing using a likelihood-based threshold, which firstfilters out the outlier. Therefore, we conclude thatthe negative log-likelihood for mixture density isbetter than those of other loss functions `2 and `1-loss for the regression.

7 Future work

This study assessed the performance of our pro-posed model, CMDN, using text data alone. Al-though text-only estimation is readily applica-ble to existing resources, we still have roomfor improvement of the prediction performance.The winner of the Twitter Geolocation Prediction

Page 7: arXiv:1705.02750v1 [cs.CL] 8 May 2017

8 6 4 2 0 2 4 6 8 10

log-distance

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

27.1 (km)

CMDNCNN-l2CNN-l1

Figure 5: Loss function difference: Lower valuesrepresent better predictions. Our proposed model,CMDN (blue), tends to be the most aggressive es-timation.

Shared Task, Miura et al. (2016), proposed that theunified architecture handle several meta-data suchas the user location, user description, and timezone for predicting geolocations. The CMDN canintegrate this information in the same manner.

Furthermore, Liu and Huang (2016) reports thatthe user home location strongly affects locationprediction for a single tweet. For example, a rou-tine tweet is fundamentally unpredictable usingtext contents alone, but if the user home locationis known, this information is a valuable indicationfor evaluating the tweet. As future work, we planto develop a unified architecture that incorporatesuser movement information using a recurrent neu-ral network.

In contrast, our objective function might be nolonger useful for world scale geolocation becauseours approximates the spherical coordinates intothe real coordinate space. This approximation er-ror tends to become larger for the larger scale ge-olocation inference. We will explore our method’sgeolocation performance using the world scale ge-olocation dataset such as W-NUT data (Han et al.,2016).

8 Conclusion

This study clarified the capabilities of the den-sity estimation approach to Twitter geolocation.Our proposed model, CMDN, performed not onlywith high accuracy for our experimental data; italso extracted reliable geolocated tweets usinglikelihood-based thresholds. Results show thatCMDN merely requires the tweet message con-tents to identify its geolocation, while obviating

preparation of meta-information. Consequently,CMDN can contribute to extension of the fields inwhich geographic information application can beused.

ReferencesLars Backstrom, Eric Sun, and Cameron Marlow. 2010.

Find me if you can: improving geographical predic-tion with social and spatial proximity. In Proceed-ings of the 19th international conference on Worldwide web. ACM, pages 61–70.

Christopher Bishop. 1994. Mixture density networks.Technical report.

David A. Broniatowski, Michael J. Paul, and MarkDredze. 2013. National and local influenza surveil-lance through twitter: An analysis of the 2012-2013 influenza epidemic. PLoS ONE 8(12).https://doi.org/10.1371/journal.pone.0083672.

Guido Caldarelli, Alessandro Chessa, Fabio Pammolli,Gabriele Pompa, Michelangelo Puliga, MassimoRiccaboni, and Gianni Riotta. 2014. A multi-level geographical study of italian political elec-tions from twitter data. PLoS ONE 9(5):1–11.https://doi.org/10.1371/journal.pone.0095809.

Zhiyuan Cheng, James Caverlee, and Kyumin Lee.2010. You are where you tweet: A content-basedapproach to geo-locating twitter users. In Proceed-ings of the 19th ACM International Conference onInformation and Knowledge Management. ACM,New York, NY, USA, CIKM ’10, pages 759–768.https://doi.org/10.1145/1871437.1871535.

Mark Dredze, Miles Osborne, and Prabhanjan Kam-badur. 2016. Geolocation for twitter: Timing mat-ters. In North American Chapter of the Associationfor Computational Linguistics (NAACL).

B Efron. 1979. Bootstrap methods: Another look at thejackknife. The Annals of Statistics pages 1–26.

Kunihiko Fukushima. 1980. Neocognitron: A self-organizing neural network model for a mechanismof pattern recognition unaffected by shift in position.Biological cybernetics 36(4):193–202.

Xavier Glorot and Yoshua Bengio. 2010. Understand-ing the difficulty of training deep feedforward neuralnetworks. In Aistats. volume 9, pages 249–256.

Xavier Glorot, Antoine Bordes, and Yoshua Bengio.2011. Deep sparse rectifier neural networks. In Ais-tats. volume 15, page 275.

Alex Graves. 2013. Generating sequences withrecurrent neural networks. arXiv preprintarXiv:1308.0850 .

Page 8: arXiv:1705.02750v1 [cs.CL] 8 May 2017

Bo Han, Paul Cook, and Timothy Baldwin. 2014. Text-based twitter user geolocation prediction. Jour-nal of Artificial Intelligence Research 49:451–500.https://doi.org/10.1613/jair.4200.

Bo Han, AI Hugo, Afshin Rahimi, Leon Derczyn-ski, and Timothy Baldwin. 2016. Twitter geoloca-tion prediction shared task of the 2016 workshop onnoisy user-generated text. WNUT 2016 page 213.

David Jurgens, Tyler Finethy, James McCorriston,Yi Tian Xu, and Derek Ruths. 2015. Geolocationprediction in twitter using social networks: A criticalanalysis and review of current practice. In ICWSM.pages 188–197.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP). Association for Com-putational Linguistics, Doha, Qatar, pages 1746–1751. http://www.aclweb.org/anthology/D14-1181.

Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .

Yann LeCun, Leon Bottou, Yoshua Bengio, and PatrickHaffner. 1998. Gradient-based learning applied todocument recognition. Proceedings of the IEEE86(11):2278–2324.

Michael D Lieberman and Jimmy J Lin. 2009. Youare where you edit: Locating wikipedia contributorsthrough edit histories. In ICWSM.

Zhi Liu and Yan Huang. 2016. Where areyou tweeting?: A context and user movementbased approach. In Proceedings of the 25thACM International on Conference on Informa-tion and Knowledge Management. ACM, NewYork, NY, USA, CIKM ’16, pages 1949–1952.https://doi.org/10.1145/2983323.2983881.

Marvin Minsky and Seymour Papert. 1969. Percep-trons: An Introduction to Computational Geometry.MIT Press, Cambridge, MA, USA.

Yasuhide Miura, Motoki Taniguchi, Tomoki Taniguchi,and Tomoko Ohkuma. 2016. A simple scalable neu-ral networks based model for geolocation predictionin twitter. WNUT 2016 9026924:235.

Vinod Nair and Geoffrey E Hinton. 2010. Rectifiedlinear units improve restricted boltzmann machines.In Proceedings of the 27th international conferenceon machine learning (ICML-10). pages 807–814.

Umashanthi Pavalanathan and Jacob Eisenstein.2015. Confounds and consequences in geotaggedtwitter data. In Proceedings of Empirical Meth-ods for Natural Language Processing (EMNLP).http://www.aclweb.org/anthology/D/D15/D15-1256.pdf.

Reid Priedhorsky, Aron Culotta, and Sara Y Del Valle.2014. Inferring the origin locations of tweets withquantitative confidence. In Proceedings of the 17thACM conference on Computer supported coopera-tive work & social computing. ACM, pages 1523–1536.

Afshin Rahimi, Duy Vu, Trevor Cohn, and Tim-othy Baldwin. 2015. Exploiting text and net-work context for geolocation of social mediausers. In Proceedings of the 2015 Conferenceof the North American Chapter of the Associa-tion for Computational Linguistics: Human Lan-guage Technologies. Association for ComputationalLinguistics, Denver, Colorado, pages 1362–1367.http://www.aclweb.org/anthology/N15-1153.

Takeshi Sakaki, Makoto Okazaki, and YutakaMatsuo. 2010. Earthquake shakes twitterusers: Real-time event detection by social sen-sors. In Proceedings of the 19th InternationalConference on World Wide Web. ACM, NewYork, NY, USA, WWW ’10, pages 851–860.https://doi.org/10.1145/1772690.1772777.

Pavel Serdyukov, Vanessa Murdock, and RoelofVan Zwol. 2009. Placing flickr photos on a map. InProceedings of the 32nd international ACM SIGIRconference on Research and development in infor-mation retrieval. ACM, pages 484–491.

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: a simple way to prevent neural networksfrom overfitting. Journal of Machine Learning Re-search 15(1):1929–1958.

Stephen M Stigler. 1997. Regression towards themean, historically considered. Statistical methodsin medical research 6(2):103–114.

Thaddeus Vincenty. 1975. Direct and inverse solu-tions of geodesics on the ellipsoid with applicationof nested equations. Survey review 23(176):88–93.

Hui Zou and Trevor Hastie. 2005. Regularization andvariable selection via the elastic net. Journal ofthe Royal Statistical Society: Series B (StatisticalMethodology) 67(2):301–320.