Latent Geographic Feature Extraction
from Social Media
Christian Sengstock*Michael Gertz
Database Systems Research Group
Heidelberg University, Germany
November 8, 2012
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Social Media is a huge and increasing source of unstructuredand uncertain geographic information
E�ort to make data usable:
(Structured) Information ExtractionPlace/event extraction from Flickr [Rattenburry SIGIR'07]Event trajectory extraction from Twitter [Sakaki WWW'10]Spatial AnalysisSpatio-temporal forecasting using Flickr [Jin MM'12]Study ecological phenomena [Zhang WWW'12]
Nov 8 2 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
General Motivation: Extract spatial variables fromunstructured and noisy geographic information sources
l1
l2
|Φ(l)|
Φ(l)
Flickr Twitter Wikipedia
l2
l1
This work: Framework for unsupervised extraction ofinformative spatial variables (dimensions of geographicsemantics) from Social Media
Nov 8 3 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Outline
1 De�nitions and Problem Statement
2 Data Characteristics and Normalization
3 Latent Geographic Feature Extraction
4 Experiments
5 Conclusions
Nov 8 4 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Outline
1 De�nitions and Problem StatementGeographic FeatureSignal EstimationProblem Statement
2 Data Characteristics and NormalizationDistribution CharacteristicsNormalizationGeographic Feature Types
3 Latent Geographic Feature ExtractionDimensionality ReductionFramework
4 ExperimentsTechnique ComparisonNormalization In�uenceExploration Task
5 Conclusions
Nov 8 5 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Terminology
Geographic Feature f
A dimension representing some semantics of a location (e.g.,temperature, population, number of restaurants)Sampled (measured) at any location l in geographic space W
(→ spatial variable)
Geographic Feature Sensor φ and Signal φ(l) of f :
φ : W → R+
l 7→ φ(l)
Nov 8 6 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Terminology
Set of geographic features f1, . . . , fp de�nes a MultivariateGeographic Feature Sensor:
Φ := (φ1, . . . , φp)T
Spatial sampling scheme (measurements) L = (l1, . . . , ln)de�nes a Location Sampling Matrix:
Xn×p = (Φ(l1), . . . ,Φ(ln))T =
φ1(l1) . . . φp(l1)...
. . ....
φ1(ln) . . . φp(ln)
Nov 8 7 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Terminology
A Social Media Collection D consists of documents:
di = (X , u, l , t)
X : Bag of document features(terms, tags, image features,...)
u : Userl : Locationt : Timestamp
Assumption: Features with geographic meaning aggregate in
subsets of geographic space → high signal
Nov 8 8 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Signal Estimation
Every document feature f1, . . . , fp is a possiblymeaningful/meaningless geographic feature
Intuition of geographic feature signal φi (l):Number of users using feature fi around location l ∈W 1
Estimation of φi by Non-parametric 2D-histogramestimator on regular grid C of bandwidth w
Small w → Capture small scale variation/phenomena
Large w → Capture large scale variation/phenomena
1motivated in next sectionNov 8 9 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Problem Statement
Problem:
Given high-dimensional geographic feature signal Φ from aSocial Media collection (all terms/tags)
→ Features might be meaningless, redundant, noisy
Goal:Unsupervised extraction of small number of informativegeographic features
Applications:
Prepare data for learning tasks that cannot handlehigh-dimensional dataDiscover hidden spatial variables in the data
Nov 8 10 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Outline
1 De�nitions and Problem StatementGeographic FeatureSignal EstimationProblem Statement
2 Data Characteristics and NormalizationDistribution CharacteristicsNormalizationGeographic Feature Types
3 Latent Geographic Feature ExtractionDimensionality ReductionFramework
4 ExperimentsTechnique ComparisonNormalization In�uenceExploration Task
5 Conclusions
Nov 8 11 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Dataset
Two Flickr datasets covering US and LA
Document features: Tags (pre-�ltered by minimum userfrequency)
Spatial resolution: US (1.0 degree), LA (0.01 degree)
Nov 8 12 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Spatial Distribution Characteristics
Figure: F (l): Num of features, D(l): Num of documents, U(l): Num of users,
Fd(l): Num of distinct features.
Exponential characteristics of spatial feature distribution
Users ∼ distinct features / documents ∼ features
Nov 8 13 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Spatial Feature Distribution: 'beach'
Figure: F (l , f ): Number of feature f = beach, U(l , f ): Number of users using
f = beach
Some users contribute large number of documents
Estimate signal on basis of users is less biased (more robust)
Nov 8 14 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Normalization
Exponential distribution characteristics → Few locationsdominate the signals' spatial distribution
Normalization transforms the signal into a more naturaldomain
Logging:φ′i (l) := log φi (l) + 1
Binarization:φ′i (l) := 1{φi (l) > 0}
Nov 8 15 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Geographic Feature Types
Geographic Feature Types: Classes of geographic featureswith similar geographic semantics [Sengstock ACMGIS'11]
Global: Same intensity as baseline distribution (number ofusers) → Not interesting to discriminate between locations
Regional: Widely spread in geographic space but di�erentfrom baseline → Interesting to discriminate between large
subsets in geographic space
Landmark: Occurring only in small subsets of geographicspace → Interesting to discriminate between single small
subset and the rest
Depends on area of interest W and spatial resolution w .
Nov 8 16 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Geographic Feature Types
Entropy over locations of spatial signal Xi as geographicfeature type statistic for fi :
large entropy → Signal widely spread / smoothly distributed
small entropy → Signal peaky / occurs in small areas
Figure: Ordered entropies H[Xi ] for tag features of US Flickr dataset
Nov 8 17 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Outline
1 De�nitions and Problem StatementGeographic FeatureSignal EstimationProblem Statement
2 Data Characteristics and NormalizationDistribution CharacteristicsNormalizationGeographic Feature Types
3 Latent Geographic Feature ExtractionDimensionality ReductionFramework
4 ExperimentsTechnique ComparisonNormalization In�uenceExploration Task
5 Conclusions
Nov 8 18 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Dimensionality Reduction
Describe high-dimensional data by k << p dimensions whilepreserving statistical properties of data
General formulation (generative latent factor model)
Xn×p = Sn×kAk×p
S: Latent factor (component) values for each record
A: Combination of latent factors by original features
Latent Geographic Feature Sensor/Signal:
Φ̃(l)k×1 := Ak×p Φ(l)p×1
Nov 8 19 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Dimensionality Reduction
Spatial distributions in the data re�ect spatial phenomena
Statistical structure in X depends on spatial co-occurrence offeatures
Reducing the location sampling matrix X:
Latent factors represent dominant spatial distributions(correlated features collapse, non-dominant features diminish)Latent factors describe distinct spatial distributions
Latent geographic features describe signals of dominant anddistinct geographic phenomena
Nov 8 20 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Framework
Nov 8 21 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Outline
1 De�nitions and Problem StatementGeographic FeatureSignal EstimationProblem Statement
2 Data Characteristics and NormalizationDistribution CharacteristicsNormalizationGeographic Feature Types
3 Latent Geographic Feature ExtractionDimensionality ReductionFramework
4 ExperimentsTechnique ComparisonNormalization In�uenceExploration Task
5 Conclusions
Nov 8 22 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Technique Comparison
Study of dimensionality reduction techniques preservingdi�erent statistical properties
Principal Component Analysis (PCA):Components are statistically uncorrelated
Sparse Principal Component Analysis (SPCA):Components are statistically uncorrelated and sparse (α << p
non-zero entries)
Independent Component Analysis (ICA):Components are statistically independent
Nov 8 23 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Technique Comparison
Qualitative Evaluation
a Extraction of k = 20 latent geographic features
b Manual labeling of extracted features on basis of componentweights and spatial signal distribution
c Identi�cation of 'informative' latent features
d Selection of similar latent features of other techniques on basisof highest component weights
e Comparison of component weights and signal distribution
Nov 8 24 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Technique Comparison
SPCA feat.: 'landscape' (top), 'beach' (center), 'desert' (bot.)
Nov 8 25 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Technique Comparison:
Comparison 'beach': PCA (top), ICA (center), SPCA (bottom)
Nov 8 26 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Normalization In�uence
Extracted latent geographic features can be of di�erent types(global, landmark, regional)
Calculation of entropy over k = 20 extracted features for eachtechnique and each normalization strategy (none, logging,binarization)
SPCA and ICA show response towards more regional featuresfor stronger normalization
Nov 8 27 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Normalization In�uence
Normalization: None (top), Logging (center), Binar. (bottom)
Nov 8 28 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Exploration Task
Extraction of informative geographic features for Los Angelesusing Flickr
Exploration Setting
SPCA feature extraction
Normalization as exploration parameter
Nov 8 29 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Exploration Task
Los Angeles Landmark Features
Nov 8 30 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Exploration Task
Los Angeles Regional Features
Nov 8 31 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Outline
1 De�nitions and Problem StatementGeographic FeatureSignal EstimationProblem Statement
2 Data Characteristics and NormalizationDistribution CharacteristicsNormalizationGeographic Feature Types
3 Latent Geographic Feature ExtractionDimensionality ReductionFramework
4 ExperimentsTechnique ComparisonNormalization In�uenceExploration Task
5 Conclusions
Nov 8 32 / 33
De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions
Conclusions
Summary
General framework for unsupervised extraction of informativespatial variables (dimensions of geographic semantics) fromSocial MediaPCA � ICA � SPCATransformation of informative geographic feature extractioninto a problem of high-dimensional statistics in geographicfeature spaceExtraction of spatial variables of di�erent types (landmarks,regional) by normalization
Ongoing Work
Extrinsic Evaluation of parametrization and techniques (e.g.spatial classi�cation task)(Semi-) supervised feature extraction
Nov 8 33 / 33