Top Banner
33

Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

Apr 19, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

Latent Geographic Feature Extraction

from Social Media

Christian Sengstock*Michael Gertz

Database Systems Research Group

Heidelberg University, Germany

November 8, 2012

Page 2: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Social Media is a huge and increasing source of unstructuredand uncertain geographic information

E�ort to make data usable:

(Structured) Information ExtractionPlace/event extraction from Flickr [Rattenburry SIGIR'07]Event trajectory extraction from Twitter [Sakaki WWW'10]Spatial AnalysisSpatio-temporal forecasting using Flickr [Jin MM'12]Study ecological phenomena [Zhang WWW'12]

Nov 8 2 / 33

Page 3: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

General Motivation: Extract spatial variables fromunstructured and noisy geographic information sources

l1

l2

|Φ(l)|

Φ(l)

Flickr Twitter Wikipedia

l2

l1

This work: Framework for unsupervised extraction ofinformative spatial variables (dimensions of geographicsemantics) from Social Media

Nov 8 3 / 33

Page 4: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Outline

1 De�nitions and Problem Statement

2 Data Characteristics and Normalization

3 Latent Geographic Feature Extraction

4 Experiments

5 Conclusions

Nov 8 4 / 33

Page 5: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Outline

1 De�nitions and Problem StatementGeographic FeatureSignal EstimationProblem Statement

2 Data Characteristics and NormalizationDistribution CharacteristicsNormalizationGeographic Feature Types

3 Latent Geographic Feature ExtractionDimensionality ReductionFramework

4 ExperimentsTechnique ComparisonNormalization In�uenceExploration Task

5 Conclusions

Nov 8 5 / 33

Page 6: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Terminology

Geographic Feature f

A dimension representing some semantics of a location (e.g.,temperature, population, number of restaurants)Sampled (measured) at any location l in geographic space W

(→ spatial variable)

Geographic Feature Sensor φ and Signal φ(l) of f :

φ : W → R+

l 7→ φ(l)

Nov 8 6 / 33

Page 7: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Terminology

Set of geographic features f1, . . . , fp de�nes a MultivariateGeographic Feature Sensor:

Φ := (φ1, . . . , φp)T

Spatial sampling scheme (measurements) L = (l1, . . . , ln)de�nes a Location Sampling Matrix:

Xn×p = (Φ(l1), . . . ,Φ(ln))T =

φ1(l1) . . . φp(l1)...

. . ....

φ1(ln) . . . φp(ln)

Nov 8 7 / 33

Page 8: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Terminology

A Social Media Collection D consists of documents:

di = (X , u, l , t)

X : Bag of document features(terms, tags, image features,...)

u : Userl : Locationt : Timestamp

Assumption: Features with geographic meaning aggregate in

subsets of geographic space → high signal

Nov 8 8 / 33

Page 9: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Signal Estimation

Every document feature f1, . . . , fp is a possiblymeaningful/meaningless geographic feature

Intuition of geographic feature signal φi (l):Number of users using feature fi around location l ∈W 1

Estimation of φi by Non-parametric 2D-histogramestimator on regular grid C of bandwidth w

Small w → Capture small scale variation/phenomena

Large w → Capture large scale variation/phenomena

1motivated in next sectionNov 8 9 / 33

Page 10: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Problem Statement

Problem:

Given high-dimensional geographic feature signal Φ from aSocial Media collection (all terms/tags)

→ Features might be meaningless, redundant, noisy

Goal:Unsupervised extraction of small number of informativegeographic features

Applications:

Prepare data for learning tasks that cannot handlehigh-dimensional dataDiscover hidden spatial variables in the data

Nov 8 10 / 33

Page 11: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Outline

1 De�nitions and Problem StatementGeographic FeatureSignal EstimationProblem Statement

2 Data Characteristics and NormalizationDistribution CharacteristicsNormalizationGeographic Feature Types

3 Latent Geographic Feature ExtractionDimensionality ReductionFramework

4 ExperimentsTechnique ComparisonNormalization In�uenceExploration Task

5 Conclusions

Nov 8 11 / 33

Page 12: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Dataset

Two Flickr datasets covering US and LA

Document features: Tags (pre-�ltered by minimum userfrequency)

Spatial resolution: US (1.0 degree), LA (0.01 degree)

Nov 8 12 / 33

Page 13: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Spatial Distribution Characteristics

Figure: F (l): Num of features, D(l): Num of documents, U(l): Num of users,

Fd(l): Num of distinct features.

Exponential characteristics of spatial feature distribution

Users ∼ distinct features / documents ∼ features

Nov 8 13 / 33

Page 14: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Spatial Feature Distribution: 'beach'

Figure: F (l , f ): Number of feature f = beach, U(l , f ): Number of users using

f = beach

Some users contribute large number of documents

Estimate signal on basis of users is less biased (more robust)

Nov 8 14 / 33

Page 15: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Normalization

Exponential distribution characteristics → Few locationsdominate the signals' spatial distribution

Normalization transforms the signal into a more naturaldomain

Logging:φ′i (l) := log φi (l) + 1

Binarization:φ′i (l) := 1{φi (l) > 0}

Nov 8 15 / 33

Page 16: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Geographic Feature Types

Geographic Feature Types: Classes of geographic featureswith similar geographic semantics [Sengstock ACMGIS'11]

Global: Same intensity as baseline distribution (number ofusers) → Not interesting to discriminate between locations

Regional: Widely spread in geographic space but di�erentfrom baseline → Interesting to discriminate between large

subsets in geographic space

Landmark: Occurring only in small subsets of geographicspace → Interesting to discriminate between single small

subset and the rest

Depends on area of interest W and spatial resolution w .

Nov 8 16 / 33

Page 17: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Geographic Feature Types

Entropy over locations of spatial signal Xi as geographicfeature type statistic for fi :

large entropy → Signal widely spread / smoothly distributed

small entropy → Signal peaky / occurs in small areas

Figure: Ordered entropies H[Xi ] for tag features of US Flickr dataset

Nov 8 17 / 33

Page 18: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Outline

1 De�nitions and Problem StatementGeographic FeatureSignal EstimationProblem Statement

2 Data Characteristics and NormalizationDistribution CharacteristicsNormalizationGeographic Feature Types

3 Latent Geographic Feature ExtractionDimensionality ReductionFramework

4 ExperimentsTechnique ComparisonNormalization In�uenceExploration Task

5 Conclusions

Nov 8 18 / 33

Page 19: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Dimensionality Reduction

Describe high-dimensional data by k << p dimensions whilepreserving statistical properties of data

General formulation (generative latent factor model)

Xn×p = Sn×kAk×p

S: Latent factor (component) values for each record

A: Combination of latent factors by original features

Latent Geographic Feature Sensor/Signal:

Φ̃(l)k×1 := Ak×p Φ(l)p×1

Nov 8 19 / 33

Page 20: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Dimensionality Reduction

Spatial distributions in the data re�ect spatial phenomena

Statistical structure in X depends on spatial co-occurrence offeatures

Reducing the location sampling matrix X:

Latent factors represent dominant spatial distributions(correlated features collapse, non-dominant features diminish)Latent factors describe distinct spatial distributions

Latent geographic features describe signals of dominant anddistinct geographic phenomena

Nov 8 20 / 33

Page 21: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Framework

Nov 8 21 / 33

Page 22: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Outline

1 De�nitions and Problem StatementGeographic FeatureSignal EstimationProblem Statement

2 Data Characteristics and NormalizationDistribution CharacteristicsNormalizationGeographic Feature Types

3 Latent Geographic Feature ExtractionDimensionality ReductionFramework

4 ExperimentsTechnique ComparisonNormalization In�uenceExploration Task

5 Conclusions

Nov 8 22 / 33

Page 23: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Technique Comparison

Study of dimensionality reduction techniques preservingdi�erent statistical properties

Principal Component Analysis (PCA):Components are statistically uncorrelated

Sparse Principal Component Analysis (SPCA):Components are statistically uncorrelated and sparse (α << p

non-zero entries)

Independent Component Analysis (ICA):Components are statistically independent

Nov 8 23 / 33

Page 24: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Technique Comparison

Qualitative Evaluation

a Extraction of k = 20 latent geographic features

b Manual labeling of extracted features on basis of componentweights and spatial signal distribution

c Identi�cation of 'informative' latent features

d Selection of similar latent features of other techniques on basisof highest component weights

e Comparison of component weights and signal distribution

Nov 8 24 / 33

Page 25: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Technique Comparison

SPCA feat.: 'landscape' (top), 'beach' (center), 'desert' (bot.)

Nov 8 25 / 33

Page 26: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Technique Comparison:

Comparison 'beach': PCA (top), ICA (center), SPCA (bottom)

Nov 8 26 / 33

Page 27: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Normalization In�uence

Extracted latent geographic features can be of di�erent types(global, landmark, regional)

Calculation of entropy over k = 20 extracted features for eachtechnique and each normalization strategy (none, logging,binarization)

SPCA and ICA show response towards more regional featuresfor stronger normalization

Nov 8 27 / 33

Page 28: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Normalization In�uence

Normalization: None (top), Logging (center), Binar. (bottom)

Nov 8 28 / 33

Page 29: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Exploration Task

Extraction of informative geographic features for Los Angelesusing Flickr

Exploration Setting

SPCA feature extraction

Normalization as exploration parameter

Nov 8 29 / 33

Page 30: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Exploration Task

Los Angeles Landmark Features

Nov 8 30 / 33

Page 31: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Exploration Task

Los Angeles Regional Features

Nov 8 31 / 33

Page 32: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Outline

1 De�nitions and Problem StatementGeographic FeatureSignal EstimationProblem Statement

2 Data Characteristics and NormalizationDistribution CharacteristicsNormalizationGeographic Feature Types

3 Latent Geographic Feature ExtractionDimensionality ReductionFramework

4 ExperimentsTechnique ComparisonNormalization In�uenceExploration Task

5 Conclusions

Nov 8 32 / 33

Page 33: Latent Geographic Feature Extraction from Social …...Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg

De�nitions Data Charac and Norm Latent Geographic Feature Extraction Experiments Conclusions

Conclusions

Summary

General framework for unsupervised extraction of informativespatial variables (dimensions of geographic semantics) fromSocial MediaPCA � ICA � SPCATransformation of informative geographic feature extractioninto a problem of high-dimensional statistics in geographicfeature spaceExtraction of spatial variables of di�erent types (landmarks,regional) by normalization

Ongoing Work

Extrinsic Evaluation of parametrization and techniques (e.g.spatial classi�cation task)(Semi-) supervised feature extraction

Nov 8 33 / 33