Traffic Speed Data Investigation with Hierarchical Modeling Tomonari MASADA Nagasaki University [email protected]
Traffic Speed Data Investigation with
Hierarchical Modeling
Tomonari MASADA
Nagasaki University
Real-Time Traffic Speed Data | NYC Open Datahttps://data.cityofnewyork.us/Transportation/Real-Time-Traffic-Speed-Data/xsat-x5sa
Traffic speed measurements at 128 streets
(Regrettably, no longer maintained)
Problem 1
• Traffic speed data show a clear
periodicity at one day period.
• However, many different traffic speed
distribution patterns can be observed
also within each period.
Solution 1 [Masada+ 14]
• We take intuition from topic models
in text mining.
–The data set of each day should be
modeled as a mixture of many
different speed distributions.
Latent Dirichlet Allocation (LDA) [Blei+ 03]
• LDA achieves a word token level clustering.
• Not a document level clustering
• Each document is modeled as a mixture of
many different word probability distributions.
topic <-> word probability distribution
document <-> topic probability distribution
v3
v1
v3
v2
v2
v1 v2 v3 v4
t3φ31
φ32
φ33
φ34
v1 v2 v3 v4
t2φ21
φ22φ23φ24
v1 v2 v3 v4
t1
φ11
φ12φ13
φ14
θj1 θj2
θj3
An important difference
• Words are discrete entities.
– LDA uses multinomial distribution for modeling
per-topic word distribution.
• Speeds (in mph) are continuous entities.
– Our model uses gamma distribution.
Comparison with LDA
• word token
<-> speed measurement (in mph)
• topic (multinomial)
<-> topic (gamma)
• document
<-> document (24 hrs from midnight)
Problem 2
• Traffic speed data may show a similarity
at the same time point of day.
• Traffic speed data may show a similarity
for the streets whose locations are close
to one another.
TRINH = TRaffic speed INvestigation
with Hierarchical modeling
• Make topic probabilities dependent on
time points and on locations
– probability that the speed measured by the sensor
s at the time point t is assigned to the topic k
𝜃𝑑𝑡𝑘 ≡exp(𝑚𝑑𝑘 + 𝜆𝑘𝑠 + 𝜏𝑘𝑡)
𝑘′ exp(𝑚𝑑𝑘′ + 𝜆𝑘′𝑠 + 𝜏𝑘′𝑡)
Parameters
• 𝑚𝑑𝑘
– How often the document d provides the topic k
• 𝜆𝑘𝑠
– How often the sensor s provides the topic k
• 𝜏𝑘𝑡
– How often the time point t (of day) provides the
topic k
Priors for parameters ("hierarchical")
• 𝑚𝑑𝑘
–K Gaussian priors
• 𝜆𝑘𝑠
–K Gaussian process priors
• 𝜏𝑘𝑡
–K Gaussian process priors
Inference by MCMC
• Sample from the posterior distribution
– Slice sampling for topic probability
parameters 𝑚𝑑𝑘, 𝜆𝑘𝑠, and 𝜏𝑘𝑡
–Metropolis-Hastings for hyperparameters
Comparison experiment
• Log likelihood per measurement
– Larger is better.
• Data
–May 27 ~ June 16, 2013 (three weeks)
• Data files were downloaded every minute.
–20% measurements for testing
What we achieved
• We obtained an MCMC for a topic model
whose topic probabilities are defined by
combining multiple factors.
• And the factors are correlated via Gaussian.
– Our model can also be applied to other types of
metadata indicating intrinsic similarity of data.
Summary
• We proposed a topic model for traffic data analysis.
• Sensor locations and measurement timestamps
affects topic assignment.
• TRINH achieves better likelihood in earlier iterations.
• However, TRINH gives worse likelihood in later
iterations.