DeepTransport: Learning Spatial-Temporal Dependency for ... · input data in these works are directly concatenated from dif-ferent locations, which ignored the spatial relationship.

DeepTransport: Learning Spatial-Temporal Dependencyfor Traffic Condition Forecasting

Xingyi Cheng†∗, Ruiqing Zhang†, Jie Zhou, Wei XuBaidu Research - Institue of Deep Learning

[email protected]{zhangruiqing01,zhoujie01,wei.xu}@baidu.com

Abstract

Predicting traffic conditions has been recently explored asa way to relieve traffic congestion. Several pioneering ap-proaches have been proposed based on traffic observationsof the target location as well as its adjacent regions, butthey obtain somewhat limited accuracy due to lack of min-ing road topology. To address the effect attenuation prob-lem, we suggest to take account of the traffic of surroundinglocations(wider than adjacent range). We propose an end-to-end framework called DeepTransport, in which Convolu-tional Neural Networks (CNN) and Recurrent Neural Net-works (RNN) are utilized to obtain spatial-temporal trafficinformation within a transport network topology. In addition,attention mechanism is introduced to align spatial and tem-poral information. Moreover, we constructed and released areal-world large traffic condition dataset with 5-minute res-olution. Our experiments on this dataset demonstrate ourmethod captures the complex relationship in temporal andspatial domain. It significantly outperforms traditional sta-tistical methods and a state-of-the-art deep learning method.

IntroductionWith the development of location-acquisition and wirelessdevice, a vast amount of data with spatial transport networksand timestamps can be collected by mobile phone map app.The majority of map apps can tell users real-time traffic con-ditions, as shown in Figure 1. However, only the currenttraffic conditions are not enough for making effective routeplaning, a traffic system to predict future road condition maybe more valuable.

In the past, there are mainly two approaches for trafficprediction: time-series analysis based on classical statisticsand data-driven methods based on machine learning. Mostformer methods are univariate; they predict the traffic of aplace at a certain time. The fundamental work was AutoRegressive Integrated Moving Average (ARIMA) (Ahmedand Cook 1979) and its variations (Pan, Demiryurek, andShahabi 2012; Williams and Hoel 1999). Motivated by thefact (Williams 2001) that traffic evolution is a temporal-spatial phenomenon, multivariate methods with both tem-poral and spatial features was proposed. (Stathopoulos∗This work was done before leaving Baidu. Email: der-

[email protected] .†main contribution

Figure 1: A real-time traffic network example from a com-mercial map app, the networks including many locations andthe color(green, yellow, red, dark red) depth illustrated thecondition of a location(a stretch of road).

and Karlaftis 2003) developed a model that feeds on datafrom upstream detectors to improve the predictions of down-stream locations. However, many statistics are neededin such methods. On the other hand, data-driven meth-ods (Jeong et al. 2013; Vlahogianni, Karlaftis, and Golias2005) fit a single model from vector-valued observations in-cluding historical scalar measurements with the trend, sea-sonal, cyclical, and calendar variations. For instance, (Denget al. 2016) expressed traffic pattern by mapping road at-tributes to a latent space. However, the linear model here islimited in its ability to extract effective features.

Neural networks and deep learning have been demon-strated as a unified learning framework for feature extrac-tion and data modeling. Since its applicability in this topic,significant progress has been made in related work. Firstly,both temporal and spatial dependencies between observa-tions in time and space are complex and can be stronglynonlinear. While the statistics frequently fail when deal-ing with nonlinearity, neural networks are powerful to cap-ture very complex relations (LeCun, Bengio, and Hinton2015). Secondly, neural networks can be trained with rawdata in an end-to-end manner. Apparently, hand-craftedengineered features that extract all information from dataspread in time and space is laborious. Data-driven basedneural networks extracts features without the need of sta-

tistical feature. e.g., Mean or variance of all adjacent lo-cations of the current location. The advantage of neuralnetworks for traffic prediction has long been discoveredby researchers. Some early work (Chang and Su 1995;Innamaa 2000) simply put observations into input layer, ortake sequential feature into consideration (Dia 2001a) tocapture temporal patterns in time-series. Until the last fewyears, some works of deep learning was applied. For in-stance, Deep Belief Networks (DBN) (Huang et al. 2014)and Stack Autoencoders (SAEs) (Lv et al. 2015). However,input data in these works are directly concatenated from dif-ferent locations, which ignored the spatial relationship. Ingeneral, the existing methods either concerns with the timeseries or just a little use of the spatial information. Depend-ing on traffic condition of a “narrow” spatial range will un-doubtely degrades prediction accuracy. To achieve a betterundestanding of spatial information, we propose to solve thisproblem by taking the intricate topological graph as a keyfeature in traffic condition forecasting, especially for longprediction horizon.

To any target location as the center of radiation, surround-ing locations with same order form a “width” region, andregions with different order constitute a “depth” sequence.We propose a double sequential deep learning model to ex-plore the traffic condition pattern. This model adopts a com-bination of convolutional neural networks (CNN) (LeCun,Bengio, and others 1995) and recurrent networks with longshort-term memory (LSTM) units (Hochreiter and Schmid-huber 1997) to deal with spatial dependencies. CNN isresponsible for maintaining the “width” structure, whileLSTM for the “depth” structure. To depict the compli-cated spatial dependency, we utilize attention mechanism todemonstrate the relationships between time and space.

The main contribution of the paper is summarized as fol-lows:• We introduce a novel deep architecture to enable tempo-

ral and dynamical spatial modeling for traffic conditionforecasting.

• We propose the necessity of aligning spatial and temporalinformation and introduce attention mechanism into themodel to quantify their relationship. The obtained atten-tion weight is helpful for daily traveling and path plan-ning.

• Experiment results demonstrate that the proposed modelsignificantly outperforms existing methods based on deeplearning and time series forecasting methods.

• We also release a real large (millions) traffic dataset withtopological networks and temporal traffic conditions 1.

PreliminaryIn this section, we briefly revisit the traffic prediction prob-lem and introduce notations in this work.

Common Notations and DefinitionA traffic network can be represented in a graph in two ways.Either monitoring the traffic flow of crossings, take crossing

1https://github.com/cxysteven/MapBJ

(a) A plain graph at a time point (b) A graph with time-series

Figure 2: Traffic condition. Five colors in this graph de-note five states for visually displaying: green(1, fluency),yellow(2, slow), red(3, congestion) and dark red(4, extremecongestion). “27180” is the ID number of a location(roadsection).

as node and road as an edge of graph, or conversely, monitor-ing the condition of roads, take roads as nodes and crossingsas connecting edges. The latter annotation is adopted in ourwork. Taking figure 2(a) as an example, each colored nodecorresponds to a stretch of road in a map app.

We consider a graph consists of weighted vertices anddirected edges. Denote the graph as G = 〈V,E〉. V isthe set of vertices and E ⊆ {(u, v)|u ∈ V, v ∈ V } isthe set of edges, where (u, v) is an ordered pair. A lo-cation(vertex) v at any time point t have five traffic condi-tion states c(v, t) ∈ {0, 1, 2, 3, 4}, expressing not-released,fluency, slow, congestion, extreme congestion respectively.Figure 2(b) presents an example of road traffic at three timepoints in an area.

Observations: Each vertex in the graph is associated witha feature vector, which consists of two parts, time-varyingO and time-invariant variables F . Time-varying variablesthat characterize the traffic network dynamically are trafficflow observation aggregated by a 5-minutes interval. Time-invariant variables are static features as natural propertieswhich do not change with time s, such as the number ofinput and output degrees of a road, its length, limit speedand so forth.

In particular, the time-varying and time-invariant vari-ables are denoted as:

Ov,t =

c(v, t)

c(v, t− 1)...

c(v, t− p)

Fv =

fv,1fv,2

...fv,k

(1)

where c(v, t) is traffic condition of vertex v at time t, p isthe length of historical measurement. fv,k are time-invariantfeatures.

Order Slot: In a path of the directed graph, the num-ber of edges required to take from one vertex to another iscalled order. Vertices of the same order constitute an orderslot. Directly linked vertices are termed first-order neigh-bors. Second-order spatial neighbors of a vertex are the first-order neighbors of its first-order neighbors and so forth. Forany vertex in our directed graph, we define the incoming

L1

L

L

L

L L

L4

(a) The direct graph ofa location.

[[L2L6],[L5L5]]� L4� [[L3L3],[L1L2]]�

Upstreamneighbors� Downstreamneighbors�

2nd-order� 1st-order� 1st-order� 2nd-order�

(b) Upstream flow and down-stream flow neighbor of L4 within order 2.

Figure 3: An example of directed graph and order slot nota-tion in DeepTransport

traffic flow as its upstream flow and the outflow as its down-stream flow. Take figure 3(a) as an example, L4 is the targetlocation to be predict. L3 is the first-order downstream ver-tex of L4. L1, L2 is the first order downstream set of L3 andthey constitute the second order slot of L4. Each vertex inthe traffic flow that goes in one direction is affected by itsupstream flow and downstream flow. The first and secondorder slots of L4 is shown in Figure 3(b). Introducing thedimension of time series, any location Lv,t is composed oftwo vectors, Ov,t and Fv . Any order slot consists of somelocations:

Lv,t =

[Ov,t

Fv

]Xj

v,t =

LTu1,t

LTu2,t...

LTuk,t

(2)

where location index u· is one of the jth order neighborsof v.

Perceptive Radius: The maximum ordered number con-trols the perceptive scope of the target location. It is an im-portant hyperparameter describing spatial information, wecall it perceptive radius and denote it as r.

Problem Definition: According to the above notation,we define the problem as follows: Predict a sequence oftraffic flow Lv,t+h for prediction horizon h given the his-torical observations of Lv′,t′ , where v′ ∈ neighbor(v, r),t′ ∈ {t−p, · · · , t}, r ∈ {0, · · · , R} is perceptive radius andp is the length of historical measurement.

ModelAs shown in Figure 4, our model consists of fourparts: upstream flow observation(left), target location mod-ule(middle), downstream flow observation(right), and train-ing cost module(top). In this section, we detail the workprocess of each module.

Spatial-temporal Relation ConstructionSince the traffic condition of a road is strongly impacted byits upstream and downstream flow, we use a convolutionalsubnetwork and a recurrent subnetwork to maintain the roadtopology in the proposed model.

Convolutional Layer CNN is used to extract temporaland “width” spatial information. As demonstrated in the ex-ample of figure 3, when feeding into our model, L4’s firstupstream neighbor L5 should be copied twice, because thereare two paths to L4, that are [L6, L5] and [L2, L5]. Withthe exponential growth of paths, the model suffers from thehigh dimension and intensive computation. Therefore, weemploy a convolution operation with multiple encoders andshared weights (LeCun, Bengio, and others 1995). To fur-ther reduce the parameter space while maintaining indepen-dence among vertices with the same order, we set the convo-lution stride to the convolution kernel window size, which isequal to the length of a vertex’s observation representation.

The non-linear convolutional feature is obtained as fol-lows:

erup,q = σ(Wup,q ∗Uv,t + bup,q), (3)

erdown,q = σ(Wdown,q ∗Dv,t + bdown,q), (4)

where Uv,t = [X1v,t, · · · ,Xr

v,t](only upstream neighbors) isdenoted as upstream input matrix, while Dv,t is downstreaminput matrix. The er·,q is at rth order vector of upstreamor downstream module where q ∈ {1, 2...m} and m is thenumber of feature map. We set erup = [erup,1, · · · , erup,m]

and erup ∈ Rl×m, l is the number of observations in a slot.Similarly, we can get the erdown. The weights W and biasb composes parameters of CNN subnetworks. σ representsnonlinear activation, we empirically adopt the tanh functionhere.

Recurrent Layer RNN is utilized to represent each paththat goes to the target location(upstream path) or go outfrom the target location(downstream path). The use ofRNN have been investigated for traffic prediction for a longtime, (Dia 2001b) used a Time-Lag RNN for short-termspeed prediction(from 20 seconds to 15 minutes) and (Lint,Hooqendoorn, and Zuvlen 2002) adopted RNN to modelstate space dynamics for travel time prediction. In our pro-posed method, since the upstream flow from high-order tolow-order, while the downstream flow is contrary, the out-put of the CNN layer in upstream module and downstreammodule are fed into RNN layer separately.

The structure of vehicle flow direction uses LSTM with“peephole” connections to encode a path as a sequential rep-resentation. In LSTM, the forget gate f controls memory cellc to erase, the input gate i helps to ingest new information,and the output gate o exposes the internal memory state out-ward. Specifically, given a rth slot matrix erdown ∈ Rl×m,map it to a hidden representation hr

down ∈ Rl×d with LSTMas follows:c̃

r

or

ir

fr

=

tanhσσσ

(Wp

[er

hr−1

]+ bp

), (5)

cr = c̃r � ir + cr−1 � fr, (6)

hr = [or � tanh (cr)]T , (7)

where er ∈ Rl×m is the input at the rth order step; Wp ∈R4d×(m+d) and bp ∈ R4d are parameters of affine trans-

3rd–orderupstream

slot

2nd–orderupstream

slot

1st–orderupstream

slot

Conv Conv Conv FC

Targetlocal8on

15minSquareerror

30minSquareerror

45minSquareerror

60minSquareerror

Mul8-tasktraining

Max-pooling

LSTM

Max-pooling Max-pooling

+

1rd–orderupstream

slot

2nd–orderupstream

slot

3st–orderupstream

slot

Conv Conv Conv

Max-pooling

LSTM

Max-pooling Max-pooling

+

Upstreammodule

α3 α2 α1 α'1 α'2 α'3

Downstreammodule

Figure 4: An example of the model architecture. There are three slots in upstream and downstream module respectively, eachwith input vertices length of two. The convolution operation has four sharing feature map. The middle block demonstrates thatthe target location is propagated by the fully-connected operation. A multi-task module that with four cost layers on the topblock. Conv: Convolution; FC: Fully-connection.

formation; σ denotes the logistic sigmoid function and �denotes elementwise multiplication.

The update of upstream and downstream LSTM unit canbe written precisely as follows:

hrdown = LSTM(hr−1

down, erdown, θp). (8)

hrup = LSTM(hr+1

up , erup, θp). (9)

The function LSTM(·, ·, ·) is a shorthand for Eq. (5-7),in which θp represents all the parameters of LSTM.

Slot Attention To get the representation of each order slot,max-pooling is performed on the output of LSTM. As hr

represents the status sequence of the vertices in the cor-responding order slot, we pool on each order slot to getr number of slot embeddings Sup = [s1up, · · · , srup] andSdown = [s1down, · · · , srdown]. Since different order slothave different effects on target prediction, we introduceattention mechanisms to align these embeddings. Giventhe target location hidden representation g, we get the jthslot attention weights (Bahdanau, Cho, and Bengio 2014;Rocktschel et al. 2015) as follows:

αj =exp a(g, sj)∑rk=1 exp a(g, sk)

. (10)

We parametrize the model a as a Feedforward Neural Net-works that is used to compute the relevance between targetlocation and corresponding order slot. The weight αj is nor-malized by a softmax function. To write it precisely, we let

ATTW(sj) is a shorthand for Eq.(10), we get the upstreamand downstream hidden representation by weighting sum ofthese slots:

zdown =

r∑j=1

ATTW(sjdown)sjdown. (11)

zup =

r∑j=1

ATTW(sjup)sjup. (12)

Lastly, we concatenate the zup, zdown and target loca-tion’s hidden representation g and then sent them to costlayer.

Top Layers with Multi-task LearningThe choice of cost function on the top layer is tightly cou-pled with the choice of the output unit. We simply use squareerror to fit the future conditions of the target locations.

Multi-task learning is first introduced by (Huang et al.2014) for traffic forecasting task. It is considered as softconstraints imposed on the parameters arising out of severaltasks (Evgeniou and Pontil 2004). These Additional train-ing examples put more pressure on the parameters of themodel towards values that generalize well when part of amodel is shared across tasks. Forecasting traffic future con-dition is a multi-task problem as time goes on and differ-ent time points correspond to different tasks. In DeepTrans-port model, in addition to the computation of the attentionweights and affine transformations of the output layer, allother parameters are shared.

ExperimentsDatasetWe adopt snowball sampling method (Biernacki and Wal-dorf 1981) to collect an urban areal dataset in Beijing froma commercial map app and named it “MapBJ”. The datasetprovides traffic condition in {fluency, slow, congestion, ex-treme congestion}. The dataset contains about 349 locationswhich are collected from March 2016 to June for every fiveminutes. We select the first two months data for trainingand the remaining half month for testing. Besides traffictopological graph and time-varying traffic condition, we alsoprovide the limit speed of each road. Since the limit speed ofdifferent roads may be very distinct, and locations segmen-tations method regards this as an important reference index.We introduce a time-invariable feature called limit level anddiscretize it into four classes.

EvaluationEvaluation is ranked based on quadratic weighted Cohen’sKappa (Ben-David 2008), a criterion for evaluating the per-formance of categorical sorting.

In our problem, quadratic weighted Cohen’s Kappa ischaracterized by three 4 × 4 matrices: observed matrixO, expected matrix E and weight matrix w. Given RaterA(ground truth) and Rater B(prediction), Oi,j denotes thenumber of records rating i in A while rating j in B, Ei,j

indicates how many samples with label i is expected to berated as j by B and wi,j is the weight of different rating,

wi,j =(i− j)2

(N − 1)2, (13)

where N is the number of subjects, we have N = 4 in ourproblem. From these three matrices, the quadratic weightedkappa is calculated as:

κ = 1− Σi,jwi,jOi,j

Σi,jwi,jEi,j. (14)

This metric typically in the range of 0 (random agreementbetween raters) to 1 (complete agreement between raters).

Implementation DetailsSince the condition value ranges in {1, 2, 3, 4}, the multi-classification loss can be treated as the objective function.However, cost layer with softmax cross-entropy for does nottake into account the magnitude of the rating. Thus, squareerror loss is applied as the training objective. But anotherdisadvantage straightforward use linear regression is that thepredicted value may be out of the range in {1, 2, 3, 4}. How-ever, we can avoid this problem by label projection as fol-lows:

We have a statistical analysis on the state distribution oftraining data. Fluency occupies 88.2% of all records, flu-ency and slower occupies about 96.7%, fluency, slower andcongestion occupies about 99.5%, the extreme congestionis very rare that it accounts for only 0.5%. Therefore, werank the prediction result in ascending order and set the first88.2% to fluency, 88.2%-96.7% to slower, 96.7%-99.5% tocongestion, 99.5%-100% to extreme congestion.

We put the all the observation into 32 dimension continu-ous vectors. The training optimization is optimized by back-propagation using Adam (Kingma and Ba 2014). Parametersare initialized with uniformly distributed random variablesand we use batch size 1100 for 11 CPU threads, with eachthread processes 100 records. All models are trained untilconvergence. Besides, there are two important hyperparam-eters in our model, the length of historical measurement pand perceptive radius r that controls temporal and spatialmagnitude respectively.

Choosing HyperparamertersWe intuitively suppose that expanding perceptive radiuswould improve prediction accuracy, but also increase theamount of computation, so it is necessary to explore the cor-relation between the target location and its correspondingrth order neighbors.

Mutual Infomation(MI) measures the degree of correla-tion between two random variables. When MI is 0, it meansthe given two random variables are completely irrelevant.When MI reaches the maximum value, it equals to the en-tropy of one of them, and the uncertainty of the other vari-able can be eliminated. MI is defined as

MI(X;Y) = H(X)−H(X|Y)

=∑

x∈X,y∈Y

p(x, y) logp(x, y)

p(x)p(y), (15)

where H(X) and H(X|Y) are marginal entropy and condi-tional entropy respectively. MI describes how much uncer-tainty is reduced.

With MI divided by the average of entropy of the giventwo variables, we get Normalized mutual information(NMI)in [0, 1]:

NMI(X;Y) = 2MI(X,Y)

H(X) +H(Y). (16)

We calculated NMI between observation of each vertex andits rth neighbors over all time points. The NMI gradually de-creases as the order increases, it values 0.116, 0.052, 0.038,0.035, 0.034 for r in {1, 2, 3, 4, 5} respectively and hardlychange after r > 5.

Therefore, we set the two hyperparameters as: p ∈{3, 6, 12, 18} (corresponding to 15, 30, 60, 90 minutespast measurements as 5-minutes record interval) and r ∈{1, 2, 3, 4, 5}.

Effects of HyperparametersFigure 5 shows the averaged quadratic weighted kappa ofcorresponding prediction horizon. Figure 5(a) illustrates 1)closer prediction horizon always performs better; 2) As r in-creases, its impaction on the prediction also increases. Thiscan be seen from the slope between r = 1 and r = 5, theslope at 60-min is greater than the same segment of 15-min.Figure 5(b) takes 60-min estimation as an example, indicat-ing that the predictive effect is not monotonically increasingas the length of measurement p, and the same result can beobtained at other time points. This is because the increase inp brings an increase in the amount of parameter, which leadsto overfitting.

0.6787 0.6829 0.6858 0.687 0.689

0.60470.6114 0.6192 0.6233 0.6267

0.53880.5494

0.5611 0.5684 0.5724

0.4790.4925

0.5079 0.5152 0.5259

0.47

0.52

0.57

0.62

0.67

0.72

1 2 3 4 5

Quadra2

cweightedkapp

a

Percep2veradius

15min 30min 45min 60min

(a) Prediction with p = 12

0.665

0.67

0.675

0.68

0.685

0.69

1 2 3 4 5

Quadra2

cweightedkapp

a

Percep2veradius

p=3 p=6 p=12 p=18

(b) 60-minute prediction

Figure 5: Averaged quadratic weighted kappa over the per-ceptive radius r and the length of historical measurement pon validation data. The left figure illustrates that as a func-tion of perceptive radius r increase, the longer horizon pre-diction has more growth. The right figure shows that theoptimal p should be chosen by observing the perceptive ra-dius.

Comparison with Other MethodsWe compared DeepTransport with four representativeapproaches: Random Walk(RW), Autoregressive Inte-grated Moving Average(ARIMA) and Stacked AutoEn-coders(SAEs).

RW: In this baseline, the traffic condition at the next mo-ment is estimated as a result of the random walk at the cur-rent moment condition that adds a white noise(a normal vari-able with zero mean and variance one).

ARIMA: It (Ahmed and Cook 1979) is a common statis-tical method for learning and predicting future values withtime series data. We take a grid search over all admissiblevalues of p, d and q which are less than p = 5, d = 2 and q =5.

FNN: We also implemented Feed-forward Neural Net-works (FNN), with a single hidden layer and an output layerwith regression cost. Hidden layer has 32 neurons, and fouroutput neurons refer to the prediction horizon. Hyperbolictangent function and linear transfer function are used for ac-tivation function and output respectively.

SAEs: We also implemented SAEs (Lv et al. 2015), oneof the most effective deep learning based methods for trafficcondition forecasting. It concatenates observations of all lo-cations to a large vector as inputs. SAEs also can be viewedas a pre-training version of FNN with large input vector thatproposed by (Polson and Sokolov 2017). The stacked au-toencoder is configured with four layers with [256, 256, 256,256] hidden units for pre-train. After that, a multi-task linearregression model is trained on the top layer.

Besides, we also provides the result of DeepTrans-port with two configurations, with r = 1, p = 12(DeepTransport-R1P12) and r = 5, p = 12 (DeepTransport-R5P12).

Table 1 shows the results of our model and other baselineson MapBJ. In summary, the models that use spatial informa-tion(SAEs, DeepTransport) significantly have higher perfor-mance than those that do not use(RW, ARIMA, FNN), espe-cially in longer prediction horizon. On the other hand, SAEsis a the fully-connected form, meaning that it assumes that

Quadratic Weighted KappaModel 15-min 30-min 45-min 60-min Avg.

RW 0.5106 0.4474 0.3917 0.3427 0.4231ARIMA 0.6716 0.5943 0.5389 0.4545 0.5648

FNN-P12 0.6729 0.596 0.5292 0.4689 0.5667SAEs 0.6782 0.6157 0.5553 0.4919 0.5852

DeepTransport-R1P12 0.6787 0.6114 0.5494 0.4925 0.5841DeepTransport-R5p12 0.6889 0.6267 0.5724 0.5259 0.6035

Table 1: Models performance comparison at various futuretime points.

any couple locations directly connect each other so that it ne-glects the topology structure of transport networks. On thecontrary, DeepTransport considers traffic structure resultsinto higher performance than these baselines, demonstrat-ing that our proposed model has good generalization perfor-mance.

15 30 45 60Prediction Minutes

54

32

1P

erce

ptiv

e R

adiu

s0.19 0.27 0.3 0.33

0.16 0.15 0.17 0.2

0.14 0.12 0.11 0.13

0.15 0.15 0.13 0.14

0.36 0.31 0.29 0.2

0.15

0.20

0.25

0.30

0.35

(a) Downstream attention weights

15 30 45 60Prediction Minutes

54

32

1P

erce

ptiv

e R

adiu

s

0.1 0.16 0.16 0.18

0.093 0.11 0.11 0.1

0.24 0.22 0.22 0.22

0.22 0.18 0.17 0.15

0.34 0.34 0.35 0.35

0.10

0.15

0.20

0.25

0.30

(b) Upstream attention weights

Figure 6: Average attention weights alignments. It quanti-fies the spatial-temporal dependency relationships. The leftfigure is downstream alignments; it captures our intuitionthat as predict time increased, the attention weights shiftsfrom low order slot to higher ones. The right figure is up-stream alignments; the model pay more attention to lowerorders because traffic flow in higher order is dispersed.

Slot Attention WeightsDeepTransport also can observe the influence of each sloton the target location by checking slot attention weights.Figure 6 illustrates the attention weights between predictionminutes and perceptive radius by averaging all target loca-tions. For downstream order slots, as shown in figure 6(a),it can be seen that as predict time increased, the attentionweights shifts from low order slot to higher ones. On theother side, figure 6(b) shows that upstream first order slothas more impact on target location for any future time. Tocapture this intuition, we utilized sandglass as a metaphor todepict the spatial-temporal dependencies of traffic flow. Theflowing sand passes through the aperture of a sandglass justlike traffic flow through the target location. For the down-stream part, the sand is first to sink to the bottom, after aperiod, these accumulated sand will affect the aperture justlike the cumulative congestion from the higher order to thelower order. Thus, when we predict the long-period con-dition of the target location, our model is more willing torefer to higher order current conditions. On the other hand,the upstream part is a little different. Higher order slots are

no longer important references because traffic flow in higherorder is dispersed. The target location may not be the onlychannel of upstream traffic flow. The nearest locations arethat can directly affect the target location just like the sandgather to the aperture of the sandglass. So the future condi-tion of target location put more attention on the lower order.Although higher order row receives less attention in the up-stream module, there is still a gradual change as predictionminutes increase.

Case StudyFor office workers, it might be more valuable to tell whentraffic congestion comes and when the traffic condition willease. We analyze the model performance over time in fig-ure 7, which shows the Root Mean Square Error(RMSE) be-tween ground truth and prediction result of RW, ARIMA,SAEs, DeepTransport-R5P12. It has two peak periods, dur-ing morning and evening rush hours. We summed up threepoints from this figure:

1. During flat periods, especially in the early morning, thereis almost no difference between models as almost allroads are fluency

2. Rush hours are usually used to test the effectiveness ofmodels. When the prediction horizon is 15 minutes,DeepTransport has lower errors than other models, andthe advantage of DeepTransport is more obvious whenpredicting the far point of time(60-minute prediction).

3. After the traffic peak, it is helpful to tell when the trafficcondition can be mitigated. The result just after trafficpeaks shows that DeepTransport predicts better over theseperiods.

Related WorksThere has been a long thread of statistical models basedon solid mathematical foundations for traffic prediction.Such as ARIMA (Ahmed and Cook 1979) and its large va-riety (Kamarianakis and Vouton 2003; Kamarianakis andPrastacos 2005; Kamarianakis, Shen, and Wynter 2012)played a central role due to effectiveness and interpretabil-ity. However, the statistical methods rely on a set of con-straining assumptions that may fail when dealing when com-plex and highly nonlinear data. (Karlaftis and Vlahogianni2011) compare the difference and similarity between statis-tical methods versus neural networks in transportation re-search.

To our knowledge, the first deep learning approach totraffic prediction was published by (Huang et al. 2014),they used a hierarchical structure with a Deep Belief Net-works(DBN) in the bottom and a (multi-task) regressionlayer on the top. Afterward, (Lv et al. 2015) used deepstacked autoencoders(SAEs) model for traffic prediction. Acomparison (Tan et al. 2016) between SAEs and DNB fortraffic flow prediction was investigated. More recently, (Pol-son and Sokolov 2017) concatenated all observations to alarge vector as inputs and send them to Feed-forward Neu-ral Networks(FNN) that predicted future traffic conditions ateach location.

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0:15

0:45

1:15

1:45

2:15

2:45

3:15

3:45

4:15

4:45

5:15

5:45

6:15

6:45

7:15

7:45

8:15

8:45

9:15

9:45

10:15

10:45

11:15

11:45

12:15

12:45

13:15

13:45

14:15

14:45

15:15

15:45

16:15

16:45

17:15

17:45

18:15

18:45

19:15

19:45

20:15

20:45

21:15

RMSE�

Time�

RW ARIMA SAEs DeepTransport-R5P12

(a) 15-minute prediction

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1:15

1:45

2:15

2:45

3:15

3:45

4:15

4:45

5:15

5:45

6:15

6:45

7:15

7:45

8:15

8:45

9:15

9:45

10

:15

10:45

11:15

11:45

12:15

12:45

13:15

13:45

14:15

14:45

15:15

15:45

16:15

16:45

17:15

17:45

18:15

18:45

19:15

19:45

20:15

20:45

21:15

21:45

22:15

RMSE�

Time�

RW ARIMA SAEs DeepTranport-R5P12

(b) 60-minute prediction

Figure 7: Model comparison with RMSE over time whenprediction horizon equals 3 (15-minute) and 12 (60-minute)

On other spatial-temporal tasks, several recent deep learn-ing works attempt to capture both time and space informa-tion. DeepST (Zhang et al. 2016) uses convolutional neuralnetworks to predict citywide crowd flows. Meanwhile, ST-ResNet (Zhang, Zheng, and Qi 2016) uses the frameworkof the residual neural networks to forecast the surroundingcrowds in each region through a city. These works partitiona city into an I × J grid map based on the longitude andlatitude (Lint, Hooqendoorn, and Zuvlen 2002) where a griddenotes a region. However, MapBJ provides the traffic net-works in the form of traffic sections instead of longitude andlatitude, and the road partition method should be consideredthe speed limit level rather than equally cut by road length.Due to the differences in data granularity, we do not followthese methods on traffic forecasting.

ConclusionsIn this paper, we demonstrate the importance of using roadtemporal and spatial information in traffic condition fore-casting. We proposed a novel deep learning model (Deep-Transport) to learn the spatial-temporal dependency. Themodel not only adopts two sequential models(CNN andRNN) to capture the spatial-temporal information but alsotake attention mechanism to quantify the spatial-temporaldependency relationships. We further released a real-worldlarge traffic condition dataset including millions of record-ings. Our experiment shows that DeepTransport signif-icantly outperformed other previous statistical and deeplearning methods for traffic forecasting.

References[Ahmed and Cook 1979] Ahmed, M. S., and Cook, A. R.1979. Analysis of freeway traffic time-series data by usingBox-Jenkins techniques. Number 722.

[Bahdanau, Cho, and Bengio 2014] Bahdanau, D.; Cho, K.;and Bengio, Y. 2014. Neural machine translation by jointlylearning to align and translate. Computer Science.

[Ben-David 2008] Ben-David, A. 2008. Comparison of clas-sification accuracy using cohen‘’s weighted kappa. ExpertSystems with Applications 34(2):825–832.

[Biernacki and Waldorf 1981] Biernacki, P., and Waldorf, D.1981. Snowball sampling: Problems and techniques ofchain referral sampling. Sociological methods & research10(2):141–163.

[Chang and Su 1995] Chang, G.-L., and Su, C.-C. 1995.Predicting intersection queue with neural network models.Transportation Research Part C: Emerging Technologies3(3):175–191.

[Deng et al. 2016] Deng, D.; Shahabi, C.; Demiryurek, U.;Zhu, L.; Yu, R.; and Liu, Y. 2016. Latent space model forroad networks to predict time-varying traffic. arXiv preprintarXiv:1602.04301.

[Dia 2001a] Dia, H. 2001a. An object-oriented neural net-work approach to short-term traffic forecasting. EuropeanJournal of Operational Research 131(2):253–261.

[Dia 2001b] Dia, H. 2001b. An object-oriented neural net-work approach to short-term traffic forecasting. EuropeanJournal of Operational Research 131(2):253–261.

[Evgeniou and Pontil 2004] Evgeniou, T., and Pontil, M.2004. Regularized multi–task learning. In Proceedings ofthe tenth ACM SIGKDD international conference on Knowl-edge discovery and data mining, 109–117. ACM.

[Hochreiter and Schmidhuber 1997] Hochreiter, S., andSchmidhuber, J. 1997. Long short-term memory. Neuralcomputation 9(8):1735–1780.

[Huang et al. 2014] Huang, W.; Song, G.; Hong, H.; and Xie,K. 2014. Deep architecture for traffic flow prediction: Deepbelief networks with multitask learning. IEEE Transactionson Intelligent Transportation Systems 15(5):2191–2201.

[Innamaa 2000] Innamaa, S. 2000. Short-term prediction oftraffic situation using mlp-neural networks. In Proceedingsof the 7th world congress on intelligent transport systems,Turin, Italy, 6–9.

[Jeong et al. 2013] Jeong, Y.-S.; Byon, Y.-J.; Castro-Neto,M. M.; and Easa, S. M. 2013. Supervised weighting-online learning algorithm for short-term traffic flow predic-tion. IEEE Transactions on Intelligent Transportation Sys-tems 14(4):1700–1707.

[Kamarianakis and Prastacos 2005] Kamarianakis, Y., andPrastacos, P. 2005. Space-time modeling of traffic flow.Computers & Geosciences 31(2):119–133.

[Kamarianakis and Vouton 2003] Kamarianakis, Y., andVouton, V. 2003. Forecasting traffic flow conditionsin an urban network: Comparison of multivariate andunivariate approaches. Transportation Research Record1857(1):74–84.

[Kamarianakis, Shen, and Wynter 2012] Kamarianakis, Y.;Shen, W.; and Wynter, L. 2012. Real-time road trafficforecasting using regime-switching space-time models andadaptive lasso. Applied stochastic models in business andindustry 28(4):297–315.

[Karlaftis and Vlahogianni 2011] Karlaftis, M. G., and Vla-hogianni, E. I. 2011. Statistical methods versus neural net-works in transportation research: Differences, similaritiesand some insights. Transportation Research Part C: Emerg-ing Technologies 19(3):387–399.

[Kingma and Ba 2014] Kingma, D., and Ba, J. 2014. Adam:A method for stochastic optimization. arXiv preprintarXiv:1412.6980.

[LeCun, Bengio, and Hinton 2015] LeCun, Y.; Bengio, Y.;and Hinton, G. 2015. Deep learning. Nature521(7553):436–444.

[LeCun, Bengio, and others 1995] LeCun, Y.; Bengio, Y.;et al. 1995. Convolutional networks for images, speech,and time series. The handbook of brain theory and neuralnetworks 3361(10):1995.

[Lint, Hooqendoorn, and Zuvlen 2002] Lint, J. W. C. V.;Hooqendoorn, S. P.; and Zuvlen, H. J. V. 2002. Free-way travel time prediction with state-space neural networks:Modeling state-space dynamics with recurrent neural net-works. Transportation Research Record Journal of theTransportation Research Board 1811(1):347369.

[Lv et al. 2015] Lv, Y.; Duan, Y.; Kang, W.; Li, Z.; andWang, F.-Y. 2015. Traffic flow prediction with big data:a deep learning approach. IEEE Transactions on IntelligentTransportation Systems 16(2):865–873.

[Pan, Demiryurek, and Shahabi 2012] Pan, B.; Demiryurek,U.; and Shahabi, C. 2012. Utilizing real-world transporta-tion data for accurate traffic prediction. In Data Mining(ICDM), 2012 IEEE 12th International Conference on, 595–604. IEEE.

[Polson and Sokolov 2017] Polson, N. G., and Sokolov,V. O. 2017. Deep learning for short-term traffic flow predic-tion. Transportation Research Part C Emerging Technolo-gies 79:1–17.

[Rocktschel et al. 2015] Rocktschel, T.; Grefenstette, E.;Hermann, K. M.; Koisk, T.; and Blunsom, P. 2015. Rea-soning about entailment with neural attention.

[Stathopoulos and Karlaftis 2003] Stathopoulos, A., andKarlaftis, M. G. 2003. A multivariate state space ap-proach for urban traffic flow modeling and prediction.Transportation Research Part C: Emerging Technologies11(2):121–135.

[Tan et al. 2016] Tan, H.; Xuan, X.; Wu, Y.; Zhong, Z.; andRan, B. 2016. A comparison of traffic flow prediction meth-ods based on dbn. In CICTP 2016. 273–283.

[Vlahogianni, Karlaftis, and Golias 2005] Vlahogianni,E. I.; Karlaftis, M. G.; and Golias, J. C. 2005. Optimizedand meta-optimized neural networks for short-term trafficflow prediction: A genetic approach. TransportationResearch Part C: Emerging Technologies 13(3):211–234.

[Williams and Hoel 1999] Williams, B. M., and Hoel, L. A.1999. Modeling and forecasting vehicular traffic flow as aseasonal stochastic time series process. Technical report.

[Williams 2001] Williams, B. 2001. Multivariate vehicu-lar traffic flow prediction: Evaluation of arimax modeling.Transportation Research Record: Journal of the Transporta-tion Research Board.

[Zhang et al. 2016] Zhang, J.; Zheng, Y.; Qi, D.; Li, R.;and Yi, X. 2016. Dnn-based prediction model for spatio-temporal data. In Proceedings of the 24th ACM SIGSPA-TIAL International Conference on Advances in GeographicInformation Systems, 92. ACM.

[Zhang, Zheng, and Qi 2016] Zhang, J.; Zheng, Y.; andQi, D. 2016. Deep spatio-temporal residual networksfor citywide crowd flows prediction. arXiv preprintarXiv:1610.00081.

DeepTransport: Learning Spatial-Temporal Dependency for ... · input data in these works are directly concatenated from dif-ferent locations, which ignored the spatial relationship.

Documents