StreetNet: Preference Learning with Convolutional Neural ...ctlu/Publication/2018/ACM-GIS-2018-Kaiqun.pdf · and law enforcement agencies....

StreetNet: Preference Learning with Convolutional NeuralNetwork on Urban Crime Perception

Kaiqun FuVirginia Tech

7054 Haycock RoadFalls Church, Virginia 22043

[email protected]

Zhiqian ChenVirginia Tech


[email protected]

Chang-Tien LuVirginia Tech


[email protected]

ABSTRACTOne can infer from the broken window theory that the perceptionof a city street’s safety level relies signicantly on the visual appear-ance of the street. Previous works have addressed the feasibility ofusing computer vision algorithms to classify urban scenes. Mostof the existing urban perception predictions focus on binary out-comes such as safe or dangerous, wealthy or poor. However, binarypredictions are not representative and cannot provide informativeinferences such as the potential crime types in certain areas. Inthis paper, we explore the connection between urban perceptionand crime inferences. We propose a convolutional neural network(CNN) - StreetNet to learn crime rankings from street view images.The learning process is formulated on the basis of preference learn-ing and label ranking settings. We design a street view imagesretrieval algorithm to improve the representation of urban percep-tion. A data-driven, spatiotemporal algorithm is proposed to ndunbiased label mappings between the street view images and thecrime ranking records. Extensive evaluations conducted on imagesfrom dierent cities and comparisons with baselines demonstratethe eectiveness of our proposed method.

CCS CONCEPTS•Information systems→ Geographic information systems; Datamining; •Computing methodologies→ Learning to rank; Percep-tion;

KEYWORDSpreference learning, street view, convolutional neural networks,spatial analysis

ACM Reference format:Kaiqun Fu, Zhiqian Chen, and Chang-Tien Lu. 2018. StreetNet: PreferenceLearning with Convolutional Neural Network on Urban Crime Perception.In Proceedings of 26th ACM SIGSPATIAL International Conference on Advancesin Geographic Information Systems, Seattle, WA, USA, November 6–9, 2018(SIGSPATIAL ’18), 10 pages.DOI: 10.1145/3274895.3274975

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor prot or commercial advantage and that copies bear this notice and the full citationon the rst page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specic permission and/or afee. Request permissions from [email protected] ’18, Seattle, WA, USA© 2018 ACM. 978-1-4503-5889-7/18/11. . . $15.00DOI: 10.1145/3274895.3274975

1 INTRODUCTIONThe broken window theory is a criminological theory of the norm-setting and signaling eect of urban disorder and vandalism onadditional crime and anti-social behavior. The theory was rst pro-posed by James Wilson and George Kelling in The Atlantic Monthlyin March 1982 [42]; quotes: Consider a building with a few brokenwindows. If the windows are not repaired, the tendency is for van-dals to break a few more windows. Eventually, they may even breakinto the building, and if it’s unoccupied, perhaps become squattersor light res inside. Similar to the rapid development of the ideathat social network’s justify the six degrees of separation theoryin sociology, the broken window theory in criminology may ndits endorsement in our era of big data. Previous studies on urbancrime analysis [22, 36, 37] have addressed signicant associationsbetween the locations of crime oenses and the categories of theoenses. However, all those works neglect the impact of street viewimages on urban safety perception problems.

Theft:Robbery:Burlary:

(a) Burglaries in Washington D. C.

Theft:Burlary:

Robbery:

(b) Robberies in Washington D. C.

Robbery:Theft:

Burlary:

(c) Burglaries in New York City

Assault:Burlary:

Robbery:

(d) Robberies in New York City

Figure 1: Spatial Distributions of Dierent Categories ofCrime

With the advent of image-based crowdsourcing services such asFlickr and Instagram, users can easily generate image data. Panoramic

image services such as Google Street View are also ubiquitously ac-cessible from the Internet. Previous studies have presented spatialcorrelations between crime levels and residences of oenders due tothe fact that most oenders prefer to commit illegal activities closeby, and oenders will follow the same criminal patterns while theyare committing illegal activities. For example, a person with bur-glary records is likely to commit burglary in the future [3]. Furtherassumptions are made for our exploration that dierent types ofcrimes will aect the urban visual appearance in a variety of ways.For instance, convenience stores located in suburban areas withhigh robbery rates will be equipped with substantial barriers oreven bulletproof armor; a lot of grati will be witnessed in placeswith inadequate law enforcement. Figure 1 illustrates crime rateheat maps for dierent oense categories for both Washington D.C. and New York City. Figure 1 also shows that dierent areas ofthe cities are represented by dierent urban perceptions, and suchdistinctions can be utilized to infer hidden crime rankings. This isthe main focus of our paper.

Learning crime rankings from urban perception or street viewimages can be challenging. This poses the following three issues:1) Features of images for crime ranking are not explicit. Rep-resentations of urban appearances from street view images varysubstantially due to changes in camera direction and imaging andlighting conditions. Previous studies [11, 43] in image classicationextract features from clustered bag-of-visual-words (BOVW) meth-ods, but the extracted features are not interpretable. 2) Prior ge-ographical knowledge should be considered for street viewimages retrieval. Learning hidden knowledge from street viewimage datasets is dierent from traditional image classicationproblems. The selection of camera directions signicantly aectsthe prediction results. Optimal camera directions are ones that areperpendicular to the streets’ direction because this direction canminimize the noise introduced by recorded vehicles or pedestri-ans. 3) Lack of labeled datasets. To learn crime rankings fromstreet view images, reasonable and unbiasedly labeled trainingdata is required. However, there are no existing labeled streetview image data available for crime ranking tasks. In addition,the techniques of feature engineering request tedious labor, andthe human-annotated corpora are insucient for training practi-cable models to identify crime rankings. 4) Insucient urbanperception study on multi-label analysis. Previous studies onurban perceptions only focus on binary classications of incomeand safety levels for neighborhoods [16, 32, 33], which provide lessinformative inspections than the multi-label learning for residentsand law enforcement agencies.

The methods proposed in this paper eectively address the above-mentioned issues. The proposed convolutional neural network ex-tracts hidden features from street view images, and we formulatethe crime rank learning problem under the preference learningframework. Improving upon previous works on safety level predic-tion [32], we demonstrate the feasibility of inferring crime rankingknowledge from cities’ visual appearances.According to the pre-vious assumption, we utilize spatiotemporal correlations betweenstreet view images and criminal oenses to construct crime rankinglabels. The major contributions of this paper can be summarized asfollows:

•Propose a convolutional neural network (CNN) based pref-erence learning approach for crime ranking inference fromstreet view images: A convolutional neural network is proposedand trained on street view images labeled with crime rankings frommultiple cities. We formulate the problem under the settings ofpreference learning and label ranking.

•Develop a street view image retrieval algorithm with im-proved abilities in representing actual urbanperceptions: Anecient street view image retrieval algorithm is designed and imple-mented while generating the image datasets. The retrieved imagedatasets provide better urban perception representations than theprevious datasets for those places. Such improvement assists ourmodel in achieving a better prediction performance.

•Design a data-driven spatiotemporal street view imageand crime ranking labeling strategy: A spatiotemporal basedstreet view images and criminal oense record mapping algorithmis designed for labeling the images. The proposed labeling schemeis more representative and ecient than previous methods becausethe process is unbiased and systematic.

•Extensive experiments andmake comparisons to validatethe eectiveness and eciency of the proposed techniques:We compare our proposed convolutional neural network with var-ious methods. Conventional methods for learning label rankingsare selected for comparisons. Evaluations of various metrics anddetailed case study analysis are presented illustrating the eective-ness of the proposed method. Interesting discoveries for street viewimages’ perception radius (in feet) are also presented and discussed.

The rest of our paper is structured as follows. Related worksare reviewed in section 2. In section 3, we describe the problemsetup of our work. In section 4, we present a detailed discussion ofour proposed methods for predicting potential crime rankings fromstreet view images. In section 5, extensive experiment evaluationsand comparisons are presented. In the last section, we discuss ourconclusion and directions for future work.

2 RELATEDWORKSIn this section, we provide a detailed review of the current stateof research for urban crime perception problem. There are sev-eral threads of related work of this paper: urban perception fromstreet view imagery data, scene recognition and classication, andpreference learning on multi-label learning.

Urban Perception. The earliest studies on urban perception [3,4] indicate strong spatial coherence between the locations of illegaloenses and the residences of the oenders; these studies conrmthat oenders who commit robberies, residential burglaries, theftsfrom vehicles, and assaults are more likely to target their currentand former residential area than similar areas they never livedin. Previous works [10, 13, 20, 40] addressed problems of regionalpublic safety and urban appearance perception. For example, corre-lations between a high initial level of homicide and losses in totalpopulation are observed [30] in suburban areas adjacent to a largecity like Chicago. However, without street view images under a city-wide coverage, these previous works drew conclusions based onexperiments with small datasets (160 manually taken photographs),which is insucient for mining latent patterns for the majorityof the urban appearances. In contrast, the method proposed in

this paper is trained on 44,694 street view images from two cities:Washington, D.C., and New York City.

Recent branches of works in urban perception applied computervision and deep learning techniques to improve the resolution,precision, and scale. Ordonez et al. [34] proposed a regressionmodel to predict the perceptual characteristics of places for wealth,uniqueness, and safety. The proposed model utilized features suchas Gist, SIFT and Fisher Vectors. Such hand-craft features werenot representative enough for large street view datasets and wereoutperformed by deep learning-based algorithms. Dubey et al. [12]proposed a convolutional neural network to quantify the urban per-ception along six perceptual attributes: safe, lively, boring, wealthy,depressing and beautiful. Andersson et al. [1] proposed a novel4-Cardinal Siamese convolutional neural network to predict urbancrime rates. However, this model applied four pre-trained VGG-16architecture, which is not representative of the urban perceptiontasks. Liu et al. [26] also proposed a convolutional neural networkfor urban safety perception based on the crime dataset. Most of thedeep learning based urban perception methods for safety inferencefocus on crime rate prediction and safety level comparison. Sub-jective labels are inevitably introduced in the previous works asmost of the evaluations of the studies are performed by humans.In this paper, we inspirationally address the correlations betweenthe urban appearance and the crime types. We also objectivelylabeled and evaluated based on ocial crime records as the goldenstandard.

Scene Recognition and Classication. Previous works havedemonstrated the feasibility of considering images of the appear-ance of city streets as an indicator of hidden urban inferences suchas safety, wealth, and aesthetics [2, 14, 49]. Several previous workshave proposed computer vision techniques based on supervisedclassication algorithms such as SVM or convolutional neural net-works (CNN) for predicting the safety level of a specic urban area.Although the question, “Does this place look safe?" has been re-solved, previously proposed works only consider binary classes ofsafety levels or solving the safety index regression problem. Var-ious research ideas on street view images have been proposed inrecent years. Zamir et al. [45, 46] proposed a street view image loca-tion retrieval approach with SIFT vocabulary trees and generalizedminimum clique graphs. Similar research problems of recognizingobjects such as street numbers [17], storefronts [31], and otherobject recognition [15, 44, 48] also were addressed recently. Otherworks focusing on 3-D reconstruction and city modeling based onstreet level imagery have been proposed [7, 29].

Preference Learning. Preference learning algorithm for rank-ing was previously proposed in [19] for multi-label learning prob-lems. Previous researchers utilized constraints derived from multi-label instances to enforce that the ranking of relevant classes ishigher than the irrelevant ones. Based on the proposed preferencelearning structure, further applications of multiple-object detectionand image tag ranking problems [23, 24] have been studied undersuch a problem setting. Most of the previous works in multi-labelranking applied the pairwise model [6]. However, the pairwisemodel for learning label preferences often suers from the expen-sive computation. We formulate crime preference learning using aconvolutional neural network. Such a design exploits convolutional

layer’s advantages for image feature extraction and deep neuralnetwork’s learning ability for multi-label tasks.

3 PROBLEM STATEMENTWith machine learning algorithms on a huge street view imagedataset, it may be feasible for a human to perceive or inference thetypes of criminal acts that are mostly likely to be committed to himin a certain area.

Consider a setting where potential crime rankings are inferredbased on the given street view image in a certain area. We namethis procedure a perceptional crime rank inference.

In the problem setting, we are given a street view image space Iand a nite set of crime labels C = c1, c2, ..., ck . The assumptionhas been made that there is a hidden correlation between the phys-ical appearances of the city areas and the crime rankings in thoseareas. We denote the training dataset ofn inferences asDn ⊆ I×C.

The general goal is to learn a “crime ranker” in the form of aI → SC mapping, where the output space SC is given by the set ofall permutations of the set of crimes C. Thus, label ranking can beseen as a generalization of conventional classication, a completeranking is associated with a street view image I :

cπ −1I (1)I cπ −1I (2)

I ...I cπ −1I (k )(1)

where πI is a permutation of c1, c2, ..., ck such that π−1I (i) isthe position of crime ci i in the ranking associated with the givenstreet view I .

We formulate the problem of crime ranking from street viewimages as a pointwise preference learning problem on dierentcrime types. The goal is to learn a relevance score fi (I ) = reliprediction function for each crime type ci from the street viewimages, and a set of pairwise preferences of the form ciI c j fromthe training data Dn . Such an outcome suggests that for streetview image I , ci is preferred to c j . For each rank judgment on crimepairs ci and c j , the goal is to estimate a function f ∈ I → R andF = f | fi (I ) > fj (I ) ⇔ ciI c j ; (i , j), where fi represents aprediction function for crime type i . To generalize the proposedproblem, we present the following:

f ∗ ∈ argminf ∈F

∑Dn

R∗ (f ) + Ω (f ) (2)

where R∗(f ) corresponds to the empirical risk whose perfor-mance is controlled by the selection of the loss function. A generalrepresentation of the empirical loss is given by:

R∗(f ) =1|Dn |

∑(I ,C)∈Dn

L(y, f (I )) (3)

To compare with the baseline algorithms, we discuss loss func-tion selections for the conventional rank learning settings. The lossfunction L(y, f (I )) in the empirical risk determines the descendingdirection of the learning process. Note that y is the true relevancescore of an image I for a given crime type. Under the pairwise pref-erence learning setting, various loss functions can be chosen. In thispaper, two loss functions are considered: 1) the logistic loss/crossentropy loss and 2) the squared Hinge loss for the SVMs. Both lossfunctions are smooth and convex. Consequentially, squared hingeloss and logistic loss are formed respectively:

Lhinдe2 =∑Dn

max2(1 − ϕ(wT FI + b), 0) (4)

Ωl2 (f ) = λ‖w‖2 (5)

In Equation 5, Ω(f ) is the regularization term for controllingthe complicity of the model. For the SVM classiers, only l2 normregularization is utilized, shown in Equation 5, where λ is the trade-o parameter controlling the complexity of the model.

4 METHODOLOGYIn this section, we discuss the design of the proposed convolutionalneural network and its training and solution processes. We alsoprovide detailed discussions of the direction-based, street viewimage retrieval algorithm.

4.1 StreetNetIn conventional image classication tasks, performance is greatlydependent on feature selection. However, information loss is in-evitably introduced to the classier with such feature extractionmechanisms. In contrast, convolutional neural networks signi-cantly keep complete image information. We propose an convo-lutional neural network - StreetNet for crime type inference fromstreet view images. The structure of the proposed network is pre-sented in Figure 2. The rst several layers of the neural network areconvolutional layers, and they can be considered as feature extrac-tion operators on the images globally. The dierence between ourconvolutional neural network based rank learning and other point-wise rank learning algorithms is that we can learn the relevancescore simultaneously for dierent crime types. This advantageis introduced by the structures of the fully connected layers andoutput layer of our convolutional neural network.

4.1.1 Latent Features Extraction. Convolutional layers are im-plemented for extracting latent features of street view images. AConvolutional layer performs a convolution operation with a ltersize of k × k on the output of its previous layer. The convolutionallayer is represented:

Inj = f©«Ln−1∑i=1

In−1i ∗Wni j + b

njª®¬ (6)

where I is the image feature matrix, n represents the nth layer ofthe convolutional neural network; W is the attened lter with asize of k × k ; bnj is the bias of the feature lter W; f is the speciedactivation function; and ∗ is the 2D convolution operation. Themax-pooling layer calculates the maximum activation on the areasthat are not overlapping with the lter W. The max-pooling layerdown- samples the street view images by the size of the lter.

4.1.2 Hidden Features Classification. Fully connected layers areutilized for inferring relevance scores from the extracted latentfeatures. For each crime type, our goal is to learn a regressionof the relevance score for the given street view image. A linearoperation with weight matrix w and bias b is performed on theoutput features of the last convolutional layer. The result of thislinear operation is fed into a rectied linear unit (ReLU ) activationfunction. For each hidden node in the fully connected layer, ReLU

outputs an activation. In the last output layer, we sum the acti-vations and multiply the sum of the activations by a vector of 1s.While training, the root-means- squared-error (RMSE) is selectedas the loss function for the fully connected layers. The design ofour convolutional neural network is shown in Figure 2.

4.1.3 Parameter Optimization. Various selections of optimiza-tion methods are available to optimize the empirical risk minimiza-tion problem in convolutional neural networks. In our experiment,we use AdaDelta [47], a variation of gradient descent, for optimizingthe neural network.

The AdaDelta on the other hand restricts the window of accu-mulated past gradients to some xed size w. This method reducesthe aggressively decreasing learning rate compared to the previousmethods. For representation simplicity, we dene: дt = ∇wR∗(w).The updating expectation E[д2]t at time t depends on the previousexpectation and the current gradient:

E[д2]t = γE[д2]t−1 + (1 − γ )дt 2 (7)

where γ is similar to the momentum term. In our settings, weset γ to 0.9, and we set the learning rate η to 0.05. We can rewritethe parameter update vector term:

∆wt = −η√

E[д2]t + ϵдt (8)

where ϵ is a smoothing term that avoids division by zero. As thedenominator is just the root mean squared error criterion of thegradient.

The RMS[∆w]t is approximated with the root mean squarederror of parameter updates until the previous time step. Then thenal AdaDelta updated rule is:

∆wt = −RMS[∆w]t−1RMS[д]t

дt (9)

wt+1 = wt + ∆wt (10)By using the AdaDelta method, our model is less dependent on

the learning rate determination, since it is diminished from theupdate rule.

4.2 Direction based Street View RetrievalTo reduce street view image noise introduced by recorded vehiclesor pedestrians, we select camera directions perpendicular to thestreets’ directions. The street view image retrieval process con-siders urban roadway structures as geographical prior knowledge.Under such consideration, the camera directions for the retrievedstreet view images are always perpendicular to the direction of theroadway. Compared to existing street view image datasets withxed compass directions (UCF Google Street View Dataset [46],SUN dataset [35]), our dataset preserves a better representation ofthe real urban perception. Such improvement can be quantiedexplicitly from the experiment results in the following section ofthis paper.

Details of direction based street view retrieval are presented inFigure 3, the red dots represent the crime point locations reportedfrom the crime record datasets; the arrows represent the directions;and the dashed blue lines represent the roadway networks. This

Data Layer

Identity Shortcut

FC FC Output

Layer

Conv Layer 10Conv Layer 9Conv Layer 8Conv Layer 7Conv Layer 6Conv Layer 5Conv Layer 4Conv Layer 3Conv Layer 2Conv Layer 1

Dropout

Max pooling

Dropout

Max pooling

Dropout

Max pooling

Dropout

Max pooling

Identity Shortcut Identity Shortcut Identity Shortcut Identity Shortcut

Figure 2: Representation of the Convolutional Neural Network Structure

Crime Point 2

Sample Point 4

Sample Point 5

Sample Point 6Crime Point 1

Sample Point 1

Sample Point 2

Sample Point 3

Figure 3: Street View Points Sampling from Crime Records

algorithm preserves the urban surroundings with better represen-tations; most of the previous works did not consider the directionsof the street view images [38]. Directions that are perpendicularto the roadway are calculated based on the topological structureinformation of the target city. CycloMedia GlobalSpotter API1 takesthe directions as queries retrieving the street view images.

Based on the given topological structures of the target cities’roadways, the structure can be represented by a shapele2 or aspatial database: Shp = r1, r2, r3, ..., rn where ri represents oneroad in the target city. The procedure of identifying the directionsperpendicular to the roads is presented in Algorithm 1, where theoperations <Ps , ri>, proj(ri , Ps ), and perp(Pv , ri ) are spatial func-tions. <Ps , ri> calculates the spatial distance between the samplepoint Ps and the road ri ; the function proj(ri , Ps ) nds the projec-tion point of Ps on the road ri ; the function perp(Pv , ri ) returns thedirections that are perpendicular to the tangent line of the road riat the tangency location Pv [25].

4.3 Crime Rank LabelingWhile calibrating ground truth street view images dataset withcrime rankings, we build spatiotemporal associations between o-cial crime records datasets of two cities and the street view images.We utilize crime records datasets of Washington D. C. and NewYork City. In the ocial crime records datasets, key informationof a crime record such as reported time, oense type, and geolo-cation specied by latitude and longitude is provided. Street viewimages with timestamps and geolocations are labeled with a lo-calized crime density ranking. For a given street view image I tisiwith a timestamp ti and geolocation pair si = lat, lon, we dene

1https://globespotter.cyclomedia.com/us/2http://doc.arcgis.com/en/arcgis-online/reference

Algorithm 1: Direction based Street View Retrieval

Rc : Crime Records;Shp : Topological Structure of the Roadways;Ps : Location of a Sampled Point;Function Directions(shp, Ps )

for Ps ∈ Crime Ranдe : Rc doClosest Distance : Dc ←∞;for all ri ∈ Shp do

if <Ps , ri> < Dc thenDc ← <Ps , ri> ; // update closest road

endendPv ← proj(ri , Ps ) ; // project direction to road

enddirect ion1, direct ion2← perp(Pv , ri );return direct ion1, direct ion2

for Ci ∈ Crime Records : Rc doPs1 , Ps2 , ..., Psn ← Gaussian(Ci .дeom, Std );for Psi ∈ Ps1 , Ps2 , ..., Psn do

direct ion1, direct ion2← Directions(Shp, Psi );Imд_Retr (direct ion1, direct ion2)

endend

a time window τ and a radius r . The local crime set is dened:C = Ctc

sc |ti − τ < tc ≤ ti + τ and dist(Sc , Si ) ≤ r , where Ctcsc rep-

resents the crime record with a report time at tc and a location at sc ;the function dist() returns the distance between two points. Thenthe crime types are ranked based on the local crime density witha descending order. The density for crime type k is calculated by:Dk = |Ck |/|C|. The labeling process is presented in Figure 4. Wemanipulate the radius parameter r and generate street view crimeranking datasets under multiple levels of resolutions. The radiusselected are 1 thousand feet and 2 thousand feet, which generatesfour datasets for two cities: DC-1k, DC-2k, NYC-1k, and NYC-2k.

5 EXPERIMENTIn this section, we present the experiment environment, dataset in-troduction, evaluation metrics and comparison methods, extensiveexperimental analysis, and discussions of case studies.

5.1 Experimental Environment and DatasetsThe convolutional neural network model is implemented utilizingboth Cae and Keras frameworks respectively. All convolutionalneural network experiments were proceeded on a NVIDIA Tesla

Dataset Method nDCG@3 nDCG@5 nDCG@7 P@3 P@5 P@7 MAP

DC-1k

rSVM-HOG 0.6026 0.5620 0.6951 0.3421 0.6058 0.8549 0.4433rSVM-SIFT 0.5882 0.6465 0.7098 0.3084 0.6752 0.9433 0.4880RLS-HOG 0.5154 0.5312 0.6178 0.2865 0.5475 0.8322 0.4672RLS-SIFT 0.6052 0.7054 0.7984 0.3612 0.7097 0.8579 0.5896AlexNet 0.5598 0.5713 0.7823 0.4476 0.6884 0.8831 0.4568VGGNet 0.6119 0.6546 0.6802 0.3782 0.6790 0.9217 0.5337

PlacesNet 0.6251 0.6619 0.7682 0.4146 0.6853 0.8964 0.6209StreetNet 0.6809 0.7530 0.8210 0.4353 0.7079 0.9393 0.6340

NYC-1k



DC-2k



NYC-2k



Table 1: Crime Ranking Performance

Figure 4: Image Label Strategies

K20 GPU. For support vector machine and regression models, werun the experiments on an Intel Core i7-4790 3.60GHz CPU with32 GB memory. Standard libraries such as LibSVM and LibLINEARare utilized as baseline methods.

The experiments are conducted on street view datasets of two ma-jor locations: Washington, D.C., and New York City. We trained ourproposed models on a set of 44,694 images for the physical appear-ances for the street view, which is signicantly more images than

in previous works [18, 39]. The street view images for the Wash-ington, D.C., area are obtained from the CycloMedia GlobalSpotterAPI. The CycloMedia GlobalSpotter is an interactive web-basedapplication that provides access to CycloMedia’ s panoramic streetlevel images. The Atlas PanoramaRendering Service of the Cyclo-Media GlobalSpotter API provides a controllable RESTful API forrequesting street view images. The retrieved street view images aredirected horizontally and vertically after being given geo-locationand a spatial reference index.

Street view images for New York City were extracted from theGoogle Street View data set [46]. Total of 23,764 images are pro-vided by the New York City Google Street View data set. Thereare 5,941 unique location points contained in this data set, eachlocation consists of 4 directions, and each direction represents oneview. Each image from the data set is geo-tagged with latitude andlongitude. Note that the image quality of the New York City GoogleStreet View data set is lower than the Cyclomedia GlobalSpottergenerated street view data set, and the camera view compass direc-tions for the New York City Street View dataset are xed to 0, 90,180, and 270. As we will show in the later sections, this cameradirection mismatch to the street direction shows its insuciencyin representing the actual street view.

Crime record datasets for Washington, D.C.,3 and New YorkCity4 are utilized for extracting the spatiotemporal correlationsbetween the street view images and the crime types. Nine types ofcommon crimes are considered as ranking labels: theft, theft fromauto, robbery, motor vehicle theft, burglary, assault with dangerousweapon, sex abuse, homicide, and arson5. 36,484 cases of criminaloenses in Washington, D.C., and 102,327 cases in New York Cityare collected.

5.2 Baseline MethodsWe compare the proposed method to the two major branches ofmethods in urban perception and scene recognition areas. Firstly,we implement hand-craft feature extraction methods on traditionalsupervised learning methods. These methods include ranking-SVMwith HOG features (rSVM-HOG), ranking-SVM with SIFT features(rSVM-SIFT ), regularized least squares with HOG features (RLS-HOG), and regularized least squares with SIFT features (RLS-SIFT ).Note that we utilize RBF kernel while solving the ranking-SVM.HOG [9] is a popular feature descriptor used in computer visionand image processing for multiple purposes of learning tasks. Inour experiments, descriptor blocks with a size of 8 by 8 are utilizedfor HOG feature generation. Using scale invariant feature trans-form (SIFT) [27] as a key point extraction mechanism has becomeincreasingly popular in recent years. variousus previous workshave justied its eectiveness [5, 28]. In this paper, SIFT key pointsare extracted for each street view image to construct a bag of keypoints; this method is referred to as the bag of words paradigm [8].The other branch of baseline methods is deep regression networksfor urban perception problems. Baseline method of this branchincludes AlexNet [21], VGGNet [41], and the PlacesNet [49]. Weused the pre-trained models of the deep regression networks andne-tuned all these baseline methods on our street view imagedataset separately.

5.3 Evaluation MetricsThe eectivenesses of the nDCG@k , Precision@k , and MAP areanalyzed for all the comparison methods and the proposed method.

5.3.1 nDCG@k. The rst metric is normalized discounted cu-mulative gain at top k (nDCG@k) to evaluate the accuracy of thecrime ranking produced by a given crime ranking prediction model.nDCG@k was rst dened as an information retrieval (IR) evalua-tion metric to consider the degree of relevance in retrieved results.The more relevant results retrieved at top positions in the rankwould accumulate higher score to the top k gain. This metricis chosen because it is suited for crime rankings that have mul-tiple levels of assessment. For a given ground truth crime rankc1, c2, ..., ck and its prediction c1, c2, ..., ˆck , the relevance scores ˆrel1, ˆrel2, ..., ˆrelk of the prediction ranking are rstly permutatedby the indexes of the ground truth; then nDCG@k is measured onthe permutation in the form of:

3http://data.octo.dc.gov/4https://data.cityofnewyork.us/Public-Safety/Historical-New-York-City-Crime-Data/hqhv-9zeg5http://crimemap.dc.gov/CrimeDenitions.aspx

nDCG@k =1Z

k∑i=1

2r eli − 1loд(1 + i)

(11)

The term Z is a normalization factor derived from a perfectranking of top k articles so that it would yield a nDCG@k of 1.

5.3.2 Precision@k. Precision measures in (IR) consider the num-ber of relevant documents among the top k documents. In ourevaluation, relevant crime types in the predicted crime rankingrefer to the crime types that are also presented in the ground truthcrime ranking at cutting o point k . However, unlike nDCG@k , thePrecision@k measurement is incapable of capturing the order of in-formation within the top k rankings. The Precision@k is measuredin the form of:

P@k =|Predicted_Crimes@k ∩Ground_Truth_Crimes@k |

k

5.3.3 MAP. Mean average precision (MAP ) for a set of streetview images is the mean of the average precision scores for eachstreet view image. MAP has been shown to have especially gooddiscrimination and stability. The MAP of a given set of rankings iscalculated:

MAP =1|I |

|I |∑j=1

1k

k∑i=1

Precision@i (12)

where I is the complete set of the street view images for valida-tion.

5.4 Experimental AnalysisIn this section, we demonstrate the results of the crime type predic-tion from street view urban perceptions. Experimental evaluationsof our proposed methods and extensive comparisons to the baselinemethods are conducted.

5.4.1 Crime Ranking Prediction. As shown in Table 1, our pro-posed StreetNet outperforms the baseline methods in general. Suchperformance increase is even more signicant when the parameterk is relatively small for both nDCG@k and precision@k. This resultalso implies that the process of feature selection and extraction iscritical for image label ranking tasks, and the convolutional lay-ers in the convolutional neural networks achieve better featureextraction.

We compare the performance of our proposed convolutionalneural network and competing methods for crime ranking predic-tion on dierent datasets. We generate four datasets out of twomajor cities, Washington, D.C., and New York City, with two levelsof street view perception radius: 1,000 feet and 2,000 feet.

Table 1 shows that crime ranking prediction performance gener-ally decreases as the street view perception radius becomes larger.For example, comparing theDC-1k andDC-2k datasets, the nDCG@kscore of the DC-1k is 4% greater than the score of DC-2k; for themetric precision@k , the prediction results of the DC-1k also out-perform the results of DC-2k by 2% in general; the MAP of theDC-1k also exceed DC-2k by 6%. From the previous experimentalobservations, we nd that the increase of the precision@k metricis not as signicant as the increase of the nDCG@k metric. This

0 5 10 15 20 25

Radius

0.4

0.5

0.6

0.7

0.8

0.9

nDCG@5

DC

0 5 10 15 20 25

Radius

0.4

0.5

0.6

0.7

0.8

0.9

Precision@5

DC

0 5 10 15 20 25

Radius

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

MAP

DC

0 5 10 15 20 25

Radius

0.3

0.4

0.5

0.6

0.7

0.8

0.9

nDCG@5

NYC

0 5 10 15 20 25

Radius

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85Precision@5

NYC

0 5 10 15 20 25

Radius

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

MAP

NYC

rSVM-HOG rSVM-SIFT RLS-HOG RLS-SIFT StreetNetAlexNet VGGNet PlacesNet

Figure 5: Street View Perception Radius Analysis

may be caused by the dierent properties of the evaluation metrics:nDCG@k considers ordering of the crime relevance scores, whileprecision@k is calculated based on the number of crime intersectedwith the true crime set.

From the crime type ranking prediction results, interesting per-formance patterns can be observed. Firstly, for metrics nDCG@kand Precision@k, when the ranking parameter k is relatively small(k = 3 or k = 5), some of the hand-craft features based methodscan outperform the deep neural networks (AlexNet, VGG-16, andPlacesNet) . On the other hand, when the ranking parameter k isset to be relatively large (k = 7), the deep neural networks canoutperform the hand-craft features based methods. Secondly, thePlacesNet achieves better performance than other baseline methodswhen trained on the NYC-1k and NYC-2k datasets. This is becausethe pre-trained PlacesNet model was trained on an imagery datasetwith higher diversity. While handling street view images on urbanperception task, the PlacesNet will converge faster.

5.4.2 Street View Perception Radius Analysis. Further analysisof the correlations between the selection of the radius and theevaluation metrics are studied. The results help us to learn, in anempirical way, the best crime rank representation area (resolution)of a given street view image. Such a nding is highly practical inthe study of urban perception. For example, given a street viewimage with geo-location, one is always interested in questions suchas: Can the street view represent the crime rank for the whole

city? Or can the street view only represent the crime rank for asmall neighborhood? What is the resolution? Figure 5 shows theevaluation metrics results by varying the selections of the streetview perception radius. We nd that for dierent learning methods,the optimal radius varies. For nDCG@5, our proposed convolutionalneural network outperforms other comparison methods, and theoptimal radius for our method locates at 1,200 feet; ranking-SVMwith SIFT features locates its optimal selection of radius at 900 feetfor the same metric. In order to achieve the best of precision@5,our method locates the optimal radius selection at 1,500 feet; andranking-SVM at 1,000 feet.

5.4.3 Direction-based Street View Image Retrieval Analysis. Asproposed in the methodology section, the direction-based streetview image retrieval algorithm is applied for retrieving the streetviews with higher- quality urban representation. The street viewimage datasets DC-1k and DC-2k are retrieved by our algorithm;on the other hand, the other two street view image datasets areretrieved with xed compass directions of 0, 90, 180, and 270.As in Table 1, we nd that the performances of our method onthe NYC-1k and NYC-2k datasets is not as stable. This result isintuitive, because a tremendous amount of noise can be introducedto street view images with camera directions not perpendicularto the streets. For example, if the camera direction is always thesame as the street’s direction, the retrieved street view image willalways present the street surface or the sky. In other words, the

(a) Theft / Other (b) Robbery (c) Burglary (d) Robbery (e) Vehicle Theft

(f) Vehicle Theft (g) Burglary (h) Burglary (i) Robbery (j) Theft / Other

Figure 6: Crime Type Inferences from Street Views

resulting street view images will not be representative enough forthe real street view; image content like the front of a store or theappearance of a building will be neglected.

5.5 Case Studies DiscussionsIn this section, a number of interesting crime ranking predictionpatterns are observed discovered by the proposed convolutionalneural network. The top 5 crime types with the highest relevancescores learned are listed for each input street view image in Fig-ure 6. The corresponding relevance scores are also presented. FromFigure 6, we can nd interesting correlations between urban ap-pearance and the predicted crime rankings. For example, crimetypes such as robbery and motor vehicle theft are more likely to beinferred from the street views of downtown areas. Such ndingsare presented in Figures 6(a), 6(b), and 6(f). On the other hand, fromthe street view images of residential areas or suburbs, crime typessuch as burglary and theft are more likely to be predicted by ourapproach; the results are shown in Figures 6(c), 6(g), and 6(h). Aspresented in Figure 1, the predicted crime ranking results for bothdowntown areas and the suburbs t the crime distributions in thoseplaces. The consistency with ocial crime records indicates thefeasibility of inferring crime rankings or other safety informationfrom street view images or other forms of urban perception.

In order to test the performance of our model, we manuallyextract street view images from Google Maps, and selected areas

with no public crime records accessible. The results are shown inFigures 6(d), 6(e), 6(i), and 6(j). Similar crime ranking predictionpatterns can be witnessed from these results. From these tests, weshow that our model is highly practical for application scenariossuch as 1) areas and cities with no easy access to public crimerecords data and 2) end users traveling to an unfamiliar area withno idea how safe it is.

6 CONCLUSIONThis paper presents a novel convolutional neural network solu-tion to the problem of inferring crime rankings from street viewimages of an area. The convolutional neural network model isdesigned based on the settings of a preference learning framework.By taking road structure data as prior knowledge, the proposeddirection-based street view image retrieval method presents betterpreservation of urban perceptions. By exploiting the spatiotempo-ral correlations between the street view images and ocial crimerecords datasets, we generate labeled training data in a data-drivenway, which greatly reduces bias. Comparisons with previous im-age feature extraction and ranking learning algorithms show thatthe proposed convolutional neural network approach outperformsthe baseline methods in learning crime rankings from street viewimages. Extensive experiments based on multiple street view im-age datasets and crime records conrm the feasibility of inferring

hidden knowledge such as crime ranking from urban perceptiondata.

REFERENCES[1] Virginia O Andersson, Marco AF Birck, and Ricardo M Araujo. 2017. Investigating

Crime Rate Prediction Using Street-Level Images and Siamese ConvolutionalNeural Networks. In Latin American Workshop on Computational Neuroscience.Springer, 81–93.

[2] Sean M Arietta, Alexei A Efros, Ravi Ramamoorthi, and Maneesh Agrawala. 2014.City forensics: Using visual elements to predict non-visual city attributes. IEEEtransactions on visualization and computer graphics 20, 12 (2014), 2624–2633.

[3] Wim Bernasco. 2010. Modeling micro-level crime location choice: Applicationof the discrete choice framework to crime at places. Journal of QuantitativeCriminology 26, 1 (2010), 113–138.

[4] Wim Bernasco. 2010. A sentimental journey to crime: Eects of residentialhistory on crime location choice. Criminology 48, 2 (2010), 389–416.

[5] Anna Bosch, Andrew Zisserman, and Xavier Muñoz. 2006. Scene classicationvia pLSA. In European conference on computer vision. Springer, 517–530.

[6] Gang Chen, Yangqiu Song, Fei Wang, and Changshui Zhang. 2008. Semi-supervised multi-label learning by solving a sylvester equation. In Proceedings ofthe 2008 SIAM International Conference on Data Mining. SIAM, 410–419.

[7] Nico Cornelis, Bastian Leibe, Kurt Cornelis, and Luc Van Gool. 2008. 3d urbanscene modeling integrating recognition and reconstruction. International Journalof Computer Vision 78, 2-3 (2008), 121–141.

[8] Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and Cé-dric Bray. 2004. Visual categorization with bags of keypoints. In Workshop onstatistical learning in computer vision, ECCV, Vol. 1. Prague, 1–2.

[9] Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for humandetection. In 2005 IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR’05), Vol. 1. IEEE, 886–893.

[10] Marco De Nadai, Radu Laurentiu Vieriu, Gloria Zen, Stefan Dragicevic, NikhilNaik, Michele Caraviello, Cesar Augusto Hidalgo, Nicu Sebe, and Bruno Lepri.2016. Are safer looking neighborhoods more lively?: A multimodal investigationinto urban life. In Proceedings of the 2016 ACM on Multimedia Conference. ACM,1127–1135.

[11] Thomas Deselaers, Lexi Pimenidis, and Hermann Ney. 2008. Bag-of-visual-wordsmodels for adult image classication and ltering. In Pattern Recognition, 2008.ICPR 2008. 19th International Conference on. IEEE, 1–4.

[12] Abhimanyu Dubey, Nikhil Naik, Devi Parikh, Ramesh Raskar, and César AHidalgo. 2016. Deep learning the city: Quantifying urban perception at a globalscale. In European Conference on Computer Vision. Springer, 196–212.

[13] Kaiqun Fu, Chang-Tien Lu, Rakesh Nune, and Jason Xianding Tao. 2015. Steds:Social Media Based Transportation Event Detection with Text Summarization..In ITSC. 1952–1957.

[14] Timnit Gebru, Jonathan Krause, Yilun Wang, Duyun Chen, Jia Deng, Erez Lieber-man Aiden, and Li Fei-Fei. 2017. Using deep learning and Google Street View toestimate the demographic makeup of neighborhoods across the United States.Proceedings of the National Academy of Sciences (2017), 201700035.

[15] Timnit Gebru, Jonathan Krause, Yilun Wang, Duyun Chen, Jia Deng, and LiFei-Fei. 2017. Fine-Grained Car Detection for Visual Census Estimation.. In AAAI,Vol. 2. 6.

[16] Edward L Glaeser, Scott Duke Kominers, Michael Luca, and Nikhil Naik. 2018.Big data and big cities: The promises and limitations of improved measures ofurban life. Economic Inquiry 56, 1 (2018), 114–137.

[17] Ian J Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet.2013. Multi-digit number recognition from street view imagery using deepconvolutional neural networks. arXiv preprint arXiv:1312.6082 (2013).

[18] Asaad Hakeem, Roberto Vezzani, Mubarak Shah, and Rita Cucchiara. 2006. Esti-mating geospatial trajectory of a moving camera. In 18th International Conferenceon Pattern Recognition (ICPR’06), Vol. 2. IEEE, 82–87.

[19] Sariel Har-Peled, Dan Roth, and Dav Zimak. 2002. Constraint classication formulticlass classication and ranking. Urbana 51 (2002), 61801.

[20] Taoran Ji, Kaiqun Fu, Nathan Self, Chang-Tien Lu, and Naren Ramakrishnan.2018. Multi-task Learning for Transit Service Disruption Detection. In IEEE/ACMInternational Conference on Advances in Social Networks Analysis and Mining(ASONAM’18).

[21] Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. 2012. Imagenet classica-tion with deep convolutional neural networks. In Advances in neural informationprocessing systems. 1097–1105.

[22] Ickjai Lee and Peter Phillips. 2008. Urban crime analysis through areal categorizedmultivariate associations mining. Applied Articial Intelligence 22, 5 (2008), 483–499.

[23] Xirong Li, Cees GM Snoek, and Marcel Worring. 2010. Unsupervised multi-feature tag relevance learning for social image retrieval. In Proceedings of theACM International Conference on Image and Video Retrieval. ACM, 10–17.

[24] Dong Liu, Xian-Sheng Hua, Linjun Yang, Meng Wang, and Hong-Jiang Zhang.2009. Tag ranking. In Proceedings of the 18th international conference on Worldwide web. ACM, 351–360.

[25] Meiling Liu, Kaiqun Fu, Chang-Tien Lu, Guangsheng Chen, and Huiqiang Wang.2014. A search and summary application for trac events detection based ontwitter data. In Proceedings of the 22nd ACM SIGSPATIAL International Conferenceon Advances in Geographic Information Systems. ACM, 549–552.

[26] Xiaobai Liu, Qi Chen, Lei Zhu, Yuanlu Xu, and Liang Lin. 2017. Place-centricVisual Urban Perception with Deep Multi-instance Regression. In Proceedings ofthe 2017 ACM on Multimedia Conference. ACM, 19–27.

[27] David G Lowe. 2004. Distinctive image features from scale-invariant keypoints.International journal of computer vision 60, 2 (2004), 91–110.

[28] Yong Luo, Dacheng Tao, Chang Xu, Dongchen Li, and Chao Xu. 2013. Vector-Valued Multi-View Semi-Supervsed Learning for Multi-Label Image Classica-tion.. In 27th AAAI Conference on Articial Intelligence (AAAI). AAAI, 647–653.

[29] Branislav Micusik and Jana Kosecka. 2009. Piecewise planar city 3D modelingfrom street view panoramic sequences. In Computer Vision and Pattern Recogni-tion, 2009. CVPR 2009. IEEE Conference on. IEEE, 2906–2912.

[30] Jerey D Moreno and Robert J Sampson. 1997. Violent crime and the spatialdynamics of neighborhood transition: Chicago, 1970–1990. Social forces 76, 1(1997), 31–64.

[31] Yair Movshovitz-Attias, Qian Yu, Martin C Stumpe, Vinay Shet, Sacha Arnoud,and Liron Yatziv. 2015. Ontological supervision for ne grained classication ofstreet view storefronts. In 2015 IEEE Conference on Computer Vision and PatternRecognition (CVPR). IEEE, 1693–1702.

[32] Nikhil Naik, Jade Philipoom, Ramesh Raskar, and César Hidalgo. 2014.Streetscore–Predicting the Perceived Safety of One Million Streetscapes. In2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.IEEE, 793–799.

[33] Nikhil Naik, Ramesh Raskar, and César A Hidalgo. 2016. Cities are physical too:Using computer vision to measure the quality and impact of urban appearance.American Economic Review 106, 5 (2016), 128–32.

[34] Vicente Ordonez and Tamara L Berg. 2014. Learning high-level judgments ofurban perception. In European Conference on Computer Vision. Springer, 494–510.

[35] Genevieve Patterson, Chen Xu, Hang Su, and James Hays. 2014. The sun attributedatabase: Beyond categories for deeper scene understanding. InternationalJournal of Computer Vision 108, 1-2 (2014), 59–81.

[36] Peter Phillips and Ickjai Lee. 2011. Crime analysis through spatial areal aggre-gated density patterns. Geoinformatica 15, 1 (2011), 49–74.

[37] Peter Phillips and Ickjai Lee. 2012. Mining co-distribution patterns for largecrime datasets. Expert Systems with Applications 39, 14 (2012), 11556–11563.

[38] Lorenzo Porzi, Samuel Rota Bulò, Bruno Lepri, and Elisa Ricci. 2015. Predictingand understanding urban perception with convolutional neural networks. InProceedings of the 23rd ACM international conference on Multimedia. ACM, 139–148.

[39] Grant Schindler, Matthew Brown, and Richard Szeliski. 2007. City-scale locationrecognition. In 2007 IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 1–7.

[40] Herbert W Schroeder and LM Anderson. 1984. Perception of personal safety inurban recreation sites. Journal of Leisure Research 16, 2 (1984), 178.

[41] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional net-works for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[42] James Q Wilson and George L Kelling. 1982. Broken windows. Critical issues inpolicing: Contemporary readings (1982), 395–407.

[43] Jun Yang, Yu-Gang Jiang, Alexander G Hauptmann, and Chong-Wah Ngo. 2007.Evaluating bag-of-visual-words representations in scene classication. In Pro-ceedings of the international workshop on Workshop on multimedia informationretrieval. ACM, 197–206.

[44] Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. 2016. Building a LargeScale Dataset for Image Emotion Recognition: The Fine Print and The Benchmark.In 30th AAAI Conference on Articial Intelligence (AAAI). AAAI, 308–314.

[45] Amir Roshan Zamir and Mubarak Shah. 2010. Accurate image localization basedon google maps street view. In European Conference on Computer Vision. Springer,255–268.

[46] Amir Roshan Zamir and Mubarak Shah. 2014. Image geo-localization basedon multiplenearest neighbor feature matching usinggeneralized graphs. IEEEtransactions on pattern analysis and machine intelligence 36, 8 (2014), 1546–1558.

[47] Matthew D Zeiler. 2012. ADADELTA: an adaptive learning rate method. arXivpreprint arXiv:1212.5701 (2012).

[48] Man Zhang, Ran He, Dong Cao, Zhenan Sun, and Tieniu Tan. 2016. Simultane-ous Feature and Sample Reduction for Image-Set Classication. In 30th AAAIConference on Articial Intelligence (AAAI). AAAI, 1401–1407.

[49] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva.2014. Learning deep features for scene recognition using places database. InAdvances in neural information processing systems. 487–495.

StreetNet: Preference Learning with Convolutional Neural ...ctlu/Publication/2018/ACM-GIS-2018-Kaiqun.pdf · and law enforcement agencies....

Documents