-
TransportmodedetectionusingCellularSignalingData
(CasestudyofGrazandVienna,Austria)
GEO620Master’sThesis
AuthorKimberleyChinJiaqi
16-721-185
SupervisedbyDrHaoshengHuang,UniversityofZurichChristopherHorn,InveniumDataInsightsDr.IvanKasanicky,ParkbobFacultyrepresentativeProf.Dr.RobertWeibel
29.06.2018DepartmentofGeography,UniversityofZurich
-
ii
ACKNOWLEDGEMENTS
AsIembarkonmyfinallegofmystudiesattheUniversityofZurich,Iwouldliketoexpressmyheartfeltthankstoeveryonewhohastaught,helpedandsupportedmethroughoutmystudiesandmaster’sthesis.Thisjourneywouldhavebeenalotharderwithout:
•
MysupervisorDrHaoshengHuang.Thankyouforallthesupport,ideasandtimeyouhavegivenmeduringthispastyear.Withoutyourcriticalinputandconstantguidancethiswouldnothavebeenpossible.
•
Co-supervisorsChristopherHornofInveniumDataInsightsandDr.IvanKasanickyofParkbob.Manythanksforgivingmetheopportunitytoworkonatopicinsuchan
exciting field, and all your valuable insights and suggestions.
Special thanksgoes to to Christopher for providing the data on
their own as well as fromAlexandraLechnerofTUGraz.
• My dear friends and family and especially Nick, for
proofreading and moreimportantly, for the much needed emotional
support and encouragement youhavegiveninmorewaysthanone.
-
iii
Contact
AuthorKimberleyChinRousseaustrasse61,8037Zürich,Switzerlandkimberleychin@hotmail.comSupervisorDrHaoshengHuangGeographicInformationScience(GIS)DepartmentofGeographyUniversityofZurichWinterthurerstrasse1908057Zürich,Switzerlandhaosheng.huang@geo.uzh.chCo-SupervisorChristopherHornScienceTowerWaagner-Biro-Strasse100/118020Graz,[email protected]
Co-SupervisorDrIvanKanasickyParkbobGmbhTreustrasse22-241200Vienna,[email protected]
-
iv
ABSTRACT
TheriseofnewBigDatasourcessuchascellularnetworkdatahasallowedustoobserveandcomprehend
human behavior and the interactions between them and the
environment on amuch deeper level. This leads to both new research
opportunities as well as challenges.Transportmode detection plays a
key role directly or indirectly inmany fields such as
urbanplanning, epidemiology, transportation science and many more.
Improving travel
demandsurveysisanimportantdrivingfactorandmotivationinthisresearch.Aspectslikescalabilityofthese
alternatives are critical considerations in their development in
terms of data collectionand processing. Researchers have looked to
Global Positioning Systems (GPS) in the form ofloggers or
GPS-enabled mobile phones, as well as Call Detail Records (CDR) as
alternatives.Whilethesemethodshaveshownpromisingresults,theyarenotwithoutflaws.The
aim of this research is thus to design a methodology that can
detect modes oftransportation from another more unknown type of
data, cellular signaling data. Cellularsignaling data does not
require overhead as it can be described as data crumbs leftover
byeveryday usage of one’s cellphone. Based on the results, we can
present a deeperunderstanding into thedata characteristics and
itspotential
inunderstandinghumanmobilityflowsincities.Thisresearchwillpresentasetofsupervisedandunsupervisedmethodsthatareapplied
to data that is collected in Vienna andGraz (Austria) in two
separate data collectioncampaigns by a group of 2 and 9
participants. The results from the proposedmethods showpromise and
are comparable to existing GPS studies of the same aim. The best
performingmethod, a hybridmethod of rule-based heuristics and
supervised random forestmanaged
tocorrectlydistinguishbetweenU-Bahns, S-Bahns,
cars,bikesandwalkmodes73%of the time.Rule-based methods perform
especially well on rail modes (U-Bahn/S-Bahn). For the
moresimilarmodes(cars,bikesforexample),randomforestdoesthebestatdistinguishingbetweenthesemodes.While
unsupervisedmethods are not able to achieve the same accuracies,
theresultsarestillcomparablewitha68%accuracyachievedwiththepartitioning-around-medoidtechnique.Keywords:
Transport mode detection, cellular signaling data, fuzzy logic,
rule-based
heuristic,randomforest,unsupervisedclustering,principalcomponentanalysis
-
v
TABLEOFCONTENTS
ACKNOWLEDGEMENTS............................................................................................................II
CONTACT................................................................................................................................III
ABSTRACT...............................................................................................................................IV
LISTOFFIGURES...................................................................................................................VIII
LISTOFTABLES........................................................................................................................X
INTRODUCTION.....................................................................................................1CHAPTER1
1.1
ContextandMotivation........................................................................................................1
1.2
Problemstatementandresearchaims.................................................................................3
1.3
Mainexpectedoutcomes.....................................................................................................5
1.4
Thesisstructure.....................................................................................................................5
BACKGROUNDANDRELATEDWORK......................................................................6CHAPTER2
2.1
Theworldoftransportation..................................................................................................6
2.2
Transportmodedetection....................................................................................................7
2.3
TransportmodedetectionusingGlobalPositioningSystemsdata......................................8
2.3.1
Pre-processing...........................................................................................................9
2.3.2
Modedetection.......................................................................................................10
2.4
Transportmodedetectionusingcellularnetworkdata.....................................................16
2.4.1
Mobilephonenetworkstructure.............................................................................17
2.4.2
DataGeneration.......................................................................................................17
2.4.3
SpatialandTemporalGranularity............................................................................19
2.4.4
Pre-processing.........................................................................................................20
2.4.5
Modedetection.......................................................................................................23
2.5
Variableselection................................................................................................................27
2.6
Ethicalissues.......................................................................................................................31
2.7
SummaryandResearchGaps.............................................................................................32
-
vi
DATAANDITSCHARACTERISTICS.........................................................................35CHAPTER3
3.1
CellularSignalingData........................................................................................................35
3.1.1
Spatialresolution.....................................................................................................36
3.1.2
Temporalresolution................................................................................................41
3.2
Otherdata...........................................................................................................................42
METHODOLOGY...................................................................................................44CHAPTER4
4.1
Methodologicalprocedure.................................................................................................44
4.2
Computingenvironment.....................................................................................................49
4.3
Pre-processing....................................................................................................................49
4.4
Featurelist..........................................................................................................................52
4.4.1
Featuresattheobservationlevel............................................................................52
4.4.2
Featuresatthetrajectorylevel................................................................................53
4.5
Rule-basedheuristics..........................................................................................................56
4.6
FuzzyLogicSystems............................................................................................................61
4.7
MachineLearning...............................................................................................................65
4.8
UnsupervisedLearning.......................................................................................................65
4.9
Variableselection...............................................................................................................67
RESULTS...............................................................................................................74CHAPTER5
5.1
Validation............................................................................................................................74
5.2
Parametersettings..............................................................................................................76
5.2.1
FuzzyLogicSystems.................................................................................................76
5.2.2
RandomForest.........................................................................................................77
5.2.3
UnsupervisedK-meansandPAMwithRF................................................................78
5.3
Pre-processingandtrip-segmentation...............................................................................78
5.4
Supervisedmethods...........................................................................................................81
5.4.1
WithRBHvs.withoutRBH.......................................................................................85
5.4.2
RandomForestvsFuzzyLogic..................................................................................92
5.4.3
Variablesformodedetection..................................................................................95
5.5
Unsupervisedmethods.......................................................................................................97
5.6
Summaryofmainresults..................................................................................................102
-
vii
DISCUSSION.......................................................................................................104CHAPTER6
6.1
ResearchQuestion1.........................................................................................................104
6.2
ResearchQuestion2.........................................................................................................105
6.3
Limitationsofthestudy....................................................................................................109
CONCLUSIONANDFUTUREWORK.....................................................................111CHAPTER7
REFERENCES........................................................................................................................114
PersonalDeclaration..................................................................................................................119
-
viii
LISTOFFIGURES
Figure1RulesusedinGongetal.'spaperinthemodedetectionprocess(Gongetal.,2012)...11Figure2RulesusedinBohte&Maat’spaperformodedetection(BohteandMaat,2009).......12Figure
3 Fuzzy Logic membership functions generated by human expertise
(Axhausen and
Schüssler,2009)....................................................................................................................13Figure
4 a) Location Area and Base Stations b) Periodic updates c)
Handovers d) Mobility
locationupdates(Calabreseetal.,2015).............................................................................18Figure
5 Illustration of data pre-processing to extract stay locations
(Jiang et al., 2013) with
distanceandtimeconstraintsof300mand10minutes......................................................22Figure6Detectionofstaysusinggeometry(Widhalmetal.,2015).............................................23Figure7Tripdataclusteredtotwosubgroups,drivingandpublictransit.Thearrowsshowthe
average travel times of the subgroups. Black lines are the
travel times as reported
byGoogleMaps.(Wangetal.,2010).......................................................................................24
Figure8FrameworkproposedbyQuetal.(2015)formodedetectionwithCDRdatathroughaspeed
split, augmenting the dataset with transportation network
information and
utilityapproximations.....................................................................................................................25
Figure 9 Average Euclidean distance between subsequent
observations during
stationary,walkinganddrivingperiods.(Sohnetal.,2006)...................................................................27
Figure10Exampleoffirsttwocomponentsbasedontwovariables,wordlengthandnumberoflinesindictionarydefinition.Greenlinesrepresentthevectorswhosemaindirectionleadstothemaximumsumofsquareddistancesfromthepointstothevectors(AbdiandWilliams,2010).....................................................................................................................................30
Figure11VisualizationofCSDpointsinVienna...........................................................................38Figure12VisualizationofCSDpointsinGraz...............................................................................38Figure
13: Examples of differing spatial resolution of observations, each
color representing a
differentU-Bahntrajectory..................................................................................................37Figure14DistributionofrawCSDobservationstocorrespondingGPSobservations..................39Figure
15 CSD (yellow) and corresponding GPS trajectory (black) when
commuting between
citieswherecellularcoverageisnotasstrong.....................................................................40Figure16DistributionofrawCSDobservationstocorrespondingGPSobservations
inGrazand
Vienna...................................................................................................................................41Figure
17 Distribution of the first quartiles,medians and third quartiles
of the time intervals
betweenobservationsofeachuser......................................................................................42Figure18ExampleoftwoU-Bahntrajectories.Thepointsareindividualobservationsoftheuser
taking theU-Bahn.Bluecircles representa trip in
themorningandpinkcircles
representtripsintheafternoon.RedtrackssymbolizetheU-Bahnnetwork......................................45
Figure19:ExampleofCSD(pentagons)generatedforwalk(yellow)andbike(blue)comparedtotheircorrespondingGPStracks(circles)...............................................................................46
Figure20:MethodologicalFramework........................................................................................48Figure21:Exampleofattributesextractedforeachobservation................................................53Figure22:Version1ofrule-basedalgorithmwhenallconsiderationsasmentionedinsection4.1
are taken into account, and when concepts from other GPS studies
using
rule-basedheuristicsareborrowed........................................................................................................57
-
ix
Figure 23: Cumulative distribution function of ubahndist. The
diagram shows all U-Bahntrajectories have an average distance of
less than 200m to the U-Bahn network for
theViennadataset......................................................................................................................58
Figure24Cumulativedistributionofpercentile95accoftrajectoriesofvariousmodesinVienna..............................................................................................................................................59
Figure 25 Cumulative distribution of vel.rolling2.median of
trajectories of various modes
inVienna...................................................................................................................................60
Figure26Secondversionofrule-basedalgorithm,simplifiedandimproved..............................60Figure27Trapezoidalmembership
functionsandparametervaluesbasedonnormaliseddata.
Differentcolorsmeandifferentmemberships,blue-small,red-medium,green-high........64Figure28PlotofZ-scoresafterBorutaalgorithmisrunonvariables.Blueboxplotsrepresentto
minimal, average andmaximum Z score of a shadow attribute. Red
and green boxplotsrepresent Z scores of respectively rejected and
confirmed attributes. Yellow are
thetentativevariables................................................................................................................69
Figure29Variablesrankedbasedontheircontributionstothefirstprincipalcomponentinthedatasetforthisstudy(Vienna).............................................................................................70
Figure30Variables rankedbasedon theircontributions to
thesecondprincipalcomponent
inthedatasetforthisstudy(Vienna).......................................................................................71
Figure31ComparisonofeignevaluesandvaluesfromtheBroken-stickmodel..........................73Figure32SensitivityanalysisforFLSparametersofRB_FLELmethodontheViennesedataset.
77Figure33SensitivityanalysisforRFparameters..........................................................................78Figure34ResultsforViennadataset............................................................................................82Figure35ResultsforGrazdata.....................................................................................................82Figure36PrecisionofeachmodeinVienna.................................................................................87Figure37RecallofeachmodeinVienna......................................................................................87Figure38F1ofeachmodeinVienna...........................................................................................87Figure39PrecisionofallmodesinGraz.......................................................................................88Figure40RecallofallmodesinGraz............................................................................................88Figure41F1statisticforallmodesinGraz...................................................................................88Figure42CDFfor95thpercentilespeedsinprivatemodesinVienna.........................................93Figure
43 Example of unmatched bike start point (red circle). Small points
represent GPS
observations, larger pentagons represent CSD observations. Blue
indicates bicycle
andgreenindicatescars..............................................................................................................94
Figure44Precision,RecallandF1-scoreforRB_PAMandRB_KMEANSinViennesedataset...101
-
x
LISTOFTABLES
Table1Examplesoffuzzyrulesformodedetection(AxhausenandSchüssler,2009).................14Table2DistanceofrawCSDtocorrespoindingGPSpoints
in(allpointsandintheurbanstudy
areas)....................................................................................................................................39Table3Listofmodedetectionmethodsproposedandtheirabbreviations...............................47Table4OverviewofmainR-packagesused.................................................................................50Table5:SpatialresolutionofCSDpointsinmetres......................................................................52Table6:Listofattributesextractedfromeachobservation........................................................52Table7:Listofspatialfeaturesextractedforeachtrajectory......................................................54Table8Listofdescriptivemotionfeaturesextractedfromeachtrajectory................................56Table9Parametervaluesofmembershipfunctionsofnormalizeddata.....................................63Table10ImportanceofcomponentsinPCAofVienna................................................................70Table11InterpretationofCohen'sKappa(McHugh,2012).........................................................76Table12Codeformethods..........................................................................................................79Table13Tableofnumberoftripsextractedfromfirstdatacollection(A)..................................80Table14Tableofnumberoftripsextractedfromseconddatacollection(C).............................80Table15totalmodesharesofdata(A+C)..................................................................................80Table16Precision,RecallandF1valuesforeachmodeinVienna..............................................83Table17Precision,RecallandF1valuesforeachmodeinGraz..................................................84Table18ConfusionmatrixofRule-BasedHeuristics+RandomForestVienna............................89Table19ConfusionmatrixofRule-BasedHeuristics+FuzzyLogicwithExistingLiteratureVienna
..............................................................................................................................................90Table
20 Confusionmatrix of Rule-BasedHeuristics + Fuzzy Logicwith
variables fromexisting
literatureGraz.......................................................................................................................91Table21ConfusionmatrixofRule-BasedHeuristics+RandomForestGraz................................91Table22ComparisonofspeedandaccelerationprofilesofwalktrajectoriesinGrazandVienna
..............................................................................................................................................92Table23ConfusionmatrixoftrajectoriesthatareassignedmodesintheFLELstepofRB_FLEL93Table24ConfusionmatrixoftrajectoriesthatareassignedmodesintheRFstepofRB_RF......93Table
25 Variables selected for FL by RF and existing literature (in no
particular order), and
variablesselectedbyPCAforunsupervisedmethods..........................................................97Table26ExampleofmedoidsofeachclusterforPAMandthemodeassigned..........................99Table27ClustercentroidsforK-means........................................................................................99
-
xi
Table28ConfusionmatrixoftrajectorieswhosemodesareassignedbythePAMstep...........100Table29ConfusionmatrixoftrajectorieswhosemodesareassignedbytheK-meansstep.....100Table30PerformanceofunsupervisedmethodsintheViennesedataset................................101Table31PerformanceofunsupervisedmethodsintheGrazdataset.......................................102Table32OverviewofthecompositionofclustersgeneratedinPAMonGrazdataset.............102Table33OverviewofthecompositionofclustersgeneratedinK-meansonGrazdataset.......102
-
1
CHAPTER1Introduction
1.1 ContextandMotivation
“Researchinhumanmovementintimeandspacehasbeenaroundforatleastoverfivedecades”-(Weiner,1986).
ResearchinhumanmovementhasbeengivenahugehelpinghandwiththeriseofnewBigDatasourcessuchasmobilephonecalldetailrecordsorsocialmediarecordswithlocationtags.Wenowhavetheabilitytoobserveandcomprehendhumanbehaviorandhowthey
interactwiththeir environment on an unprecedented level of detail.
This leads to both new researchopportunities aswell as challenges.
Such location-based data can give us valuable insights
tohumanmovement
inbothtimeandspaceoncenewtechniquesaredevelopedtoharnessthispotential
(Zooketal.,2015).Theglobal spreadofmobile technologies
forcommunicationhasbroughttheworldclosertogetherwhilstalsoresultingintheexistenceofanunparalleleddatasourcecapableofdescribingdiversedealings
in theworldofhumanandsocialbehavior.Oneexample of this is the Call
Detail Records, the byproduct of billing services for calls,
whichincludetimestampsandlocationcoordinatesofthesetransmissions.Thesewidespreaddatasetscan
reveal compelling information on patterns on an individual as well
as collective scale(Blondel et al., 2015).This is a passive data
type which is usually by-products of
existingstructuresthatweregeneratedforpurposesthatwerenotforbutcouldpotentiallybeusedforresearch(Chenetal.,2016).Otherpassivedatatypesincludesocialmediadatathathavebeenposted
voluntarily byonlineusers (Gonzalez et al., 2008) and transit
carddataused in
publictransportsystems(Hasanetal.,2013;Liuetal.,2009).Oneimportantgroupofbenefactorsofthisdataisthoseinthefieldoftransportscience.Urbanplanners,
policymakers and transportmanagement are interested in howpeople
travel, howinfrastructure and the environment affect movement, and
of course, in obtaining a
realisticpictureoftraveldemand.Indoingso,manyotherfieldscanbenefitfromit.Inthisvein,these
-
2
mobilephonetraceshavebeenusedtofurtherseveralaimssuchasestimatinghumanmobilitypatternsandpopulationdistribution(Calabreseetal.,2015;Gonzalezetal.,2008,2010;SevtsukandRatti,2010;Readesetal.,2007),analysisurbanactivities(Jiangetal.,2013;Ricciatoetal.,2017;Widhalmetal.,2015),generatingOrigin-destinationflows(Calabreseetal.,2011;Hornetal.,2017;KalatianandShafahi,2016;Tettamantietal.,2012;Wangetal.,2010),andmanymore.All
thesepursuits areof great interest to the fieldof transportation
science andplanners arelooking intohowbest toyield this information
for cities’ transportation
systems.Singapore,arapidlygrowing,denselypopulatedmetropolitoncityisusingdecadelongdemandforecastsontheir
public transportion, among other planning sectors such as land use
and
urbanredevelopemntplanning.TheirSmartMobility20301initiativeisleadingthewayinincorporatingsuchlocationdatatoplanthenationstransitneedsandiscontinueingtogrowinthisarea.Whenwehaveabetterunderstandingofhumanandtrafficflows,wegaingreaterinsightsandmanagement
capabilities for traffic congestion, health monitoring, elderly
care and evenepidemiology. Oneway this can be done is by
understanding the transportmode choices ofpeople and this is
achieved through collecting information through travel demand
surveys
ortraveldiaries.Improvingtraveldemandsurveysisanimportantdrivingfactorandmotivationinthis
research. Aspects like scalability of these alternatives are
critical considerations in theirdevelopment in
termsofdatacollectionandprocessing.Forexample, traditional
surveysmaytaketheformofmanualcollectionand labeling, liketelephone
interviewsandquestionnaires.Inaccuraciesareintroducedasaresult.ResearchershavelookedtoGlobalPositioningSystems(GPS)
in the form of loggers or GPS-enabledmobile phones, as well as CDRs
as alternatives.Another motivation is context-aware location-based
services. Transportation modes such
aswalking,cyclingortraindenotescertaincharacteristicsofauser.Oneuseofthisknowledgeistargeted
and customized advertisements that may be deployed to the relevant
markets. Aspeople have began to see the large potential of these
datasets, and while ourtelecommunications infrastructurehas
improved leapsandbounds, sohave thequalityof
thecellularnetworkdatathatcomeswithit.AstepupfromtheusualCDRdataisCellularSignalingData
(CSD). CSD consists of not some normal CDRs, but other cellular
network related
dataincludingbothnetworkandevent-drivendata(section3.4.2).Withanadditionalmap-matchingstep,
CSD ultimately lends itself an increased spatial and temporal
resolution. Similar to adtargeting strategies of some socialmedia
platforms, there are potential avenues to generate
1https://www.lta.gov.sg/content/dam/ltaweb/corp/RoadsMotoring/files/SmartMobility2030.pdf
-
3
morerevenuewithmoretargetedandcustomizedservices.Telecommunicationprovidershavetaken
notice and begun to invest in developing techniques tomine
information and provideaccess to this data. As such, this thesis
will attempt to achieve the aims of transport
modedetectionwithCSD.ThisresearchisincollaborationwithParkbob,arapidlygrowingcompanythatdeliverscontext-awareparking
information todriversaswell asusingpredictivemodelswith real-time,
crowdsensed data to provide parking availability information.While
this is in the realm of LBS, themotivation for this is largely
driven by the desire to understand the transport demand
thatdrivestheneedforthisLBS,andishenceliesmoreintheveinoftransportationscience.
1.2 Problemstatementandresearchaims
With the aims of transportmodedetection inmind, CSDoffer
amuchmore opportunities interms of types of modes and performance
due to its higher quality. However, despite
thispassivedatatypehavingtheadvantagesoflargesamplesizeandlongobservationperiods,theyalso
have obviousweaknesses: cell phone traces can be sparsely sampled
in time during
idleperiods,theymightprovideonlyalowspatialresolutionandincludenoisestemmingfrompuresignalmovement.
Therefore thedatahas tobe carefullyprocessed toextract triporigins
anddestinations. Access is also hindered due to varying privacy and
business-sensitivityconsiderations. Many of the previous studies
involving cellular network data for
mobilityanalysishavebeenlimitedtoCDRsandhavecomeupwithmethodstoalleviatetheimpactofthesechallenges(Quetal.,2015;Wangetal.,2010).Itwasonlyrecentlythatcompanieshaveallowed
access to this new cellular signaling data whose greater detail
means much
moreinformationcanbeminedascomparedtoCDR.Themainchallengeherehowever,
ishandlingthe still much lower spatial and temporal resolutions
associated with this passive data typewithout having to actively
solicit other supplementary data in a time and resource
intensivemanner. Privacy concerns also mean that existing studies
do not have ground truth data
toevaluatetheirresults.Becauseofthis,thenumberofmodesthathavebeendistinguishedusingpassive
mobile phone data has been rather limited, usually to
betweenmotorized and
non-motorized,orprivateandpublictransportationmodes.
-
4
As such, the aim of this paper is to overcome the restrictions
of active data types (GPS) andpassivedatatypes(CDR-only)
formodedetectionandproposenovelmethodsforthisrelativenewcomer,CSD,whilealsoaccountingforitslowandirregularspatialandtemporalresolution.ThestudyareaconsistsoftwomajorcitiesinAustria,ViennaandGraz.Themodesoftransportof
interestwillbebothprivateandpublictransportationmodes:car,bike,walk,
tram,S-Bahn(commutertrains)andU-Bahn(metro).This
isalsolimitedbytheamountofactualdatamadeavailabletothisstudy.MethodsthataretakenfromexistingCDRandGPSstudies,whether
inpartor inwhole,willbeadjustedsothat theyaremoresuitable
todealwiththeuniquedatacharacteristicsofCSD.This
thesiswillconsistof twomainparts.First,
severalmethodswillbedevelopedusingcombinedapproachesofseveralpopularmodedetectionmethodsproposedbyseveral
existing studies. Thiswill include both scenarioswhereby labels are
available and
themorelikelyoneswherebytheyarenot.Thesecondpartwillevaluatetheperformancesoftheseproposedapproaches,andcomparethembasedonvariousperformancemetrics.
Research Question 1: Development and Implementation: How can
various modes oftransportation (walk, bus, tram, car) be detected
from cellular signaling data
(CSD)consideringitslowerandmoreirregularspatialandtemporalresolution?Hypothesis
1: Due to the unique characteristics of this dataset, tailoring
existingmethodstobeappliedherecandetectvariousmodesoftransportation.Thisispossiblebydistinguishingbetweentheirspatiotemporalcharacteristicsaswellascomplementarycommon
sense information such as locations of transport networks.
Supervisedmodedetectionmethodsdevelopedherewouldfollowacombinationofrule-basedheuristicsand
fuzzy logic systems or machine learning. Unsupervised mode
detection
methodswouldfollowaclusteringapproachcombinedwithunsupervisedrandomforests.Thesemethodswillusevariablesselectedthroughvariousvariableselectionmeasures.
Following the implementation of this developed methods the next
question addresses theirperformance and quality and determines the
bestmethod that should be adopted formodedetection in urban areas
for these particular modes of interest. Using various
performancemetrics,theresultsoftheseproposedmethodswillbecomparedagainsteachotheraswellasagainstthoseinexistingstudiestogiveanideaofhowwellthechosenmethodperforms.
-
5
Research Question 2: Evaluation and comparison: How do these
proposed methods(RQ1) perform and compare against each other?Which
is the best method of modedetection for detecting these modes of
transportation? What are the most
usefulfeaturesfortransportmodedetectionusingCSD?Hypothesis 2: The
results of these algorithms will be compared against ground
truthprovided by the data collectors’ annotations. Due to the
noisier and dirtiercharacteristics of CSD as compared to GPS data,
it is likely that the inclusion
ofcontextualdatasuchasGISdataofthetransportationnetworktosupplementtheCSDwillleadtobetterperformancesofthemethods.
1.3 Mainexpectedoutcomes
Themain contributionof this thesis is thusamethodology
todetectmodesof transportationfrom CSD data based on their spatial
and temporal features. A set of methods that
arepermutationsofvariousexistingapproacheswillbeproposedwiththegoalof
findingtheonebestsuitedtoCSDafterevaluatingtheirresults.ThiswillbeanovelcontributiontothefieldastransportmodedetectionusingCSDisstillverymuchinitsinfancy.
1.4 Thesisstructure
A summary of related works will be presented in the next
chapter, chapter 2. It will
alsohighlighttheprosandconsofexistingmodedetectionmethodsandhowlessonslearnedfromthem
are used in the development and creation of our proposed methods.
Following this,chapter 3 will give an overview of the data and
chapter 4, themethods proposed formodedetection in this study. The
evaluation results will be presented in chapter 5, along
withsensitivity analyses of the chosen parameters. Upon discussion
of these various methods
inChapter6,themethod(/s)deemedasbestandmostsuitedforCSDwillberecommended.Bothresearchquestionsaswellaslimitationsofthestudywillalsobediscussedinthischapter.Lastly,Chapter
7 concludes the study with a summary as well as considerations for
future work.
-
6
CHAPTER2BACKGROUNDANDRELATEDWORK
2.1 Theworldoftransportation
Recentdecadesofmassivepopulationgrowthcombinedwiththehugeinfluxofurbanmigrationhascalled
for theneed tomanageoururban resources inamoreeffectiveway
tostreamlinealready limited resources. Transportation is one of
these key issues that growing populationshavetograpplewith,asit
isanunavoidableaspectofeverydaylife.Motivatedbytheneedtobetterservesociety,citiesneedtobeabletoforecastfuturetraveldemandsoastochanneltherightinvestmentsintherightvolumestotherightplaces,suchastolarge-scaletransportationprojects.Much
effort has gone into seeking to developmodels that predictwhere
andwhenpeopletravelto,howtheydoit,andwhataffectsthesechoices.Informationlikehowtransportnetworksperformintermsofcongestionandflow,orhowrouteandmodechoicesrespondtoroadpricingschemesareextremelyimportantinaidingdensecitieskeepupwiththeincreasingpressures
and demands of a growing population. By evaluating that
information, decisionmakers are offered valuable insights into
urban activities and movement flows, enablingthemselves intomaking
thebestandmostprudentdecisions for thegoodof thepeople theyserve
(Rasmussen et al., 2015). For example, urbanplanners canmake a
citymore livable bymitigating congestion and planning for
developments to cater to high volumes of
people.Transportplannerscanunderstand themobilitypatterns
ingreaterdepth,knowing timesandlocationsof traffichotspotson the
roads,aswell ason the
transitnetworks.Evencompanieswhowanttostreamlinetheirproductsandservicescandosotothosewhoneeditmost.Inthegranderschemeofthings,modelsthatcanpredicthowpeopletravelintimeandspacecanhelpin
the fight to reduce our global carbon footprint. Our reliance on
motorized forms oftransportation isoneof thekeydriving
forcesofclimatechange, furthermotivating transportresearchers and
providers to strive towards more sustainable transport networks,
one
thatpromotestheuseofnon-motorizedorpublictransportationforexample.
-
7
2.2 Transportmodedetection
Akeywaytoachieveagreaterunderstandingofhumantravelpatternsisanunderstandingofthemodesoftransportationtheytake,aswellasitscorrespondingtemporaldistributions.Thisproblemhasbeentackleddifferentlybasedonthewhatobjectiveoftheresearchersisandcanbe
summarised into three main branches (Prelipcean et al., 2017).
First is Location-BasedServices(LBS)wherebythegoal
istodetectthemodeasclosetorealtimeaspossiblesothatimportantandrelevantinformationcanbegiventothecommutersorinterestedpartiesatthesuitable
times and places, such as with Parkbob’s smart car parking
application2. This
“on-demand”kindofmodedetectionisinlinewithmanycitiesaspirationstowardafullyfunctionalsmartcity,whereresourcescanbeallocatedontheflytoplacesorpeoplewhorequirethem.Another
huge and long-standing branch is transportation science, which aims
to generatereliable and usable statistical data on usage of the
transport system. This data is ued as thefoundation to answer many
city planning questions and further many of the
applicationsmentioned previously. With such valuable uses of this
data, transportation scientists
havecontinuallytriedtogatherthisinformationmainlyintheformofactivelysolicitedtraveldiaries,or
paper, internet and phone surveys (Rojas et al., 2016; Shen and
Stopher, 2014a).
Thesetraditionalapproacheshaveproventobeinaccurate,time-consumingandresourceintensiveasit
relies on people manually self-reporting their daily activities,
travels, and correspondingschedules. Often, these get
under-reported due to forgetfulness or the amount of effort
itrequires (Bohte andMaat, 2009). Transportation scientists
aremotivated to overcome
theseproblemsandautomatepartofthisdatacollection,asthepositiveimpactofgoodqualitydataon
transportation mode usage is compelling. Thirdly, transport mode
detection is of
greatinteresttothefieldofhumangeography,whoseobjectivesarelargelytoenrichthesedatasetswithdomain-specific
semantics suchwithassociatedPoints-Of-Interests. This fieldof
researchhasmethodsofmodedetectionsimilartothatoftransportationscience,buthasoutputsthatcover
ahuge scope,using thesehumanmobility trajectories to
answeramyriadofquestionssuchaslinguisticevolutionorhumaninteractionpatterns(Prelipceanetal.,2017).Thisresearchwillhaveaimsmore
linewith thatof transportationscience, indevelopingmethods
tocollectinformation on people’s mode choices. The applications
however, are motivated to
supportParkbob,whoseservicesliemoreintheLBSrealm.
2http://www.parkbob.com/
-
8
2.3 TransportmodedetectionusingGlobalPositioningSystemsdata
InMay2000,theUSgovernmentdecidedtoremoveselectiveavailabilityofGlobalpositioningsystems(GPS),whichwasamilitaryefforttowardssecurityreasonstointentionallydegradeGPSsignals.NowGPSdevicescandeterminelocationswithaccuraciesoflessthan10m(BohteandMaat,2009)forvariouspurposesinthecivilianworld.Someexamplesincludeinagriculturetoaccuratelymonitor
yielddataor to enablework in poor visibility orweather, aviation
for
thecontinuousprovisionofreliableandaccurateinformationonflightsaswellasformoreefficientairtrafficmanagement(KaplanandHegarty,2005).Disastermanagement
isanotherareathatbenefits from this technology. When little
information is available, GPS makes mapping
ofdisasterzonespossible.Floodandearthquakepredictioncapabilitiesarealsoimprovedwiththistechnology(KaplanandHegarty,2005)..TransportationisagreatbenefactorofGPS,withmoreaccuratepositioningleadingtobetterscheduleadherenceandtransportdemands,forexample.
As compared tomore conventionalmeans of collecting such data
through surveys and traveldiaries, collection via GPS devices
alleviates many of the formers’ shortcomings, on top ofproviding
greater opportunities of quality and quantity of data collected.
Providing morecomprehensive information on origins, destinations
and the routes taken between them,
tripstartandendtimesaswelltriplengthscanbemorerealisticallyachievedbytherespondentastheydonotrelyonmemoryorneedtomaketheefforttopendowntheirschedule.Thedatatends
to be more accurate and independent on the respondents perception
of durations,distances, and departure/arrival times (Rojas et al.,
2016; Shen and Stopher, 2014a).Underreporting is also avoided as
the GPS logger captures all movements of
participants(Stopheretal.,2008).Thisalsomeansthatdatacollectioncanbedoneoveraprolongedperiodoftime.Furthermore,GPScanalsobeusedinconjunctionwithtraditionalmethodsasameansofverification.Theseadvantageshave
led to the riseof incorporating these
technologiesasasupplementarytoolortocompletelyreplacetravelsurveys.Now,
GPS units are commonplace and are accurate and lightweight enough
to make this afeasiblealternative todatacollection,
facilitatingmorecompleteanalyses
(Bolboletal.,2012;Gongetal.,2012).Morecomplextravelpatternscanbeminedfrominformationonmodesuchas
what combinations are taken, route choices in multi-modal trips and
how they vary ondifferentdaysoratdifferenttimes.
-
9
2.3.1 Pre-processing
TherawGPSdatasetscollectedareextremely
large,sometimescontaininglogs
intheorderofmillions.Thisalsoincludesirrelevantdatasuchaswhenthepersonisnottravelling.Combinedwithissueslikesignallossandcoldstarts(whenthedeviceisturnedbackonhandtakestimetorecalculateinformation),asetofpre-processingtechniquesmustbeappliedtoturnthisdatasetinto
a comprehendible information source. This usually entails cleaning
thedataof noise
andthensegmentingthemintoindividualtrajectories,ortripswithstartandendpoints(alsoknownasSegment
IdentificationorSI).A commonlyusedmethod for this step is through
theuseofrule-basedalgorithms,oftenby identifying
stoppointsandassign themas
start/endpointsofthetrajectory(ShenandStopher,2014b).Manystudiesuseathresholdof120secondsastheminimumtimeapersonmustbeinthesameplaceforittobeconsideredastartorendpointwhichcouldbeactivitiesormodechangingpoints(ChungandShalaby,2005;Gongetal.,2012;Stopheretal.,2008).Trafficlightchangetimesorbusstoptendtobelowerthanthat,makingita
reasonable criterion.Todate, this rule is stillbeingused,butalso
supplementedwithotherrules. For example, Schüssler&Axhausen
(2009) combine this threshold andpoint density
astheircriteriaforactivitydetection.Activitylocationsaredetectedwhenobservationsmeettwocriteria:
(1) low speeds (
-
10
2.3.2 Modedetection
Earlier GPS studies differentiated betweenwalking, driving
andmotorizedmodes (Bohte andMaat, 2009; Chung and Shalaby, 2005),
butmore recent studies have begun to detect
publictransportationmodesaswell(AxhausenandSchüssler,2009),andmanygoastepfurtherastoclassify
thesemotorizedmodes into the variousmodes of public transportation
like buses
ortrains(Gongetal.,2012;Rasmussenetal.,2015;Stennethetal.,2011).Input
variables that determine modes such as average, maximum and
standard deviation ofspeed,accelerationmeasures,averagedwell
timeandaverageheadingchangeare frequentlyused in this stage
(Gonzalez et al., 2010; Stenneth et al., 2011; Xiao et al.,
2015).Due to
thegrowingpopularityofgeospatialdata,manymodedetectionstudiesalsoincorporatedatafromexternal
sources suchas the transportationnetworkor real timepublic
transport
information(Asgarietal.,2016;Gongetal.,2012;Stennethetal.,2011;TsuiandShalaby,2006).Especiallyintimesofheavytrafficwhenmovementisslow,itisdifficulttoinfermodesolelyfromvelocity.Theareaswhereabusortramcanbearegenerallyfixed.ThistypeofdatafusionwithGISdatahasproventomakemodedetectionmorerobustandproducemuchbetterresultsthanwhencomparedtothebaselinemethodwithoutcontextualinformation.When
it comes to research more in the line of transportation science,
there seems to be apreference for
inferringmodesusingRule-basedmethods (BohteandMaat,
2009;ChungandShalaby, 2005; Gong et al., 2012) and fuzzy logic
systems (Axhausen and Schüssler,
2009;Rasmussenetal.,2015).Somealsousesupervised
learningmethodssuchasRandomForests.Thesethreetypesofmethodswillbedescribedingreaterdepthinthefollowingsection.
Rule-basedheuristics
Gong et al. (2012) use a rule-based GIS algorithm that
automatically processes GPS data
todetect5modes.Thealgorithmalsorecognizeswhethermodetransferswithinatriparefeasible.BycombiningGISdata
likestreetcenterlines,busroutesandstops,subway
lines,stationsandstationentrances,thismethodisabletoachieveapromising82.6%accuracy.First,trajectoriesare
split into segments by identifying stop and mode change points.
Through a set ofhierarchical rules,walkmodesare inferred
firstbasedon speeds.Next, by comparison to
thepublictransportationnetwork,railfollowedbybusmodesareinferred.Therestareconsideredascarmodes.Thresholdsusedarebasedonthespecificationsof
thecity.Forexample, in thestudy area of NYC, themaximum length of
trains was 184m, so the threshold formaximum
-
11
distancetoastationtobeconsideredrailmodewassetat200mtoaccountforthefactthattheusercouldbeattheendofthetrainwhile
itstoppedatthestation.Otherrulesderivedfromcontextaware information
includethethirdrule,wherethemaximumspeedandaccelerationofanexpressbus
inNewYorkCity is88km/h,or1.5m/s2.The full setof rules canbe seen
inFigure1.Thestudyshowedpromisingresults,howevernotedthattherelativelyloweraccuracyofbusandcarmodeidentificationwasduetothedensestreetnetworksandtheconsequencesoftheurbancanyoneffectwhichsometimescauseaparallelshiftofGPSobservations(Gongetal.,2012).Thiscanleadtomisclassificationofbusmodesascarorwalkmodes.Map-matchingtechniques
derived from Chung and Shalaby's (2005) paperwere also applied
tomatchwalksegments to street segments. Furthermore, that paper
developed a trip reconstruction toolusingGPSdatawitha
rule-basedalgorithmaswell,whichachievedanaccuracyof92%of allfour
modes of interest. Bohte and Maat (2009) also use straightforward
rules (Figure 2)
onmeasuresofmaximumandaveragetripspeedstoinfermodes,startingfromtheslowermodes,walk,thenbicycleandcar,followedbypublictrainmodes,duetoitscharacteristiclocationthatisconstrainedbytherailnetwork.Stopheretal.
(2008)managetoachievean
impressive95%accuracywithanotherhierarchical setof rules
togetherwithexternal transportnetworkdata.Furthermore, the studies
found that the distinction between bus and car modes was
verysensitive to their specific rules, anddue to the similar
speedprofiles of bothmodes, there isusuallyahigh
tradeoffbetweensuccess rates foronemodeand theother
(BohteandMaat,2009;Gongetal.,2012).
Figure1RulesusedinGongetal.'spaperinthemodedetectionprocess(Gongetal.,2012)
-
12
Figure2RulesusedinBohte&Maat’spaperformodedetection(BohteandMaat,2009)
Another study by Kasahara applied a rule-basedmode
detectionmethod, but detected
high-speedmodesfirst,andassignedmodestotheindividualobservationsinsteadoftrips(Kasaharaetal.,2017).Observationsofthesamemodearesubsequentlymergedintotrips,providedthetimeperiodislessthanacertainthreshold.However,despitethehighperformance,someopinethismethodstruggleswithlowgeneralizabilityasrulesobtainedfromaonecitymaynotbesoapplicabletoanothercityduetovariousreasonslikethebuiltenvironmentaffectingGPSsignals(or
density of cell towers, affecting overall coverage and signal
strength). However,
thesimplicityandcomprehensibilityofthesemethodsmeanthatitisfeasibletoderiveparametersfor
each city (lengths of trains or average distance between stops from
the cities transportprovider,forexample).
FuzzyLogicSystems
Fuzzy logic (FL) systems are powerful predictive models as they
can handle uncertainty andvagueness inawaythat
isunderstandablebyhumans.However,thesuccessofafuzzyexpertsystemliesinproperselectionofitsfunctionsandparameters,whichareusuallydonemanually(Das
andWinter, 2016a). Unlike crisp sets with hard border values, fuzzy
set theory assignsmembership values to an element, introducing the
concept of partial membership of
thatelementinaset,oranumberofsets.ThewayFLisusedinthesestudiesismostlysubjective,astheempiricalapproachtogeneratingtheserulesisdependentonhumanjudgmenttodefinethem.Schussler&Axhausen(2009)usean
open source FL platform to generate trapezoidal membership
functions of their
fuzzyvariables.Thevariablesweremedianofspeed,95thpercentilespeedandacceleration,andwereexplicitly
chosen over average values to make the algorithms more robust
against
outliers.Figure3showsthemembershipfunctionsofeachvariable.Aminimumofoneruleisdefinedforeach
mode based on these membership functions, as seen by the examples
in Table 1.
-
13
Ambiguity and fuzziness is intentionally introduced through the
rules as well as from theoverlappingmembership functions.This
canbeespeciallyuseful formodes
thathavevariablespeedprofiles,suchasbuseswhichstartandstopfrequently,andspeedchangesdependingonwhether
they are in the city center and the stops are close together, and
in residential areaswhere there are longer stretches between stops
(Tsui and Shalaby, 2006).Modes are
finallyinferredbasedonthemembershipvaluesfromtheaggregatedmembershipfunctions.Ramussenetal.(2015)appliedasimilartechniquetotheirstudyareainCopenhagenusingthesame
variables, butwith values derived from their own expert knowledge
and analysis. Theyalso combined the FL system with a
Rule-Basedmethod first sieve out rail trips, due to
theassumptionthatraillinesarecharacteristicallydifferenttoroadnetworksandthusatrajectoryalignedwithraillineshasahighpossibilityofbeingarailtip.ThestudyfoundthattheFLruleswerestillinsufficienttoeffectivelydistinguishbetweenbusandcarmodes.Inresponsetothat,theymeasure
alignmentof the identifiedGPS stopswith thatof thebus routes. Both
studiesappliedfeedbackmechanismsforweirdcombinationssuchascartobicycleorcar-bus-carwereappliedtocorrectthesetomorerealisticmodes.Forexample,
ifthesequenceofmodeswerecar-bus-car,thealgorithmwouldflagitandreclassifythetrajectoriesasacartrip.Highspatialand
temporal granularity allows for shorter andmore detailed trips
thatmay constitute
onesinglejourneytobeidentified,allowingforthismethodtoactasasuitablefeedbackalgorithm.
Figure3FuzzyLogicmembershipfunctionsgeneratedbyhumanexpertise(AxhausenandSchüssler,2009)
-
14
Table1Examplesoffuzzyrulesformodedetection(AxhausenandSchüssler,2009)
Thepaperdidnotreportanyaccuracymeasuresofthisprobabilisticmethod,butcomparedtheresultswiththeofficialcensusdataontravelbehaviorthatwasreleasedafewyearspriortothestud,andconcludedthatthisformofmodedetectionyieldsrealisticandreasonableresults.Astudyreleasedafter
thatdesignedamorecomplexFLsystemsoas toclassifymoremodes.Afew more
fuzzy variables were added to the list, including proximity to a
network like
therailwayorbusnetwork.Thismethodwasabletodetectwalking,bicycling,car,
ferryboat,sailboat,train,subway,bus,tramandflightmodesusingGPSdataandthesefuzzyvariableswithanaccuracyof91.6%(Biljecki,2010).Thefuzzysystemhadcertaintyfactorsappliedtoeachresulttomeasuretheconfidenceoftheinference.One
drawback of FL systems is that these rules tend not to take into
account
inter-variablecorrelation,andastherulesaregeneratedusingexperts’understandingofthefield,anyclass(mode)additionstothemodelwouldbeextremelycostly(Elkanetal.,1994).Thismayprovetobe
a problem when trying to transfer this method that was designed for
GPS data to CSD.However, it is still possible to construct a FL
model without expert given a set of input
andoutputpairs.Thetaskthenisfundamentallyakintodeterminingasystemthatprovidesthebestfittothesepairs(Mendel,1997),andwillbeexploredfurtherinthenextchapter.
MachineLearning
Duetocertainlimitationsofsettingrulesandalgorithmsinprograms,somestudieshaveturnedtomachine
learningmethods instead.Thesemethodscommonly
includeneuralnetworksandtree-basedmodels,amongafewothers.Gonzalesetal.appliedareducedsubsampledGPSdatasettoneuralnetworkstoinfermodes.Thesubsetconsistedonlyofcriticalpoints,whichwerecharacterizedbyheadingchangeanda
-
15
minimumspeedsoas to remove
redundantdataandminimizeprocessing.The inputs chosenfor the neural
network algorithm were the common variables such as acceleration,
speed,distances between stop locations, dwell time, aswell as GPS
specific variables like
estimatedhorizontalaccuracyuncertainty(Gonzalezetal.,2010).Theneuralnetworkwasabletolearntodistinguish
betweenwalk, car and bus trips.However, the paper cited that the
critical
pointsusedintheproposedalgorithmwereinsufficienttoachieveagoodresult.Tsui&Shalaby(2006)proposedahybridmethodthatcombinesthisneuralnetworkwithafuzzylogicsystemformodedetection.ThefuzzyvariableschosenweresimilartotheotherFLstudiesdescribedabove,withtheinclusionofdataquality.However,theparametersofthesevariablesandtheirmembershipfunctionsweresetbyaneuralnetworkalgorithm(NEFCLASS-J).Theirworkmanagedtoidentifymodes(walk,bus,bicycle,car,rail)withanoverallaccuracyof91%,thoughtheperformanceofbus
modes was relatively poor due to the considerable overlap of
characteristics with othertravelmodes.These,combinedwiththe
largevariabilityofmovements
inbuseswerecitedasreasonsforthispoorerperformance.Thepaperhowever,didnotreportonthemodeshareoftheactualdata.Severalworksalsoincludetemporalmeasuresintheirlearningmethodssuchastimeofdaytogivecontexttoaprobabilitymodeltoestimatemodechoice(Liaoetal.,2007).Stennethetal.(2011)incorporatelivebusandtraintimeswheninferringbetweenstationary,walking,cycling,bus,driving,andtrainmodes.Theyextractedvariablessuchasaveragespeed,headingchange,acceleration,
as well as context-aware information like average bus line
closeness, rail
lineclosenessaswellasbusstopcloseness.Theauthorsranthesevariablesthroughseverallearningalgorithms
(Random Forest, Decision Tree, Naïve Bayes, Bayesian Network and
MultilayerPerceptron) and found that Random Forest had the best
performance. A few importantstrengths of RF that are relevant to
this study are that it is one of the highest performingmachine
learning algorithms in terms of accuracy and can run efficiently on
large
datasets.Estimatesofwhatvariablesareimportantarealsoincludedintheoutput,whichcanbeusefulfor
purposes like dimension reduction (Degenhardt et al., 2017). It is
also able to generatepairwise proximities between data points that
can be used as input in other
classificationmethodslikeunsupervisedk-medoidclustering.
-
16
2.4 Transportmodedetectionusingcellularnetworkdata
Pervasive technologies suchasmobilephonescreatedatasets
thatgivean inside lookofhowpeople use the city’s infrastructure.
Urban planning is one of the greater benefactors of theanalysisof
this collectivepersonal locationdata.Mobilephone tracescontribute
toamassivepool of passive data that can provide knowledge on the
whereabouts and movements ofindividual users. According to the
Global System for Mobile Communications
Association(GSMA),aninternationaltradebodythatrepresentstheinterestsoftheworld’smobilenetworkoperators,thereareabout7.7billionmobileconnectionsby5billionuniquesubscribersin2017.465
million of these subscribers reside in Europe alone3. This large
potential has not
goneunnoticedasmanyoftheseoperatorshavebeguntoexperimentwithnewbusinessmodelsthatwouldgeneraterevenuefromboththeirmobilesubscribersaswellasothercustomerssuchastraffic
analysis, advertising andmarketing, and social networking
companies. As such, it is
nosurprisethatthesharingofsuchmobiledatawithresearchcommunitieshasstartedtopickupspeed(Calabreseetal.,2015).Themainhurdle
isthe lowerspatialresolution,
inconsistentandsometimessparsesamplesofdata. As such, they require
a specialized set of techniques for extracting valuable and
usableinformation from them. There are various types of cellular
network data such as call
detailrecordsandcellularsignalingdata.Thelatter,whichistheonethatwillbeusedinthisresearchis
known by many names, including floating cellular/phone data,
sightings data, and so on.Furthermore, this data type can have
varying properties depending onwhether the phone
isconnectedtoa2G,3Gor4Gnetwork.Thisgivesanindicationofhowrecentthisdatatypehasbeenincorporatedintosuchresearchfields.Forexample,SevtsukandRatti(2010)addresshowcoarse-grainedcallvolumedatainRomecanbeused
to teaseoutpropertiesofusermobility,where they found
regularityandpatterns inurbanmobility at different timesof thedays,
aswell aswhichdayof theweek itwas.Otherindicators
likedemographic,economicand infrastructural
indicatorswereusedtosupplementandaccountforthesepatterns.Travelroutescanalsobeestimatedusingcellularnetworkbasedvoronoisandmapmatching(Tettamantietal.,2012).Otherstudiesalsousethefluctuationsof
3https://www.gsma.com/newsroom/press-release/number-mobile-subscribers-worldwide-hits-5-billion/
-
17
signalstrengthinGSMcellulardatatoestimatemoreprecisegeographiccoordinatesforthesepurposes(Thiagarajanetal.,2011).Transportmodedetectionisanotherareaofresearchusingcellularnetworkdatathatisstillinits
early stages. This is largely due to data’s lower spatial and
temporal granularity,
themainchallengestothecomputationofspecificmeasurementsonspeed.However,therehavebeenafew
studies attempting to estimate coarse speeds according to the rate
of change of theconnected cells, as well as the distances between
them (Gonzalez et al., 2008; Reddy et
al.,2010a;Sohnetal.,2006).OthershavemadesenseofcoarseCDRdatabyclusteringtraveltimesoftripsintothecorrespondingtransportationmodeclusters(KalatianandShafahi,2016;Wangetal.,2010).Thenext
sectionswilldelvedeeper intohowcellularnetworkdata
isgenerated,processed,andusedfortransportmodedetection.
2.4.1 Mobilephonenetworkstructure
The basic network structure is composed of a Core Network (CN)
and Radio Access
Network(RAN).TheCNisdividedintoeitherCircuit-Switched(CS)foractivitieslikevoicecallsorPacket-Switched
(PS) domains for packet data transfers. Radio communication occurs
between
themobilephones(terminals)andthebasestationservingthatcell.Assuch,cellsarethesmallestspatialentitiesinthecellularnetwork,withageographiccoveragethatvariesfrommagnitudesofmeters
(microcells),up to severalkilometers (macrocells). Several cells
togethermakeupaLocationArea(LA)(Janeceketal.,2012;Miaoetal.,2016).ThisstructureisillustratedinFigure4.
2.4.2 DataGeneration
Mobile phone positioning occurs whenever a terminal communicates
with the network,essentiallywhen a user uses his phone (Chen et
al., 2016). Calabrese et al. (2015) categorizemobile phone data
into two types, event driven and network driven. Event driven data
isgenerated during billed activities such when calls or texts are
made, or when data is beingtransferredwhilebrowsing the Internet
forexample.These includeCallDetailRecords
(CDRs)andInternetProtocoldetailrecords(IPDRs)respectively.Atthisstage,theterminalissaidtobe
-
18
in active state, whereby the voice call or data connection is
open. At any given time,
themajorityofmobileterminalsarenotintheactivestate,butintheidlestate.Eventerminalsthathavetheirdataconnectionpermanentlyswitchedonremain
inthe idlestate,switchingtotheactive state only during packet
bursts, like data downloads (Janecek et al., 2012).
Theinformationineacheventthatisultimatelyrecordedinthedatadependslargelyonthemobilephone
provider that operates the network. For example, CDRs could include
the IDs of thecallers, receivers, cell towers and start and end
time stamps. Similarly, IPDRs will consist
ofinformationonInternetusageandothercellulardatarelatedactivities.Network
driven data is also known as floating cellular/phone data or
signaling data and
isgeneratedwheneveraphoneislocalized,i.e.,duringdifferenttypesoflocationupdates(Figure4).
Periodic updates occur on a periodic basis as determined by the
telecommunicationsprovider to generate periodic information on
which cell tower the terminal is
currentlyconnectedto.Handoversaregeneratedwhenanactiveterminalmovesbetweentwocellsandlastly,mobility
location updates are generatedwhen a terminalmoves between two
locationareas.Assuch,dependingonthestate(activevs.idle)oftheterminal,thespatialgranularityofthedatarecordedcanbeatthecellleveloratthelocationarealevel.
Figure4a)LocationAreaandBaseStationsb)Periodicupdatesc)Handoversd)Mobilitylocationupdates(Calabreseetal.,2015)
-
19
2.4.3 SpatialandTemporalGranularity
The frequency of the data depends on the type ofmobile data
generated and is largely
userdependent.Chenetal.(2016)foundthatthefrequencyofbotheventandnetworkdrivendatatypesdisplayhighheterogeneity
inthenumberoftimesthephonewas localizedoracallwasmade
forexample,whereby themajorityofusershavea smallnumberof
recordsandonlyafewhavemore a largenumbers. Each voice call
generates oneCDR.However, that same callmight generatemultiple
network driven data points ifmultiple cells are traversed during
thedurationofthecall.Periodicupdatesaretypically
intheorderofafewhours(Wildhametal.,2015).Forpurposesofclarity,fromthispointonthetermcellularsignalingdata(CSD)willrefertobotheventdrivenandnetworkdrivendata.AstudythatusedCDRsfoundthattheaveragetimeintervalbetweeneacheventwasabout8hours
(Gonzalezetal., 2008).These intervalsaccurately represent the time
intervalsbetweeneach call and aremuch longer thandatasets that
includenetwork drivendata aswell. In
thelatter,asasinglecallmighttriggermultiplenetworkdrivenevents,theseeventsmighttendtobemore
clustered together in terms of the times they were recorded(Chen et
al., 2016). InCalabrese et al.'s (2011) paper, the average
inter-event time intervals that included
networkdrivendatawasfoundtobe260minutes,withtheaverageofthequartiles’medianstobelessthan1.5hours.Assuch,thedatawasfineenoughfortheresearcherstoidentifystopsoflesserthanthattime
interval.However, it isalso importanttonotethatstopsof
lessthan1.5hourswill be missed. This might add to inaccuracies of
the processed data especially since somehousehold travel surveys
define a stop exceeding 5minutes to be an activity that should
berecorded (Chen et al., 2016). As for location area updates, there
may be cases whereby noupdates are sent from themobile phone
despite large amounts ofmovement if the
locationareacoversalargearea,someseveralhundredsofkilometers.In
termsof spatial granularity, thesedata typesdiffer in fromGPSdata
in the sense that thelocation information has to be estimated using
variousmethods. As thismeans that the
cellphoneeventsonlycontainapproximatedlocations,thisissignificantlylessaccuratethanthatofGPSdata(Hornetal.,2014).Triangulationisoftenusedandresults
incoordinatesthatdonotcorrespondto thecell tower
locationbutareanestimateof the
terminalsposition.Measuressuchasreceivedsignalstrength,transmissiontimeandanglesofmultipletowersareusedintheestimation
if there are multiple base stations in available range (Chen et
al., 2016).
Thealgorithmshereusedareusuallyundisclosedbythemobileproviderandusebotheventdrivenand
network data. Furthermore, most mobile providers do not disclose
the structure and
-
20
organization of their cellular networks, or the spatial extents
of each cell or LAs, which varydepending on the density of towers
and the level of urbanization (Widhalm et al.,
2015).Experimentsshowthatthespatial
resolutionwasfoundtobefromtheorderofa
fewmeters(Chenetal.,2016)toabout300m(Calabreseetal.,2011;Jiangetal.,2013)or500m(Hornetal.,2014)
in urban areas where the density of cell towers is much higher, to
that of
severalkilometers(Hornetal.,2014;Widhalmetal.,2015)inrural,lessheavilypopulatedareas.
2.4.4 Pre-processing
Severalpre-processingtechniquesmustfirstbeappliedbeforevaluableinformationonhumanpatterns
can be extracted. These vary depending on the research aims, but
generally, noisereduction techniques in the form of filters or
through clusters are applied to filter
outinaccuraciesinthedata.Next,theseobservationsarethensegmentedintoindividualsegments,whereassomeGPSstudiesassignmodestotheobservationsandthengroupsimilarconsecutiveobservationsintoamodetrajectory(Kasaharaetal.,2017;Reddyetal.,2010b).
Noisereduction
Pre-processingneedstobedonetoreducenoise.Thisisdonetolowertheinfluenceofoutliersthatarisefromvariousphenomenaonthefinalanalysis.Onesuchphenomenon,knownastheping-pong
effect, occurs when the terminal bounces back and forth between
multiple
basestationswhiletheuserisnotmoving(Fiadinoetal.,2012;Miaoetal.,2016).Thisoccursduetofluctuationsinthereceivedsignalstrengthandhenceleadstotheoscillationbetweendifferentcelltowersdespitebeingstationary.Thesefluctuationsinsignalstrengtharealsolikelytohaveanimpactontheestimatedtriangulatedposition,leadingtowhatappearstobedriftsandshiftsinthelocationofthedatapoints.Also,intheeventthereareseveralcelltowerswhosesignalsreachaterminal,theconnectionofthisdevicemayhopbetweenthesetowers.Thismeansthatoutlierscansuddenlyoccurkilometersawaywithinanunrealisticallyshortperiodoftime.Whilesome
(like the above) can be the result of localization errors, others
can be
intentionallytriggeredforprivacyprotectionreasons.Forinstance,arbitraryeventscanbeinsertedintothemobile
traces to prevent the creation ofmovement profiles. These events
are part of efforts
-
21
towardsprivacyprotectionandarecalledtemporarymobilesubscriberidentitiesorTMSI(3GPP,2010)4.Oneapproachisthroughpattern-basedrecognitionandthisrequiresinformationonwhichcelltower
the phone is connected to (Iovan et al., 2013; Schlaich et al.,
2010). Users with highoscillations between cell towers are
identified with a proposed “jumpiness rule” using thenumber of
updates and area codes. Another approach does this through
speed-basedcorrections.Athresholdischosentodistinguishbetweenwhatisareasonablespeedandwhatis
not. Instances that produce values that exceed this threshold are
flagged. This can also bedone through a number of ways as explored
by Horn et al. (2014). They tested a series offiltering techniques
including a recursive naïve filter, recursive look-ahead filter and
Kalmanfilter.Outliersare
identifiedasdatapointswherethespeedscalculatedareexceptionallyfast,and
the thresholdwas set to 250km/h. The recursivenaïve filter simply
removes anyoutliersfrom a sequential stream of events whereas the
recursive look-ahead filter accounts for thepossibility that
theeventbefore theoutlying event is theoutlier instead. TheKalman
filter
ismorecomplexinthatittakesaprobabilisticapproachandisapopularchoiceindatapredictiontasks
including trafficmodelingusingGPSandothersensordata
(Faragher,2012). Itproducesestimates of unknown variables in nosy
time series through approximating joint probabilitydistributions
over the variables in their time frames (Kalman, 1960). Results
show that therecursive filtersoutperformedtheKalman filter,andoneof
the reasonsproposedwasdue
tothetemporalsparsenessofthecellularsignalingdata.Assuch,theformerismoresuitableasanoisereductionmethodformoreirregulardatalikeCSD.
Tripextraction
Inmanyofthestudiesusingcellularnetworkdatatominehumanpatterns,individualtripsarefirstextractedbeforetheyareanalyzedformodedetectionoraggregatedformorelarge-scaleanalysissuchasinurbanactivityanalysis.Toachievethis,keyplacesmustbeidentified.Ontopofassigningtheseuserstotheselocations,theremustbeadistinctiononwhethertheseplacesare
stopsor theuser ismerelypassing through it. The former
canbeplacesof activities, likework,homeor
leisure,originsordestinationswhentryingtogenerateOrigin-Destination(OD)matrices,ormoreexactlocationsasstartandendpointsoftrips.Assigningthesestartandend
43rdGenerationPartnershipProject;TechnicalSpecificationGroupCoreNetworkandTerminals;Numbering,addressingandidentification
-
22
points to specific locations can be done using a few methods. A
more straightforward
andcommonlyusedapproachisbyusingcentroidsofcellareasifthecelltowerlocationsareknown(Gonzalezetal.,2008;Tettamantietal.,2012).Manyotherstudiesusestopdetection
to filteroutsignificantplaces,with the rationalebeingthat
ifapersonstopsthereforareasonabletimeperiodtheseplacesare important
inhumanpatterns. Due to the noisy and raw nature of cellular
network data, the same event
cansometimesberegisteredasmanyconsecutiveeventsthatcouldberelatedtovariouslocationsinitssurroundings.Theselocationsarefilteredouteitherbyasetofrulespertainingtospace,space
and time or speed. A commonly used method here is through spatial
and
temporalclustering.Wangetal.(2010)workedwithCDRs,anduseanincrementalclusteringalgorithmtoextractthesestoplocations.Aradiuscorrespondingtotheestimatedpositioningerror(setat1km)andminimumdwelltimeweredefinedasthethresholdstoformclusters.Themedoidsoftheseclusterswerefoundandtheremainingpointsintheclustersweredeleted.Consequently,thesemedoidsweresetasthestartandendpointsofeachtrip.Jiangetal.(2013)applysimilarthresholdstocellularsignalingdata,butwithfinerthresholdsof300mand10minutestodetectstaylocations,asillustratedinFigure5.
Figure5Illustrationofdatapre-processingtoextractstaylocations(Jiangetal.,2013)withdistanceandtimeconstraintsof300mand10minutes.
-
23
Another study byWidhalm et al. (2015) uses a similar clustering
algorithm to both CDRs andcellular signalingdatabut incorporates
thegeometryof the trajectory to
filteroutpassingbylocations.Figure6showsanexampleofhowBinI)isnotdetectedasastopandthatinII)is.Asthe
study was mining urban activity patterns, the reasoning behind this
was that
significantextradistancestravelledareoftenmotivatedbyanactivity.
Figure6Detectionofstaysusinggeometry(Widhalmetal.,2015)
2.4.5 Modedetection
Oncethesetripsareextractedtheyarenowreadytobeanalyzedtoidentifythetripmodes.Inordertodoso,
informationonthesetripsareextractedastrip features.Unlike
inGPSstudiesthat assign modes to individual observations, most
cellular network studies only do so
aftergroupingtheseobservationsintosegments,treatingthesegmentsasthesmallestunitto
infermodes insteadofoneach individualobservation.Due to the
infancyofmodedetectionusingmobilephonedata, thestudiesareusually
limited toCDR-onlydataandtheexistingmethodsused here can be
classified into unsupervised k-means clustering, rule-based
heuristics andmachinelearning.
ClusteringK-meansAwell-knownmethodofmodedetectionusingcellularnetworkdataiswithk-meansclusteringoftraveltimes.Wangetal.(2010)workedwithanonymizedCDRsintheirstudy.Startandendpoints
of these tripswere assigned to cells in a grid. Tripswith the
sameODswere
groupedtogether.Tripsover63minuteswereremovedastheyweredeemedtobetoolongatravelling
-
24
timewithinthestudyarea.K-meansunsupervisedclusteringwasthenperformedonthetraveltimesofeachoftheseODgroups,withdistinctionsbetweenweekdaysandweekends.K-meansisacentroid-basedclusteringalgorithmthatusesthemeanvalueofeachcluster
(centroid) torepresentthecluster(
Figure7).ThegoalofK-meansisthustoreducethesumofsquarederrorbetweentheindividualobjects
in the cluster and their centroids (Hastie et al., 2009a). The
clusteringpartitioned therecords
intotwoseparateclusterscorrespondingtothemodesof
interest,namelydrivingandpublictransport(Wangetal.,2010),wheretheclusterswiththeshortertraveltimeisassignedto
driving and vice versa. There is an assumption of singlemodal trips
here, similar tomanymode detection studies of this nature. The
error of the inference is then calculated as
theaverageofdifferencesbetween the travel timesandobtained
fromtheclusteringand thatofreported by Google Maps. Silhouette
values to measure how well associated the
clustermembersaretotherepresentativeoftheclusterwasalsomeasured,andthisindicatedagoodperformance
of the model. Due to lack of official census data of the city with
regards totransportationmode, themodel could not be validated
against such official records. Kalatianand Shafahi’s (2016) paper
also detected walking, and used a similar approach. Their
studyworkedon anonymized signaling data and grouped the trips into
traffic zones insteadof
gridcells,andseparatedbytimeofdaytoaccountfortraffic.Groupingthembythehourmeantthattherewereinsufficientrecordstoperformclusteringwell.Assuchtheyweregroupedintotripsoccurringatsimilarhoursoftheday,suchaswhenpeoplecommutehomefrom4PMto9PM.However,while
thepaper stated that validationwasdoneagainst surveyeddata
collectedbythecity,theresultsofthatvalidationwerenotincludedinthepaper.Theseclusteringmethodsonlyuseonefeatureofthetrips;thetraveltimeandassigningmodestotheseclustersmaynotbe
so straightforward. In a city with a well-integrated public
transport system such asmanymajorcities
inEurope,traveltimeswhenprivateorpublicmotorizedmodescanbeextremelysimilar,ifnotshorter.
Figure7Tripdata clustered to twosubgroups,drivingandpublic
transit. Thearrows show
theaveragetraveltimesofthesubgroups.BlacklinesarethetraveltimesasreportedbyGoogleMaps.(Wangetal.,2010)
-
25
Rule-basedmodesplit
Quetal.(2015)workedwithCDRdatatodetecttransportationmodeusingarule-basedmodesplit
algorithm that combined speed, trip distance and a logit model. The
paper focused onestimating transportationmodesharesat the traffic
zone levelof thecity,andonly
lookedatcommutesbetweenworkandhome.Thesehome-work
tripswereextracted througha longerobservational period of 3 weeks
and was possible as the dataset was not subject
toanonymizationevery24hours.Byapproximatingthehomeandworkareasasplaceswheretheuser
is mostly found between 8pm – 7am and 9am- 5pm respectively, the
travel times aresubsequentlyestimatedas the timedifferencebetween
the latest timeone is
foundathomeandtheearliesttimeoneisfoundatwork.Thisistoaccountforthefactthatitisunlikelythatausermakesacalljustbeforeleavingoruponarrival.Asaresult,theywereabletoestimatethetraveldistancesandtimesbetweenhomeandwork,andsubsequentlyfromthesetwovalues,the
speed. Here, the distinction is also between driving, public
transportation and
walking,whereeachtriponlyconstitutesoneofthesemodes.
Figure8FrameworkproposedbyQuetal.(2015)formodedetectionwithCDRdatathroughaspeedsplit,augmentingthedatasetwithtransportationnetworkinformationandutilityapproximations.Basedontheassumptionthat15km/histhemaximumspeedofanon-motorizedmode,Figure8
shows the speed ruleused to split the trips intohighand low speed
trips.High-speed
tripswhoseaveragedistancetotheunderlyingpublictransportationnetworkcountedasacartrip.Therestarefedthroughthelogitmodel.Forthelowspeedtrips,thedistinctionismadeusingtriplengthsbasedontherationalethatpeopledonotwalkformorethan3km.Thosemorethan3km
are also fed through the logitmodel, which is a discrete
choicemodel that predicts
anindividualschoicebaseonutilityorattractiveness.Forexample,inthestudyareaofBoston,itis
-
26
regardedasmoreattractivetousepublic transportation
inthecentralBostonregionandcarsfor the surrounding suburb region.
This differs from many GPS studies using rule-basedalgorithms that
usually detect slowest modes first. This can be attributed to the
betterresolutionofdata,enablingmorerepresentativemeasurementsofslowerspeedsinmodeslikewalking,
with lesser chances of data inaccuracies resulting in higher
speeds. Linear relationsbetween census data and the predictions for
each census tract are used to evaluate theperformanceof
themodelandthemodeldoeswell for identifyingcarmodes,butnot for
theother two. Also, while some areas observe high prediction
accuracies, others have
largerdeviationsfromthesurveydata.Onereasoncitedwastheconfoundingeffectsofotherfactorssuchasincomeandlandusethatmayhavecausedlargererrorsespeciallyintheirlogitmodel.
Machinelearning
Machinelearningissometimesusedinmodedetectionstudiesusingmobilephonedata,morespecifically,
the GPS and accelerometer data collected from phone applications.
However, astudy by Sohn et al., (2006) applied some of these
machine learning methods on
cellularnetworkdata,ormorespecificallyGSMdata.Aspecialmobileapplicationwascreatedforthisstudy
to capture this data. The dataset they generated were labeled with
these modes andincluded signal strength values, cell IDs, as well
as the channel numbers of atmost 7 of thenearest cell towers. The
method used here assumed that a user is stationary when
theobservationshaveaconsistentsetoftowersandsignalstrengths,andmovingwhentherearechanges
in these sets. They also found that the Euclidean distances between
consecutiveobservationswereproportionalwiththespeedofmovement.In
essence, a theoretical fingerprint of the signal strength and
constituent cell towers
werecreatedforeachobservationandfromthis,sevenfeatureswerechosentotrainthemodel.Thisincluded
theEuclideandistance, correlationof signal strengths fromcommoncell
towers andnumber of common cell towers between two measurements.
The remaining variables werevarious descriptive statistics of
Euclidean distances in the variouswindows ofmeasurements.The
classifiers were trained with a boosted logistic regression
technique with a
single-nodedecisiontree.Overallthemodelperformedwellwithanaccuracyof85%thoughtheidentifiedmodes
were only stationary, walk and drive. The signal strength
information however is
notavailabletothisstudyastheywerecollectedbyanappdevelopedbytheresearchers.Thereisstillvalueintheirworkintermsofimportantvariablesthatcanbeusedinourstudy.
-
27
Inamorerecentstudy,Asgarietal.
(2016)developedanunsupervisedalgorithmthatenablesthemappingofcoarsemobilephonetracesoveramultimodal
transportationnetwork,wheremobile trajectories are the observations
and hidden states to be predicted are nodes of themultilayer graph.
This unsupervised HMM completes the originally sparse trajectory
andenriches it with the used modes by leveraging on the
transportation layer type and theirtopological properties (i.e.
route complexity). Transition probability predicts how likely
anindividual moves from one hidden state to another using factors
like edge type, speed, andlength. The model performs well and
proves that using the transport network improvesperformance.Other
studieshavealsoattempted tomatch theobservations to
theunderlyingnetwork.
Figure 9 Average Euclidean distance between subsequent
observations during stationary, walking
anddrivingperiods.(Sohnetal.,2006)
2.5 Variableselection
Methods like machine learning, FL systems and unsupervised
clustering have proven to
bepowerfultoolsforclassificationtasksinbothGPSandmobilephonedatastudies.Especiallyfordatawithhighdimensionality,oftenselectingareducedsetofrelevantvariableswouldbeideal
-
28
if theobjectivewas tobuild a classificationmodel for
thepurposesof identification.
Thiswillreducetheprocessingtimeandstoragespaceneeded.Furthermore,selectedvariablesmayalsoprovideasuggested
framework for futurestudiesusingCSD.To
thebestofourknowledgeofexistingworkusingCSD,thereseemtobenodocumentedcasesofvariableselectionprocesses,ordescriptionsofsuchmethods.Assuch,thispaperhasexploredafewtechniquesthatcouldberelevant
toour researchaims.Twooptionsareexploredhere:RandomForest (RF),a
tree-basedmodelwhereby
labelsarerequiredandPrincipalComponentAnalysis
(PCA)wheretheyarenot.
RandomForest
RandomForest is a tree-based learningmodel thathasbeen shown
toperformwell inmodedetectionstudiesusingGPSdata.Labelsareused
inthegenerationoftherandomforest(RF).Thesealgorithmshavethepowertohandlehighdimensionaldataandakeystrengthisthatitoutputs
themost significant variables, which is especially relevant if the
objective is variableselection. Another benefit of using RF is that
it is able to account for and balance errors inimbalanced datasets,
where one class may be disproportionately more represented
thananother.Tree-basedmodelsuselabelstobuildtheirtrees,bysplittingthepopulationintotwoormorehomogenoussetsbasedonthemostimportantvariable.ThisisdecidedbyusingtheGiniindexor
entropy to evaluate the quality of a particular split, and is
usually used in
classificationproblemsratherthanregressionones(Jamesetal.,2013).TheGiniindexisdefinedas:
where istheproportionof individualsthathaveclasscatnoden.Gini is
lowestwhenallobservationsinthenodesbelongtothesameclass,andincreasesastheobservationsofthesamenodehaveamoreevenclassdistribution.ThemeandecreaseofGiniortheinformationgainforsplittingatnodenonvariablexi,isdefinedasthedifferencebetweenimpuritiesofthenodeandtheweightedaveragesoftheirchildnodes:
Gain(xi,n)=Gini(xi,n)-wLGini(xi,nL)–wRGini(xi,nR)
-
29
WherenLandnRaretheleftandrightchildnodesofparentnoden,andtheweightsassignedtotheleftandrightnodesarewLandwRrespectively.Basedonthiscalculation,variablexiwiththelowest
impurity is selected to be the basis of the split at node n.An
alternative to the
Ginicoefficientisanothermeasureofmeandecreaseinaccuracy.Eachtreethatusesthisparticularattributewillcomputevalueseparatelyandthentheaverageofalllossofaccuracyiscalculated(Degenhardtetal.,2017).Forexample,inasampleof100childrenwithvariablesgender,heightandage,halfofthematemeat
and the other did not. The most important variable of determining
their
meatconsumptionstatuswouldbetheonethatproducesthemosthomogenousorpuresetsafterasplitbasedonthatvariable,whereoneresultingsethasahighpercentageofnon-meateatersandtheotherhasahighpercentageofmeateaters.RFisanextensionofthesedecisiontreesinthatitgrowsmultipletrees.ThedefinitionofanRFalgorithmis“RFisaclassifierconsistingofacollection
of tree-structured classifiers {h(x, k ), k = 1,...} where the {k }
are
independentidenticallydistributedrandomvectorsandeachtreecastsaunitvoteforthemostpopularclassatinputx.”(Breiman,2001,p.2).Eachtreewillhaveonevotethatwillcounttowardsthefinalclassification.While
RF is generally considered a supervisedmachine learningmethod, it
canalsobeadapted inforunsupervised learningtoderiveaproximitymatrix
fromunlabeleddata(ShiandHorvath,2006).ThiswillbeelaboratedfurtherinSection4.8.Asforitsroleinvariableselection,therearevariousapproachesproposedtoidentifythemostimportantvariablesbasedonthisranking.Degenhardtetal.(2017)didacomparativestudyofthesemethodsand
concluded that theBorutamethodwas themostpowerful approach,
andwillbedescribed further inSection4.9.Theoverarchingconcept is
toaddrandomness to
thesystemandbycollectingresultsfromthissystem,thedeceptiveimpactsofrandomfluctuationsand
correlations can be lessened, providing a better picture of which
attributes are reallyimportant(Kursaetal.,2010).
PrincipalComponentAnalysis(PCA)
PCA is an unsupervised process of transforming data by plotting
it on different axes so as toderive a set of smaller representative
variables, or principal components (Abdi andWilliams,2010). The aim
of PCA is to try and explain asmuch variation as possible in the
data. Sincelabeleddata is not required, PCA is especially useful
todetermine inputs tomethods such
asunsupervisedclustering.Theseprincipalcomponentsareaxeswherebythedataismostspread
-
30
outwhenprojectedtoit.Inordertofindtheselines,onederiveseigenvectorsandvalues,whichcomeinpairs.Theeigenvectorisadirectionandtheeigenvalueisanumberrepresentinghowmuch
variance there is in the data in that direction, or how spread out
the data is in
thatdirection.Thefirstprincipalcomponentisthustheeigenvectorwiththehighesteigenvalue,andcanalsobeseenasthelinethatisclosesttotheoriginaldata(AbdiandWilliams,2010).
Figure10Exampleoffirsttwocomponentsbasedontwovariables,word
lengthandnumberof lines
indictionarydefinition.Greenlinesrepresentthevectorswhosemaindirectionleadstothemaximumsumofsquareddistancesfromthepointstothevectors(AbdiandWilliams,2010).
PCAiscommonlyusedasadimensionreductiontechniquewherelargedatasetswithredundantvariables
can be discarded without the loss of variation. Each of the
resulting principalcomponents will have differing contributions
from different variables. The first
principalcomponentcanbedominatedbyafewvariables,andthiscanbethebasisofhowtheanalysisisinterpreted.
These variables can give an indication of what key combinations of
variablesaccountforahighproportionoftotalvarianceinthedata.Butbecausethedataisorthogonallytransformed
onto a new coordinate system and because the values are scaled, the
variables(principal components) do not explicitly represent the
system-produced variables, hence
-
31
applyingPCA to thedata setmight cause it to lose
interpretability.Nevertheless, PCA
canbeusedtoselectthosevariablesthatcontainthemostinformation.KingandJackson(1999)didacomparative
research study on the bestmethod of variable selection using PCA,
so that thereduced subset you are left with is as representative of
the original dataset as possible. Theresultsof
thestudywereconclusiveandtheyrecommendedthat
theB4methodworkedbestwhen complemented with the Broken-stick model
criterion of number of variables to
select(Jackson,1993;KingandJackson,1999).ThiswillbedescribedinSection4.9.
2.6 Ethicalissues
Despite overcomingmany of the shortcomings ofGPS, the use of CSD
in any
formhasmanyethicalimplicationsthatneedtobecarefullyconsideredbeforechargingforwardwiththeuseofthis
typeofdata. Locationdata frommobilephones can reveal a
lotaboutaperson,not justwhere one works and lives, but also visits.
Activities like participation in protests, or
alcoholconsumptionbasedonfrequencyofbeinglocatedinbarsandpubscanbeinferredandassumedofauser.Thisincludestheirschedulesaswell(Carteretal.,2015).Unsurprisingly,thisisamajorconcern,
especially when such information is made available to applications
that serve
thirdpartiessuchascommercialcompanies(Calabreseetal.,2015).Tocombatthis,regulationsliketheGeneralDataProtectionRegulation5havebeenputinplaceinMay2018,dictatingthatthedata
telecommunication companies release must be treated such that it
was impossible
toassociatethelocationdatawithacellphonenumber.Exceptionsincludecaseswherebyconsentoftheuserswhoaretrackedisexplicitlyconveyed.Assuch,researchersdevelopmethodsthatreflectcomplianceandthatabidebytheseregulations.Someoftheseattemptsincludelocationobsfucation,
where locations are slightly altered but within the realms of being
useful forservices (Krumm, 2009). Another key change in the GDPR is
the strengthening to privacy bydesignprinciplesbymakinga legal
requirement.Companiesarenowrequired to
includedataprotectionfromtheonsetofdesigninganewsystem,ratherthananadditional
featureattheend.
5TheGeneralDataProtectionRegulation (GDPR)aims toprotectall
EUcitizens
fromprivacyanddatabreachesinanincreasinglydata-drivenworld.Itwasfirstestablishedin1995
-
32
However, despite the efforts made to deliberately encumber any
form of matching of thesetrajectories to individual users, it can
be argued that it is insufficient to truly protect
users’privacies.Whileit
isthepredictabilityandrepetitivenessofhumanbehaviorthatmakemobilephonedatavaluableintermsofmobilityresearch,itistheverysamesetoftraitsthatmakesitdifficult
to completely anonymise this data. A recent study found that just 4
spatiotemporalobservationsarerequiredtouniquelyidentify95%oftheindividualsintheirtests(deMontjoyeet
al., 2013). As such, moving forward, more complex techniques should
be designed
andimplementedtoprotectindividualprivacy.Furthermore,thereisalsotheissueof“groupprivacy”where
people can be targeted on the basis of the social group they belong
to. For example,certain groups may be represented more in mobile
phone datasets depending on their age,gender, ethnicity etc. as
indirect reasons for their level ofmobile phone activity at
particulartimesorparticularplaces(Calabreseetal.,2015).Implicationsofbeingrecognizedasaresultofidentifyingwithaparticularsocialgroupstarttobecomeaconcern(Letouzéetal.,2015).Weacknowledgethattheseconstraintstoindividualprivacyareimportantissuestoconsider.Inthe
years to come, it is expected that the scrutiny on datamining and
its associated
privacyconcernswillcontinuetoincrease.Usersofthisdatamustbesensitivetotheirmethodologiesandhowlegalprivacy
issuesmight
impactthem(Calabreseetal.,2015).Withrisingconsumerconcern, there
might be legal challenges that this field runs into if these
concerns are
notadequatelyaddressedwiththeimplementationofmoreeffectivedesignframeworksdevotedtoprivacyprotection.
2.7 SummaryandResearchGaps
Amajormotivationofpursuingthisresearch
isthatCSDisapassivedatatype.Whencarryingoutimportanturbanmovementanalyseswhoseresultswillhaveanimpactondecisionsmadebycityandtransportplanners,thisdatamustbeasrepresentativeofthepopulationinquestionaspossible.Sincecellularnetworkdataalreadyexistsandmuchofthepopulationalreadycarrypersonalmobiledevices,th