Dempsy - Stream Based “Big Data” Applied to Traffic
Feb 22, 2016
Dempsy -Stream Based “Big Data” Applied to Traffic
Traffic End-to-End
Incident and Event Data
DOT Sensor / Flow Data
Traffic.com Sensor Network
Collection
Probe Data
Fusion
In-Vehicle
Wireless
Dissemination
Internet
Television
Radio
Historic Data
Data Fusion
Sensor Data Collection
Probe Data Collection
ProbeCollector
Metro Mapping
Alg
orith
ms
Arc
hive
Handlers
Veh. IdentifierSpeedHeadingLocation (lat,lon)Metro ID
Roadnet-work
...Third party
probe data provider
Metro 1 Mapmatch
Metro 2 Mapmatch
Metro 3 Mapmatch
Metro Mapping
3rd Party ProbeCollector
Metro Mapmatchers
Veh. IdentifierSpeedHeadingLocation (lat,lon)Metro IDNavteq Edge IDLocation along edge
Overview of Arterial Model
5
Probe Data
Map Matcher
Path Analysis
Travel Time Allocation
Arterial Travel Times
Arterial Traffic Data
• Map Matcher Matches the probe data to road network in real time
with associated probabilities.• Path Analysis
Routes between pairs of probable matches across chains and applies a Hidden Markov Model to the results to determine the most likely path through a set of points.
• Travel Time AllocationAssigns the path travel times to the appropriate
arterial segments.• Arterial Model
Combines expected values with the allocated travel times and previous estimates into the current estimate
Width of the Road• Center the normal distribution over the probe reported location• Compute the distance from the peak of distribution to the edges of the road.
o It is possible to estimate road width from the number of lanes• Integral of the normal distribution gives the probability of the probe being on that road.
6
b
ag
x
eP 22
2)(
221
Real Life Examples
7
Technology Survey
9
• Streams Processing Engines• Hadoop / Map Reduce• “Distributed Actors Model”
Technology Survey
• Streams processing enginesoOracle, IBM, SQLStream oNot a good fit. More for relational data processing.
• Hadoop Map ReduceoNot a good fit for low latency computations (15 to 30 minutes
per batch)oHbase Co-processors are a possibility but more of a hack
• Actors ModeloS4, Akka, Stormo Just what we need
10
Dempsy – Distributed Elastic Message Processing System• POJO based Actors programming abstraction eliminates synchronization bugs
• Framework handles messaging and distribution• Fine grained partitioning of work• Elastic• Fault tolerant
11
Dempsy – Distributed Elastic Message Processing System• Separation of concerns – scale agnostic apps versus scale aware platform
• Support code quality goals (guidelines, reuse, design patterns, etc)
• Functional programming (-like)• Map Reduce (-like)• Distributed Actors Model (-like)
12
Dempsy
13
MP Container
ZooKeeper
MP Container Cluster
Dist
ribut
or
MP Container
MP Container
ZooKeeper
MP Container Cluster
MP Container
System Characteristics - DevOps• Manage every node and every process in exactly the same way.
E.g. arterial, path analyzer, map matcher look the same to an operations person.
• Everything runs on exactly the same hardware• Scale elastically. To increase throughput, just add a machine to
the cluster – no extra work required. The system can even be automatically scaled as load increases.
• Robust failure handling – no real-time manual intervention required when nodes fail.
14
• Development, QA and Integration teams can use a pool of resources rather than dedicated resources. The pool can grow elastically as required by overlapping project schedules
Example – Traffic Processing
15
Map Matching and Path Analysis as an Example
• Algorithm decompositionoDiscrete Business Logic Components
1. Map Matching2. Vehicle Accumulation3. Path Analysis (currently A* routing)
oMP Addressing1. Tile based addressing2. Addressing by vehicle id3. Tile based addressing
16
17
Adaptor
MapMatch
MP
MapMatcher
Singleton
Vehicle Accumulat
orMP
PathAnalyzer
Singleton
PathAnalyzer
MP
TravelTime
Singleton
TravelTime
MP
TrafficState
Singleton
TrafficStateMP
Linkset AstarGraph
Traffic History
Segment
Table
Key: tilex 40k
Key: probeIdx 10M
Key: tilex 40k
Key: tilex 40k
Key: segment Idx 2M
x 50 x 50 x 50 x 50
TrafficReporter
OLTP
X 9Every 60 seconds
Analytics
Extract
x 1
Quality & Audit Logs App Logs
Distributed Log Collection
Distributed File Storage
Dempsy – Arterial Model Example
Dempsy Proof Of Concept Results
18
Dempsy Testing and Analysis• Decomposed Arterial (MegaVM) into Dempsy Message processors• Implemented first two stages of Arterial, Map Match and Path Analysis• Implemented Message Processors as trivial POJOs around existing
mapmatch and path analysis libraries• Wrapped into a Dempsy Application• Front ended with Dempsy Adaptor to read probe data from files and
inject them into Dempsy• Deployed to Amazon EC2 to prove out scaling, collect performance
data, and analyze behavior of system under load• Three main rounds of testing
1. Original HornetQ Transport (Sprint 6.2 )2. Lighter weight TCP/Socket Based Transport (6.3 Sprint)3. More finely grained Message Keys (6.3 Sprint)
19
Distributed Map Match /Path Analyzer Testing• Ran multiple tests on EC2 with increasing number of Dempsy Nodes
• Scaled Map Match in Parallel• Used a constant number of Probe Readers, empirically set at 3
20
Test 1: HornetQ Transport
Stack Width (# Map/Path Nodes)
Throughput Probes per Second
1 6,2812 12,4993 18,8404 25,7165 28,123
21
1 2 3 4 50
50001000015000200002500030000
Probes/Sec
Probes/Sec
Test 2: TCP Transport
Stack Width (# Map/Path Nodes)
Throughput Probes per Second
1 14,4982 27,7423 32,2114 53,1695 49,207
22
1 2 3 4 50
100002000030000400005000060000
Probes/Sec
Probes/Sec
Test 3: TCP w/ Small Tiles Transport
Stack Width (# Map/Path Nodes)
Throughput Probes per Second
1 14,4002 27,5813 41,8394 55,7255 68,2856 81,967
23
1 2 3 4 5 60
20000400006000080000
100000Probes/Sec
Probes/Sec
Development Life Cycle• Write Message Processor (MP) prototypes• Configuration using the Dependency Injection container of your
choice (currently supports Spring).• Develop using one node or pseudo distributed mode• No messaging code to write• No queues• No synchronization• Narrow scope of concern – each processing element deals with
only a limited set of data. There may be millions of processing elements.
• Simple debugging and unit testing
24
Trade-offs• There’s no free lunch
o Sacrifice guaranteed delivery, message ordering, message uniqueness
o Gain response timeo Gain simple clusteringo Gain memory efficiency (no queuing)o Gain lower latency under load
• Where does this worko Statistically based analyticso Techniques where sacrificing input data quantity results in low
output quality• Where doesn’t this work
o Transaction based systemso Techniques where a message results in ‘false’ results (e.g. bank
transactions)
25
StartConstruct
Startup@Start
Start
Prototype Ready
clone()message
explicitinstantiation
@Activate
Ready
Activate
No Activate
@MessageHandler
message
@Output
scheduledoutput
complete
completeoutput
@Evictablescheduledevictcheck
no eviction
@Passivate
eviction
Passivate
finalize
jvm gc
jvm gc
Elasticity
Message Processor PrototypeMessage Processor
Proposed AdditionProposed Addition
Future Addition
DEMPSY – MP LIFECYCLE DIAGRAM