Top Banner
GET STARTED WITH DATA SCIENCE BY ANALYZING TRAFFIC DATA FROM CALIFORNIA HIGHWAYS Cirrus Shakeri, Ph.D. Inventurist LLC 1 With contribution and support from: Aerospike Team: Brian Bulkowski, Monica Pal, Dash Desai, and Ondrej Jaura GraphLab Team: Danny Bickson and Eduardo Rosini
47

Get Started with Data Science by Analyzing Traffic Data from California Highways

Jul 15, 2015

Download

Data & Analytics

Aerospike, Inc.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Get Started with Data Science by Analyzing Traffic Data from California Highways

GET STARTED WITH DATA SCIENCE BY ANALYZING TRAFFIC DATA FROM CALIFORNIA HIGHWAYS

Cirrus Shakeri, Ph.D. Inventurist LLC

1

With contribution and support from:

•  Aerospike Team: Brian Bulkowski, Monica Pal, Dash Desai, and Ondrej Jaura

•  GraphLab Team: Danny Bickson and Eduardo Rosini

Page 2: Get Started with Data Science by Analyzing Traffic Data from California Highways

The Short Story •  What we wanted to achieve

•  To understand how data science works and what results can be achieved •  How repeatable and reusable the data science techniques can be •  Broaden the use of Aerospike in Realtime Analytics

•  The starting point: data •  Hard to find the right dataset •  Datasets from California highways •  Application example: realtime detection of traffic incidents (a hypothesis)

•  The non-glamorous part of data science •  Understanding the data and the domain in reasonable depth •  Data cleansing and preparation

•  The glamorous part •  The algorithms– can we treat machine learning as a black box? •  The reality of Data Science hits: We need a data scientist!

•  Glimpses of hope •  Some promising results but the hard work still remains—please join the effort

2

Page 3: Get Started with Data Science by Analyzing Traffic Data from California Highways

THE LONGER STORY Our lessons learned and how you can apply these techniques in your applications

3

Page 4: Get Started with Data Science by Analyzing Traffic Data from California Highways

DATA

4

Page 5: Get Started with Data Science by Analyzing Traffic Data from California Highways

Search for the right dataset • Criteria

•  Open, realtime, interesting, not-a-toy dataset, and not-creepy!

• California highways sensor data •  16,490 stations, 41,470 sensors across California highways

5

Courtesy of California Department of Transportation

Start from Data

Page 6: Get Started with Data Science by Analyzing Traffic Data from California Highways

Checklist for understanding the data •  Download a sample and study it

•  what are the attributes, what is the range of values, … •  How is data structured?

•  tables, documents, graph, etc. •  How do we access it?

•  API, file downloads, … •  How is the quality of data?

•  missing values, incorrect values, … •  How is data generated?

•  How fast? Do we have access to the historical data? How much of it? •  Are there legal and ethical issues in analyzing the data

•  privacy, security, …

6 Understand the Dataset

Page 7: Get Started with Data Science by Analyzing Traffic Data from California Highways

Data format and structure “d04_text_station_raw_2014_10_29.txt”

7

CSV format 8,403,282 lines 565.5 MB

Understand the Dataset

Page 8: Get Started with Data Science by Analyzing Traffic Data from California Highways

What exactly is in the data

8

station ID

Average speed for lane 1: 44 miles per hour

lane 2 lane 3 lane 4 lane 5

vehicle flow for lane 1: 17 vehicles passed in 30 sec on lane 1

average occupancy for lane 1 lane 1 was occupied 18% of time in 30 sec

Understand the Dataset

Page 9: Get Started with Data Science by Analyzing Traffic Data from California Highways

Deeper study of the data • GraphLab + iPython Notebook

•  Both open source and free

9

www.dato.com ipython.org/notebook.html

Pick Tools

Page 10: Get Started with Data Science by Analyzing Traffic Data from California Highways

Load Sample Data

10

graphlab.SFrame.read_csv(url_d04_text_station_raw_2014_10_29, ...) ... PROGRESS: Unable to parse line "10/29/2014 00:44:36,400728,2,... PROGRESS: Unable to parse line "10/29/2014 00:00:05,400402,2,... PROGRESS: Unable to parse line "10/29/2014 01:29:57,401542,1,... PROGRESS: Unable to parse line "10/29/2014 00:44:36,401211,0,... PROGRESS: Read 778502 lines. Lines per second: 174789 PROGRESS: Read 3707451 lines. Lines per second: 340087 PROGRESS: Read 6514640 lines. Lines per second: 377548 PROGRESS: 577297 lines failed to parse correctly PROGRESS: Finished parsing file /Users/cirrus/Documents/Aerospike/... PROGRESS: Parsing completed. Parsed 7825984 lines in 19.8226 secs.

Understand the Dataset

Page 11: Get Started with Data Science by Analyzing Traffic Data from California Highways

Study Range of Values in the Dataset

11

!

Understand the Dataset

Page 12: Get Started with Data Science by Analyzing Traffic Data from California Highways

Lesson Learned: Need a fast database •  Exploring sample files worked as a start •  Next we needed to slice and dice a large amounts of the traffic data in

various ways •  How is speed at a specific station changes over the course of a year at a

specific time of day? •  Are there repeatable seasonal or weekly patterns? •  What about traffic behavior along a specific highway, say US-101 in Bay Area? •  ...

•  This requires loading a large amount of historical data in a database and running queries •  Exploring data should be interactive so the database must be fast

•  Current plan •  Load a good portion of the highway traffic data into Aerospike •  Provide developer community free API access •  Experiment with the data and collaborative on building applications •  The motivation is to open up the use of the Aerospike technology to different

applications especially in realtime analytics

12 Pick Tools

Page 13: Get Started with Data Science by Analyzing Traffic Data from California Highways

Good Data vs. Bad Data

13 Check the Data Quality

Page 14: Get Started with Data Science by Analyzing Traffic Data from California Highways

Imputed Data: 5-min aggregate

14 Check the Data Quality

Page 15: Get Started with Data Science by Analyzing Traffic Data from California Highways

Two more datasets • Station locations

•  Incident data

15 Add More Datasets

Page 16: Get Started with Data Science by Analyzing Traffic Data from California Highways

Detecting Incidents in Realtime • Can we detect accidents based on sensor data?

•  Is there a value in such a prediction? •  Can we detect accidents faster than the way they are reported

now? •  Is it doable? Has been done before?

• Can we ‘predict’ accidents?

• A better understanding of the domain is needed

16 Generate Application Ideas

Page 17: Get Started with Data Science by Analyzing Traffic Data from California Highways

STUDY THE DOMAIN

17

Page 18: Get Started with Data Science by Analyzing Traffic Data from California Highways

Literature Survey •  Types of incidents: Collision, Debris, Break-down

18 Study the Domain

Reference: “REAL-TIME DETECTION OF ROAD TRAFFIC INCIDENTS”, by PERO ŠKORPUT, et al

Page 19: Get Started with Data Science by Analyzing Traffic Data from California Highways

Literature Survey

19 Study the Domain

Reference: “REAL-TIME DETECTION OF ROAD TRAFFIC INCIDENTS”, by PERO ŠKORPUT, et al

Page 20: Get Started with Data Science by Analyzing Traffic Data from California Highways

Can incidents be detected in near realtime? (under 1 minute)

20

“The survey responses point to a general consensus that the unacceptably high rates of false alarms generated by available incident detection algorithms is the major deterrent.”

Study the Domain

Page 21: Get Started with Data Science by Analyzing Traffic Data from California Highways

Algorithmic Approaches • Comparative algorithms

•  Current traffic data is compared with the ‘normal’ threshold

• Statistical algorithms •  Current traffic data is compared with forecasted data

•  Traffic theory based algorithms •  Rapid change in speed while flow or occupancy do not change

considerably (based on ‘catastrophe theory’)

• Advanced algorithms •  AI and neural networks

21 Study the Domain

Reference: “OVERVIEW TO SOME INCIDENT DETECTION ALGORITHMS: A COMPARATIVE EVALUATION WITH ISTANBUL FREEWAY DATA” by Onur Deniz, et al

Page 22: Get Started with Data Science by Analyzing Traffic Data from California Highways

EXPLORE THE DATA

22

Page 23: Get Started with Data Science by Analyzing Traffic Data from California Highways

Study Patterns and Correlations

23 Explore the Data

Page 24: Get Started with Data Science by Analyzing Traffic Data from California Highways

Exploring the Data – Speed

24 Explore the Data

Page 25: Get Started with Data Science by Analyzing Traffic Data from California Highways

Exploring the Data – Occupancy

25 Explore the Data

Page 26: Get Started with Data Science by Analyzing Traffic Data from California Highways

Exploring the Data – Flow

26 Explore the Data

Page 27: Get Started with Data Science by Analyzing Traffic Data from California Highways

FORMULATE THE DATA SCIENCE APPROACH

27

Page 28: Get Started with Data Science by Analyzing Traffic Data from California Highways

Detecting Incidents as a Classification Problem

•  Features •  Started simple: Flow, occupancy and speed

• Next: prepare training data

28

Any individual record classified as incident or no-incident:

Prototype the Solution

Page 29: Get Started with Data Science by Analyzing Traffic Data from California Highways

Classifiers in GraphLab • Classification models supported in GraphLab

•  support vector machines (SVM) •  classify instances based on a linear function of the features

•  logistic regression •  estimates the probability of instances belonging to a class as a

logistic function of a linear combination of features •  boosted trees

•  combines the results of a set of base classification decision trees

•  neural networks •  middle layers of the network compute composite, intermediary

features

• Also, a smart interface that selects the right model based on the data

29 Prototype the Solution

Page 30: Get Started with Data Science by Analyzing Traffic Data from California Highways

PREPARE TRAINING DATA TRAIN THE MODEL EVALUATE THE MODEL

30

Page 31: Get Started with Data Science by Analyzing Traffic Data from California Highways

Why do we need training data? • Supervised learning: a classifier is trained via a dataset

that is labeled with positive and negative instances

31

downstream occupancy / speed

upst

ream

occ

upan

cy /

spee

d

incident

no-incident

?

Prototype the Solution

Page 32: Get Started with Data Science by Analyzing Traffic Data from California Highways

Training Data – the end result 32 Prototype the Solution

Page 33: Get Started with Data Science by Analyzing Traffic Data from California Highways

Train the Classifier

•  This is the ‘smart interface’ where GraphLab decides which classifier to us

• Note that the features are simply flow, speed and occupancy

•  Is this model oversimplified?

33

model = graphlab.classifier.create(final_training_data_d0_2014_10_29, target='incident_happened’, features=['total_flow', 'avg_speed', 'avg_occupancy'])

Prototype the Solution

Page 34: Get Started with Data Science by Analyzing Traffic Data from California Highways

First results were embarrassing! • Basically no incident could be detected in the test data •  The reality of data science hit!

•  Leveraged the GraphLab on-line community •  Worked with GraphLab Data Scientist: Danny Bickson

• Made some changes •  Added more features

•  Station id, Num of lanes, Direction of travel, Absolute postmile, Hour/Min/Day that incident occurred

•  Balanced the negative and positive examples •  Tagged station data in a 3-mile vicinity of an incident as positive

instances •  Created classifiers for specific highways (e.g., US-101) •  …

34 Prototype the Solution

Page 35: Get Started with Data Science by Analyzing Traffic Data from California Highways

Training Data Preparation Steps •  Load CSV files for 5-min aggregate sensor data

•  From Oct 29 – Nov 4, 2014 •  In District 4 (Bay Area)

• Prepare sensor dataset •  Filter for highway US-101

•  1,861,626 records •  Convert data types

•  String to datatime •  Split timestamp and add as separate features •  Add day of week as extra feature •  Drop records with unknown values

•  1,447,195 records

35 Prototype the Solution

Page 36: Get Started with Data Science by Analyzing Traffic Data from California Highways

Training Data Preparation Steps •  Load CSV file for sensor locations

•  920 sensor locations for US101S in District 4

• Convert highway data to match sensor data •  SR101S à highway = 101, direction of travel = ‘S’

•  Join with sensor data •  Load CSV file for incidents from Oct 29 – Nov 4

•  Covert datatypes, drop records with unknown values, … •  Find sections of highways 101 in District 4 •  Filter for incident data for 101S in District 4

•  176 incidents

• Create a nearest neighbor model for sensor locations •  To query the nearest sensors to an incident

36 Prototype the Solution

Page 37: Get Started with Data Science by Analyzing Traffic Data from California Highways

Training Data Preparation Steps •  For each incident on highway 101S in Bay Area between Oct

29 and Nov 4: •  Find the sensor stations in the 3-miles radius •  For each sensor in the 3 miles radius

•  Extract the sensor data for any timestamp that falls within the ‘duration of the incident’

•  Add sensor data to the ‘positive examples’ in the training data •  28,460 records

•  Note: upstream/downstream stations not taken into account

•  ‘Negative Examples’ data (no incidents) is un-proportionally larger (10 times) •  Extract a 10% sample of the negative examples data

•  24,026 records •  Add sampled sensor data for negative examples to the training data

•  52,486 records

37 Prototype the Solution

Page 38: Get Started with Data Science by Analyzing Traffic Data from California Highways

SVM – Results

38

Data: +--------------+-----------------+-------+ | target_label | predicted_label | count | +--------------+-----------------+-------+ | 0 | 0 | 3746 | | 0 | 1 | 12330 | | 1 | 0 | 3522 | | 1 | 1 | 3711 | +--------------+-----------------+-------+ [4 rows x 3 columns] , 'accuracy': 0.31991934445922177}

Prototype the Solution

Page 39: Get Started with Data Science by Analyzing Traffic Data from California Highways

Boosted Trees – Results

39

Data: +--------------+-----------------+-------+ | target_label | predicted_label | count | +--------------+-----------------+-------+ | 0 | 0 | 15082 | | 0 | 1 | 994 | | 1 | 0 | 5574 | | 1 | 1 | 1659 | +--------------+-----------------+-------+ [4 rows x 3 columns] , 'accuracy': 0.7182204298768716}

Prototype the Solution

Page 40: Get Started with Data Science by Analyzing Traffic Data from California Highways

Ideas for further improving the results •  Work with ‘raw’ station data instead of 5-min aggregates •  Further explore the data for entire 2014 and entire California highways

•  It’s not easy when you are dealing with files: load into Aerospike to rescue •  Add more features

•  difference between values in neighboring stations •  difference between consecutive times at the same location •  ...

•  Try different models for rush-hour and off-peak hours •  Take into account seasonal effects •  Validate the actual time of incident in training data •  Include upstream and downstream data as separate features •  Investigate causality

•  Is an incident causing slow-down or is it the other way around?

•  Stretch goal: Can we ‘predict’ incidents before happening? •  Leading indicators for accidents •  Improve safety of self-driving cars

40 Prototype the Solution

Page 41: Get Started with Data Science by Analyzing Traffic Data from California Highways

THE LAST WORD(S)

41

Page 42: Get Started with Data Science by Analyzing Traffic Data from California Highways

Data Science – it’s iterative

42

Data

Domain

Problem

Data Science

Prototype

Product

Last Word

Page 43: Get Started with Data Science by Analyzing Traffic Data from California Highways

43 Last Word

Page 44: Get Started with Data Science by Analyzing Traffic Data from California Highways

Take Away 1: Realtime Analytics • Realtime (fast) vs. offline

analytics (slow) •  Getting closer to how human brain

makes decisions

The next big thing: •  The line between capturing and

analyzing data is disappearing •  And technologies such as Aerospike

and GraphLab are helping!

44 Last Word

Page 45: Get Started with Data Science by Analyzing Traffic Data from California Highways

Take Away 2: Where does all this lead? 1.  More intelligence for us

•  ‘Augmenting Human Intellect’ •  Doug Engelbart

2.  More automation for machines •  ‘The Second Machine Age’

•  Erik Brynjolfsson, Andrew McAfee

45 Last Word

Page 46: Get Started with Data Science by Analyzing Traffic Data from California Highways

THANK YOU Cirrus Shakeri, Ph.D. Co-Founder, Inventurist +1 650-380-9794 twitter.com/cirrus_shakeri www.linkedin.com/in/cshakeri/ [email protected]

46

Page 47: Get Started with Data Science by Analyzing Traffic Data from California Highways

Location (mile) (Abs PM)

Time

00:00:00

23:55:00

0.0 5.0 10.0

Incident

Incident start time In

cide

nt

loc

atio

n

Stat

ion

loc

atio

n 5-min aggregate end timestamp

5 min

Distance bet stations

5-min aggregate start timestamp

Visualize an incident at a time and location

47