Page 1
STAT 6620 Asmar Farooq and Weizhong Li
Project# 1
Abstract
The main purpose of the project is to predict the delay status of flights using KNN algorithm and
to predict the number of hours arrival delay using regression tree. The data that are used to develop the
models comes from American Statistical Association’s website at http://stat-
computing.org/dataexpo/2009/the-data.html. The data spans from 1987 to 2008. A total of 29 variables
and total observations of 24,117,234 are recorded in the flight dataset. Instead of creating a model for
the entire dataset, the model team focuses on predicting the flights’ delay status for JFK airport at New
York. This decision is due to the team’s perception that airport capaicity, unique temperature patterns,
and local air travel demands for each individual airport in the country will have unique impact of the
number of delay flights. As a result, the model is more practical and useful for each individual airport in
the country.
Based on the team’s model exercise, the model team is concluded that KNN algorithm is a poor
tool to estimate the delay status of flight with a prediction error rate of more than 90 percent for flights
that are actually delayed. On the other hand, regression tree algorithm is an appropraite model that
correctly predicts 88 percent of the duration of arrival delay.
Data Summary
For each year, we are given 29 following features.
Feature Type Feature Type
ArrDelay Numerical
Year Categorical DepDelay Numerical
Month Categorical Origin Categorical
DayofMonth Categorical Dest Categorical
DayOfWeek Categorical Distance Numerical
DepTime Categorical TaxiIn Numerical
CRSDepTime Categorical TaxiOut Numerical
ArrTime Categorical Cancelled Categorical
CRSArrTime Categorical CancellationCode Categorical
Page 2
UniqueCarrier Categorical Diverted Categorical
FlightNum Categorical CarrierDelay Categorical
TailNum Categorical WeatherDelay Categorical
ActualElapsedTime Numerical NASDelay Categorical
CRSElapsedTime Numerical SecurityDelay Categorical
AirTime Numerical LateAircraftDelay Categorical
ArrDelay Numerical SecurityDelay Categorical
Here is the summarized mean and standard deviation of year 1987 data grouped by months.
Oct Nov Dec
Feature Mean Sd Mean Sd Mean Sd
ActualElapsedTime 100.71 60.85 102.16 61.52 103.72 63.08
CRSElapsedTime 88.53 60.65 100.62 61.21 101.73 61.82
ArrDelay 6.00 18.56 8.47 24.46 13.99 32.18
DepDelay 5.00 19.79 7.13 20.62 12.10 29.86
Distance 587.99 496.51 590.80 497.70 594.98 500.14
AirTime
N/A for 1987 TaxiIn
TaxiOut
Below, the counts and relative frequencies for few of the categorical data are listed. Rest of the
categorical features had 31 or more categories and creating tables for each level would be almost
impossible. For example, FlightNum had 2161 levels.
str(x2)
'data.frame': 1311826 obs. of 13 variables:
$ Month : int 10 10 10 10 10 10 10 10 10 10 ...
$ DayofMonth : Factor w/ 31 levels "1","2","3","4",..: 14 15 17 18 19 21 22 23 24 25 ...
$ DayOfWeek : Factor w/ 7 levels "1","2","3","4",..: 3 4 6 7 1 3 4 5 6 7 ...
$ DepTime : Factor w/ 1430 levels "1","2","3","4",..: 451 439 451 439 459 438 438 441 454 439 ...
$ CRSDepTime : Factor w/ 1174 levels "1","5","6","8",..: 209 209 209 209 209 209 209 209 209 209 ...
$ ArrTime : Factor w/ 1440 levels "1","2","3","4",..: 552 543 558 527 562 528 532 542 548 531 ...
$ CRSArrTime : Factor w/ 1301 levels "1","2","3","4",..: 390 390 390 390 390 390 390 390 390 390 ...
$ UniqueCarrier: Factor w/ 14 levels "AA","AS","CO",..: 10 10 10 10 10 10 10 10 10 10 ...
$ FlightNum : Factor w/ 2161 levels "1","2","3","4",..: 1359 1359 1359 1359 1359 1359 1359 1359 1359
$ Origin : Factor w/ 237 levels "ABE","ABQ","ACV",..: 198 198 198 198 198 198 198 198 198 198 ...
$ Dest : Factor w/ 237 levels "ABE","ABQ","ACV",..: 207 207 207 207 207 207 207 207 207 207 ...
$ Cancelled : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Diverted : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
Page 3
Day of Week
October November December
Level Counts Rel. Freq. Level Counts Rel. Freq. Level Counts
Rel. Freq.
1 59,243 0.1321 1 73,057 0.1728 1 58,411 0.1326
2 59,214 0.1320 2 58,441 0.1382 2 72,583 0.1648
3 59,076 0.1317 3 58,763 0.1390 3 72,396 0.1644
4 73,966 0.1649 4 55,614 0.1315 4 71,331 0.1620
5 73,739 0.1644 5 54,637 0.1292 5 56,537 0.1284
6 67,256 0.1499 6 52,767 0.1248 6 53,347 0.1211
7 56,126 0.1251 7 69,524 0.1644 7 55,798 0.1267
Total 448,620 1.0000 Total 422,803 1.0000 Total 440,403 1.0000
Unique Carrier
October November December
Level Counts Rel. Freq. Level Counts Rel. Freq. Level Counts Rel. Freq.
DL 63,104 0.1407 DL 60,150 0.1423 DL 62,559 0.142
AA 56,091 0.125 AA 53,200 0.1258 AA 55,830 0.1267
UA 52,952 0.118 UA 48,702 0.1152 UA 50,970 0.1157
CO 42,756 0.0953 CO 39,408 0.0932 CO 40,838 0.0927
PI 39,228 0.0874 PI 37,707 0.0892 PI 39,547 0.0898
NW 37,590 0.0838 EA 34,865 0.0825 EA 36,863 0.0837
EA 37,048 0.0826 NW 34,342 0.0812 NW 36,341 0.0825
US 32,293 0.072 US 31,006 0.0733 US 31,515 0.0715
TW 23,823 0.0531 TW 22,125 0.0523 TW 23,792 0.054
WN 21,738 0.0485 WN 20,237 0.0479 WN 20,000 0.0454
HP 15,026 0.0335 HP 14,939 0.0353 HP 15,434 0.035
PS 14,405 0.0321 PS 13,540 0.032 PS 13,761 0.0312
AS 7,432 0.0166 AS 6,967 0.0165 AS 7,007 0.0159
PA 5,134 0.0114 PA 5,615 0.0133 PA 6,036 0.0137
Total 448,620 1 Total 422,803 1 Total 440,493 1
Page 4
Cancelled
October November December
Level Counts Rel. Freq. Level Counts
Rel. Freq. Level Counts
Rel. Freq.
0 445,619 0.9933 0 417,612 0.9877 0 428,910 0.9739
1 3,001 0.0067 1 5,191 0.0123 1 11,493 0.0261
Total 448,620 1.0000 Total 422,803 1.0000 Total 440,403 1.0000
Diverted
October November December
Level Counts Rel. Freq. Level Counts
Rel. Freq. Level Counts
Rel. Freq.
0 447,781 0.9982 0 421,708 0.9974 0 438,522 0.9957
1 829 0.0018 1 1,095 0.0026 1 1,881 0.0043
Total 448,610 1.0000 Total 422,803 1.0000 Total 440,403 1.0000
Model Variable Construction
In order to clean the data, we excluded all observations with NA value in airtime as well as
observations where the flight was cancelled as those observations do not add any information. The
reason why we exclude observations with an NA value in airtime is because we consider taxiin (the time
it takes a flight to leave the terminal and take off from the airport) and taxiout (the time it takes a flight
to land and reach to the terminal) important variables that explains flight delay status and arrival delay
time. In general, observations with null in airtime will always have null in taxiin and taxiout as well.
After cleaning the data, five additional columns variables are added to the dataset. The first one
is the delay flag. A flight is considered late if either of the following two conditions were true.
Actual elapsed time is 30 minutes more than the scheduled elapsed time
Actual arrival time is 30 minutes more than the scheduled arrival time.
30 minutes delay was chosen due to personal experience and the assumption that a passenger will
expect to wait at least 30 minutes before he/she would consider a flight delayed. The delay flag is a
Page 5
binary variable with value of 0 and 1 (delayed flight) and is used as the response variable in the KNN
model.
Three temperature measures for JFK: mean temp, min temp and max temp, are added using the
R package called “weatherData”. Common experience tells the model team that temperature will play a
big role, especially for airports that are located in locals that have adverse winter conditions, which JFK
is one of them.
Lastly, the variable called “Total number of flights per day” is created. The variable reflects the
usage rate of the airport on a daily basis. The model team suspect that on days with more flights needed
to take off from an airport, the probabliy of delay will be different than on days when less than normal
flights are using the airport.
All in all, we dropped 19 variables either due to the reasons above or we found them to be
useless in explaining the delay of air flights. Those 19 variables are Year, ArrTime, CRSArrTime,
UniqueCarrier, FlightNum, TailNum, ActualElapsedTime, CRSElapsedTime, AirTime, Origin, Dest,
Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay,
LateAircraftDelay.
By the end of this whole process, we had a little more than 150K observation, out of which
about 36K flights were delayed. The 150k observations are then split into a training set and a test set
with 80 percent of the observations (120555) as the training set and the rest are reserved for the test
set.
Delay Status Classification Model
Using KNN algorithm, ArrivedLate is predicted. Following results are generated from ‘R’.
Variables: Airtime, distance, taxiin, taxiout, max temperature, min temperature, number of flight.
Normalization method: maximum minimum normalization.
Model Confusion Matrix for k =10, k=100, k=200, k=300, k=400. Actual values are listed on the vertical
side while the predicted values on the horizontal side.
Page 8
K=400
K=500 doesn’t work
It is observed that as the number of nearest neighbor increases, the number of incorrectly
classified ontime flight decreases while the number of incorrectly classified delay flight increases. The
model results are unsatisfactory as the error rate for correctly classified delay flight is more than 20
percent for all of the models. Based on the above results, KNN is not a good model for predicting the
delay status of flights.
Prediction for Delay Arrival Time
Regression tree algorithm is used to predict the delay arrival time. The explainatory variables
include: month, day of month, day of week, departure time, planned time, airtime, taxiin, taxiout, max
temperature, mean temperature, min temperature, number of flights, departure delay.
Correlation Matrix between the numerical variables:
Page 9
Regression graph:
Model effectiveness Measures RMSE:
Model effective Measures Correlation with the test set:
Page 10
Summary
In summary, after cleaning and preparing the data set, KNN algorithm was applied to predict
whether a flight at JFK would delay or not. It was found that while KNN algorithm is simple to apply, the
error rate it generated is unacceptable at more than 20%. Furthermore, it was also observed that by
increasing the k nearest neighbor value, the model’s ability to predict late flight worsen by misclassifying
late flights as on-time flights.
On the other hand, regression tree algorithm generated better results and produced an easy to
follow graph. According to the graph, the main cause for flight to arrive late is when there is a delay in
departure. In other words, if a flight leaves late, it is very likely to arrive late. Moreover, according to our
model, shorter taxi out and airtime will result in fewer delays as our common sense would suggest. The
correlation of 0.88 between our test data and model fitted data suggests that this model is acceptable
and is a good candidate to predict late flights at JFK airport.
Since this model was only applied to JFK airport, it is our team’s suggestion to apply the same
technique to other airport to further evaluate the effectiveness of regression tree algorithm for airline
flights.
Page 11
APPENDIX
R CODE
## Loading Data
library(RODBC)
myconn <- odbcConnect('project')
flight_data <- sqlQuery(myconn, "select * from
[project].[dbo].[jfk_revised]")
close(myconn)
str(flight_data)
attach(flight_data)
fit_data <- data.frame(factor(Month),factor(DayofMonth),factor(DayOfWeek),
round(DepTime/100, digit=0), round(CRSDepTime/100, digit=0),
AirTime, Distance, TaxiIn, TaxiOut,
Max_TemperatureF, Mean_TemperatureF, Min_TemperatureF,num_flight,
factor(delay),
ArrDelay, DepDelay)
names(fit_data)<- c('month', 'dayofmonth', 'dayofweek', 'deptime',
'crsdpetime', 'airtime', 'distance', 'taxiin','taxiout'
,'maxt', 'meant', 'mint', 'num_flight', 'delay',
'arrdelay', 'depdelay')
set.seed(12345)
fit_data <- fit_data[order(runif(150694)), ]
fit_training <- fit_data[1:120555,]
fit_test <- fit_data[120556:150694,]
##KNN
## Normalize Function
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
knn_training_full <- fit_training[complete.cases(fit_training),]
knn_training <- knn_training_full[,-(1:5)]
knn_training_cor <- knn_training[,-(9:11)]
knn_training <- as.data.frame(lapply(knn_training[,-(9:11)], normalize))
knn_test_full <- fit_test[complete.cases(fit_test),]
knn_test <- knn_test_full[,-(1:5)]
knn_test <- as.data.frame(lapply(knn_test[,-(9:11)], normalize))
library(class)
Page 12
fit_pred <- knn(train = knn_training, test = knn_test,
cl = knn_training_full[,14], k=10)
library(gmodels)
CrossTable(x = knn_test_full$delay, y = fit_pred,
prop.chisq=FALSE)
i <- seq(100,1000,100)
lapply(i,function(x){
knn_i <- knn(train = knn_training, test = knn_test,
cl = knn_training_full[,14], k=i)
CrossTable(x = knn_test_full$delay, y = knn_i,
prop.chisq=FALSE)
})
## Regression Tree
library(rpart)
m.rpart <- rpart(arrdelay~ ., data = fit_training)
library(rpart.plot)
rpart.plot(m.rpart, digits = 3)
library(caret)
rpart.pred<- predict(m.rpart, fit_test)
RMSE(rpart.pred, fit_test$arrdelay)
cor(fit_test$arrdelay, rpart.pred)
cor(knn_training_cor)
Page 13
SQL CODE
USE [project]
GO
/****** Object: View [dbo].[vw_jfk] Script Date: 6/7/2015 11:08:32 AM
******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE view [dbo].[vw_jfk] as
/* First Chain: filtering out for where origin = jfk , create a date variable
to aggregate number of flights taken off at jfk.*/
with cte1 as
(SELECT [Year]
,[Month]
,[DayofMonth]
,[DayOfWeek]
,[DepTime]
,[CRSDepTime]
,[ArrTime]
,[CRSArrTime]
,[UniqueCarrier]
,[FlightNum]
,[TailNum]
,[ActualElapsedTime]
,[CRSElapsedTime]
,[AirTime]
,[ArrDelay]
,[DepDelay]
,[Origin]
,[Dest]
,[Distance]
,[TaxiIn]
,[TaxiOut]
,[Cancelled]
,[CancellationCode]
,[Diverted]
,[CarrierDelay]
,[WeatherDelay]
,[NASDelay]
,[SecurityDelay]
,[LateAircraftDelay]
,DATEFROMPARTS([year],[month],[dayofmonth]) as [date]
FROM [project].[dbo].[part1_data]
where origin = 'jfk'
)
/* joining the first chain with the min, mean, and max temperacture while
filtering out cancelled flights and null value for airtime.
Page 14
The resultant dataset will have taxiin and taxiout fields filled with
values.*/
,cte2 as
(select a.*
,b.[Max_TemperatureF]
,b.[Mean_TemperatureF]
,b.[Min_TemperatureF]
from cte1 a left join [project].[dbo].[temp] b on a.[date] = b.[date]
where cancelled <> 1 and airtime <>'NA'
)
/*creating the delay flag using the second chain*/
,cte3 as
(select a.*
,case when cast(a.[ActualElapsedTime] as int) - cast(a.[CRSElapsedTime]
as int) > 30 then '1'
when cast(a.[ArrTime] as int)- cast(a.[CRSArrTime] as int) >30 then
'1'
else '0' end as [delay]
from cte2 a
)
/*aggregating the number of flights taken place in JFK on a daily basis*/
,cte4 as
(select [date]
,count([date]) as [num_flight]
from cte3
group by [date]
)
/*combining chain number three and four into a final table*/
,cte5 as
(select a.*
,b.[num_flight]
from cte3 a left join cte4 b on a.[date] = b.[date]
)
select * from cte5
GO