KNN and regression Tree

STAT 6620 Asmar Farooq and Weizhong Li

Project# 1

Abstract

The main purpose of the project is to predict the delay status of flights using KNN algorithm and

to predict the number of hours arrival delay using regression tree. The data that are used to develop the

models comes from American Statistical Association’s website at http://stat-

computing.org/dataexpo/2009/the-data.html. The data spans from 1987 to 2008. A total of 29 variables

and total observations of 24,117,234 are recorded in the flight dataset. Instead of creating a model for

the entire dataset, the model team focuses on predicting the flights’ delay status for JFK airport at New

York. This decision is due to the team’s perception that airport capaicity, unique temperature patterns,

and local air travel demands for each individual airport in the country will have unique impact of the

number of delay flights. As a result, the model is more practical and useful for each individual airport in

the country.

Based on the team’s model exercise, the model team is concluded that KNN algorithm is a poor

tool to estimate the delay status of flight with a prediction error rate of more than 90 percent for flights

that are actually delayed. On the other hand, regression tree algorithm is an appropraite model that

correctly predicts 88 percent of the duration of arrival delay.

Data Summary

For each year, we are given 29 following features.

Feature Type Feature Type

ArrDelay Numerical

Year Categorical DepDelay Numerical

Month Categorical Origin Categorical

DayofMonth Categorical Dest Categorical

DayOfWeek Categorical Distance Numerical

DepTime Categorical TaxiIn Numerical

CRSDepTime Categorical TaxiOut Numerical

ArrTime Categorical Cancelled Categorical

CRSArrTime Categorical CancellationCode Categorical

http://stat-computing.org/dataexpo/2009/the-data.html

http://stat-computing.org/dataexpo/2009/the-data.html

UniqueCarrier Categorical Diverted Categorical

FlightNum Categorical CarrierDelay Categorical

TailNum Categorical WeatherDelay Categorical

ActualElapsedTime Numerical NASDelay Categorical

CRSElapsedTime Numerical SecurityDelay Categorical

AirTime Numerical LateAircraftDelay Categorical

ArrDelay Numerical SecurityDelay Categorical

Here is the summarized mean and standard deviation of year 1987 data grouped by months.

Oct Nov Dec

Feature Mean Sd Mean Sd Mean Sd

ActualElapsedTime 100.71 60.85 102.16 61.52 103.72 63.08

CRSElapsedTime 88.53 60.65 100.62 61.21 101.73 61.82

ArrDelay 6.00 18.56 8.47 24.46 13.99 32.18

DepDelay 5.00 19.79 7.13 20.62 12.10 29.86

Distance 587.99 496.51 590.80 497.70 594.98 500.14

AirTime

N/A for 1987 TaxiIn

TaxiOut

Below, the counts and relative frequencies for few of the categorical data are listed. Rest of the

categorical features had 31 or more categories and creating tables for each level would be almost

impossible. For example, FlightNum had 2161 levels.

str(x2)

'data.frame': 1311826 obs. of 13 variables:

$ Month : int 10 10 10 10 10 10 10 10 10 10 ...

$ DayofMonth : Factor w/ 31 levels "1","2","3","4",..: 14 15 17 18 19 21 22 23 24 25 ...

$ DayOfWeek : Factor w/ 7 levels "1","2","3","4",..: 3 4 6 7 1 3 4 5 6 7 ...

$ DepTime : Factor w/ 1430 levels "1","2","3","4",..: 451 439 451 439 459 438 438 441 454 439 ...

$ CRSDepTime : Factor w/ 1174 levels "1","5","6","8",..: 209 209 209 209 209 209 209 209 209 209 ...

$ ArrTime : Factor w/ 1440 levels "1","2","3","4",..: 552 543 558 527 562 528 532 542 548 531 ...

$ CRSArrTime : Factor w/ 1301 levels "1","2","3","4",..: 390 390 390 390 390 390 390 390 390 390 ...

$ UniqueCarrier: Factor w/ 14 levels "AA","AS","CO",..: 10 10 10 10 10 10 10 10 10 10 ...

$ FlightNum : Factor w/ 2161 levels "1","2","3","4",..: 1359 1359 1359 1359 1359 1359 1359 1359 1359

$ Origin : Factor w/ 237 levels "ABE","ABQ","ACV",..: 198 198 198 198 198 198 198 198 198 198 ...

$ Dest : Factor w/ 237 levels "ABE","ABQ","ACV",..: 207 207 207 207 207 207 207 207 207 207 ...

$ Cancelled : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

$ Diverted : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

Day of Week

October November December

Level Counts Rel. Freq. Level Counts Rel. Freq. Level Counts

Rel. Freq.

1 59,243 0.1321 1 73,057 0.1728 1 58,411 0.1326

2 59,214 0.1320 2 58,441 0.1382 2 72,583 0.1648

3 59,076 0.1317 3 58,763 0.1390 3 72,396 0.1644

4 73,966 0.1649 4 55,614 0.1315 4 71,331 0.1620

5 73,739 0.1644 5 54,637 0.1292 5 56,537 0.1284

6 67,256 0.1499 6 52,767 0.1248 6 53,347 0.1211

7 56,126 0.1251 7 69,524 0.1644 7 55,798 0.1267

Total 448,620 1.0000 Total 422,803 1.0000 Total 440,403 1.0000

Unique Carrier


Level Counts Rel. Freq. Level Counts Rel. Freq. Level Counts Rel. Freq.

DL 63,104 0.1407 DL 60,150 0.1423 DL 62,559 0.142

AA 56,091 0.125 AA 53,200 0.1258 AA 55,830 0.1267

UA 52,952 0.118 UA 48,702 0.1152 UA 50,970 0.1157

CO 42,756 0.0953 CO 39,408 0.0932 CO 40,838 0.0927

PI 39,228 0.0874 PI 37,707 0.0892 PI 39,547 0.0898

NW 37,590 0.0838 EA 34,865 0.0825 EA 36,863 0.0837

EA 37,048 0.0826 NW 34,342 0.0812 NW 36,341 0.0825

US 32,293 0.072 US 31,006 0.0733 US 31,515 0.0715

TW 23,823 0.0531 TW 22,125 0.0523 TW 23,792 0.054

WN 21,738 0.0485 WN 20,237 0.0479 WN 20,000 0.0454

HP 15,026 0.0335 HP 14,939 0.0353 HP 15,434 0.035

PS 14,405 0.0321 PS 13,540 0.032 PS 13,761 0.0312

AS 7,432 0.0166 AS 6,967 0.0165 AS 7,007 0.0159

PA 5,134 0.0114 PA 5,615 0.0133 PA 6,036 0.0137

Total 448,620 1 Total 422,803 1 Total 440,493 1

Cancelled


Level Counts Rel. Freq. Level Counts

Rel. Freq. Level Counts

Rel. Freq.

0 445,619 0.9933 0 417,612 0.9877 0 428,910 0.9739

1 3,001 0.0067 1 5,191 0.0123 1 11,493 0.0261


Diverted


Level Counts Rel. Freq. Level Counts

Rel. Freq. Level Counts

Rel. Freq.

0 447,781 0.9982 0 421,708 0.9974 0 438,522 0.9957

1 829 0.0018 1 1,095 0.0026 1 1,881 0.0043


Model Variable Construction

In order to clean the data, we excluded all observations with NA value in airtime as well as

observations where the flight was cancelled as those observations do not add any information. The

reason why we exclude observations with an NA value in airtime is because we consider taxiin (the time

it takes a flight to leave the terminal and take off from the airport) and taxiout (the time it takes a flight

to land and reach to the terminal) important variables that explains flight delay status and arrival delay

time. In general, observations with null in airtime will always have null in taxiin and taxiout as well.

After cleaning the data, five additional columns variables are added to the dataset. The first one

is the delay flag. A flight is considered late if either of the following two conditions were true.

Actual elapsed time is 30 minutes more than the scheduled elapsed time

Actual arrival time is 30 minutes more than the scheduled arrival time.

30 minutes delay was chosen due to personal experience and the assumption that a passenger will

expect to wait at least 30 minutes before he/she would consider a flight delayed. The delay flag is a

binary variable with value of 0 and 1 (delayed flight) and is used as the response variable in the KNN

model.

Three temperature measures for JFK: mean temp, min temp and max temp, are added using the

R package called “weatherData”. Common experience tells the model team that temperature will play a

big role, especially for airports that are located in locals that have adverse winter conditions, which JFK

is one of them.

Lastly, the variable called “Total number of flights per day” is created. The variable reflects the

usage rate of the airport on a daily basis. The model team suspect that on days with more flights needed

to take off from an airport, the probabliy of delay will be different than on days when less than normal

flights are using the airport.

All in all, we dropped 19 variables either due to the reasons above or we found them to be

useless in explaining the delay of air flights. Those 19 variables are Year, ArrTime, CRSArrTime,

UniqueCarrier, FlightNum, TailNum, ActualElapsedTime, CRSElapsedTime, AirTime, Origin, Dest,

Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay,

LateAircraftDelay.

By the end of this whole process, we had a little more than 150K observation, out of which

about 36K flights were delayed. The 150k observations are then split into a training set and a test set

with 80 percent of the observations (120555) as the training set and the rest are reserved for the test

set.

Delay Status Classification Model

Using KNN algorithm, ArrivedLate is predicted. Following results are generated from ‘R’.

Variables: Airtime, distance, taxiin, taxiout, max temperature, min temperature, number of flight.

Normalization method: maximum minimum normalization.

Model Confusion Matrix for k =10, k=100, k=200, k=300, k=400. Actual values are listed on the vertical

side while the predicted values on the horizontal side.

K=10

K=100

K=200

K=300

K=400

K=500 doesn’t work

It is observed that as the number of nearest neighbor increases, the number of incorrectly

classified ontime flight decreases while the number of incorrectly classified delay flight increases. The

model results are unsatisfactory as the error rate for correctly classified delay flight is more than 20

percent for all of the models. Based on the above results, KNN is not a good model for predicting the

delay status of flights.

Prediction for Delay Arrival Time

Regression tree algorithm is used to predict the delay arrival time. The explainatory variables

include: month, day of month, day of week, departure time, planned time, airtime, taxiin, taxiout, max

temperature, mean temperature, min temperature, number of flights, departure delay.

Correlation Matrix between the numerical variables:

Regression graph:

Model effectiveness Measures RMSE:

Model effective Measures Correlation with the test set:

Summary

In summary, after cleaning and preparing the data set, KNN algorithm was applied to predict

whether a flight at JFK would delay or not. It was found that while KNN algorithm is simple to apply, the

error rate it generated is unacceptable at more than 20%. Furthermore, it was also observed that by

increasing the k nearest neighbor value, the model’s ability to predict late flight worsen by misclassifying

late flights as on-time flights.

On the other hand, regression tree algorithm generated better results and produced an easy to

follow graph. According to the graph, the main cause for flight to arrive late is when there is a delay in

departure. In other words, if a flight leaves late, it is very likely to arrive late. Moreover, according to our

model, shorter taxi out and airtime will result in fewer delays as our common sense would suggest. The

correlation of 0.88 between our test data and model fitted data suggests that this model is acceptable

and is a good candidate to predict late flights at JFK airport.

Since this model was only applied to JFK airport, it is our team’s suggestion to apply the same

technique to other airport to further evaluate the effectiveness of regression tree algorithm for airline

flights.

APPENDIX

R CODE

## Loading Data

library(RODBC)

myconn <- odbcConnect('project')

flight_data <- sqlQuery(myconn, "select * from

[project].[dbo].[jfk_revised]")

close(myconn)

str(flight_data)

attach(flight_data)

fit_data <- data.frame(factor(Month),factor(DayofMonth),factor(DayOfWeek),

round(DepTime/100, digit=0), round(CRSDepTime/100, digit=0),

AirTime, Distance, TaxiIn, TaxiOut,

Max_TemperatureF, Mean_TemperatureF, Min_TemperatureF,num_flight,

factor(delay),

ArrDelay, DepDelay)

names(fit_data)<- c('month', 'dayofmonth', 'dayofweek', 'deptime',

'crsdpetime', 'airtime', 'distance', 'taxiin','taxiout'

,'maxt', 'meant', 'mint', 'num_flight', 'delay',

'arrdelay', 'depdelay')

set.seed(12345)

fit_data <- fit_data[order(runif(150694)), ]

fit_training <- fit_data[1:120555,]

fit_test <- fit_data[120556:150694,]

##KNN

## Normalize Function

normalize <- function(x) {

return ((x - min(x)) / (max(x) - min(x)))

}

knn_training_full <- fit_training[complete.cases(fit_training),]

knn_training <- knn_training_full[,-(1:5)]

knn_training_cor <- knn_training[,-(9:11)]

knn_training <- as.data.frame(lapply(knn_training[,-(9:11)], normalize))

knn_test_full <- fit_test[complete.cases(fit_test),]

knn_test <- knn_test_full[,-(1:5)]

knn_test <- as.data.frame(lapply(knn_test[,-(9:11)], normalize))

library(class)

fit_pred <- knn(train = knn_training, test = knn_test,

cl = knn_training_full[,14], k=10)

library(gmodels)

CrossTable(x = knn_test_full$delay, y = fit_pred,

prop.chisq=FALSE)

i <- seq(100,1000,100)

lapply(i,function(x){

knn_i <- knn(train = knn_training, test = knn_test,

cl = knn_training_full[,14], k=i)

CrossTable(x = knn_test_full$delay, y = knn_i,

prop.chisq=FALSE)

})

## Regression Tree

library(rpart)

m.rpart <- rpart(arrdelay~ ., data = fit_training)

library(rpart.plot)

rpart.plot(m.rpart, digits = 3)

library(caret)

rpart.pred<- predict(m.rpart, fit_test)

RMSE(rpart.pred, fit_test$arrdelay)

cor(fit_test$arrdelay, rpart.pred)

cor(knn_training_cor)

SQL CODE

USE [project]

GO

/****** Object: View [dbo].[vw_jfk] Script Date: 6/7/2015 11:08:32 AM

******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE view [dbo].[vw_jfk] as

/* First Chain: filtering out for where origin = jfk , create a date variable

to aggregate number of flights taken off at jfk.*/

with cte1 as

(SELECT [Year]

,[Month]

,[DayofMonth]

,[DayOfWeek]

,[DepTime]

,[CRSDepTime]

,[ArrTime]

,[CRSArrTime]

,[UniqueCarrier]

,[FlightNum]

,[TailNum]

,[ActualElapsedTime]

,[CRSElapsedTime]

,[AirTime]

,[ArrDelay]

,[DepDelay]

,[Origin]

,[Dest]

,[Distance]

,[TaxiIn]

,[TaxiOut]

,[Cancelled]

,[CancellationCode]

,[Diverted]

,[CarrierDelay]

,[WeatherDelay]

,[NASDelay]

,[SecurityDelay]

,[LateAircraftDelay]

,DATEFROMPARTS([year],[month],[dayofmonth]) as [date]

FROM [project].[dbo].[part1_data]

where origin = 'jfk'

)

/* joining the first chain with the min, mean, and max temperacture while

filtering out cancelled flights and null value for airtime.

The resultant dataset will have taxiin and taxiout fields filled with

values.*/

,cte2 as

(select a.*

,b.[Max_TemperatureF]

,b.[Mean_TemperatureF]

,b.[Min_TemperatureF]

from cte1 a left join [project].[dbo].[temp] b on a.[date] = b.[date]

where cancelled <> 1 and airtime <>'NA'

)

/*creating the delay flag using the second chain*/

,cte3 as

(select a.*

,case when cast(a.[ActualElapsedTime] as int) - cast(a.[CRSElapsedTime]

as int) > 30 then '1'

when cast(a.[ArrTime] as int)- cast(a.[CRSArrTime] as int) >30 then

'1'

else '0' end as [delay]

from cte2 a

)

/*aggregating the number of flights taken place in JFK on a daily basis*/

,cte4 as

(select [date]

,count([date]) as [num_flight]

from cte3

group by [date]

)

/*combining chain number three and four into a final table*/

,cte5 as

(select a.*

,b.[num_flight]

from cte3 a left join cte4 b on a.[date] = b.[date]

)

select * from cte5

GO

KNN and regression Tree

Documents