Top Banner
DATA ANALYSIS METHODS Mid Term Onkar Deshmukh M06156153 [email protected] Abstract
16

Data Analysis Methods

Nov 19, 2015

Download

Documents

Ashish Boora

Analyzing the Flight Data
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • DATA ANALYSIS METHODS Mid Term

    Onkar Deshmukh M06156153

    [email protected]

    Abstract

  • 1

    DATA ANALYSIS METHODS

    Table of Contents 1. Purpose: ................................................................................................................................................ 2

    2. Understanding the data ........................................................................................................................ 2

    3. Data Cleansing....................................................................................................................................... 4

    4. Plots and Data Visualization .................................................................................................................. 5

    5. Correlation and Variable Selection ....................................................................................................... 6

    6. Effect of Variables on landing distance ................................................................................................. 8

    6.1 Effect of Aircraft make on landing distance: ................................................................................. 8

    6.2 Effect of speed_ground on landing distance: ............................................................................... 9

    7. Model Fitting and Regression Analysis ............................................................................................... 10

    7.1 Initial Model with all variables .................................................................................................... 10

    7.2 Model with only aircraftmodel, speed_ground and height ........................................................ 11

    7.3 Squaring the values for speed_ground and Improving Model....................................................... 13

    8. Summary ............................................................................................................................................. 15

  • 2

    DATA ANALYSIS METHODS

    1. Purpose: Purpose of this project is to analyze given landing.csv data and study what factors and how they would

    impact the landing distance. Also, come up with a best fit model which fits factors affecting landing

    distance.

    2. Understanding the data Objective: Need to understand the data given to us. This section will give us basic overview of the data.

    What kind of values we have, how many variables we have and what are their summary level statistics

    like mean, min and max. Histograms will help us understand value frequency distribution of these

    variables.

    R Code : > FlightData dim(FlightData) Output: [1] 800 8

    R Code

    > summary(FlightData)

    > attach(FlightData)

    Output:

    > par(mfrow= c(2,4)) > hist(duration) > hist(no_pasg) > hist(speed_ground) > hist(speed_air) > hist(height) > hist(pitch) > hist(distance)

  • 3

    DATA ANALYSIS METHODS

    Observations:

    Given dataset contains 8 variables and 800 observations

    Two types of aircrafts exist in the dataset

    All the variables except for Speed_air are completely populated. Speed_air has 600 N/A values

    which we might need to handle separately

    Looking at minimum and maximum values of these variables, we realize that some of the

    observations dont contain recommended values for Duration, Speed_ground, Speed_air, Height

    and distance.

    Decision:

    We need to cleanse the data to choose the observations abiding by the rules given in the document and

    filter out the observations that contain values beyond the threshold for that variable.

  • 4

    DATA ANALYSIS METHODS

    3. Data Cleansing Objective: As noted in previous section, data given in the file has few observations that are not in line

    with the given recommended threshold values. For our analysis we need to clean up this data. Data

    cleansing rules specified in the requirement document are:

    Duration of a normal flight should be greater than 40 minutes

    Speed_ground and Speed_air should not be less than 30MPH and greater than 140 MPH

    Height should at least be 6 meters at the threshold of runway

    Length of airport runway should be less than 6000 feet

    R Code:

    rule1 rule6 140) > rule7 rule8 6000) > rulei1 rulei2 rulei3 rulei4 rulei5 rulei6 FlightDataClean dim(FlightDataClean) > detach(FlightData) > attach(FlightDataClean)

    R Output:

    > dim(FlightDataClean) [1] 781 8

    Observations:

    19 observations have been filtered out. 5 observations have duration less than 40 minutes. 2

    observations have speed_ground less than 30 MPH and 1 observation for speed_ground is more

    than 140 MPH. 10 observations have height greater than 6 meters. 1 observation has landing

    distance greater than 6000 meters.

    Conclusion:

    We have cleaner data for our analysis

  • 5

    DATA ANALYSIS METHODS

    4. Plots and Data Visualization Objective: Graphically understand impact of different variables on landing distance.

    Rcode: pairs(FlightDataClean)

    Output:

  • 6

    DATA ANALYSIS METHODS

    Observation:

    Speed_air and speed_ground seems to have a prominent graphical pattern

    Other variables dont have a meaningful pattern

    Conclusion:

    Preliminary graphical analysis makes us believe that speed_air and speed_ground seem to have

    an effect on landing distance

    5. Correlation and Variable Selection Objective: Understand correlation between all the variables. Also, we need to select variables that can

    be used in our linear model. If any of these variables are highly correlated then we can use either one of

    these variables. Also, as described in chapter1, speed_air has too many missing values. We need to find

    if we can use an alternate variable in place of speed_air. This variable should be highly correlated to

    speed_air.

    R Code:

    > cor(FlightDataClean[,2:8]) > cor(FlightDataClean[,2:8],use = "pairwise.complete.obs") > plot(speed_ground,speed_air) > speed_diff = speed_air - speed_ground > summary(speed_diff) > hist(speed_diff)

    R Output:

  • 7

    DATA ANALYSIS METHODS

  • 8

    DATA ANALYSIS METHODS

    Observation:

    Speed_ground and Speed_air have very strong positive correlation.

    Because of 600 missing values in Speed_air the correlation of it with other variables cant be

    determined. So we need to drop N/A values and then find the correlation

    Speed_air and Speed_ground plot is a linear graph

    Speed_air is N/A for values less than 90 MPH. Its populated only when its greater than 90

    MPH

    Conclusion:

    Missing values present in Speed_air is definitely an issue in data analysis. We cant just drop this

    variable. However, Speed_ground has a correlation coefficient of .989 which means that we can use

    Speed_ground as a substitute for Speed_air during our analysis. This will eliminate the issue as well as

    we wont lose significant information.

    6. Effect of Variables on landing distance

    6.1 Effect of Aircraft make on landing distance: Objective: Understand effect of aircraft model on landing distance

    R Code: > aircraftmodel aircraftmodel[which(aircraft == "boeing")] plot(distance~aircraftmodel)

    Output:

  • 9

    DATA ANALYSIS METHODS

    Observation: Based on the given data, it seems that landing distance for boeing has an upward shift as

    compared to airbus aircraft.

    Conclusion: It seems that range of landing distance for boeing aircrafts is greater than the range of

    landing distance for airbus model

    6.2 Effect of speed_ground on landing distance: Objective: To understand effect of speed_ground on landing distance.

    Rcode:

    plot(distance~speed_ground)

    Output:

    Observation:

    From the graphs and correlation, it can be concluded that speed_ground has an effect on landing

    distance. Speed_ground seems to have a linear relationship with distance in the range of 80-120. In this

    range, distance seems to increase linearly with speed_ground. For the range 40-80 it distance seems to

    have a nonlinear relationship.

    Conclusion: To explain nonlinear component in the graph, we can conclude that there is a quadratic

    component needs to be involved while we are fitting a model to explain relationship between

    speed_ground and distance.

  • 10

    DATA ANALYSIS METHODS

    7. Model Fitting and Regression Analysis

    7.1 Initial Model with all variables Objective: Goal of this section is to define a model which will fit for all the variables present in the

    cleaned up dataset.

    Rcode:

    > Model1 summary(Model1)

    Observation:

    Null hypothesis: Variables (regressors) have no impact on the response (landing distance).

    P value for aircraftmodel, speed_ground and height is less than 0.05. This means that we have

    95% confidence that we can reject null hypothesis. That means, it seems that aircraftmodel,

    speed_ground and height may have an impact on the model that we are trying to fit.

    In ideal scenario, if this value is 1 then the model that we are trying to come up fits given data

    perfectly. R-squared value given here is .856 which is close to 1.

  • 11

    DATA ANALYSIS METHODS

    Conclusion: Based on above observations we can conclude that:

    R-squared value of .856 indicates that 85.6% of the variability in landing distance is explained by

    the variables and model that we have come up with.

    We have 95% confidence and enough evidence to believe that aircraftmodel, speed_ground and

    height dont have an impact on our model. So we need to consider effect of these variables

    separately. We also need to monitor adjusted R-squared to decide if our model has any

    improvement in explaining variability

    7.2 Model with only aircraftmodel, speed_ground and height Objective: Objective here is to reduce number of variables from previously built model and analyze the

    impact on goodness of the fit of this new model. For that we are going to consider only 3 variables:

    aircraftmodel, speed_ground and height. Moreover, we also need to plot residuals for these 3 variables.

    Rcode:

    > Model2 summary(Model2) > Residuals1 par(mfrow=c(1,3)) > plot(Residuals1~aircraftmodel) > plot(Residuals1~speed_ground) > plot(Residuals1~height)

    Output:

  • 12

    DATA ANALYSIS METHODS

    Observations:

    Residual plot for speed_ground seems to have a nonlinear or quadratic pattern. It has a U-

    shaped plot.

    Because aircraftmodel has only two discrete values 0 and 1, we are still getting those as two

    discrete residual values. No meaningful conclusion can be drawn at this point

    Residual plot for height has random non-symmetric pattern, so meaningful conclusion is difficult

    to be drawn.

    Conclusion:

    We need to improve our mode by improving nonlinear residual plot for speed_ground. To

    incorporate nonlinearity shown in the curved graph, we need to include a nonlinear component

    in our model so that we can better explain variability using nonlinear equation.

    We can keep height and aircraftmodel variables as it is in the model.

  • 13

    DATA ANALYSIS METHODS

    7.3 Squaring the values for speed_ground and Improving Model Objective: As described in previous model, we are going to square the values for speed_ground to

    include nonlinear nature of the curve. We also need to monitor R-squared and Adjusted R-Squared

    values for this new model

    Rcode:

    > speed_ground_sqr model3 summary(model3) > residuals2 par(mfrow=c(1,4)) > plot(residuals2~aircraftmodel) > plot(residuals2~height) > plot(residuals2~speed_ground) > plot(residuals2~speed_ground_sqr)

  • 14

    DATA ANALYSIS METHODS

    Observations:

    R-Square and adjusted R-Square values have gone up. Now these values are 0.9776 each.

    P-values for all the variables in the model are less than 0.05

    Residual plots for speed_ground and speed_ground_sqr are randomly distributed

    Conclusion:

    R-Square value of 0.9776 indicates that 97.76% of the variability in the landing distance data is

    explained by the model that we have come up with

    This model is the best choice amongst all the models that we discussed so far

  • 15

    DATA ANALYSIS METHODS

    8. Summary

    Based on the analysis, we can conclude that:

    speed_ground and speed_air are highly correlated and they both seem to have an impact on

    landing distance

    From the data and regression analysis, we cant reject probability of height having an impact on

    landing distance

    Referring to the plots, we can conclude that speed_ground has a strong relationship with

    landing distance. Part of the graph points out a linear relationship and part of the graph

    indicates nonlinear relationship. However, nonlinear and U-shaped residual plot for

    speed_ground makes reinforces that there is a nonlinear or quadratic relationship between

    speed_ground and landing distance. Hence, we need to incorporate nonlinear component in our

    model to find most accurately fitting model

    In the end, model that includes a squared term of speed_ground, has a very high R-Squared

    value (0.9776) which means that the nonlinear model that we came up in section 7.3 is the

    better fit than other models that we discussed and explains most of the variability in the landing

    distance.