Predicting the performance of US Airline carriers

Predicting the performance of
US Airline carriers
Applied Business Analytics & Business Intelligence (BIT 5534)
Submitted by:
Group 4
Akash Yadav & Suresh Malhotra
Agenda
1
• Problem Definition
2
• Data Preparation
3
• Data Exploration
4
• Modeling & Analysis
5
• Modeling Selection & Comparison
6
• Recommendation
7
• Customized Models & Future work
Virginia Tech
2
Business Problem
Problem: Insufficient information availability on Flight carriers & their flights.
Traveler’s concern:
Which Flight carrier is better & what are the chances of a flight delay.
Our Goal:
• Predict flight carriers performance based on delay time in future.
• Determine the main causes of flight delay & suggest improvements.
• What’s in it for customers or traveler
• Able to make flight reservations for time-crunch business meetings
• Ease of choosing inter-connected flights and airports ensuring better services
• Educate themselves to make better decision while making a flight reservation.
Note: Data Source: (http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp?pn=1)
Define Problem
Data Preparation
Data Exploration
Virginia Tech
Modeling
Model Comparison &
Selection
Recommendation
3
Data Definition & Preparation
Variable Dictionary
Attribute
Data Type
Description
year
month
carrier
carrier Name
airport
airport Name
arr-flights
Nominal
Nominal
Nominal
Nominal
Nominal
Nominal
Continuous
carrier_ct
Continuous
weather_ct
Continuous
Time horizon of data.
Period of Year.
Airline company code.
Name of Airline carrier.
Airport code.
Name of Airport
No. of flights arriving.
Flights delayed due to
airline’s own issue
Flights delayed due to
weather issues.
Flights delayed due to
National Aviation system
issues/problems.
Flights delayed due to
security issues.
Flights delayed due to
previous flight delay.
Delay time (Target variable)
nas_ct
Continuous
security_ct
Continuous
late_aircraft_ct Continuous
arr_delay
Continuous
Training /
Validation
Data
Exploration
(next step)
Multivariate
Analytical
Models
Scrub-off
Outliers
Remove
Missing Value
Cases
Missing values ?
Independent Variables
Define Problem
Data Preparation
Data Exploration
Modeling
Virginia Tech
Model Comparison &
Selection
Recommendation
4
Data Exploration
1) Carrier_ct, nas_ct & later_aircraft_ct: Better predictors of flight delay. (Red Boxes)
Reason: Small variation from the mean (red trend lines) & more compact variation of data between
arr_delay and above variables.
2) weather_ct & security_ct: Least contributors in prediction objective.
Reason: Large variation from the mean (red trend lines) & more scattered pattern of data between arr_delay
and above variables.
3) Small or No Redundancy:
Reason: Since the plots among independent variables are more scattered or widespread in space which
indicates weak correlation. (This is a good indication as it validates independency of variables).
Define Problem
Data Preparation
Data Exploration
Virginia Tech
Modeling
Model Comparison &
Selection
Recommendation
5
Models & Analysis
Linear Regression
PCA Analysis
Cluster Analysis
Decision Tree
Neural Network
Used a linear combination of 5
independent variables for
predicting the target.
Created new variables from
given variables.
Better for data mining.
Creates groups of data with
similar attributes. Data split
is based on some threshold
value of variables.
Created a hidden layer of
new variables which receive
data from current variables.
Plot (below) shows predicted
values (using regression) of
avg. flight delay against actual
values of delay.
2 new variables sufficient for
analysis as they cover 79%
of variation in data.
Clusters data with similar
attributes/characteristics.
Hierarchical: estimated no.
of clusters = 20
No. of Split = 35
K-means:
Optimal Number of
clusters = 21
Utilizes a regression like
approach and gives
prediction values as output.
Below plot: Output vs Actual
values of delay.
Diff
Groups
Define Problem
Data Preparation
Data Exploration
Virginia Tech
Modeling
Model Comparison &
Selection
Recommendation
6
Model & Analysis (continued…)
R2 : Higher value depicts that model is capable of accounting or
explaining most of the variation in the data which is important.
(Refer plots below)
RMSE: Root mean square error
Low RMSE value means predicted values are close to the actual
values or the deviation from actual value is small.
Linear Regression & Neural Network Model - Best
Since already R2 is high, so 5-6% increase matters a lot. Further,
RMSE is also low in comparison for these two models.
Note 1: All the models perform quite well as evident from
significantly high R2 and low RMSE values.
Note 1: Consistent performance on both Training & Validation set
which ensures that data preparation was good and models are
acceptable.
Modeling Technique
Linear Regression
5 Independent Variables
Principle
Component Analysis
2 Principle Components
Training
Validation
RSquare
95.5
RMSE
1851
RSquare
95.6
RMSE
1805
90.5
2688
90.8
2617
Cluster Analysis
21 Clusters
87.8
3053
87.5
3042
Decision Trees
35 Splits
90.9
2639
90.3
2694
1 Hidden Layer, 6 Nodes,
Learning Rate = 0.1, Transform
Covariates
96.2
1703
96.02
1728
Neural Networks
Neural
PCA
Define Problem
Modeling Highlight
Data Preparation
Data Exploration
Virginia Tech
Modeling
Model Comparison &
Selection
Recommendation
7
Model Comparison & Selection
Neural Network Model - Selected
- neural
- linear regression
Why Neural Network !!!
In comparison to linear regression model, Neural Network Model is slightly better on R2 and RMSE values.
Profiler depicts that variables variation profiles match much closely with the desired profiles in case of Neural Model.
Define Problem
Data Preparation
Data Exploration
Virginia Tech
Modeling
Model Comparison &
Selection
Recommendation
8
Recommendation & Key points
• Good to know:
- Although we have focused more on Delay time, the model has the potential to predict flight cancellations
and other performance metrics.
• How to ensure an efficient model
- Its critical to explore and prepare the data efficiently.
- Treating missing values and removing outliers is of utmost importance for model stability.
- Model should be created on training data and then tested on Validation data set. (Modify model if needed)
• What else could be done or added to scope
- Additional predictor variables can be included like distance between arrival & departure airports, air
traffic etc.
- Use data from different source and different time horizon.
- Focus on specific airports or air carriers.
- Using Forecast models for prediction.
Define Problem
Data Preparation
Data Exploration
Virginia Tech
Modeling
Model Comparison &
Selection
Recommendation
9
Customized Models and Future Work
Air Carrier
R2
RMSE
Logworth
Comment & LogWorth
AA
American Airlines 97.38
1554
A - 125, B - 152, C - 181 All significant. Security - 4
DL
Delta Airways
95.79
1568
A - 555, B - 334, C - 135 All significant. Security - 3.4
WN
Southwest Airlines 95.68
1485
A - 125, B - 152, C - 181 All significant. Security - 4.2
AS
United Airlines
98.07
2554
A - 737, B - 409, C - 114 Security_ct insignificant.
B6
JetBlue
96.82
1784
A - 318, B - 165, C - 71
Busiest Airport
R2
RMSE
Logworth
All significant. Security - 20
Comment & LogWorth
ATL
Atlanta, GA
96.72
1378
A - 190, B - 179, C - 55
LAX
Los Angeles, CA
97.19
2002
A - 344, B - 246, D - 123 All significant. Security - 31
ORD
Chicago, IL
96.07
2819
A - 177, B - 145, D – 31
DFW
JFK
Dallas, TX
New York, NY
97.12
96.10
776
2287
A - 95, B - 480, D - 126 Weather and security - Low
A - 156, B - 125, D – 55 All significant. Security - 4.9
Best Airport (US)
R2
RMSE
Logworth
Security_ct insignificant.
Weather and security - Low
Comment & LogWorth
SLC
Salt Lake City
97.6
1157
A - 201, B - 292, D - 135 All significant. Security - 17
DCA
Washington
95.4
1179
A - 118, B - 235, C - 117 All significant. Security – 2.2
SEA
Seattle-Tacoma
97.5
1026
A - 228, B - 290, D - 128 All significant. Security - 1.4
PDX
MSP
Portland
Minneapolis
95.3
97.1
810
1297
A - 157, B - 283, D - 123 All significant. Security – 4.5
A - 180, B - 221, D - 64 All significant. Security – 4.7
Delta & Airports
R2
RMSE
Logworth
Comment & LogWorth
DL
LAX
96.28
1120
A - 18, B - 23, D - 10
Security insignificant
DL
ORD
88.29
1065
A - 25, B - 3.7, D - 4.4
Security - Zeroed
DL
SLC
94.5
1868
A - 7.5, B - 17, D - 8.4
All significant.
DL
DCA
88.28
1140
A - 11, B - 5.6, D - 10
Security insignificant
DL
MSP
98.4
2183
B - 4.8, C - 8.1, D - 7.4
Nas_ct Insignificant.
A = nas_ct; B = late_aircraft_ct; C = weather_ct; D = carrier_ct
The Dashboard Story
(Future Scope of Work)
Virginia Tech
1. Individual models to be prepared for each
flight carrier, airport and a combination of
flight carrier + airport for monthly predictions.
2. Prediction from the model will be used to
provide information to the travelers.
3. Description Models will inform the current
performance of flight carriers and airports
based on historical data.
4. Interactive Visual Information Delivery is
10
the goal !!
Virginia Tech
11