Predicting the performance of US Airline carriers Applied Business Analytics & Business Intelligence (BIT 5534) Submitted by: Group 4 Akash Yadav & Suresh Malhotra Agenda 1 • Problem Definition 2 • Data Preparation 3 • Data Exploration 4 • Modeling & Analysis 5 • Modeling Selection & Comparison 6 • Recommendation 7 • Customized Models & Future work Virginia Tech 2 Business Problem Problem: Insufficient information availability on Flight carriers & their flights. Traveler’s concern: Which Flight carrier is better & what are the chances of a flight delay. Our Goal: • Predict flight carriers performance based on delay time in future. • Determine the main causes of flight delay & suggest improvements. • What’s in it for customers or traveler • Able to make flight reservations for time-crunch business meetings • Ease of choosing inter-connected flights and airports ensuring better services • Educate themselves to make better decision while making a flight reservation. Note: Data Source: (http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp?pn=1) Define Problem Data Preparation Data Exploration Virginia Tech Modeling Model Comparison & Selection Recommendation 3 Data Definition & Preparation Variable Dictionary Attribute Data Type Description year month carrier carrier Name airport airport Name arr-flights Nominal Nominal Nominal Nominal Nominal Nominal Continuous carrier_ct Continuous weather_ct Continuous Time horizon of data. Period of Year. Airline company code. Name of Airline carrier. Airport code. Name of Airport No. of flights arriving. Flights delayed due to airline’s own issue Flights delayed due to weather issues. Flights delayed due to National Aviation system issues/problems. Flights delayed due to security issues. Flights delayed due to previous flight delay. Delay time (Target variable) nas_ct Continuous security_ct Continuous late_aircraft_ct Continuous arr_delay Continuous Training / Validation Data Exploration (next step) Multivariate Analytical Models Scrub-off Outliers Remove Missing Value Cases Missing values ? Independent Variables Define Problem Data Preparation Data Exploration Modeling Virginia Tech Model Comparison & Selection Recommendation 4 Data Exploration 1) Carrier_ct, nas_ct & later_aircraft_ct: Better predictors of flight delay. (Red Boxes) Reason: Small variation from the mean (red trend lines) & more compact variation of data between arr_delay and above variables. 2) weather_ct & security_ct: Least contributors in prediction objective. Reason: Large variation from the mean (red trend lines) & more scattered pattern of data between arr_delay and above variables. 3) Small or No Redundancy: Reason: Since the plots among independent variables are more scattered or widespread in space which indicates weak correlation. (This is a good indication as it validates independency of variables). Define Problem Data Preparation Data Exploration Virginia Tech Modeling Model Comparison & Selection Recommendation 5 Models & Analysis Linear Regression PCA Analysis Cluster Analysis Decision Tree Neural Network Used a linear combination of 5 independent variables for predicting the target. Created new variables from given variables. Better for data mining. Creates groups of data with similar attributes. Data split is based on some threshold value of variables. Created a hidden layer of new variables which receive data from current variables. Plot (below) shows predicted values (using regression) of avg. flight delay against actual values of delay. 2 new variables sufficient for analysis as they cover 79% of variation in data. Clusters data with similar attributes/characteristics. Hierarchical: estimated no. of clusters = 20 No. of Split = 35 K-means: Optimal Number of clusters = 21 Utilizes a regression like approach and gives prediction values as output. Below plot: Output vs Actual values of delay. Diff Groups Define Problem Data Preparation Data Exploration Virginia Tech Modeling Model Comparison & Selection Recommendation 6 Model & Analysis (continued…) R2 : Higher value depicts that model is capable of accounting or explaining most of the variation in the data which is important. (Refer plots below) RMSE: Root mean square error Low RMSE value means predicted values are close to the actual values or the deviation from actual value is small. Linear Regression & Neural Network Model - Best Since already R2 is high, so 5-6% increase matters a lot. Further, RMSE is also low in comparison for these two models. Note 1: All the models perform quite well as evident from significantly high R2 and low RMSE values. Note 1: Consistent performance on both Training & Validation set which ensures that data preparation was good and models are acceptable. Modeling Technique Linear Regression 5 Independent Variables Principle Component Analysis 2 Principle Components Training Validation RSquare 95.5 RMSE 1851 RSquare 95.6 RMSE 1805 90.5 2688 90.8 2617 Cluster Analysis 21 Clusters 87.8 3053 87.5 3042 Decision Trees 35 Splits 90.9 2639 90.3 2694 1 Hidden Layer, 6 Nodes, Learning Rate = 0.1, Transform Covariates 96.2 1703 96.02 1728 Neural Networks Neural PCA Define Problem Modeling Highlight Data Preparation Data Exploration Virginia Tech Modeling Model Comparison & Selection Recommendation 7 Model Comparison & Selection Neural Network Model - Selected - neural - linear regression Why Neural Network !!! In comparison to linear regression model, Neural Network Model is slightly better on R2 and RMSE values. Profiler depicts that variables variation profiles match much closely with the desired profiles in case of Neural Model. Define Problem Data Preparation Data Exploration Virginia Tech Modeling Model Comparison & Selection Recommendation 8 Recommendation & Key points • Good to know: - Although we have focused more on Delay time, the model has the potential to predict flight cancellations and other performance metrics. • How to ensure an efficient model - Its critical to explore and prepare the data efficiently. - Treating missing values and removing outliers is of utmost importance for model stability. - Model should be created on training data and then tested on Validation data set. (Modify model if needed) • What else could be done or added to scope - Additional predictor variables can be included like distance between arrival & departure airports, air traffic etc. - Use data from different source and different time horizon. - Focus on specific airports or air carriers. - Using Forecast models for prediction. Define Problem Data Preparation Data Exploration Virginia Tech Modeling Model Comparison & Selection Recommendation 9 Customized Models and Future Work Air Carrier R2 RMSE Logworth Comment & LogWorth AA American Airlines 97.38 1554 A - 125, B - 152, C - 181 All significant. Security - 4 DL Delta Airways 95.79 1568 A - 555, B - 334, C - 135 All significant. Security - 3.4 WN Southwest Airlines 95.68 1485 A - 125, B - 152, C - 181 All significant. Security - 4.2 AS United Airlines 98.07 2554 A - 737, B - 409, C - 114 Security_ct insignificant. B6 JetBlue 96.82 1784 A - 318, B - 165, C - 71 Busiest Airport R2 RMSE Logworth All significant. Security - 20 Comment & LogWorth ATL Atlanta, GA 96.72 1378 A - 190, B - 179, C - 55 LAX Los Angeles, CA 97.19 2002 A - 344, B - 246, D - 123 All significant. Security - 31 ORD Chicago, IL 96.07 2819 A - 177, B - 145, D – 31 DFW JFK Dallas, TX New York, NY 97.12 96.10 776 2287 A - 95, B - 480, D - 126 Weather and security - Low A - 156, B - 125, D – 55 All significant. Security - 4.9 Best Airport (US) R2 RMSE Logworth Security_ct insignificant. Weather and security - Low Comment & LogWorth SLC Salt Lake City 97.6 1157 A - 201, B - 292, D - 135 All significant. Security - 17 DCA Washington 95.4 1179 A - 118, B - 235, C - 117 All significant. Security – 2.2 SEA Seattle-Tacoma 97.5 1026 A - 228, B - 290, D - 128 All significant. Security - 1.4 PDX MSP Portland Minneapolis 95.3 97.1 810 1297 A - 157, B - 283, D - 123 All significant. Security – 4.5 A - 180, B - 221, D - 64 All significant. Security – 4.7 Delta & Airports R2 RMSE Logworth Comment & LogWorth DL LAX 96.28 1120 A - 18, B - 23, D - 10 Security insignificant DL ORD 88.29 1065 A - 25, B - 3.7, D - 4.4 Security - Zeroed DL SLC 94.5 1868 A - 7.5, B - 17, D - 8.4 All significant. DL DCA 88.28 1140 A - 11, B - 5.6, D - 10 Security insignificant DL MSP 98.4 2183 B - 4.8, C - 8.1, D - 7.4 Nas_ct Insignificant. A = nas_ct; B = late_aircraft_ct; C = weather_ct; D = carrier_ct The Dashboard Story (Future Scope of Work) Virginia Tech 1. Individual models to be prepared for each flight carrier, airport and a combination of flight carrier + airport for monthly predictions. 2. Prediction from the model will be used to provide information to the travelers. 3. Description Models will inform the current performance of flight carriers and airports based on historical data. 4. Interactive Visual Information Delivery is 10 the goal !! Virginia Tech 11
© Copyright 2026 Paperzz