Navigating the Skies: A Data-Driven Exploration of Flight Pricing and Airline Performance
Introduction
Fasten your seatbelts! We’re embarking on a data-driven journey through the complex world of air travel. This project delves into the fascinating interplay between flight pricing, airline performance, and environmental impact. By analyzing a comprehensive dataset encompassing numerous factors, we aim to shed light on the intricate factors influencing your travel decisions.
Airlines: Which carriers offer the best value for your money, taking into account price, service, and reliability?
Dates and Days: Does flexibility in travel dates significantly impact costs? Does flying on weekdays differ from weekends?
Cities: How do origin and destination cities influence pricing and flight options?
Duration: Is opting for shorter, more direct flights always the most economical choice?
Stops: Does the number of layovers significantly impact cost and convenience?
Meal and Baggage: How do these add-ons affect the ticket price, and are they always worth the expense?
Emissions Data: For the environmentally conscious traveller, we’ll analyse the carbon footprint associated with different airlines and flight options.
Dataset Collection
Yatra.com is a popular Indian online travel agency offering flight booking services. To gather the desired data beyond simply listed prices, manual browsing wouldn’t suffice. This is where Selenium comes in. Selenium is a powerful tool for automating web browser interactions, allowing us to navigate dynamic content ,extract specific data points and iterate efficiently.
XPath in Selenium is a powerful technique used to traverse and interact with the HTML structure of a web page. It provides a standardized way to navigate HTML and XML documents, allowing testers to locate elements for automation testing precisely.
Libraries like Selenium for browser automation and pandas for data manipulation are imported. Data containers like lists are defined to store extracted details (airlines, prices, etc.). Travel dates and city combinations for desired routes are also established. We navigate to yatra.com and enter the cities and use Selenium’s WebDriverWait to ensure elements are loaded before interacting for the desired dates. Using XPaths we loop through each visible flight block on the page to locate specific data elements and exporting the data to a CSV file to make a dataset.
Dataset: <link>
By utilizing Selenium’s capabilities for automated data extraction, this project unlocks valuable insights into the world of flight travel, empowering users to make informed and responsible choices.
Data Cleaning
- Removing Unwanted Entries
We begin by identifying and eliminating data points where the destination is recorded as “Nagpur”. Since this appears to be a glitch of the web scraping process rather than actual flight information.
2. Standardizing Data Formats
We convert the “departure time” feature from its current format into a standard datetime format. Both “emissions” and “day” columns undergo transformations to extract numerical values. We remove “kgs” from the emission values and “day” from the day entries. The “duration” feature is processed to extract the actual flight duration in minutes. This conversion facilitates further analysis involving flight times and durations.
By implementing these data cleaning steps, we ensure the flight information is standardized and ready for meaningful analysis and insights.
Exploratory Data Analysis
- Emissions
The impact of air travel on the environment, particularly CO2 emissions, is a growing concern.
The Density distribution graph shows that most of the flights cause an average carbon footprint of about 130kgs. This amount of CO2 emissions is roughly equivalent to the carbon footprint of driving a car for around 1,400 miles (2,253 kilometers) in an average gasoline-powered vehicle.
There is a visible correlation between flight distance and emission. Longer routes are more likely to emit higher CO2 emissions. Hence, Chennai- Bangalore route has the lowest average emission.
2. Airline Strength
This pie chart clearly depicts that 39.36% of the dataset has Indigo Flights, followed by Air India with 27% and Vistara with 17.25%. Indigo has about 55% market share in the whole domestic airline system, which shows a clear dominance in the domestic market.
3. Stops
This graph shows the distribution of Flights by Stops (ie, Non Stop, 1 stop, 2 stop). The chart reveals that about 66.5% of the flights have no layovers and only 31.8% have 1 stop. The average ticket price also clearly depends on the number of stops with an approximate difference of about Rs 2000 between a non stop and 2- stop flight proving that non stop flights are cheaper.
4. Airport Taxes
An airport tax is a tax levied on passengers for passing through an airport and is usually included in the price of an airline ticket. The taxes that airports charge are used to pay for the operation and maintenance of the airport. Kolkata Airport has the highest tax cost per ticket with Rs.2048 whereas Mumbai Airport has the lowest cost at Rs.1261.5.
5. Prices
This graph reflects the pricing strategies followed by different airlines. Akasa Air has the lowest median out of all the other airlines, hence offering the cheapest flights. Vistara, Indigo, Air India and multiple other airlines have close medians depicting in more direct price competition for similar routes passengers. Premium airlines like Air India Express have high medians but lower availability.
This is a scatter plot between the duration of the flight and the price. Clusters can be visually differentiated based on the distribution of prices except for a few outliers. Majority of the durations lie in the price clusters of Rs. 4000 to 8000.
Statistical Tests
- Heatmap
A heatmap is used to visualize data that represents the magnitude of individual values within a dataset as a color. It’s interesting to note that the price of a fare ticket is greatly dependent on emissions rather than the distance of travel. Duration of the journey and emissions are also highly correlated also since it includes layovers.
2. ANOVA Test
ANOVA is a statistical method used to analyze the differences between the means of two or more groups. It calculates an F-statistic by comparing between-group variability to within-group variability.
ANOVA test was carried out on these features to conduct hypothesis testing. In all the three cases, we could reject the null hypothesis due to high F- stat and P value< 0.5.
i) Price of the ticket and Time of flight
ii) Check In baggage allowed and emissions
iii) Meal Option and Price of the ticket
are all dependent on each other.
Modelling
Random Forest algorithm is a powerful supervised machine learning technique for both classification and regression problems. It operates by combining the predictions of multiple decision trees to generate a final, more robust prediction.
Hyperparameters are settings that control the learning process of a machine learning algorithm and are not learned from the data itself. Selecting the right hyperparameters is crucial for achieving optimal performance.
Grid Search CV is a common technique used to find the best hyperparameter combination. It systematically evaluates different combinations of hyperparameter values and selects the one that performs best on a held-out validation set.
- The number of trees (n_estimators) in the forest significantly impacts the model’s accuracy and complexity. Too few trees might underfit the data, while too many can lead to overfitting.
- The maximum depth (max_depth) of each tree determines its complexity. Deeper trees can capture more complex relationships but are also more prone to overfitting.
Our model promises forecasting future flight prices based on features available in our dataset.