Airline Data Analysis by Roshan Parajuli

Short introduction about dataset:

Dataset contains the airlines data of Indian Airways from 2019 which is verified in below cells. The dataset has 11 columns and 10683 columns as of raw state. Columns are destined to increase after encoding in preprocessing and rows are destined to decrease because of duplicate data.

Airline column consists of different names of airlines that flew in the given routes in the span of an year.

Date of Journey indicates the date it flew from the Destination to the Source along the route mentioned in the other columns of the dataset.

The departure time is the time the plane flew which contains the time object whereas the arrival time contains time as well as the date if arrived the next day.

The total stops column can be derived from the number of routes the plane flew.

There are only two NaN values and the additional info column has no information in more in 50 percent of data. The column usually contains the data about the food serving, baggage details and layover details.

The price is the final column which indicates the price of service of airlines which can also be used as a predictive variable.

Airline

Categorical data

Date_of_journey

Source and Destination

Routes and Stops

It can be be seen that the number of total stops can be derived from the routes itself. One of these columns can be dropped in the data preprocessing stage.

Additional_Info

Since 78% data in this column is empty or consists of "No info" which is of no use, it is safe to drop the column as the the data which do exist are not as too crucial.

Departure Time, Arrival Time and Duration

Price

Removing duplicate data

Handling categorical data

The data in the airline column is nominal since the order of the data does not matter. The data is handled with one hot encoding.

Finding outliers

Feature selection

Here, Total_Stops is the most important feature.

Profiling report

Data Wrangling

Data wrangling is the process of transforming data from its original "raw" form into more readily used format. It prepares the data for analysis. It is also known as data cleaning, data remediation or data munging.

It can be both manual or an automated process. When the dataset is immense, the manual data wrangling is very tedious and needs automation.

It consists of six steps:
Step 1: Discovering
In this step, data is understood more deeply. It is also known as the way of familiarizing with data so it can be further passed to different steps. During this phase, some patterns in data could be identified and the issues with the dataset is also known. The values which are unnecessary, missing or incomplete are identified for addressing.

Step 2: Structuring
Raw data has a strong probability to be in a haphazard manner and unstructured and it needs to be restructured in a proper manner. Movement of data is made for easier computation as well as analysis.

Step 3: Cleaning
In this step, data is cleaned for high quality analysis. Here, null values will have to be changed and formatting will also be standardized. It also includes the processes of deleting empty rows and removing outliers ensuring there are as less errors as possible.

Step 4: Enriching
In this step, it is determined whether all the data necessary for a project is fulfilled. If not, the dataset is extended by merging with another dataset or simply incorporating values from other datasets. If the data is complete, the enriching part is optional. Once new datas are added from another dataset, steps of discovering, structuring, cleaning and enriching dataset needs to be repeated.

Step 5: Validating
In this step, the states of the data (its consistancy and quality) are verified. If no issues are found to be resolved, the data is ready to be analysed.

Step 6: Publishing
It is the final step where the data that has been validated is published. It can be published in different file formats ready for analysis by the organisation or an individual.

Sources: https://www.trifacta.com/data-wrangling/
https://online.hbs.edu/blog/post/data-wrangling

Data Cleaning

It is a process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated or improperly formatted.

When data from multiple dataset are combined, there is a strong possibility that the same data is replicated or even mislabeled. Using such data into the machine learning models are destined to give unexpected output.

Cleaning data is fairly simple and straight forward. There are four steps involved during the cleaning of data. Although the techniques may vary according to the types of data a company stores, the basic steps are common across all types of data.

Step 1: Removing duplicate and irrelevant observations
Data is collected thorugh different resources be it specific datasets, web scraping or any other medium. During the collection of data, some observations might repeat themselves due to a human error. Sometimes, the datasets might contain more information than we actually need. Such irrelevant as well as duplicate data needs to be dealt with in this step.

Step 2: Fixing structural errors
Data collected might have several structural issues including typos, incorrect data types, and weird naming convention which decreases the quality and reliability of data.

Step 3: Filtering unwanted outliers There might be some pieces of data that doesn't match the rest, called outliers. It can be due to improper data entry. But, that doesn't mean that every outlier is incorrect, its validity needs to be determined and if the outlier is proven to be irrelevant, it should be removed.

Step 4: Handling missing data The data with the removed outliers and some NaN fields are generally not accepted by many algorithms and that is why it should be handled. There are two ways to do so. The first one is dropping the entire observation. It is not optimal but necessary in some cases. Another option is to fill the missing data with observation and reference from other observations. It is also not ideal because the data is then based on assumptions and not the actual observation.

Step 5: Validating the data After all the above steps are perfomed, the data should be validated and the quality of data should be questioned. Not only that, it should be carefully observed whether the data brings any insights to light or not.

Sources: https://www.tableau.com/learn/articles/what-is-data-cleaning

Data Integration

It is a preprocessing method that involves merging of data from different sources in order to form a data store like data warehouse.

Issues in Data Integration:

  1. Schema Integration and object matching
    For example: Different datasets might contain same information with different labels which might not be visible at first glance.

  2. Redundancy
    For example: Different attributes might be redundant like if a dataset contains both the date of birth and the age details, age is redundant bacause it can simply be derived from date of birth.

  3. Detection and resolution of data value conflicts
    For example: If a column in a dataset contains the price in NPR and another dataset contains the price details in USD, there might be conflicts in values if NPR data is simply converted to USD.

Sources: https://www.youtube.com/watch?v=UKUq7hZdZUw

Data Reduction

The data that the world is generating right now is tremendous. Data Reduction or Dimensionality Reduction helps to reduce p dimensions of the data into a subset of k dimensions where k<<n. The outcomes of doing this are the less computation/training time, less space required for storage, Better performance in most algorithms, removal of redundant features helping in multicollinearity and a lot more. There are many techniques for doing so including missing value ratio, low variance filter, high correlation filter, random forest, backward feature elimination, forward feature selection, factor analysis, Principal Component Analysis, Independent Component Analysis and so on.

Sources: https://www.analyticsvidhya.com/blog/2018/08/dimensionality-reduction-techniques-python/

Data Transformation

Data transformation is the process of changing the format, structure, or values of data. Due to data transformation, the data is easier for both humans and computers to use, data quality is tremendously improved, compatibility is facilitated between applications, systems, and types of data. However, there are a few challenges associated with data transformation i.e., the transformation can be expensive in terms of computing resources & licensing, lack of expertise can cause more problems than the problem it solves and the process can be resource-intensive.

Sources: https://www.stitchdata.com/resources/data-transformation/