Airline Data Analysis by Roshan Parajuli

Short introduction about dataset:

Dataset contains the airlines data of Indian Airways from 2019 which is verified in below cells. The dataset has 11 columns and 10683 columns as of raw state. Columns are destined to increase after encoding in preprocessing and rows are destined to decrease because of duplicate data.

Airline column consists of different names of airlines that flew in the given routes in the span of an year.

Date of Journey indicates the date it flew from the Destination to the Source along the route mentioned in the other columns of the dataset.

The departure time is the time the plane flew which contains the time object whereas the arrival time contains time as well as the date if arrived the next day.

The total stops column can be derived from the number of routes the plane flew.

There are only two NaN values and the additional info column has no information in more in 50 percent of data. The column usually contains the data about the food serving, baggage details and layover details.

The price is the final column which indicates the price of service of airlines which can also be used as a predictive variable.

Airline

Categorical data

Date_of_journey

Source and Destination

Routes and Stops

It can be be seen that the number of total stops can be derived from the routes itself. One of these columns can be dropped in the data preprocessing stage.

Additional_Info

Since 78% data in this column is empty or consists of "No info" which is of no use, it is safe to drop the column as the the data which do exist are not as too crucial.

Departure Time, Arrival Time and Duration

Price

Removing duplicate data

Handling categorical data

The data in the airline column is nominal since the order of the data does not matter. The data is handled with one hot encoding.

Finding outliers

Feature selection