Project: Data Cleaning | Data Dilemmas and Discussions

Introduction

As an out-of-state college student, I have spent a lot of time in airports and a lot of time flying back and forth across the country. I grew up in Ohio and the fastest way for me to visit my family is by plane. I want to investigate 2022 statistics for the most popular airports in the United States.

I created an extensive table for the top 200 ranked airports in the United States. This data was available for download from bts.gov and faa.gov.

I began working with the ranked list csv file which included the rank, airport name, and number of enplaned passengers that year. Enplaned passengers is another way of saying the number of people that boarded airplanes at a specific airport.

Eventually, I add ICAO codes, IATA codes, official airport names, state, elevation, coordinate data, and statistics for each airport. The end result csv file has 200 rows and 15 columns.

File Link

In the file DataCleaning.ipynb, I cleaned and organized the data into a .cvs file.

Process

Necessary Packages

import pandas as pd
import numpy as np
import requests
import re
import json

Starting Point

The first file I read in was called ranking.xlsx.

rankings

U.S. Airports, ranked by 2022 Systemwide* Scheduled Enplanements Passenger numbers in millions (000,000)
Rankings source

rankings = pd.read_excel(partialpath+’AirportRankings2022.xlsx’)

From this table, I decided to only keep the first three columns:
2022 Rank, Airport, & 2022 Enplaned Passengers.

Adding IATA Codes and Cleaning

Next, I needed to add IATA codes for each airport. This will allow me to use API later to fetch more information for the table.

IATA codes are specific location identifies for airports made up of three letters. IATA codes are specificly used for “airline timetables, reservations, and baggage tags”. For example, Salt Lake City International Airport has an IATA code: SLC.

airportLOCID
- data sourced from faa.gov

I used airportLOCID IATA codes and Airport Names to replace the Airport column in my table with IATA codes. This was all done in a for-loop.

rankedAirports = rankings[‘Airport’]
rankedAirportNames = rankedAirports.str.extract(r’:\s(.)$’)

rankedLOCID = []

for i in range(len(rankedAirportNames)):
if rankedAirportNames[0][i] == ‘Tri Cities WA’:
rankedLOCID += [‘TRI’] # I ended up having to hard code this to fix an error.
else:
try:
rankedLOCID += [cleanLOCID.loc[cleanLOCID[‘Airport’] == rankedAirportNames[0][i] , ‘LOCID’].iloc[0]]
except IndexError:
rankedLOCID += [rankedAirportNames[0][i]]

Then, I referenced data from rankings, originatingPassengers, & airportPassengersBoarded to create lists that will be used as columns.

rank = rankings[‘2022 Rank’]
conversionfactor = 1000000
numPassengers = rankings[‘2022 Enplaned Passengers’]*conversionfactor

Missing IATA Codes

At this point, the rankedLOCID was still missing IATA codes for some of the airports. In the previous for-loop, I used try: and except IndexError: to replace the airport name with the IATA code. This means if the IATA code could not be found, I kept the airport name so I knew to replace it in another way.

I used the airportLOCID data first, but it did not match up with each airport perfectly. I noticed that some of the airports are known by multiple names, which caused one of the issue in the previous for-loop. IATA codes are well known and easily accessible. Instead of manually searching and replacing these values, I used chatGPT AI to create a text file for the missing codes and replaced the Airport names with a similar for-loop.

Concat Lists for Table

Next, I combined rank, cleanedRankedLOCID (IATA values), and numPassengers into a DataFrame.

cleanRank = pd.concat([rank,pd.DataFrame(cleanedRankedLOCID),numPassengers],axis=1)
cleanRank.rename(columns={0: ‘Airport’}, inplace=True)

I renamed the column with IATA codes Airport, but I eventually change it IATA for clarity.

New Columns

I wanted to add more columns, so I used the DataFrame: originatingPassengers.

OriginatingPassengers

U.S. Airports ranked by 2022 Originating Domestic Passengers
Source: Bureau of Transportation Statistics, Origin & Destination Survey
DB1B Ticket, Based on 10 Percent Ticket Sample
O&D numbers are not comparable to T-100 Market Enplanement numbers

I noticed that originatingPassengers has a column for Code to match up to Airport (IATA Codes). I used a for-loop to create a list with Originating Domestic Passengers for each specific airport IATA. Then I added the new column to my DataFrame and renamed it.

cleanedData = pd.concat([cleanRank, pd.DataFrame(ogDomPassengers)],axis=1)
cleanedData.rename(columns={0: ‘2022 Originating Domestic Passengers’}, inplace=True)

First API Request

At this step, my table cleanedData had 4 columnns: 2022 Rank, Airport, 2022 Enplaned Passengers, & 2022 Originating Domestic Passengers.

I wanted to access coordinates and airport specs from airportdb.io API, but was not able to because I did not have a columns for ICAO codes.

ICAO codes are another form of location identification for airports. ICAO codes are specific to the region and country of an airports location. They are distinct and different from IATA codes. The second API I use needed ICAO codes in order to make a valid request.

After reading in my API key to a variable called: apiKey, I used it to request all the ICAO codes.

I encountered a couple of errors in trying to do this because of issues with my IATA codes. For example, Glacier Park International Airport uses both GPI and FCA.
To remedy this easily, I hard coded for the airports that kept having issues.

base_url = “https://aviation-reference-data.p.rapidapi.com/airports/”
endpoints = cleanedData[‘Airport’]

cleanedICAO = [[‘IATA’,’ICAO’]]

for i in range(len(endpoints)):
if endpoints[i].strip() == ‘GPI’: # Glacier Park International
icao = ‘KGPI’
iata = ‘GPI’
if endpoints[i].strip() == ‘VPS’: # Eglin AFB Destin Fort Walton Beach
icao = ‘KVPS’
iata = ‘VPS’
else:
url = base_url + endpoints[i].strip()
headers = {“X-RapidAPI-Key”: apiKey,
“X-RapidAPI-Host”: “aviation-reference-data.p.rapidapi.com”}
r = requests.get(url, headers=headers)
icao = r.json()[‘icaoCode’]
iata = r.json()[‘iataCode’]
cleanedICAO += [[iata,icao]]

I cleaned up the DataFrame of ICAO and IATA codes, then used pandas merge with my main DataFrame, cleanedData.

Fixing Errors

In reviewing my table, I realized there are some issues I needed to take care of.

No matter how hard I tried, I could not fix the GPI/FCA Glacier vs. Guapi Airport issue when using the API request.
Like I mentioned above, Airport GPI is Glacier Park International Airport, which has 2 codes: GPI and FCA.
In this step, I realized I needed to update the IATA from GPI to FCA in order for the merge later in my code with airportStats to work correctly.
I hard coded the IATA to be GPI before I updated it here, but the more efficient way would have been to hard code FCA in the previous step. The information will all be correct, regardless.

cleanedData[‘Airport’] = cleanedData[‘Airport’].replace(‘GPI’, ‘FCA’)
cleanedData.loc[cleanedData[‘Airport’] == ‘FCA’, ‘ICAO’] = ‘KGPI’

Second API Request

Next, I used a secondary API to request Longitude, Lattitude, Elevation, and State columns.

After reading in the API token to a variable called: apiToken, I made my request for all 200 airports.

I created a new DataFrame called airportDF to hold all the information from this API request.

url_template = “https://airportdb.io/api/v1/airport/{ICAO}?apiToken={apiToken}”
icao = cleanedData[‘ICAO’]

airportDF = pd.DataFrame()

for i in range(len(icao)):
url = url_template.format(ICAO=icao[i], apiToken=apiToken)
r = requests.get(url)
if i == 0:
airportDF = pd.json_normalize(r.json())
else:
airportDF = pd.concat([airportDF, pd.json_normalize(r.json())],axis=0)

Merge on airportDF

After I created the airportDF from the API request, I merged the new columns with cleanDF and reset the index.

cleanDF = pd.merge(cleanedData, airportDF[[‘iata_code’, ‘icao_code’, ‘name’, ‘latitude_deg’, ‘longitude_deg’, ‘elevation_ft’, ‘region.name’]], left_on=’ICAO’, right_on=’icao_code’, how=’left’)
cleanDF.reset_index(drop=True, inplace=True)

Cleaning up

Next, I cleaned up my main DataFrame, cleanDF. Since I merged on ICAO code, I had two columns of the same information. I dropped the duplicate columns and renamed the remaining columns.

Adding more information…

I wanted to add more interesting information about each airport. After searching online for available data, I found this website: https://www.stratosjets.com/blog/us-airport-rankings/.

I was able to access a table that included information for 2022 U.S. airports, including…

Flights Per Year
Delays %
Canceled %
Avg. Delay (Mins)
Security Delay (Mins)

I copied the data into a .txt file named: AirportStats.txt and I merged it with cleanDF to create a new DataFrame, finalDF.

finalDF = pd.merge(cleanDF, airportStats[[‘Code’, ‘Flights Per Year’, ‘Delays %’, ‘Canceled %’, ‘Avg. Delay (Mins)’, ‘Security Delay (Mins)’]], left_on=’IATA’, right_on=’Code’, how=’left’)

More Cleaning

I did some cleaning and organization to the finalDF DataFame, including:

dropping duplicate columns
convering percentages to proportions
renaming and reordering the columns

finalDF = finalDF.drop(‘Code’, axis=1)
finalDF[‘Delays %’] = finalDF[‘Delays %’] / 100
finalDF[‘Canceled %’] = finalDF[‘Canceled %’] / 100
columnOrder = [‘2022 Rank’, ‘Airport’, ‘IATA’, ‘ICAO’, ‘2022 Enplaned Passengers’,
‘2022 Originating Domestic Passengers’, ‘Flights Per Year’, ‘Delays %’,
‘Canceled %’, ‘Avg. Delay (Mins)’, ‘Security Delay (Mins)’, ‘Latitude’,
‘Longitude’, ‘State’, ‘Elevation’]
finalDF = finalDF[columnOrder]
finalDF.rename(columns={‘Delays %’: ‘Delays’, ‘Canceled %’: ‘Canceled’}, inplace=True)

Duplicate Rows

Upon previewing the finalDF DataFrame, I realized it was 203 rows long, which is too many.
I cleaned for the duplicates in this step.

finalDF.drop_duplicates(subset=[‘IATA’], keep=’first’, inplace=True, ignore_index=True)

Export

Finally, I exported my DataFrame to a .csv file for data visualization.

finalDF.to_csv(‘2022Airports.csv’, index=False)

Note: I use the variable partialpath to refer to the folder path location on my local machine. All of these files can be viewed at my Project Repo.

Check out my EDA in this blog post.