This page was generated from getting_started/use_case_1_airline_connectedness.ipynb.

Use Case Tutorial 1: Well-Connected US Regions¶

This is a tutorial on how to find the most well-connected regions of the U.S. via air travel.

The U.S. Bureau of Transportation Statistics provides data on monthly air travel from all certificated U.S. air carriers and makes it available here. The 2018 air travel data used for this tutorial can be downloaded here. We chose 2018 data to avoid any impact COVID-19 might’ve had on travel.

We will utilize this data to determine which areas in the U.S. are most well-connected using betweenness centrality.

Data Preprocessing¶

Let’s first look at the data.

First, we’ll need to import some libraries.

import metagraph as mg
import pandas as pd

Let’s see what the data looks like.

RAW_DATA_CSV = './data/airtravel/raw_data.csv' # https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=258
raw_data_df = pd.read_csv(RAW_DATA_CSV)
raw_data_df.head()

	FREIGHT	DISTANCE	UNIQUE_CARRIER	AIRLINE_ID	UNIQUE_CARRIER_NAME	ORIGIN_AIRPORT_ID	ORIGIN_AIRPORT_SEQ_ID	ORIGIN_CITY_MARKET_ID	...	DEST_AIRPORT_SEQ_ID	DEST_CITY_MARKET_ID	DEST	DEST_CITY_NAME	DEST_STATE_ABR	DEST_STATE_FIPS	DEST_STATE_NM	DEST_WAC	MONTH	Unnamed: 26
0	410.0	616.0	WN	19393.0	Southwest Airlines Co.	13851	1385103	33851	...	1069302	30693	BNA	Nashville, TN	TN	47	Tennessee	54	6	NaN
1	184.0	2592.0	WN	19393.0	Southwest Airlines Co.	14307	1430705	30721	...	1289208	32575	LAX	Los Angeles, CA	CA	6	California	91	6	NaN
2	87.0	2445.0	WN	19393.0	Southwest Airlines Co.	14679	1467903	33570	...	1025702	30257	ALB	Albany, NY	NY	36	New York	22	6	NaN
3	10.0	432.0	WN	19393.0	Southwest Airlines Co.	14730	1473003	33044	...	1299206	32600	LIT	Little Rock, AR	AR	5	Arkansas	71	6	NaN
4	100.0	129.0	WN	19393.0	Southwest Airlines Co.	14747	1474703	30559	...	1405702	34057	PDX	Portland, OR	OR	41	Oregon	92	6	NaN

5 rows × 27 columns

A city market is a region that an airport supports. For example, New York City has many airports (and it’s sometimes cheaper to fly into and out of different airports), but all of their airports serve the same region / city market.

Since we’re mostly concerned with where passengers will end up going (and not which airport they choose), we will view city markets as the regions of interest.

We will define a region as being well-connected if many people travel in and out of it.

Let’s filter out all the irrelevant information not required for finding the well-connected regions and any flight paths with zero passengers (these flights are usually flights transporting packages).

RELEVANT_COLUMNS = [
    'PASSENGERS',
    'ORIGIN_AIRPORT_ID', 'ORIGIN_AIRPORT_SEQ_ID', 'ORIGIN_CITY_MARKET_ID', 'ORIGIN', 'ORIGIN_CITY_NAME', 'ORIGIN_STATE_ABR', 'ORIGIN_STATE_NM',
    'DEST_AIRPORT_ID',   'DEST_AIRPORT_SEQ_ID',   'DEST_CITY_MARKET_ID',   'DEST',   'DEST_CITY_NAME',   'DEST_STATE_ABR',   'DEST_STATE_NM',
]
relevant_df = raw_data_df[RELEVANT_COLUMNS]
relevant_df = relevant_df[relevant_df.PASSENGERS != 0.0]
relevant_df.head()

	PASSENGERS	ORIGIN_AIRPORT_ID	ORIGIN_AIRPORT_SEQ_ID	ORIGIN_CITY_MARKET_ID	ORIGIN	ORIGIN_CITY_NAME	ORIGIN_STATE_ABR	ORIGIN_STATE_NM	DEST_AIRPORT_ID	DEST_AIRPORT_SEQ_ID	DEST_CITY_MARKET_ID	DEST	DEST_CITY_NAME	DEST_STATE_ABR	DEST_STATE_NM
44447	1.0	12523	1252306	32523	JNU	Juneau, AK	AK	Alaska	11545	1154501	31545	ELV	Elfin Cove, AK	AK	Alaska
44448	1.0	12523	1252306	32523	JNU	Juneau, AK	AK	Alaska	11619	1161902	31619	EXI	Excursion Inlet, AK	AK	Alaska
44449	1.0	12610	1261001	32610	KAE	Kake, AK	AK	Alaska	10204	1020401	30204	AGN	Angoon, AK	AK	Alaska
44450	1.0	11298	1129806	30194	DFW	Dallas/Fort Worth, TX	TX	Texas	11292	1129202	30325	DEN	Denver, CO	CO	Colorado
44451	1.0	15991	1599102	35991	YAK	Yakutat, AK	AK	Alaska	14828	1482805	34828	SIT	Sitka, AK	AK	Alaska

We’ll want to have our data in an edge list format where the city markets are the nodes so that we can import this data into metagraph.

We’ll use betweenness centrality to determine connectedness since it is a metric of how many shortest paths go through a node. In order to use betweenness centrality effectively for our goal, we’ll want paths with less total weight to be the ones denoting paths with more passengers. More elegant metrics might be considered in practice, but we’ll use 1/number_of_passengers for the weights in this tutorial for the sake of simplicity.

We’ll create an edge list with such weights using pandas.

passenger_flow_df = relevant_df[['ORIGIN_CITY_MARKET_ID', 'DEST_CITY_MARKET_ID', 'PASSENGERS']]
passenger_flow_df = passenger_flow_df.groupby(['ORIGIN_CITY_MARKET_ID', 'DEST_CITY_MARKET_ID']) \
                        .PASSENGERS.sum() \
                        .reset_index()
passenger_flow_df['INVERSE_PASSENGER_COUNT'] = passenger_flow_df.PASSENGERS.map(lambda passenger_count: 1/passenger_count)
assert len(passenger_flow_df[passenger_flow_df.INVERSE_PASSENGER_COUNT != passenger_flow_df.INVERSE_PASSENGER_COUNT]) == 0, "Edge list has NaN weights."
passenger_flow_df.head()

	ORIGIN_CITY_MARKET_ID	DEST_CITY_MARKET_ID	PASSENGERS	INVERSE_PASSENGER_COUNT
0	30005	30349	4.0	0.250000
1	30005	31214	10.0	0.100000
2	30005	31517	193.0	0.005181
3	30005	35731	7.0	0.142857
4	30006	30056	5.0	0.200000

Since the data has city market IDs and don’t have names because an airport can serve regions containing multiple cities, it’d be useful to get a mapping from city market IDs to city names and airports so that we can contextualize our findings.

origin_city_market_id_info_df = relevant_df[['ORIGIN_CITY_MARKET_ID', 'ORIGIN', 'ORIGIN_CITY_NAME']] \
                                    .rename(columns={'ORIGIN_CITY_MARKET_ID': 'CITY_MARKET_ID',
                                                     'ORIGIN': 'AIRPORT',
                                                     'ORIGIN_CITY_NAME': 'CITY_NAME'})
dest_city_market_id_info_df = relevant_df[['DEST_CITY_MARKET_ID', 'DEST', 'DEST_CITY_NAME']] \
                                    .rename(columns={'DEST_CITY_MARKET_ID': 'CITY_MARKET_ID',
                                                     'DEST': 'AIRPORT',
                                                     'DEST_CITY_NAME': 'CITY_NAME'})
city_market_id_info_df = pd.concat([origin_city_market_id_info_df, dest_city_market_id_info_df])
city_market_id_info_df = city_market_id_info_df.groupby('CITY_MARKET_ID').agg({'AIRPORT': set, 'CITY_NAME': set})
city_market_id_info_df.head()

	AIRPORT	CITY_NAME
CITY_MARKET_ID
30005	{05A}	{Little Squaw, AK}
30006	{06A}	{Kizhuyak, AK}
30007	{KLW}	{Klawock, AK}
30009	{HOM, 09A}	{Homer, AK}
30010	{1B1}	{Hudson, NY}

Which region is travelled through the most?¶

We’re going to determine which region is travelled through the most using betweenness centrality as it measures exactly that. There are a variety of algorithms to choose from, but we’ll stick to using solely betweenness centrality for this tutorial.

We’ll first create a metagraph graph for the data.

passenger_flow_edge_map = mg.wrappers.EdgeMap.PandasEdgeMap(passenger_flow_df,
                                                           'ORIGIN_CITY_MARKET_ID',
                                                           'DEST_CITY_MARKET_ID',
                                                           'INVERSE_PASSENGER_COUNT',
                                                           is_directed=True)
passenger_flow_graph = mg.algos.util.graph.build(passenger_flow_edge_map)

Note that we use the inverse passenger count as the weights to ensure that the shortest paths are the paths that have the most passengers.

Let’s calculate the betweenness centrality.

betweenness_centrality = mg.algos.centrality.betweenness(passenger_flow_graph, normalize=False)

Let’s look at the results and find the highest scores (which would give us the city market IDs that are most travelled through).

number_of_best_scores = 15
best_betweenness_centrality_node_vector = mg.algos.util.nodemap.sort(betweenness_centrality, ascending=False, limit=number_of_best_scores)
best_betweenness_centrality_node_set = mg.algos.util.nodeset.from_vector(best_betweenness_centrality_node_vector)
best_betweenness_centrality_node_to_score_map = mg.algos.util.nodemap.select(betweenness_centrality, best_betweenness_centrality_node_set)
best_betweenness_centrality_node_to_score_map

<metagraph.plugins.numpy.types.NumpyNodeMap at 0x7fa4ea428450>

We now have a mapping between city market IDs and their centrality scores in best_betweenness_centrality_node_to_score_map, which is a NumpyNodeMap. Since NumpyNodeMap stores it’s mapping in a non-trivial fashion for performance reasons, it’s non-trivial to inspect its internals to view the mapping’s values. Luckily, metagraph allows us to translate it to a Python dictionary, which is significantly easier to inspect.

best_betweenness_centrality_node_to_score_map = mg.translate(best_betweenness_centrality_node_to_score_map, mg.types.NodeMap.PythonNodeMapType)
best_betweenness_centrality_node_to_score_map

{30070: 62402.0,
 30113: 75327.0,
 30154: 56833.0,
 30194: 121807.0,
 30299: 349232.0,
 30325: 107586.0,
 30397: 144922.0,
 30466: 45699.0,
 30559: 465677.0,
 30977: 206250.0,
 31517: 90409.0,
 31703: 337885.0,
 32457: 46094.0,
 32467: 48068.0,
 32575: 494817.0}

Now that we have the city market IDs with the best scores, let’s find out which regions those city market IDs correspond to using the mapping from city market IDs to city names and airports we made earlier.

best_betweenness_centrality_scores_df = pd.DataFrame(best_betweenness_centrality_node_to_score_map.items()).rename(columns={0:'CITY_MARKET_ID', 1:'BETWEENNESS_CENTRALITY_SCORE'}).set_index('CITY_MARKET_ID')
best_betweenness_centrality_scores_df.join(city_market_id_info_df).sort_values('BETWEENNESS_CENTRALITY_SCORE', ascending=False)

	BETWEENNESS_CENTRALITY_SCORE	AIRPORT	CITY_NAME
CITY_MARKET_ID
32575	494817.0	{LAX, SMO, SNA, HHR, LGB, BUR, ONT, VNY}	{Santa Ana, CA, Los Angeles, CA, Van Nuys, CA,...
30559	465677.0	{BFI, SEA, LKE, KEH}	{Kenmore, WA, Seattle, WA}
30299	349232.0	{ANC, DQL, MRI}	{Anchorage, AK}
31703	337885.0	{LGA, ISP, EWR, JRB, HPN, JRA, JFK, TSS, SWF}	{Islip, NY, New York, NY, Newark, NJ, Newburgh...
30977	206250.0	{LOT, GYY, ORD, PWK, DPA, MDW}	{Chicago/Romeoville, IL, Chicago, IL, Gary, IN}
30397	144922.0	{FTY, ATL, PDK, QMA}	{Kennesaw, GA, Atlanta, GA}
30194	121807.0	{RBD, ADS, FWH, FTW, AFW, DAL, DFW}	{Dallas/Fort Worth, TX, Dallas, TX, Fort Worth...
30325	107586.0	{APA, DEN}	{Denver, CO}
31517	90409.0	{FBK, EIL, MTX, A01, FAI}	{Fairbanks/Ft. Wainwright, AK, Fairbanks, AK}
30113	75327.0	{BET}	{Bethel, AK}
30070	62402.0	{KDK, ADQ}	{Kodiak, AK}
30154	56833.0	{ACK}	{Nantucket, MA}
32467	48068.0	{FXE, FLL, OPF, TMB, MIA, MPB}	{Miami, FL, Fort Lauderdale, FL}
32457	46094.0	{OAK, CCR, SFO, SJC}	{San Jose, CA, San Francisco, CA, Oakland, CA,...
30466	45699.0	{AZA, AZ3, PHX, GYR, SCF}	{Goodyear, AZ, Phoenix, AZ, Glendale, AZ}

This is what we’d expect. Highly populated areas like Los Angeles are the most traveled through areas.

However, it’s surprising that Anchorage is more travelled through than a hub like Dallas!

There’s a good explanation for Anchorage being a very travelled through region: Since Alaska is so sparsely populated, a well-connected road infrastructure was never built. Thus, to travel between cities in Alaska, air travel is often the only option. More information can be found here.

Concepts

Use Case Tutorial 2: Kevin Bacon(s) of 2019