This page was generated from getting_started/use_case_1_airline_connectedness.ipynb.

Use Case Tutorial 1: Well-Connected US Regions

This is a tutorial on how to find the most well-connected regions of the U.S. via air travel.

The U.S. Bureau of Transportation Statistics provides data on monthly air travel from all certificated U.S. air carriers and makes it available here. The 2018 air travel data used for this tutorial can be downloaded here. We chose 2018 data to avoid any impact COVID-19 might’ve had on travel.

We will utilize this data to determine which areas in the U.S. are most well-connected using betweenness centrality.

Data Preprocessing

Let’s first look at the data.

First, we’ll need to import some libraries.


import metagraph as mg
import pandas as pd

Let’s see what the data looks like.


RAW_DATA_CSV = './data/airtravel/raw_data.csv' # https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=258
raw_data_df = pd.read_csv(RAW_DATA_CSV)
raw_data_df.head()

PASSENGERS FREIGHT MAIL DISTANCE UNIQUE_CARRIER AIRLINE_ID UNIQUE_CARRIER_NAME ORIGIN_AIRPORT_ID ORIGIN_AIRPORT_SEQ_ID ORIGIN_CITY_MARKET_ID ... DEST_AIRPORT_SEQ_ID DEST_CITY_MARKET_ID DEST DEST_CITY_NAME DEST_STATE_ABR DEST_STATE_FIPS DEST_STATE_NM DEST_WAC MONTH Unnamed: 26
0 0.0 410.0 0.0 616.0 WN 19393.0 Southwest Airlines Co. 13851 1385103 33851 ... 1069302 30693 BNA Nashville, TN TN 47 Tennessee 54 6 NaN
1 0.0 184.0 0.0 2592.0 WN 19393.0 Southwest Airlines Co. 14307 1430705 30721 ... 1289208 32575 LAX Los Angeles, CA CA 6 California 91 6 NaN
2 0.0 87.0 0.0 2445.0 WN 19393.0 Southwest Airlines Co. 14679 1467903 33570 ... 1025702 30257 ALB Albany, NY NY 36 New York 22 6 NaN
3 0.0 10.0 0.0 432.0 WN 19393.0 Southwest Airlines Co. 14730 1473003 33044 ... 1299206 32600 LIT Little Rock, AR AR 5 Arkansas 71 6 NaN
4 0.0 100.0 0.0 129.0 WN 19393.0 Southwest Airlines Co. 14747 1474703 30559 ... 1405702 34057 PDX Portland, OR OR 41 Oregon 92 6 NaN

5 rows × 27 columns

A city market is a region that an airport supports. For example, New York City has many airports (and it’s sometimes cheaper to fly into and out of different airports), but all of their airports serve the same region / city market.

Since we’re mostly concerned with where passengers will end up going (and not which airport they choose), we will view city markets as the regions of interest.

We will define a region as being well-connected if many people travel in and out of it.

Let’s filter out all the irrelevant information not required for finding the well-connected regions and any flight paths with zero passengers (these flights are usually flights transporting packages).


RELEVANT_COLUMNS = [
    'PASSENGERS',
    'ORIGIN_AIRPORT_ID', 'ORIGIN_AIRPORT_SEQ_ID', 'ORIGIN_CITY_MARKET_ID', 'ORIGIN', 'ORIGIN_CITY_NAME', 'ORIGIN_STATE_ABR', 'ORIGIN_STATE_NM',
    'DEST_AIRPORT_ID',   'DEST_AIRPORT_SEQ_ID',   'DEST_CITY_MARKET_ID',   'DEST',   'DEST_CITY_NAME',   'DEST_STATE_ABR',   'DEST_STATE_NM',
]
relevant_df = raw_data_df[RELEVANT_COLUMNS]
relevant_df = relevant_df[relevant_df.PASSENGERS != 0.0]
relevant_df.head()

PASSENGERS ORIGIN_AIRPORT_ID ORIGIN_AIRPORT_SEQ_ID ORIGIN_CITY_MARKET_ID ORIGIN ORIGIN_CITY_NAME ORIGIN_STATE_ABR ORIGIN_STATE_NM DEST_AIRPORT_ID DEST_AIRPORT_SEQ_ID DEST_CITY_MARKET_ID DEST DEST_CITY_NAME DEST_STATE_ABR DEST_STATE_NM
44447 1.0 12523 1252306 32523 JNU Juneau, AK AK Alaska 11545 1154501 31545 ELV Elfin Cove, AK AK Alaska
44448 1.0 12523 1252306 32523 JNU Juneau, AK AK Alaska 11619 1161902 31619 EXI Excursion Inlet, AK AK Alaska
44449 1.0 12610 1261001 32610 KAE Kake, AK AK Alaska 10204 1020401 30204 AGN Angoon, AK AK Alaska
44450 1.0 11298 1129806 30194 DFW Dallas/Fort Worth, TX TX Texas 11292 1129202 30325 DEN Denver, CO CO Colorado
44451 1.0 15991 1599102 35991 YAK Yakutat, AK AK Alaska 14828 1482805 34828 SIT Sitka, AK AK Alaska

We’ll want to have our data in an edge list format where the city markets are the nodes so that we can import this data into metagraph.

We’ll use betweenness centrality to determine connectedness since it is a metric of how many shortest paths go through a node. In order to use betweenness centrality effectively for our goal, we’ll want paths with less total weight to be the ones denoting paths with more passengers. More elegant metrics might be considered in practice, but we’ll use 1/number_of_passengers for the weights in this tutorial for the sake of simplicity.

We’ll create an edge list with such weights using pandas.


passenger_flow_df = relevant_df[['ORIGIN_CITY_MARKET_ID', 'DEST_CITY_MARKET_ID', 'PASSENGERS']]
passenger_flow_df = passenger_flow_df.groupby(['ORIGIN_CITY_MARKET_ID', 'DEST_CITY_MARKET_ID']) \
                        .PASSENGERS.sum() \
                        .reset_index()
passenger_flow_df['INVERSE_PASSENGER_COUNT'] = passenger_flow_df.PASSENGERS.map(lambda passenger_count: 1/passenger_count)
assert len(passenger_flow_df[passenger_flow_df.INVERSE_PASSENGER_COUNT != passenger_flow_df.INVERSE_PASSENGER_COUNT]) == 0, "Edge list has NaN weights."
passenger_flow_df.head()

ORIGIN_CITY_MARKET_ID DEST_CITY_MARKET_ID PASSENGERS INVERSE_PASSENGER_COUNT
0 30005 30349 4.0 0.250000
1 30005 31214 10.0 0.100000
2 30005 31517 193.0 0.005181
3 30005 35731 7.0 0.142857
4 30006 30056 5.0 0.200000

Since the data has city market IDs and don’t have names because an airport can serve regions containing multiple cities, it’d be useful to get a mapping from city market IDs to city names and airports so that we can contextualize our findings.


origin_city_market_id_info_df = relevant_df[['ORIGIN_CITY_MARKET_ID', 'ORIGIN', 'ORIGIN_CITY_NAME']] \
                                    .rename(columns={'ORIGIN_CITY_MARKET_ID': 'CITY_MARKET_ID',
                                                     'ORIGIN': 'AIRPORT',
                                                     'ORIGIN_CITY_NAME': 'CITY_NAME'})
dest_city_market_id_info_df = relevant_df[['DEST_CITY_MARKET_ID', 'DEST', 'DEST_CITY_NAME']] \
                                    .rename(columns={'DEST_CITY_MARKET_ID': 'CITY_MARKET_ID',
                                                     'DEST': 'AIRPORT',
                                                     'DEST_CITY_NAME': 'CITY_NAME'})
city_market_id_info_df = pd.concat([origin_city_market_id_info_df, dest_city_market_id_info_df])
city_market_id_info_df = city_market_id_info_df.groupby('CITY_MARKET_ID').agg({'AIRPORT': set, 'CITY_NAME': set})
city_market_id_info_df.head()

AIRPORT CITY_NAME
CITY_MARKET_ID
30005 {05A} {Little Squaw, AK}
30006 {06A} {Kizhuyak, AK}
30007 {KLW} {Klawock, AK}
30009 {HOM, 09A} {Homer, AK}
30010 {1B1} {Hudson, NY}

Which region is travelled through the most?

We’re going to determine which region is travelled through the most using betweenness centrality as it measures exactly that. There are a variety of algorithms to choose from, but we’ll stick to using solely betweenness centrality for this tutorial.

We’ll first create a metagraph graph for the data.


passenger_flow_edge_map = mg.wrappers.EdgeMap.PandasEdgeMap(passenger_flow_df,
                                                           'ORIGIN_CITY_MARKET_ID',
                                                           'DEST_CITY_MARKET_ID',
                                                           'INVERSE_PASSENGER_COUNT',
                                                           is_directed=True)
passenger_flow_graph = mg.algos.util.graph.build(passenger_flow_edge_map)

Note that we use the inverse passenger count as the weights to ensure that the shortest paths are the paths that have the most passengers.

Let’s calculate the betweenness centrality.


betweenness_centrality = mg.algos.centrality.betweenness(passenger_flow_graph, normalize=False)

Let’s look at the results and find the highest scores (which would give us the city market IDs that are most travelled through).


number_of_best_scores = 15
best_betweenness_centrality_node_vector = mg.algos.util.nodemap.sort(betweenness_centrality, ascending=False, limit=number_of_best_scores)
best_betweenness_centrality_node_set = mg.algos.util.nodeset.from_vector(best_betweenness_centrality_node_vector)
best_betweenness_centrality_node_to_score_map = mg.algos.util.nodemap.select(betweenness_centrality, best_betweenness_centrality_node_set)
best_betweenness_centrality_node_to_score_map

<metagraph.plugins.numpy.types.NumpyNodeMap at 0x7fa4ea428450>

We now have a mapping between city market IDs and their centrality scores in best_betweenness_centrality_node_to_score_map, which is a NumpyNodeMap. Since NumpyNodeMap stores it’s mapping in a non-trivial fashion for performance reasons, it’s non-trivial to inspect its internals to view the mapping’s values. Luckily, metagraph allows us to translate it to a Python dictionary, which is significantly easier to inspect.


best_betweenness_centrality_node_to_score_map = mg.translate(best_betweenness_centrality_node_to_score_map, mg.types.NodeMap.PythonNodeMapType)
best_betweenness_centrality_node_to_score_map

{30070: 62402.0,
 30113: 75327.0,
 30154: 56833.0,
 30194: 121807.0,
 30299: 349232.0,
 30325: 107586.0,
 30397: 144922.0,
 30466: 45699.0,
 30559: 465677.0,
 30977: 206250.0,
 31517: 90409.0,
 31703: 337885.0,
 32457: 46094.0,
 32467: 48068.0,
 32575: 494817.0}

Now that we have the city market IDs with the best scores, let’s find out which regions those city market IDs correspond to using the mapping from city market IDs to city names and airports we made earlier.


best_betweenness_centrality_scores_df = pd.DataFrame(best_betweenness_centrality_node_to_score_map.items()).rename(columns={0:'CITY_MARKET_ID', 1:'BETWEENNESS_CENTRALITY_SCORE'}).set_index('CITY_MARKET_ID')
best_betweenness_centrality_scores_df.join(city_market_id_info_df).sort_values('BETWEENNESS_CENTRALITY_SCORE', ascending=False)

BETWEENNESS_CENTRALITY_SCORE AIRPORT CITY_NAME
CITY_MARKET_ID
32575 494817.0 {LAX, SMO, SNA, HHR, LGB, BUR, ONT, VNY} {Santa Ana, CA, Los Angeles, CA, Van Nuys, CA,...
30559 465677.0 {BFI, SEA, LKE, KEH} {Kenmore, WA, Seattle, WA}
30299 349232.0 {ANC, DQL, MRI} {Anchorage, AK}
31703 337885.0 {LGA, ISP, EWR, JRB, HPN, JRA, JFK, TSS, SWF} {Islip, NY, New York, NY, Newark, NJ, Newburgh...
30977 206250.0 {LOT, GYY, ORD, PWK, DPA, MDW} {Chicago/Romeoville, IL, Chicago, IL, Gary, IN}
30397 144922.0 {FTY, ATL, PDK, QMA} {Kennesaw, GA, Atlanta, GA}
30194 121807.0 {RBD, ADS, FWH, FTW, AFW, DAL, DFW} {Dallas/Fort Worth, TX, Dallas, TX, Fort Worth...
30325 107586.0 {APA, DEN} {Denver, CO}
31517 90409.0 {FBK, EIL, MTX, A01, FAI} {Fairbanks/Ft. Wainwright, AK, Fairbanks, AK}
30113 75327.0 {BET} {Bethel, AK}
30070 62402.0 {KDK, ADQ} {Kodiak, AK}
30154 56833.0 {ACK} {Nantucket, MA}
32467 48068.0 {FXE, FLL, OPF, TMB, MIA, MPB} {Miami, FL, Fort Lauderdale, FL}
32457 46094.0 {OAK, CCR, SFO, SJC} {San Jose, CA, San Francisco, CA, Oakland, CA,...
30466 45699.0 {AZA, AZ3, PHX, GYR, SCF} {Goodyear, AZ, Phoenix, AZ, Glendale, AZ}

This is what we’d expect. Highly populated areas like Los Angeles are the most traveled through areas.

However, it’s surprising that Anchorage is more travelled through than a hub like Dallas!

There’s a good explanation for Anchorage being a very travelled through region: Since Alaska is so sparsely populated, a well-connected road infrastructure was never built. Thus, to travel between cities in Alaska, air travel is often the only option. More information can be found here.