{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Use Case Tutorial 1: Well-Connected US Regions\n",
    "\n",
    "This is a tutorial on how to find the most well-connected regions of the U.S. via air travel.\n",
    "\n",
    "The U.S. Bureau of Transportation Statistics provides data on monthly air travel from all certificated U.S. air carriers and makes it available [here](https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=258). The 2018 air travel data used for this tutorial can be downloaded [here](https://transtats.bts.gov/ftproot/TranStatsData/403537556_T_T100D_MARKET_US_CARRIER_ONLY.zip). We chose 2018 data to avoid any impact COVID-19 might’ve had on travel.\n",
    "\n",
    "We will utilize this data to determine which areas in the U.S. are most well-connected using betweenness centrality."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Preprocessing\n",
    "\n",
    "Let’s first look at the data.\n",
    "\n",
    "First, we’ll need to import some libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import metagraph as mg\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s see what the data looks like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PASSENGERS</th>\n",
       "      <th>FREIGHT</th>\n",
       "      <th>MAIL</th>\n",
       "      <th>DISTANCE</th>\n",
       "      <th>UNIQUE_CARRIER</th>\n",
       "      <th>AIRLINE_ID</th>\n",
       "      <th>UNIQUE_CARRIER_NAME</th>\n",
       "      <th>ORIGIN_AIRPORT_ID</th>\n",
       "      <th>ORIGIN_AIRPORT_SEQ_ID</th>\n",
       "      <th>ORIGIN_CITY_MARKET_ID</th>\n",
       "      <th>...</th>\n",
       "      <th>DEST_AIRPORT_SEQ_ID</th>\n",
       "      <th>DEST_CITY_MARKET_ID</th>\n",
       "      <th>DEST</th>\n",
       "      <th>DEST_CITY_NAME</th>\n",
       "      <th>DEST_STATE_ABR</th>\n",
       "      <th>DEST_STATE_FIPS</th>\n",
       "      <th>DEST_STATE_NM</th>\n",
       "      <th>DEST_WAC</th>\n",
       "      <th>MONTH</th>\n",
       "      <th>Unnamed: 26</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>410.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>616.0</td>\n",
       "      <td>WN</td>\n",
       "      <td>19393.0</td>\n",
       "      <td>Southwest Airlines Co.</td>\n",
       "      <td>13851</td>\n",
       "      <td>1385103</td>\n",
       "      <td>33851</td>\n",
       "      <td>...</td>\n",
       "      <td>1069302</td>\n",
       "      <td>30693</td>\n",
       "      <td>BNA</td>\n",
       "      <td>Nashville, TN</td>\n",
       "      <td>TN</td>\n",
       "      <td>47</td>\n",
       "      <td>Tennessee</td>\n",
       "      <td>54</td>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>184.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2592.0</td>\n",
       "      <td>WN</td>\n",
       "      <td>19393.0</td>\n",
       "      <td>Southwest Airlines Co.</td>\n",
       "      <td>14307</td>\n",
       "      <td>1430705</td>\n",
       "      <td>30721</td>\n",
       "      <td>...</td>\n",
       "      <td>1289208</td>\n",
       "      <td>32575</td>\n",
       "      <td>LAX</td>\n",
       "      <td>Los Angeles, CA</td>\n",
       "      <td>CA</td>\n",
       "      <td>6</td>\n",
       "      <td>California</td>\n",
       "      <td>91</td>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>87.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2445.0</td>\n",
       "      <td>WN</td>\n",
       "      <td>19393.0</td>\n",
       "      <td>Southwest Airlines Co.</td>\n",
       "      <td>14679</td>\n",
       "      <td>1467903</td>\n",
       "      <td>33570</td>\n",
       "      <td>...</td>\n",
       "      <td>1025702</td>\n",
       "      <td>30257</td>\n",
       "      <td>ALB</td>\n",
       "      <td>Albany, NY</td>\n",
       "      <td>NY</td>\n",
       "      <td>36</td>\n",
       "      <td>New York</td>\n",
       "      <td>22</td>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>432.0</td>\n",
       "      <td>WN</td>\n",
       "      <td>19393.0</td>\n",
       "      <td>Southwest Airlines Co.</td>\n",
       "      <td>14730</td>\n",
       "      <td>1473003</td>\n",
       "      <td>33044</td>\n",
       "      <td>...</td>\n",
       "      <td>1299206</td>\n",
       "      <td>32600</td>\n",
       "      <td>LIT</td>\n",
       "      <td>Little Rock, AR</td>\n",
       "      <td>AR</td>\n",
       "      <td>5</td>\n",
       "      <td>Arkansas</td>\n",
       "      <td>71</td>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.0</td>\n",
       "      <td>100.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>129.0</td>\n",
       "      <td>WN</td>\n",
       "      <td>19393.0</td>\n",
       "      <td>Southwest Airlines Co.</td>\n",
       "      <td>14747</td>\n",
       "      <td>1474703</td>\n",
       "      <td>30559</td>\n",
       "      <td>...</td>\n",
       "      <td>1405702</td>\n",
       "      <td>34057</td>\n",
       "      <td>PDX</td>\n",
       "      <td>Portland, OR</td>\n",
       "      <td>OR</td>\n",
       "      <td>41</td>\n",
       "      <td>Oregon</td>\n",
       "      <td>92</td>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 27 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   PASSENGERS  FREIGHT  MAIL  DISTANCE UNIQUE_CARRIER  AIRLINE_ID  \\\n",
       "0         0.0    410.0   0.0     616.0             WN     19393.0   \n",
       "1         0.0    184.0   0.0    2592.0             WN     19393.0   \n",
       "2         0.0     87.0   0.0    2445.0             WN     19393.0   \n",
       "3         0.0     10.0   0.0     432.0             WN     19393.0   \n",
       "4         0.0    100.0   0.0     129.0             WN     19393.0   \n",
       "\n",
       "      UNIQUE_CARRIER_NAME  ORIGIN_AIRPORT_ID  ORIGIN_AIRPORT_SEQ_ID  \\\n",
       "0  Southwest Airlines Co.              13851                1385103   \n",
       "1  Southwest Airlines Co.              14307                1430705   \n",
       "2  Southwest Airlines Co.              14679                1467903   \n",
       "3  Southwest Airlines Co.              14730                1473003   \n",
       "4  Southwest Airlines Co.              14747                1474703   \n",
       "\n",
       "   ORIGIN_CITY_MARKET_ID  ... DEST_AIRPORT_SEQ_ID DEST_CITY_MARKET_ID DEST  \\\n",
       "0                  33851  ...             1069302               30693  BNA   \n",
       "1                  30721  ...             1289208               32575  LAX   \n",
       "2                  33570  ...             1025702               30257  ALB   \n",
       "3                  33044  ...             1299206               32600  LIT   \n",
       "4                  30559  ...             1405702               34057  PDX   \n",
       "\n",
       "    DEST_CITY_NAME DEST_STATE_ABR  DEST_STATE_FIPS  DEST_STATE_NM  DEST_WAC  \\\n",
       "0    Nashville, TN             TN               47      Tennessee        54   \n",
       "1  Los Angeles, CA             CA                6     California        91   \n",
       "2       Albany, NY             NY               36       New York        22   \n",
       "3  Little Rock, AR             AR                5       Arkansas        71   \n",
       "4     Portland, OR             OR               41         Oregon        92   \n",
       "\n",
       "   MONTH Unnamed: 26  \n",
       "0      6         NaN  \n",
       "1      6         NaN  \n",
       "2      6         NaN  \n",
       "3      6         NaN  \n",
       "4      6         NaN  \n",
       "\n",
       "[5 rows x 27 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "RAW_DATA_CSV = './data/airtravel/raw_data.csv' # https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=258\n",
    "raw_data_df = pd.read_csv(RAW_DATA_CSV)\n",
    "raw_data_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A city market is a region that an airport supports. For example, New York City has many airports (and it’s sometimes cheaper to fly into and out of different airports), but all of their airports serve the same region / city market.\n",
    "\n",
    "Since we’re mostly concerned with where passengers will end up going (and not which airport they choose), we will view city markets as the regions of interest.\n",
    "\n",
    "We will define a region as being well-connected if many people travel in and out of it.\n",
    "\n",
    "Let’s filter out all the irrelevant information not required for finding the well-connected regions and any flight paths with zero passengers (these flights are usually flights transporting packages)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PASSENGERS</th>\n",
       "      <th>ORIGIN_AIRPORT_ID</th>\n",
       "      <th>ORIGIN_AIRPORT_SEQ_ID</th>\n",
       "      <th>ORIGIN_CITY_MARKET_ID</th>\n",
       "      <th>ORIGIN</th>\n",
       "      <th>ORIGIN_CITY_NAME</th>\n",
       "      <th>ORIGIN_STATE_ABR</th>\n",
       "      <th>ORIGIN_STATE_NM</th>\n",
       "      <th>DEST_AIRPORT_ID</th>\n",
       "      <th>DEST_AIRPORT_SEQ_ID</th>\n",
       "      <th>DEST_CITY_MARKET_ID</th>\n",
       "      <th>DEST</th>\n",
       "      <th>DEST_CITY_NAME</th>\n",
       "      <th>DEST_STATE_ABR</th>\n",
       "      <th>DEST_STATE_NM</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>44447</th>\n",
       "      <td>1.0</td>\n",
       "      <td>12523</td>\n",
       "      <td>1252306</td>\n",
       "      <td>32523</td>\n",
       "      <td>JNU</td>\n",
       "      <td>Juneau, AK</td>\n",
       "      <td>AK</td>\n",
       "      <td>Alaska</td>\n",
       "      <td>11545</td>\n",
       "      <td>1154501</td>\n",
       "      <td>31545</td>\n",
       "      <td>ELV</td>\n",
       "      <td>Elfin Cove, AK</td>\n",
       "      <td>AK</td>\n",
       "      <td>Alaska</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44448</th>\n",
       "      <td>1.0</td>\n",
       "      <td>12523</td>\n",
       "      <td>1252306</td>\n",
       "      <td>32523</td>\n",
       "      <td>JNU</td>\n",
       "      <td>Juneau, AK</td>\n",
       "      <td>AK</td>\n",
       "      <td>Alaska</td>\n",
       "      <td>11619</td>\n",
       "      <td>1161902</td>\n",
       "      <td>31619</td>\n",
       "      <td>EXI</td>\n",
       "      <td>Excursion Inlet, AK</td>\n",
       "      <td>AK</td>\n",
       "      <td>Alaska</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44449</th>\n",
       "      <td>1.0</td>\n",
       "      <td>12610</td>\n",
       "      <td>1261001</td>\n",
       "      <td>32610</td>\n",
       "      <td>KAE</td>\n",
       "      <td>Kake, AK</td>\n",
       "      <td>AK</td>\n",
       "      <td>Alaska</td>\n",
       "      <td>10204</td>\n",
       "      <td>1020401</td>\n",
       "      <td>30204</td>\n",
       "      <td>AGN</td>\n",
       "      <td>Angoon, AK</td>\n",
       "      <td>AK</td>\n",
       "      <td>Alaska</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44450</th>\n",
       "      <td>1.0</td>\n",
       "      <td>11298</td>\n",
       "      <td>1129806</td>\n",
       "      <td>30194</td>\n",
       "      <td>DFW</td>\n",
       "      <td>Dallas/Fort Worth, TX</td>\n",
       "      <td>TX</td>\n",
       "      <td>Texas</td>\n",
       "      <td>11292</td>\n",
       "      <td>1129202</td>\n",
       "      <td>30325</td>\n",
       "      <td>DEN</td>\n",
       "      <td>Denver, CO</td>\n",
       "      <td>CO</td>\n",
       "      <td>Colorado</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44451</th>\n",
       "      <td>1.0</td>\n",
       "      <td>15991</td>\n",
       "      <td>1599102</td>\n",
       "      <td>35991</td>\n",
       "      <td>YAK</td>\n",
       "      <td>Yakutat, AK</td>\n",
       "      <td>AK</td>\n",
       "      <td>Alaska</td>\n",
       "      <td>14828</td>\n",
       "      <td>1482805</td>\n",
       "      <td>34828</td>\n",
       "      <td>SIT</td>\n",
       "      <td>Sitka, AK</td>\n",
       "      <td>AK</td>\n",
       "      <td>Alaska</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       PASSENGERS  ORIGIN_AIRPORT_ID  ORIGIN_AIRPORT_SEQ_ID  \\\n",
       "44447         1.0              12523                1252306   \n",
       "44448         1.0              12523                1252306   \n",
       "44449         1.0              12610                1261001   \n",
       "44450         1.0              11298                1129806   \n",
       "44451         1.0              15991                1599102   \n",
       "\n",
       "       ORIGIN_CITY_MARKET_ID ORIGIN       ORIGIN_CITY_NAME ORIGIN_STATE_ABR  \\\n",
       "44447                  32523    JNU             Juneau, AK               AK   \n",
       "44448                  32523    JNU             Juneau, AK               AK   \n",
       "44449                  32610    KAE               Kake, AK               AK   \n",
       "44450                  30194    DFW  Dallas/Fort Worth, TX               TX   \n",
       "44451                  35991    YAK            Yakutat, AK               AK   \n",
       "\n",
       "      ORIGIN_STATE_NM  DEST_AIRPORT_ID  DEST_AIRPORT_SEQ_ID  \\\n",
       "44447          Alaska            11545              1154501   \n",
       "44448          Alaska            11619              1161902   \n",
       "44449          Alaska            10204              1020401   \n",
       "44450           Texas            11292              1129202   \n",
       "44451          Alaska            14828              1482805   \n",
       "\n",
       "       DEST_CITY_MARKET_ID DEST       DEST_CITY_NAME DEST_STATE_ABR  \\\n",
       "44447                31545  ELV       Elfin Cove, AK             AK   \n",
       "44448                31619  EXI  Excursion Inlet, AK             AK   \n",
       "44449                30204  AGN           Angoon, AK             AK   \n",
       "44450                30325  DEN           Denver, CO             CO   \n",
       "44451                34828  SIT            Sitka, AK             AK   \n",
       "\n",
       "      DEST_STATE_NM  \n",
       "44447        Alaska  \n",
       "44448        Alaska  \n",
       "44449        Alaska  \n",
       "44450      Colorado  \n",
       "44451        Alaska  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "RELEVANT_COLUMNS = [\n",
    "    'PASSENGERS',\n",
    "    'ORIGIN_AIRPORT_ID', 'ORIGIN_AIRPORT_SEQ_ID', 'ORIGIN_CITY_MARKET_ID', 'ORIGIN', 'ORIGIN_CITY_NAME', 'ORIGIN_STATE_ABR', 'ORIGIN_STATE_NM',\n",
    "    'DEST_AIRPORT_ID',   'DEST_AIRPORT_SEQ_ID',   'DEST_CITY_MARKET_ID',   'DEST',   'DEST_CITY_NAME',   'DEST_STATE_ABR',   'DEST_STATE_NM',\n",
    "]\n",
    "relevant_df = raw_data_df[RELEVANT_COLUMNS]\n",
    "relevant_df = relevant_df[relevant_df.PASSENGERS != 0.0]\n",
    "relevant_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We’ll want to have our data in an edge list format where the city markets are the nodes so that we can import this data into `metagraph`.\n",
    "\n",
    "We’ll use betweenness centrality to determine connectedness since it is a metric of how many shortest paths go through a node. In order to use betweenness centrality effectively for our goal, we’ll want paths with less total weight to be the ones denoting paths with more passengers. More elegant metrics might be considered in practice, but we’ll use `1/number_of_passengers` for the weights in this tutorial for the sake of simplicity.\n",
    "\n",
    "We’ll create an edge list with such weights using pandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ORIGIN_CITY_MARKET_ID</th>\n",
       "      <th>DEST_CITY_MARKET_ID</th>\n",
       "      <th>PASSENGERS</th>\n",
       "      <th>INVERSE_PASSENGER_COUNT</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>30005</td>\n",
       "      <td>30349</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.250000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>30005</td>\n",
       "      <td>31214</td>\n",
       "      <td>10.0</td>\n",
       "      <td>0.100000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>30005</td>\n",
       "      <td>31517</td>\n",
       "      <td>193.0</td>\n",
       "      <td>0.005181</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>30005</td>\n",
       "      <td>35731</td>\n",
       "      <td>7.0</td>\n",
       "      <td>0.142857</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>30006</td>\n",
       "      <td>30056</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.200000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ORIGIN_CITY_MARKET_ID  DEST_CITY_MARKET_ID  PASSENGERS  \\\n",
       "0                  30005                30349         4.0   \n",
       "1                  30005                31214        10.0   \n",
       "2                  30005                31517       193.0   \n",
       "3                  30005                35731         7.0   \n",
       "4                  30006                30056         5.0   \n",
       "\n",
       "   INVERSE_PASSENGER_COUNT  \n",
       "0                 0.250000  \n",
       "1                 0.100000  \n",
       "2                 0.005181  \n",
       "3                 0.142857  \n",
       "4                 0.200000  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "passenger_flow_df = relevant_df[['ORIGIN_CITY_MARKET_ID', 'DEST_CITY_MARKET_ID', 'PASSENGERS']]\n",
    "passenger_flow_df = passenger_flow_df.groupby(['ORIGIN_CITY_MARKET_ID', 'DEST_CITY_MARKET_ID']) \\\n",
    "                        .PASSENGERS.sum() \\\n",
    "                        .reset_index()\n",
    "passenger_flow_df['INVERSE_PASSENGER_COUNT'] = passenger_flow_df.PASSENGERS.map(lambda passenger_count: 1/passenger_count)\n",
    "assert len(passenger_flow_df[passenger_flow_df.INVERSE_PASSENGER_COUNT != passenger_flow_df.INVERSE_PASSENGER_COUNT]) == 0, \"Edge list has NaN weights.\"\n",
    "passenger_flow_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the data has city market IDs and don’t have names because an airport can serve regions containing multiple cities, it’d be useful to get a mapping from city market IDs to city names and airports so that we can contextualize our findings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>AIRPORT</th>\n",
       "      <th>CITY_NAME</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CITY_MARKET_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>30005</th>\n",
       "      <td>{05A}</td>\n",
       "      <td>{Little Squaw, AK}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30006</th>\n",
       "      <td>{06A}</td>\n",
       "      <td>{Kizhuyak, AK}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30007</th>\n",
       "      <td>{KLW}</td>\n",
       "      <td>{Klawock, AK}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30009</th>\n",
       "      <td>{HOM, 09A}</td>\n",
       "      <td>{Homer, AK}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30010</th>\n",
       "      <td>{1B1}</td>\n",
       "      <td>{Hudson, NY}</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   AIRPORT           CITY_NAME\n",
       "CITY_MARKET_ID                                \n",
       "30005                {05A}  {Little Squaw, AK}\n",
       "30006                {06A}      {Kizhuyak, AK}\n",
       "30007                {KLW}       {Klawock, AK}\n",
       "30009           {HOM, 09A}         {Homer, AK}\n",
       "30010                {1B1}        {Hudson, NY}"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "origin_city_market_id_info_df = relevant_df[['ORIGIN_CITY_MARKET_ID', 'ORIGIN', 'ORIGIN_CITY_NAME']] \\\n",
    "                                    .rename(columns={'ORIGIN_CITY_MARKET_ID': 'CITY_MARKET_ID',\n",
    "                                                     'ORIGIN': 'AIRPORT',\n",
    "                                                     'ORIGIN_CITY_NAME': 'CITY_NAME'})\n",
    "dest_city_market_id_info_df = relevant_df[['DEST_CITY_MARKET_ID', 'DEST', 'DEST_CITY_NAME']] \\\n",
    "                                    .rename(columns={'DEST_CITY_MARKET_ID': 'CITY_MARKET_ID',\n",
    "                                                     'DEST': 'AIRPORT',\n",
    "                                                     'DEST_CITY_NAME': 'CITY_NAME'})\n",
    "city_market_id_info_df = pd.concat([origin_city_market_id_info_df, dest_city_market_id_info_df])\n",
    "city_market_id_info_df = city_market_id_info_df.groupby('CITY_MARKET_ID').agg({'AIRPORT': set, 'CITY_NAME': set})\n",
    "city_market_id_info_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Which region is travelled through the most?\n",
    "\n",
    "We’re going to determine which region is travelled through the most using betweenness centrality as it measures exactly that. There are a variety of algorithms to choose from, but we’ll stick to using solely betweenness centrality for this tutorial.\n",
    "\n",
    "We’ll first create a metagraph graph for the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "r = mg.resolver\n",
    "passenger_flow_edge_map = r.wrappers.EdgeMap.PandasEdgeMap(passenger_flow_df, \n",
    "                                                           'ORIGIN_CITY_MARKET_ID', \n",
    "                                                           'DEST_CITY_MARKET_ID', \n",
    "                                                           'INVERSE_PASSENGER_COUNT',\n",
    "                                                           is_directed=True)\n",
    "passenger_flow_graph = r.algos.util.graph.build(passenger_flow_edge_map)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we use the inverse passenger count as the weights to ensure that the shortest paths are the paths that have the most passengers.\n",
    "\n",
    "Let’s calculate the betweenness centrality."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "betweenness_centrality = r.algos.centrality.betweenness(passenger_flow_graph, normalize=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s look at the results and find the highest scores (which would give us the city market IDs that are most travelled through)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<metagraph.plugins.numpy.types.NumpyNodeMap at 0x7fa4ea428450>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "number_of_best_scores = 15\n",
    "best_betweenness_centrality_node_vector = r.algos.util.nodemap.sort(betweenness_centrality, ascending=False, limit=number_of_best_scores)\n",
    "best_betweenness_centrality_node_set = r.algos.util.nodeset.from_vector(best_betweenness_centrality_node_vector)\n",
    "best_betweenness_centrality_node_to_score_map = r.algos.util.nodemap.select(betweenness_centrality, best_betweenness_centrality_node_set)\n",
    "best_betweenness_centrality_node_to_score_map"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now have a mapping between city market IDs and their centrality scores in `best_betweenness_centrality_node_to_score_map`, which is a `NumpyNodeMap`. Since `NumpyNodeMap` stores it's mapping in a non-trivial fashion for performance reasons, it's non-trivial to inspect its internals to view the mapping's values. Luckily, metagraph allows us to translate it to a Python dictionary, which is significantly easier to inspect."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{30070: 62402.0,\n",
       " 30113: 75327.0,\n",
       " 30154: 56833.0,\n",
       " 30194: 121807.0,\n",
       " 30299: 349232.0,\n",
       " 30325: 107586.0,\n",
       " 30397: 144922.0,\n",
       " 30466: 45699.0,\n",
       " 30559: 465677.0,\n",
       " 30977: 206250.0,\n",
       " 31517: 90409.0,\n",
       " 31703: 337885.0,\n",
       " 32457: 46094.0,\n",
       " 32467: 48068.0,\n",
       " 32575: 494817.0}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "best_betweenness_centrality_node_to_score_map = r.translate(best_betweenness_centrality_node_to_score_map, r.types.NodeMap.PythonNodeMapType)\n",
    "best_betweenness_centrality_node_to_score_map"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have the city market IDs with the best scores, let’s find out which regions those city market IDs correspond to using the mapping from city market IDs to city names and airports we made earlier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>BETWEENNESS_CENTRALITY_SCORE</th>\n",
       "      <th>AIRPORT</th>\n",
       "      <th>CITY_NAME</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CITY_MARKET_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>32575</th>\n",
       "      <td>494817.0</td>\n",
       "      <td>{LAX, SMO, SNA, HHR, LGB, BUR, ONT, VNY}</td>\n",
       "      <td>{Santa Ana, CA, Los Angeles, CA, Van Nuys, CA,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30559</th>\n",
       "      <td>465677.0</td>\n",
       "      <td>{BFI, SEA, LKE, KEH}</td>\n",
       "      <td>{Kenmore, WA, Seattle, WA}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30299</th>\n",
       "      <td>349232.0</td>\n",
       "      <td>{ANC, DQL, MRI}</td>\n",
       "      <td>{Anchorage, AK}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31703</th>\n",
       "      <td>337885.0</td>\n",
       "      <td>{LGA, ISP, EWR, JRB, HPN, JRA, JFK, TSS, SWF}</td>\n",
       "      <td>{Islip, NY, New York, NY, Newark, NJ, Newburgh...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30977</th>\n",
       "      <td>206250.0</td>\n",
       "      <td>{LOT, GYY, ORD, PWK, DPA, MDW}</td>\n",
       "      <td>{Chicago/Romeoville, IL, Chicago, IL, Gary, IN}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30397</th>\n",
       "      <td>144922.0</td>\n",
       "      <td>{FTY, ATL, PDK, QMA}</td>\n",
       "      <td>{Kennesaw, GA, Atlanta, GA}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30194</th>\n",
       "      <td>121807.0</td>\n",
       "      <td>{RBD, ADS, FWH, FTW, AFW, DAL, DFW}</td>\n",
       "      <td>{Dallas/Fort Worth, TX, Dallas, TX, Fort Worth...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30325</th>\n",
       "      <td>107586.0</td>\n",
       "      <td>{APA, DEN}</td>\n",
       "      <td>{Denver, CO}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31517</th>\n",
       "      <td>90409.0</td>\n",
       "      <td>{FBK, EIL, MTX, A01, FAI}</td>\n",
       "      <td>{Fairbanks/Ft. Wainwright, AK, Fairbanks, AK}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30113</th>\n",
       "      <td>75327.0</td>\n",
       "      <td>{BET}</td>\n",
       "      <td>{Bethel, AK}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30070</th>\n",
       "      <td>62402.0</td>\n",
       "      <td>{KDK, ADQ}</td>\n",
       "      <td>{Kodiak, AK}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30154</th>\n",
       "      <td>56833.0</td>\n",
       "      <td>{ACK}</td>\n",
       "      <td>{Nantucket, MA}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32467</th>\n",
       "      <td>48068.0</td>\n",
       "      <td>{FXE, FLL, OPF, TMB, MIA, MPB}</td>\n",
       "      <td>{Miami, FL, Fort Lauderdale, FL}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32457</th>\n",
       "      <td>46094.0</td>\n",
       "      <td>{OAK, CCR, SFO, SJC}</td>\n",
       "      <td>{San Jose, CA, San Francisco, CA, Oakland, CA,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30466</th>\n",
       "      <td>45699.0</td>\n",
       "      <td>{AZA, AZ3, PHX, GYR, SCF}</td>\n",
       "      <td>{Goodyear, AZ, Phoenix, AZ, Glendale, AZ}</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                BETWEENNESS_CENTRALITY_SCORE  \\\n",
       "CITY_MARKET_ID                                 \n",
       "32575                               494817.0   \n",
       "30559                               465677.0   \n",
       "30299                               349232.0   \n",
       "31703                               337885.0   \n",
       "30977                               206250.0   \n",
       "30397                               144922.0   \n",
       "30194                               121807.0   \n",
       "30325                               107586.0   \n",
       "31517                                90409.0   \n",
       "30113                                75327.0   \n",
       "30070                                62402.0   \n",
       "30154                                56833.0   \n",
       "32467                                48068.0   \n",
       "32457                                46094.0   \n",
       "30466                                45699.0   \n",
       "\n",
       "                                                      AIRPORT  \\\n",
       "CITY_MARKET_ID                                                  \n",
       "32575                {LAX, SMO, SNA, HHR, LGB, BUR, ONT, VNY}   \n",
       "30559                                    {BFI, SEA, LKE, KEH}   \n",
       "30299                                         {ANC, DQL, MRI}   \n",
       "31703           {LGA, ISP, EWR, JRB, HPN, JRA, JFK, TSS, SWF}   \n",
       "30977                          {LOT, GYY, ORD, PWK, DPA, MDW}   \n",
       "30397                                    {FTY, ATL, PDK, QMA}   \n",
       "30194                     {RBD, ADS, FWH, FTW, AFW, DAL, DFW}   \n",
       "30325                                              {APA, DEN}   \n",
       "31517                               {FBK, EIL, MTX, A01, FAI}   \n",
       "30113                                                   {BET}   \n",
       "30070                                              {KDK, ADQ}   \n",
       "30154                                                   {ACK}   \n",
       "32467                          {FXE, FLL, OPF, TMB, MIA, MPB}   \n",
       "32457                                    {OAK, CCR, SFO, SJC}   \n",
       "30466                               {AZA, AZ3, PHX, GYR, SCF}   \n",
       "\n",
       "                                                        CITY_NAME  \n",
       "CITY_MARKET_ID                                                     \n",
       "32575           {Santa Ana, CA, Los Angeles, CA, Van Nuys, CA,...  \n",
       "30559                                  {Kenmore, WA, Seattle, WA}  \n",
       "30299                                             {Anchorage, AK}  \n",
       "31703           {Islip, NY, New York, NY, Newark, NJ, Newburgh...  \n",
       "30977             {Chicago/Romeoville, IL, Chicago, IL, Gary, IN}  \n",
       "30397                                 {Kennesaw, GA, Atlanta, GA}  \n",
       "30194           {Dallas/Fort Worth, TX, Dallas, TX, Fort Worth...  \n",
       "30325                                                {Denver, CO}  \n",
       "31517               {Fairbanks/Ft. Wainwright, AK, Fairbanks, AK}  \n",
       "30113                                                {Bethel, AK}  \n",
       "30070                                                {Kodiak, AK}  \n",
       "30154                                             {Nantucket, MA}  \n",
       "32467                            {Miami, FL, Fort Lauderdale, FL}  \n",
       "32457           {San Jose, CA, San Francisco, CA, Oakland, CA,...  \n",
       "30466                   {Goodyear, AZ, Phoenix, AZ, Glendale, AZ}  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "best_betweenness_centrality_scores_df = pd.DataFrame(best_betweenness_centrality_node_to_score_map.items()).rename(columns={0:'CITY_MARKET_ID', 1:'BETWEENNESS_CENTRALITY_SCORE'}).set_index('CITY_MARKET_ID')\n",
    "best_betweenness_centrality_scores_df.join(city_market_id_info_df).sort_values('BETWEENNESS_CENTRALITY_SCORE', ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is what we'd expect. Highly populated areas like Los Angeles are the most traveled through areas.\n",
    "\n",
    "However, it's surprising that Anchorage is more travelled through than a hub like Dallas!\n",
    "\n",
    "There’s a good explanation for Anchorage being a very travelled through region: Since Alaska is so sparsely populated, a well-connected road infrastructure was never built. Thus, to travel between cities in Alaska, air travel is often the only option. More information can be found [here](https://en.wikipedia.org/wiki/List_of_airports_in_Alaska)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "nbsphinx": {
   "execute": "never"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
