{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Use Case Tutorial 2: Kevin Bacon(s) of 2019\n",
    "\n",
    "This is a tutorial on how to find the most well-connected Netflix cast member of 2019.\n",
    "\n",
    "Bacon’s Law is a concept claiming that most people in the Hollywood film industry can be linked through their film roles to Kevin Bacon within six steps.\n",
    "\n",
    "We’ll go over how to find out who are the centers of the the Netflix film world, similar to how Bacon is the center of the Hollywood film industry.\n",
    "\n",
    "Well use a Kaggle dataset containing all the TV shows and movies on Netflix as of 2019. The dataset can be found [here](https://www.kaggle.com/shivamb/netflix-shows)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preprocess Data\n",
    "\n",
    "The raw data is in a tabular format with columns for movies, cast members, directors, release dates, countries of release, etc.\n",
    "\n",
    "We’ll want to put it in a graph-friendly format. In particular, we’ll want to convert it to an edge list format.\n",
    "\n",
    "First, we’ll import some necessary libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import networkx as nx\n",
    "import metagraph as mg\n",
    "from collections import Counter\n",
    "from typing import Union"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at the raw data provided."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>show_id</th>\n",
       "      <th>type</th>\n",
       "      <th>title</th>\n",
       "      <th>director</th>\n",
       "      <th>cast</th>\n",
       "      <th>country</th>\n",
       "      <th>date_added</th>\n",
       "      <th>release_year</th>\n",
       "      <th>rating</th>\n",
       "      <th>duration</th>\n",
       "      <th>listed_in</th>\n",
       "      <th>description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>81145628</td>\n",
       "      <td>Movie</td>\n",
       "      <td>Norm of the North: King Sized Adventure</td>\n",
       "      <td>Richard Finn, Tim Maltby</td>\n",
       "      <td>Alan Marriott, Andrew Toth, Brian Dobson, Cole...</td>\n",
       "      <td>United States, India, South Korea, China</td>\n",
       "      <td>September 9, 2019</td>\n",
       "      <td>2019</td>\n",
       "      <td>TV-PG</td>\n",
       "      <td>90 min</td>\n",
       "      <td>Children &amp; Family Movies, Comedies</td>\n",
       "      <td>Before planning an awesome wedding for his gra...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>80117401</td>\n",
       "      <td>Movie</td>\n",
       "      <td>Jandino: Whatever it Takes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Jandino Asporaat</td>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>September 9, 2016</td>\n",
       "      <td>2016</td>\n",
       "      <td>TV-MA</td>\n",
       "      <td>94 min</td>\n",
       "      <td>Stand-Up Comedy</td>\n",
       "      <td>Jandino Asporaat riffs on the challenges of ra...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>70234439</td>\n",
       "      <td>TV Show</td>\n",
       "      <td>Transformers Prime</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Peter Cullen, Sumalee Montano, Frank Welker, J...</td>\n",
       "      <td>United States</td>\n",
       "      <td>September 8, 2018</td>\n",
       "      <td>2013</td>\n",
       "      <td>TV-Y7-FV</td>\n",
       "      <td>1 Season</td>\n",
       "      <td>Kids' TV</td>\n",
       "      <td>With the help of three human allies, the Autob...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>80058654</td>\n",
       "      <td>TV Show</td>\n",
       "      <td>Transformers: Robots in Disguise</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Will Friedle, Darren Criss, Constance Zimmer, ...</td>\n",
       "      <td>United States</td>\n",
       "      <td>September 8, 2018</td>\n",
       "      <td>2016</td>\n",
       "      <td>TV-Y7</td>\n",
       "      <td>1 Season</td>\n",
       "      <td>Kids' TV</td>\n",
       "      <td>When a prison ship crash unleashes hundreds of...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>80125979</td>\n",
       "      <td>Movie</td>\n",
       "      <td>#realityhigh</td>\n",
       "      <td>Fernando Lebrija</td>\n",
       "      <td>Nesta Cooper, Kate Walsh, John Michael Higgins...</td>\n",
       "      <td>United States</td>\n",
       "      <td>September 8, 2017</td>\n",
       "      <td>2017</td>\n",
       "      <td>TV-14</td>\n",
       "      <td>99 min</td>\n",
       "      <td>Comedies</td>\n",
       "      <td>When nerdy high schooler Dani finally attracts...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    show_id     type                                    title  \\\n",
       "0  81145628    Movie  Norm of the North: King Sized Adventure   \n",
       "1  80117401    Movie               Jandino: Whatever it Takes   \n",
       "2  70234439  TV Show                       Transformers Prime   \n",
       "3  80058654  TV Show         Transformers: Robots in Disguise   \n",
       "4  80125979    Movie                             #realityhigh   \n",
       "\n",
       "                   director  \\\n",
       "0  Richard Finn, Tim Maltby   \n",
       "1                       NaN   \n",
       "2                       NaN   \n",
       "3                       NaN   \n",
       "4          Fernando Lebrija   \n",
       "\n",
       "                                                cast  \\\n",
       "0  Alan Marriott, Andrew Toth, Brian Dobson, Cole...   \n",
       "1                                   Jandino Asporaat   \n",
       "2  Peter Cullen, Sumalee Montano, Frank Welker, J...   \n",
       "3  Will Friedle, Darren Criss, Constance Zimmer, ...   \n",
       "4  Nesta Cooper, Kate Walsh, John Michael Higgins...   \n",
       "\n",
       "                                    country         date_added  release_year  \\\n",
       "0  United States, India, South Korea, China  September 9, 2019          2019   \n",
       "1                            United Kingdom  September 9, 2016          2016   \n",
       "2                             United States  September 8, 2018          2013   \n",
       "3                             United States  September 8, 2018          2016   \n",
       "4                             United States  September 8, 2017          2017   \n",
       "\n",
       "     rating  duration                           listed_in  \\\n",
       "0     TV-PG    90 min  Children & Family Movies, Comedies   \n",
       "1     TV-MA    94 min                     Stand-Up Comedy   \n",
       "2  TV-Y7-FV  1 Season                            Kids' TV   \n",
       "3     TV-Y7  1 Season                            Kids' TV   \n",
       "4     TV-14    99 min                            Comedies   \n",
       "\n",
       "                                         description  \n",
       "0  Before planning an awesome wedding for his gra...  \n",
       "1  Jandino Asporaat riffs on the challenges of ra...  \n",
       "2  With the help of three human allies, the Autob...  \n",
       "3  When a prison ship crash unleashes hundreds of...  \n",
       "4  When nerdy high schooler Dani finally attracts...  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "RAW_DATA_CSV = './data/kevin_bacon/netflix_titles.csv' # https://www.kaggle.com/shivamb/netflix-shows\n",
    "raw_data_df = pd.read_csv(RAW_DATA_CSV)\n",
    "raw_data_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We’ll only consider movies since multiple cast members can work on the same TV show but may not ever see each other on set.\n",
    "\n",
    "We’ll also only consider U.S. movies since cast members from different countries often do not work together.\n",
    "\n",
    "We’ll necessarily need to remove any rows with missing data as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>show_id</th>\n",
       "      <th>title</th>\n",
       "      <th>director</th>\n",
       "      <th>cast</th>\n",
       "      <th>country</th>\n",
       "      <th>date_added</th>\n",
       "      <th>release_year</th>\n",
       "      <th>rating</th>\n",
       "      <th>duration</th>\n",
       "      <th>listed_in</th>\n",
       "      <th>description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>81145628</td>\n",
       "      <td>Norm of the North: King Sized Adventure</td>\n",
       "      <td>Richard Finn, Tim Maltby</td>\n",
       "      <td>Alan Marriott, Andrew Toth, Brian Dobson, Cole...</td>\n",
       "      <td>United States, India, South Korea, China</td>\n",
       "      <td>September 9, 2019</td>\n",
       "      <td>2019</td>\n",
       "      <td>TV-PG</td>\n",
       "      <td>90 min</td>\n",
       "      <td>Children &amp; Family Movies, Comedies</td>\n",
       "      <td>Before planning an awesome wedding for his gra...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>80125979</td>\n",
       "      <td>#realityhigh</td>\n",
       "      <td>Fernando Lebrija</td>\n",
       "      <td>Nesta Cooper, Kate Walsh, John Michael Higgins...</td>\n",
       "      <td>United States</td>\n",
       "      <td>September 8, 2017</td>\n",
       "      <td>2017</td>\n",
       "      <td>TV-14</td>\n",
       "      <td>99 min</td>\n",
       "      <td>Comedies</td>\n",
       "      <td>When nerdy high schooler Dani finally attracts...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>70304989</td>\n",
       "      <td>Automata</td>\n",
       "      <td>Gabe Ibáñez</td>\n",
       "      <td>Antonio Banderas, Dylan McDermott, Melanie Gri...</td>\n",
       "      <td>Bulgaria, United States, Spain, Canada</td>\n",
       "      <td>September 8, 2017</td>\n",
       "      <td>2014</td>\n",
       "      <td>R</td>\n",
       "      <td>110 min</td>\n",
       "      <td>International Movies, Sci-Fi &amp; Fantasy, Thrillers</td>\n",
       "      <td>In a dystopian future, an insurance adjuster f...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>70304990</td>\n",
       "      <td>Good People</td>\n",
       "      <td>Henrik Ruben Genz</td>\n",
       "      <td>James Franco, Kate Hudson, Tom Wilkinson, Omar...</td>\n",
       "      <td>United States, United Kingdom, Denmark, Sweden</td>\n",
       "      <td>September 8, 2017</td>\n",
       "      <td>2014</td>\n",
       "      <td>R</td>\n",
       "      <td>90 min</td>\n",
       "      <td>Action &amp; Adventure, Thrillers</td>\n",
       "      <td>A struggling couple can't believe their luck w...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>70299204</td>\n",
       "      <td>Kidnapping Mr. Heineken</td>\n",
       "      <td>Daniel Alfredson</td>\n",
       "      <td>Jim Sturgess, Sam Worthington, Ryan Kwanten, A...</td>\n",
       "      <td>Netherlands, Belgium, United Kingdom, United S...</td>\n",
       "      <td>September 8, 2017</td>\n",
       "      <td>2015</td>\n",
       "      <td>R</td>\n",
       "      <td>95 min</td>\n",
       "      <td>Action &amp; Adventure, Dramas, International Movies</td>\n",
       "      <td>When beer magnate Alfred \"Freddy\" Heineken is ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     show_id                                    title  \\\n",
       "0   81145628  Norm of the North: King Sized Adventure   \n",
       "4   80125979                             #realityhigh   \n",
       "6   70304989                                 Automata   \n",
       "9   70304990                              Good People   \n",
       "11  70299204                  Kidnapping Mr. Heineken   \n",
       "\n",
       "                    director  \\\n",
       "0   Richard Finn, Tim Maltby   \n",
       "4           Fernando Lebrija   \n",
       "6                Gabe Ibáñez   \n",
       "9          Henrik Ruben Genz   \n",
       "11          Daniel Alfredson   \n",
       "\n",
       "                                                 cast  \\\n",
       "0   Alan Marriott, Andrew Toth, Brian Dobson, Cole...   \n",
       "4   Nesta Cooper, Kate Walsh, John Michael Higgins...   \n",
       "6   Antonio Banderas, Dylan McDermott, Melanie Gri...   \n",
       "9   James Franco, Kate Hudson, Tom Wilkinson, Omar...   \n",
       "11  Jim Sturgess, Sam Worthington, Ryan Kwanten, A...   \n",
       "\n",
       "                                              country         date_added  \\\n",
       "0            United States, India, South Korea, China  September 9, 2019   \n",
       "4                                       United States  September 8, 2017   \n",
       "6              Bulgaria, United States, Spain, Canada  September 8, 2017   \n",
       "9      United States, United Kingdom, Denmark, Sweden  September 8, 2017   \n",
       "11  Netherlands, Belgium, United Kingdom, United S...  September 8, 2017   \n",
       "\n",
       "    release_year rating duration  \\\n",
       "0           2019  TV-PG   90 min   \n",
       "4           2017  TV-14   99 min   \n",
       "6           2014      R  110 min   \n",
       "9           2014      R   90 min   \n",
       "11          2015      R   95 min   \n",
       "\n",
       "                                            listed_in  \\\n",
       "0                  Children & Family Movies, Comedies   \n",
       "4                                            Comedies   \n",
       "6   International Movies, Sci-Fi & Fantasy, Thrillers   \n",
       "9                       Action & Adventure, Thrillers   \n",
       "11   Action & Adventure, Dramas, International Movies   \n",
       "\n",
       "                                          description  \n",
       "0   Before planning an awesome wedding for his gra...  \n",
       "4   When nerdy high schooler Dani finally attracts...  \n",
       "6   In a dystopian future, an insurance adjuster f...  \n",
       "9   A struggling couple can't believe their luck w...  \n",
       "11  When beer magnate Alfred \"Freddy\" Heineken is ...  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies_df = raw_data_df[raw_data_df['type']=='Movie'].drop(columns=['type']).dropna()\n",
    "movies_df = movies_df[movies_df.country.str.contains('United States')]\n",
    "movies_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All the cast members for a movie are in the same cell.\n",
    "\n",
    "To have the data in an edge list format, we’ll need to use Pandas to reformat the data to have rows where each cast member cell contains exactly one cast member. This will mean that a movie will have multiple rows (one for each cast member)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cast</th>\n",
       "      <th>show_id</th>\n",
       "      <th>title</th>\n",
       "      <th>director</th>\n",
       "      <th>country</th>\n",
       "      <th>date_added</th>\n",
       "      <th>release_year</th>\n",
       "      <th>rating</th>\n",
       "      <th>duration</th>\n",
       "      <th>listed_in</th>\n",
       "      <th>description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Alan Marriott</td>\n",
       "      <td>81145628</td>\n",
       "      <td>Norm of the North: King Sized Adventure</td>\n",
       "      <td>Richard Finn, Tim Maltby</td>\n",
       "      <td>United States, India, South Korea, China</td>\n",
       "      <td>September 9, 2019</td>\n",
       "      <td>2019</td>\n",
       "      <td>TV-PG</td>\n",
       "      <td>90 min</td>\n",
       "      <td>Children &amp; Family Movies, Comedies</td>\n",
       "      <td>Before planning an awesome wedding for his gra...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Andrew Toth</td>\n",
       "      <td>81145628</td>\n",
       "      <td>Norm of the North: King Sized Adventure</td>\n",
       "      <td>Richard Finn, Tim Maltby</td>\n",
       "      <td>United States, India, South Korea, China</td>\n",
       "      <td>September 9, 2019</td>\n",
       "      <td>2019</td>\n",
       "      <td>TV-PG</td>\n",
       "      <td>90 min</td>\n",
       "      <td>Children &amp; Family Movies, Comedies</td>\n",
       "      <td>Before planning an awesome wedding for his gra...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Brian Dobson</td>\n",
       "      <td>81145628</td>\n",
       "      <td>Norm of the North: King Sized Adventure</td>\n",
       "      <td>Richard Finn, Tim Maltby</td>\n",
       "      <td>United States, India, South Korea, China</td>\n",
       "      <td>September 9, 2019</td>\n",
       "      <td>2019</td>\n",
       "      <td>TV-PG</td>\n",
       "      <td>90 min</td>\n",
       "      <td>Children &amp; Family Movies, Comedies</td>\n",
       "      <td>Before planning an awesome wedding for his gra...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Cole Howard</td>\n",
       "      <td>81145628</td>\n",
       "      <td>Norm of the North: King Sized Adventure</td>\n",
       "      <td>Richard Finn, Tim Maltby</td>\n",
       "      <td>United States, India, South Korea, China</td>\n",
       "      <td>September 9, 2019</td>\n",
       "      <td>2019</td>\n",
       "      <td>TV-PG</td>\n",
       "      <td>90 min</td>\n",
       "      <td>Children &amp; Family Movies, Comedies</td>\n",
       "      <td>Before planning an awesome wedding for his gra...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jennifer Cameron</td>\n",
       "      <td>81145628</td>\n",
       "      <td>Norm of the North: King Sized Adventure</td>\n",
       "      <td>Richard Finn, Tim Maltby</td>\n",
       "      <td>United States, India, South Korea, China</td>\n",
       "      <td>September 9, 2019</td>\n",
       "      <td>2019</td>\n",
       "      <td>TV-PG</td>\n",
       "      <td>90 min</td>\n",
       "      <td>Children &amp; Family Movies, Comedies</td>\n",
       "      <td>Before planning an awesome wedding for his gra...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               cast   show_id                                    title  \\\n",
       "0     Alan Marriott  81145628  Norm of the North: King Sized Adventure   \n",
       "0       Andrew Toth  81145628  Norm of the North: King Sized Adventure   \n",
       "0      Brian Dobson  81145628  Norm of the North: King Sized Adventure   \n",
       "0       Cole Howard  81145628  Norm of the North: King Sized Adventure   \n",
       "0  Jennifer Cameron  81145628  Norm of the North: King Sized Adventure   \n",
       "\n",
       "                   director                                   country  \\\n",
       "0  Richard Finn, Tim Maltby  United States, India, South Korea, China   \n",
       "0  Richard Finn, Tim Maltby  United States, India, South Korea, China   \n",
       "0  Richard Finn, Tim Maltby  United States, India, South Korea, China   \n",
       "0  Richard Finn, Tim Maltby  United States, India, South Korea, China   \n",
       "0  Richard Finn, Tim Maltby  United States, India, South Korea, China   \n",
       "\n",
       "          date_added  release_year rating duration  \\\n",
       "0  September 9, 2019          2019  TV-PG   90 min   \n",
       "0  September 9, 2019          2019  TV-PG   90 min   \n",
       "0  September 9, 2019          2019  TV-PG   90 min   \n",
       "0  September 9, 2019          2019  TV-PG   90 min   \n",
       "0  September 9, 2019          2019  TV-PG   90 min   \n",
       "\n",
       "                            listed_in  \\\n",
       "0  Children & Family Movies, Comedies   \n",
       "0  Children & Family Movies, Comedies   \n",
       "0  Children & Family Movies, Comedies   \n",
       "0  Children & Family Movies, Comedies   \n",
       "0  Children & Family Movies, Comedies   \n",
       "\n",
       "                                         description  \n",
       "0  Before planning an awesome wedding for his gra...  \n",
       "0  Before planning an awesome wedding for his gra...  \n",
       "0  Before planning an awesome wedding for his gra...  \n",
       "0  Before planning an awesome wedding for his gra...  \n",
       "0  Before planning an awesome wedding for his gra...  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def expand_dataframe_list_values_for_column(df: pd.DataFrame, column_name: Union[str, int]) -> pd.DataFrame:\n",
    "    return df.apply(lambda x: pd.Series(x[column_name].split(', ')), axis=1) \\\n",
    "                  .stack() \\\n",
    "                  .reset_index(level=1, drop=True) \\\n",
    "                  .to_frame(column_name) \\\n",
    "                  .join(df.drop(columns=[column_name]))\n",
    "\n",
    "movies_df = expand_dataframe_list_values_for_column(movies_df, 'cast')\n",
    "\n",
    "movies_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One thing to note is that there might be some movies, e.g. autobiographies, who have names that overlap with those of the actors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cast</th>\n",
       "      <th>show_id</th>\n",
       "      <th>title</th>\n",
       "      <th>director</th>\n",
       "      <th>country</th>\n",
       "      <th>date_added</th>\n",
       "      <th>release_year</th>\n",
       "      <th>rating</th>\n",
       "      <th>duration</th>\n",
       "      <th>listed_in</th>\n",
       "      <th>description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1383</th>\n",
       "      <td>Jimi Hendrix</td>\n",
       "      <td>653673</td>\n",
       "      <td>Jimi Hendrix</td>\n",
       "      <td>Joe Boyd</td>\n",
       "      <td>United States</td>\n",
       "      <td>November 1, 2019</td>\n",
       "      <td>1973</td>\n",
       "      <td>R</td>\n",
       "      <td>102 min</td>\n",
       "      <td>Documentaries, Music &amp; Musicals</td>\n",
       "      <td>Jimi Hendrix's family, friends, and fellow mus...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1383</th>\n",
       "      <td>Eric Clapton</td>\n",
       "      <td>653673</td>\n",
       "      <td>Jimi Hendrix</td>\n",
       "      <td>Joe Boyd</td>\n",
       "      <td>United States</td>\n",
       "      <td>November 1, 2019</td>\n",
       "      <td>1973</td>\n",
       "      <td>R</td>\n",
       "      <td>102 min</td>\n",
       "      <td>Documentaries, Music &amp; Musicals</td>\n",
       "      <td>Jimi Hendrix's family, friends, and fellow mus...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1383</th>\n",
       "      <td>Billy Cox</td>\n",
       "      <td>653673</td>\n",
       "      <td>Jimi Hendrix</td>\n",
       "      <td>Joe Boyd</td>\n",
       "      <td>United States</td>\n",
       "      <td>November 1, 2019</td>\n",
       "      <td>1973</td>\n",
       "      <td>R</td>\n",
       "      <td>102 min</td>\n",
       "      <td>Documentaries, Music &amp; Musicals</td>\n",
       "      <td>Jimi Hendrix's family, friends, and fellow mus...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1969</th>\n",
       "      <td>Benji</td>\n",
       "      <td>296682</td>\n",
       "      <td>Benji</td>\n",
       "      <td>Joe Camp</td>\n",
       "      <td>United States</td>\n",
       "      <td>March 6, 2018</td>\n",
       "      <td>1974</td>\n",
       "      <td>G</td>\n",
       "      <td>86 min</td>\n",
       "      <td>Children &amp; Family Movies, Classic Movies</td>\n",
       "      <td>After lovable abandoned mutt Benji is adopted ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1969</th>\n",
       "      <td>Deborah Walley</td>\n",
       "      <td>296682</td>\n",
       "      <td>Benji</td>\n",
       "      <td>Joe Camp</td>\n",
       "      <td>United States</td>\n",
       "      <td>March 6, 2018</td>\n",
       "      <td>1974</td>\n",
       "      <td>G</td>\n",
       "      <td>86 min</td>\n",
       "      <td>Children &amp; Family Movies, Classic Movies</td>\n",
       "      <td>After lovable abandoned mutt Benji is adopted ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                cast  show_id         title  director        country  \\\n",
       "1383    Jimi Hendrix   653673  Jimi Hendrix  Joe Boyd  United States   \n",
       "1383    Eric Clapton   653673  Jimi Hendrix  Joe Boyd  United States   \n",
       "1383       Billy Cox   653673  Jimi Hendrix  Joe Boyd  United States   \n",
       "1969           Benji   296682         Benji  Joe Camp  United States   \n",
       "1969  Deborah Walley   296682         Benji  Joe Camp  United States   \n",
       "\n",
       "            date_added  release_year rating duration  \\\n",
       "1383  November 1, 2019          1973      R  102 min   \n",
       "1383  November 1, 2019          1973      R  102 min   \n",
       "1383  November 1, 2019          1973      R  102 min   \n",
       "1969     March 6, 2018          1974      G   86 min   \n",
       "1969     March 6, 2018          1974      G   86 min   \n",
       "\n",
       "                                     listed_in  \\\n",
       "1383           Documentaries, Music & Musicals   \n",
       "1383           Documentaries, Music & Musicals   \n",
       "1383           Documentaries, Music & Musicals   \n",
       "1969  Children & Family Movies, Classic Movies   \n",
       "1969  Children & Family Movies, Classic Movies   \n",
       "\n",
       "                                            description  \n",
       "1383  Jimi Hendrix's family, friends, and fellow mus...  \n",
       "1383  Jimi Hendrix's family, friends, and fellow mus...  \n",
       "1383  Jimi Hendrix's family, friends, and fellow mus...  \n",
       "1969  After lovable abandoned mutt Benji is adopted ...  \n",
       "1969  After lovable abandoned mutt Benji is adopted ...  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies_df[movies_df.title.isin(movies_df.cast)].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s make sure that the names of movies and actors don’t overlap so that we don’t have any problems with name collisions. We’ll accomplish this by assigning actor IDs and movie IDs (which do not overlap)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "actors = movies_df.cast.unique()\n",
    "movies = movies_df.title.unique()\n",
    "\n",
    "actor_id_to_actor = actors\n",
    "actor_to_id = dict(map(reversed, enumerate(actors)))\n",
    "\n",
    "movie_id_to_movie = dict(((len(actors)+relative_movie_id, movie) for relative_movie_id, movie in enumerate(movies)))\n",
    "movie_to_id = {movie: movie_id for movie_id, movie in movie_id_to_movie.items()}\n",
    "\n",
    "movies_df['actor_id'] = movies_df.cast.map(lambda actor: actor_to_id[actor])\n",
    "movies_df['movie_id'] = movies_df.title.map(lambda movie: movie_to_id[movie])\n",
    "\n",
    "assert len(set(movies_df.actor_id).intersection(movies_df.movie_id)) == 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have the data in an edgelist format (where edges connect cast members to movies) we want to put the data into a graph format. Since actors and movies are disjoint, we’ll create a bipartite graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "nx_actor_to_movie_graph = nx.from_pandas_edgelist(movies_df, 'actor_id', 'movie_id')\n",
    "\n",
    "actor_ids = list(actor_to_id.values())\n",
    "movie_ids = list(movie_to_id.values())\n",
    "\n",
    "actor_id_to_movie_id_graph = mg.wrappers.BipartiteGraph.NetworkXBipartiteGraph(nx_actor_to_movie_graph, [actor_ids, movie_ids])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the above graph is a bipartite graph of cast members and movies. Since we want a graph where the edges connect actors who’ve worked together on a movie, we’ll use bipartite graph projection to generate an actor-to-actor graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "actor_partition_label = 0\n",
    "actor_id_to_actor_id_graph = mg.algos.bipartite.graph_projection(actor_id_to_movie_id_graph, actor_partition_label)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The actor partition label is 0 because the actors are the 0th element of `[actor_ids, movie_ids]` that was passed into the bipartite graph initializer, i.e. `mg.wrappers.BipartiteGraph.NetworkXBipartiteGraph`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Find The Kevin Bacon(s)\n",
    "\n",
    "We’re going to find the Kevin Bacons.\n",
    "\n",
    "We’ll refer to the maximum number of hops a cast member needs to reach all other cast members as the Kevin Bacon distance.\n",
    "\n",
    "We’ll refer to the cast members who have the smallest Kevin Bacon distance the Kevin Bacons.\n",
    "\n",
    "To find the Kevin Bacons, we’ll first have to find all the connected components (since we don’t exactly have a Kevin Bacon if our graph is disconnected)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "cc_node_label_mapping = r.algos.clustering.connected_components(actor_id_to_actor_id_graph)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at the connected component results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(cc_node_label_mapping)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0, 0),\n",
       " (1, 0),\n",
       " (2, 0),\n",
       " (3, 0),\n",
       " (4, 0),\n",
       " (5, 0),\n",
       " (6, 0),\n",
       " (7, 0),\n",
       " (8, 0),\n",
       " (9, 0)]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(cc_node_label_mapping.items())[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "249"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(set(cc_node_label_mapping.values())) # number of connected components"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like we have 249 connected components. Since we can't find the Kevin Bacon of a disconnected graph, let's find the Kevin Bacon of the largest connected component. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "label_counts = Counter()\n",
    "for label in cc_node_label_mapping.values():\n",
    "    label_counts[label] += 1\n",
    "largest_cc_label, _ = max(label_counts.items(), key = lambda pair: pair[1])\n",
    "largest_cc_node_set = {node for node, label in cc_node_label_mapping.items() if label == largest_cc_label}\n",
    "largest_cc_subgraph = mg.algos.subgraph.extract_subgraph(actor_id_to_actor_id_graph, largest_cc_node_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now need to find each actor’s Kevin Bacon distance.\n",
    "\n",
    "Our graph is currently a NetworkX graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "metagraph.plugins.networkx.types.NetworkXGraph"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(largest_cc_subgraph)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "NetworkX represents graphs using hash tables, which can be slow due to spatial locality issues. We can use a Scipy graph, which represents graphs via sparse adjacency matrices, to achieve better spatial locality and faster runtime performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "largest_cc_subgraph = largest_cc_subgraph.translate(\"ScipyGraph\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to compute each actor's Kevin Bacon distance, we'll need to find the shortest path lengths between every pair of actors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "largest_cc_subgraph = mg.algos.util.graph.assign_uniform_weight(largest_cc_subgraph, 1.0)\n",
    "_, lengths_graph = mg.algos.traversal.all_pairs_shortest_paths(largest_cc_subgraph)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`lengths_graph` is a fully connected graph where each edge weight between two nodes represents the length in the original graph between the two nodes. We can calculate the Kevin Bacon distance of an actor ID node by taking the max over all the node's edges. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "actor_id_to_kevin_bacon_distance = mg.algos.util.graph.aggregate_edges(lengths_graph, np.maximum, 0, True, True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we have all the Kevin Bacon distances from every cast member, we can find the smallest Kevin Bacon distance. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "min_kevin_bacon_dist = mg.algos.util.nodemap.reduce(actor_id_to_kevin_bacon_distance, np.minimum)\n",
    "min_kevin_bacon_dist"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From here, we can determine the Kevin Bacon(s)!\n",
    "\n",
    "We'll do this by finding all the actors who have a Kevin Bacon distance equal to `min_kevin_bacon_dist`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['John Michael Higgins',\n",
       " 'Robert Forster',\n",
       " 'Jim Sturgess',\n",
       " 'Sam Worthington',\n",
       " 'Ryan Kwanten',\n",
       " 'Anthony Hopkins',\n",
       " 'Ben Kingsley',\n",
       " 'Nicolas Cage',\n",
       " 'Lindsay Burdge',\n",
       " 'Jason Sudeikis']"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kevin_bacon_ids = mg.algos.util.nodemap.filter(actor_id_to_kevin_bacon_distance, lambda distance: distance == min_kevin_bacon_dist)\n",
    "kevin_bacons = [actor_id_to_actor[actor_id] for actor_id in kevin_bacon_ids.value]\n",
    "kevin_bacons[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see what fraction of the largest connected component in the Netflix 2019 film industry the Kevin Bacons make up. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "295"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(kevin_bacons)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7824"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(largest_cc_node_set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.037704498977505115"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(kevin_bacons) / len(largest_cc_node_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It turns out that 3.8% of Netflix’s largest connected component are Kevin Bacons. It seems that being a Kevin Bacon in the Netflix film world is not as rare as one might initially believe!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "nbsphinx": {
   "execute": "never"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}