This page was generated from getting_started/use_case_2_netflix_kevin_bacon.ipynb.

Use Case Tutorial 2: Kevin Bacon(s) of 2019¶

This is a tutorial on how to find the most well-connected Netflix cast member of 2019.

Bacon’s Law is a concept claiming that most people in the Hollywood film industry can be linked through their film roles to Kevin Bacon within six steps.

We’ll go over how to find out who are the centers of the the Netflix film world, similar to how Bacon is the center of the Hollywood film industry.

Well use a Kaggle dataset containing all the TV shows and movies on Netflix as of 2019. The dataset can be found here.

Preprocess Data¶

The raw data is in a tabular format with columns for movies, cast members, directors, release dates, countries of release, etc.

We’ll want to put it in a graph-friendly format. In particular, we’ll want to convert it to an edge list format.

First, we’ll import some necessary libraries.

import pandas as pd
import numpy as np
import networkx as nx
import metagraph as mg
from collections import Counter
from typing import Union

Let’s take a look at the raw data provided.

RAW_DATA_CSV = './data/kevin_bacon/netflix_titles.csv' # https://www.kaggle.com/shivamb/netflix-shows
raw_data_df = pd.read_csv(RAW_DATA_CSV)
raw_data_df.head()

	show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
0	81145628	Movie	Norm of the North: King Sized Adventure	Richard Finn, Tim Maltby	Alan Marriott, Andrew Toth, Brian Dobson, Cole...	United States, India, South Korea, China	September 9, 2019	2019	TV-PG	90 min	Children & Family Movies, Comedies	Before planning an awesome wedding for his gra...
1	80117401	Movie	Jandino: Whatever it Takes	NaN	Jandino Asporaat	United Kingdom	September 9, 2016	2016	TV-MA	94 min	Stand-Up Comedy	Jandino Asporaat riffs on the challenges of ra...
2	70234439	TV Show	Transformers Prime	NaN	Peter Cullen, Sumalee Montano, Frank Welker, J...	United States	September 8, 2018	2013	TV-Y7-FV	1 Season	Kids' TV	With the help of three human allies, the Autob...
3	80058654	TV Show	Transformers: Robots in Disguise	NaN	Will Friedle, Darren Criss, Constance Zimmer, ...	United States	September 8, 2018	2016	TV-Y7	1 Season	Kids' TV	When a prison ship crash unleashes hundreds of...
4	80125979	Movie	#realityhigh	Fernando Lebrija	Nesta Cooper, Kate Walsh, John Michael Higgins...	United States	September 8, 2017	2017	TV-14	99 min	Comedies	When nerdy high schooler Dani finally attracts...

We’ll only consider movies since multiple cast members can work on the same TV show but may not ever see each other on set.

We’ll also only consider U.S. movies since cast members from different countries often do not work together.

We’ll necessarily need to remove any rows with missing data as well.

movies_df = raw_data_df[raw_data_df['type']=='Movie'].drop(columns=['type']).dropna()
movies_df = movies_df[movies_df.country.str.contains('United States')]
movies_df.head()

	show_id	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
0	81145628	Norm of the North: King Sized Adventure	Richard Finn, Tim Maltby	Alan Marriott, Andrew Toth, Brian Dobson, Cole...	United States, India, South Korea, China	September 9, 2019	2019	TV-PG	90 min	Children & Family Movies, Comedies	Before planning an awesome wedding for his gra...
4	80125979	#realityhigh	Fernando Lebrija	Nesta Cooper, Kate Walsh, John Michael Higgins...	United States	September 8, 2017	2017	TV-14	99 min	Comedies	When nerdy high schooler Dani finally attracts...
6	70304989	Automata	Gabe Ibáñez	Antonio Banderas, Dylan McDermott, Melanie Gri...	Bulgaria, United States, Spain, Canada	September 8, 2017	2014	R	110 min	International Movies, Sci-Fi & Fantasy, Thrillers	In a dystopian future, an insurance adjuster f...
9	70304990	Good People	Henrik Ruben Genz	James Franco, Kate Hudson, Tom Wilkinson, Omar...	United States, United Kingdom, Denmark, Sweden	September 8, 2017	2014	R	90 min	Action & Adventure, Thrillers	A struggling couple can't believe their luck w...
11	70299204	Kidnapping Mr. Heineken	Daniel Alfredson	Jim Sturgess, Sam Worthington, Ryan Kwanten, A...	Netherlands, Belgium, United Kingdom, United S...	September 8, 2017	2015	R	95 min	Action & Adventure, Dramas, International Movies	When beer magnate Alfred "Freddy" Heineken is ...

All the cast members for a movie are in the same cell.

To have the data in an edge list format, we’ll need to use Pandas to reformat the data to have rows where each cast member cell contains exactly one cast member. This will mean that a movie will have multiple rows (one for each cast member).

def expand_dataframe_list_values_for_column(df: pd.DataFrame, column_name: Union[str, int]) -> pd.DataFrame:
    return df.apply(lambda x: pd.Series(x[column_name].split(', ')), axis=1) \
                  .stack() \
                  .reset_index(level=1, drop=True) \
                  .to_frame(column_name) \
                  .join(df.drop(columns=[column_name]))

movies_df = expand_dataframe_list_values_for_column(movies_df, 'cast')

movies_df.head()

cast	show_id	title	director	country	date_added	release_year	rating	duration	listed_in	description
Alan Marriott	81145628	Norm of the North: King Sized Adventure	Richard Finn, Tim Maltby	United States, India, South Korea, China	September 9, 2019	2019	TV-PG	90 min	Children & Family Movies, Comedies	Before planning an awesome wedding for his gra...
Andrew Toth	81145628	Norm of the North: King Sized Adventure	Richard Finn, Tim Maltby	United States, India, South Korea, China	September 9, 2019	2019	TV-PG	90 min	Children & Family Movies, Comedies	Before planning an awesome wedding for his gra...
Brian Dobson	81145628	Norm of the North: King Sized Adventure	Richard Finn, Tim Maltby	United States, India, South Korea, China	September 9, 2019	2019	TV-PG	90 min	Children & Family Movies, Comedies	Before planning an awesome wedding for his gra...
Cole Howard	81145628	Norm of the North: King Sized Adventure	Richard Finn, Tim Maltby	United States, India, South Korea, China	September 9, 2019	2019	TV-PG	90 min	Children & Family Movies, Comedies	Before planning an awesome wedding for his gra...
Jennifer Cameron	81145628	Norm of the North: King Sized Adventure	Richard Finn, Tim Maltby	United States, India, South Korea, China	September 9, 2019	2019	TV-PG	90 min	Children & Family Movies, Comedies	Before planning an awesome wedding for his gra...

One thing to note is that there might be some movies, e.g. autobiographies, who have names that overlap with those of the actors.

movies_df[movies_df.title.isin(movies_df.cast)].head()

	cast	show_id	title	director	country	date_added	release_year	rating	duration	listed_in	description
1383	Jimi Hendrix	653673	Jimi Hendrix	Joe Boyd	United States	November 1, 2019	1973	R	102 min	Documentaries, Music & Musicals	Jimi Hendrix's family, friends, and fellow mus...
1383	Eric Clapton	653673	Jimi Hendrix	Joe Boyd	United States	November 1, 2019	1973	R	102 min	Documentaries, Music & Musicals	Jimi Hendrix's family, friends, and fellow mus...
1383	Billy Cox	653673	Jimi Hendrix	Joe Boyd	United States	November 1, 2019	1973	R	102 min	Documentaries, Music & Musicals	Jimi Hendrix's family, friends, and fellow mus...
1969	Benji	296682	Benji	Joe Camp	United States	March 6, 2018	1974	G	86 min	Children & Family Movies, Classic Movies	After lovable abandoned mutt Benji is adopted ...
1969	Deborah Walley	296682	Benji	Joe Camp	United States	March 6, 2018	1974	G	86 min	Children & Family Movies, Classic Movies	After lovable abandoned mutt Benji is adopted ...

Let’s make sure that the names of movies and actors don’t overlap so that we don’t have any problems with name collisions. We’ll accomplish this by assigning actor IDs and movie IDs (which do not overlap).

actors = movies_df.cast.unique()
movies = movies_df.title.unique()

actor_id_to_actor = actors
actor_to_id = dict(map(reversed, enumerate(actors)))

movie_id_to_movie = dict(((len(actors)+relative_movie_id, movie) for relative_movie_id, movie in enumerate(movies)))
movie_to_id = {movie: movie_id for movie_id, movie in movie_id_to_movie.items()}

movies_df['actor_id'] = movies_df.cast.map(lambda actor: actor_to_id[actor])
movies_df['movie_id'] = movies_df.title.map(lambda movie: movie_to_id[movie])

assert len(set(movies_df.actor_id).intersection(movies_df.movie_id)) == 0

Now that we have the data in an edgelist format (where edges connect cast members to movies) we want to put the data into a graph format. Since actors and movies are disjoint, we’ll create a bipartite graph.

nx_actor_to_movie_graph = nx.from_pandas_edgelist(movies_df, 'actor_id', 'movie_id')

actor_ids = list(actor_to_id.values())
movie_ids = list(movie_to_id.values())

r = mg.resolver
actor_id_to_movie_id_graph = r.wrappers.BipartiteGraph.NetworkXBipartiteGraph(nx_actor_to_movie_graph, [actor_ids, movie_ids])

Note that the above graph is a bipartite graph of cast members and movies. Since we want a graph where the edges connect actors who’ve worked together on a movie, we’ll use bipartite graph projection to generate an actor-to-actor graph.

actor_partition_label = 0
actor_id_to_actor_id_graph = r.algos.bipartite.graph_projection(actor_id_to_movie_id_graph, actor_partition_label)

The actor partition label is 0 because the actors are the 0th element of [actor_ids, movie_ids] that was passed into the bipartite graph initializer, i.e. r.wrappers.BipartiteGraph.NetworkXBipartiteGraph.

Find The Kevin Bacon(s)¶

We’re going to find the Kevin Bacons.

We’ll refer to the maximum number of hops a cast member needs to reach all other cast members as the Kevin Bacon distance.

We’ll refer to the cast members who have the smallest Kevin Bacon distance the Kevin Bacons.

To find the Kevin Bacons, we’ll first have to find all the connected components (since we don’t exactly have a Kevin Bacon if our graph is disconnected).

cc_node_label_mapping = r.algos.clustering.connected_components(actor_id_to_actor_id_graph)

Let’s take a look at the connected component results.

type(cc_node_label_mapping)

dict

list(cc_node_label_mapping.items())[:10]

[(0, 0),
 (1, 0),
 (2, 0),
 (3, 0),
 (4, 0),
 (5, 0),
 (6, 0),
 (7, 0),
 (8, 0),
 (9, 0)]

len(set(cc_node_label_mapping.values())) # number of connected components

It looks like we have 249 connected components. Since we can’t find the Kevin Bacon of a disconnected graph, let’s find the Kevin Bacon of the largest connected component.

label_counts = Counter()
for label in cc_node_label_mapping.values():
    label_counts[label] += 1
largest_cc_label, _ = max(label_counts.items(), key = lambda pair: pair[1])
largest_cc_node_set = {node for node, label in cc_node_label_mapping.items() if label == largest_cc_label}
largest_cc_subgraph = r.algos.subgraph.extract_subgraph(actor_id_to_actor_id_graph, largest_cc_node_set)

We now need to find each actor’s Kevin Bacon distance.

Our graph is currently a NetworkX graph.

type(largest_cc_subgraph)

metagraph.plugins.networkx.types.NetworkXGraph

NetworkX represents graphs using hash tables, which can be slow due to spatial locality issues. We can use a Scipy graph, which represents graphs via sparse adjacency matrices, to achieve better spatial locality and faster runtime performance.

largest_cc_subgraph = r.translate(largest_cc_subgraph, r.wrappers.Graph.ScipyGraph)

In order to compute each actor’s Kevin Bacon distance, we’ll need to find the shortest path lengths between every pair of actors.

largest_cc_subgraph = r.algos.util.graph.assign_uniform_weight(largest_cc_subgraph, 1.0)
_, lengths_graph = r.algos.traversal.all_pairs_shortest_paths(largest_cc_subgraph)

lengths_graph is a fully connected graph where each edge weight between two nodes represents the length in the original graph between the two nodes. We can calculate the Kevin Bacon distance of an actor ID node by taking the max over all the node’s edges.

actor_id_to_kevin_bacon_distance = r.algos.util.graph.aggregate_edges(lengths_graph, np.maximum, 0, True, True)

Once we have all the Kevin Bacon distances from every cast member, we can find the smallest Kevin Bacon distance.

min_kevin_bacon_dist = r.algos.util.nodemap.reduce(actor_id_to_kevin_bacon_distance, np.minimum)
min_kevin_bacon_dist

From here, we can determine the Kevin Bacon(s)!

We’ll do this by finding all the actors who have a Kevin Bacon distance equal to min_kevin_bacon_dist.

kevin_bacon_ids = r.algos.util.nodemap.filter(actor_id_to_kevin_bacon_distance, lambda distance: distance == min_kevin_bacon_dist)
kevin_bacons = [actor_id_to_actor[actor_id] for actor_id in kevin_bacon_ids.value]
kevin_bacons[:10]

['John Michael Higgins',
 'Robert Forster',
 'Jim Sturgess',
 'Sam Worthington',
 'Ryan Kwanten',
 'Anthony Hopkins',
 'Ben Kingsley',
 'Nicolas Cage',
 'Lindsay Burdge',
 'Jason Sudeikis']

Let’s see what fraction of the largest connected component in the Netflix 2019 film industry the Kevin Bacons make up.

len(kevin_bacons)

len(largest_cc_node_set)

len(kevin_bacons) / len(largest_cc_node_set)

0.037704498977505115

It turns out that 3.8% of Netflix’s largest connected component are Kevin Bacons. It seems that being a Kevin Bacon in the Netflix film world is not as rare as one might initially believe!

Use Case Tutorial 1: Well-Connected US Regions Use Case Tutorial 3: Customer Interest Clustering