This page was generated from getting_started/use_case_3_ecommerce_customer_interests.ipynb.

Use Case Tutorial 3: Customer Interest Clustering¶

This is a tutorial on how to perform customer clustering based on the interests and purchases of customers.

Marketing teams frequently are interested in this analysis.

We’ll show how graph analytics can be used to gain insights about the interests of customers by finding communities of customers who’ve bought similar products.

We’ll accomplish this by creating a bipartite graph of customers and products, using a graph projection to create a graph of customers linked to other customers who’ve bought the same product, and using Louvain community detection to find the communities.

We’ll be using ecommerce transaction data from a U.K. retailer provided by the University of California, Irvine. The data can be found here.

Data Preprocessing¶

Let’s first look at the data.

First, we’ll need to import some libraries.

import metagraph as mg
import pandas as pd
import networkx as nx

Let’s see what the data looks like.

RAW_DATA_CSV = './data/ecommerce/data.csv' # https://www.kaggle.com/carrie1/ecommerce-data
data_df = pd.read_csv(RAW_DATA_CSV, encoding="ISO-8859-1")
data_df.head()

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	12/1/2010 8:26	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	12/1/2010 8:26	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	12/1/2010 8:26	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	12/1/2010 8:26	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	12/1/2010 8:26	3.39	17850.0	United Kingdom

Let’s clean the data to make sure there aren’t any missing values.

data_df.InvoiceDate = pd.to_datetime(data_df.InvoiceDate, format="%m/%d/%Y %H:%M")
data_df.drop(data_df.index[data_df.CustomerID != data_df.CustomerID], inplace=True)
assert len(data_df[data_df.isnull().any(axis=1)])==0, "Raw data contains NaN"
data_df = data_df.astype({'CustomerID': int}, copy=False)
data_df.head()

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850	United Kingdom

Note that some of these transactions are for returns (denoted by negative quantity values).

data_df[data_df.Quantity < 1].head()

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
141	C536379	D	Discount	-1	2010-12-01 09:41:00	27.50	14527	United Kingdom
154	C536383	35004C	SET OF 3 COLOURED FLYING DUCKS	-1	2010-12-01 09:49:00	4.65	15311	United Kingdom
235	C536391	22556	PLASTERS IN TIN CIRCUS PARADE	-12	2010-12-01 10:24:00	1.65	17548	United Kingdom
236	C536391	21984	PACK OF 12 PINK PAISLEY TISSUES	-24	2010-12-01 10:24:00	0.29	17548	United Kingdom
237	C536391	21983	PACK OF 12 BLUE PAISLEY TISSUES	-24	2010-12-01 10:24:00	0.29	17548	United Kingdom

Though customers may have returned these products, they did initially purchase the products (which reflects an interest in the product), so we’ll keep the initial purchases. However, we’ll remove the return transactions (which will also remove any discount transactions as well).

data_df.drop(data_df.index[data_df.Quantity <= 0], inplace=True)
data_df.head()

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850	United Kingdom

Community Detection¶

Let’s now find the communities of customers with similar purchases / interests.

First, we’ll need to create a bipartite graph of customers and products.

Let’s grab the default resolver.

Let’s take a look at the nodes of the bipartite graph we’re going to create.

customer_ids = data_df['CustomerID']
stock_codes = data_df['StockCode']

customer_ids.head()

0    17850
1    17850
2    17850
3    17850
4    17850
Name: CustomerID, dtype: int64

stock_codes.head()

0    85123A
1     71053
2    84406B
3    84029G
4    84029E
Name: StockCode, dtype: object

Our customer ids are ints, but our stock codes are not ints.

Ideally, our graph will have nodes of all the same type since some hardware backends might require this. This isn’t strictly necessary here, but it’s good practice to do this in order to avoid any potential problems any specific backend might have.

We can make our graph nodes all have the same type by mapping our original customer ids and stock codes to node ids and making a graph of those node ids. We can do this with metagraph.NodeLabels.

all_nodes = pd.concat([customer_ids, stock_codes]).unique()
node_ids = range(len(all_nodes))
node_labels = mg.NodeLabels(node_ids, all_nodes)

node_labels maps the customer ids or stock codes to node ids as shown below.

first_customer_id = customer_ids.iloc[0]
first_customer_id

first_customer_id_node_id = node_labels[first_customer_id]
first_customer_id_node_id

first_stock_code = stock_codes.iloc[0]
first_stock_code

'85123A'

first_stock_code_node_id = node_labels[first_stock_code]
first_stock_code_node_id

node_labels.ids maps node ids to customer ids or stock codes as shown below.

node_labels.ids[first_customer_id_node_id]

node_labels.ids[first_stock_code_node_id]

'85123A'

assert node_labels.ids[first_customer_id_node_id] == first_customer_id
assert node_labels.ids[first_stock_code_node_id] == first_stock_code

Let’s now create our bipartite graph.

customer_id_node_ids = [node_labels[customer_id] for customer_id in customer_ids]
stock_code_node_ids = [node_labels[stock_code] for stock_code in stock_codes]
edges = zip(customer_id_node_ids, stock_code_node_ids)

nx_bipartite_graph = nx.Graph()
nx_bipartite_graph.add_edges_from(edges)
bipartite_graph = mg.wrappers.BipartiteGraph.NetworkXBipartiteGraph(
    nx_bipartite_graph,
    [customer_id_node_ids, stock_code_node_ids]
)

Next, we’ll need to use a graph projection to create a graph of customers linked to other customers who’ve bought the same product.

customer_similarity_graph = mg.algos.bipartite.graph_projection(bipartite_graph, 0)

We now have an unweighted bipartite graph. Louvain community detection requires weights. A more elegant approach might be taken in practice, but we’ll simply assign every edge to have a weight of 1 for this tutorial.

customer_similarity_graph = mg.algos.util.graph.assign_uniform_weight(
    customer_similarity_graph,
    1.0
)

Now, we’ll need to use Louvain community detection to find similar communities based on purchased products.

community_labels, modularity_score = mg.algos.clustering.louvain_community(
    customer_similarity_graph
)

community_labels is a mapping from node IDs to their community labels.

Let’s see how many / what community labels we have.

type(community_labels)

dict

set(community_labels.values())

{0, 1, 2, 3}

Let’s now merge the labels into our dataframe.

data_df['CustomerCommunityLabel'] = data_df.CustomerID.map(
    lambda customer_id: community_labels[node_labels[customer_id]]
)
data_df.sample(10)

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	CustomerCommunityLabel
174080	551739	22916	HERB MARKER THYME	2	2011-05-04 10:58:00	0.65	18118	United Kingdom	0
337560	566450	22977	DOLLY GIRL CHILDRENS EGG CUP	12	2011-09-12 16:12:00	1.25	15673	United Kingdom	2
346336	567184	22173	METAL 4 HOOK HANGER FRENCH CHATEAU	1	2011-09-18 15:41:00	3.29	16033	United Kingdom	0
493119	578155	22208	WOOD STAMP SET THANK YOU	1	2011-11-23 11:32:00	0.83	12748	United Kingdom	0
406740	571828	22812	PACK 3 BOXES CHRISTMAS PANNETONE	8	2011-10-19 11:52:00	1.95	16440	United Kingdom	2
482809	577484	23295	SET OF 12 MINI LOAF BAKING CASES	1	2011-11-20 11:52:00	0.83	13536	United Kingdom	2
303813	563555	22201	FRYING PAN BLUE POLKADOT	1	2011-08-17 13:21:00	4.25	16755	United Kingdom	3
525561	580632	23552	BICYCLE PUNCTURE REPAIR KIT	2	2011-12-05 12:16:00	2.08	16360	United Kingdom	2
406377	571747	22585	PACK OF 6 BIRDY GIFT TAGS	12	2011-10-19 10:59:00	1.25	13849	United Kingdom	2
344763	567097	23355	HOT WATER BOTTLE KEEP CALM	8	2011-09-16 13:23:00	4.95	13323	United Kingdom	0

We now have clusters of customers who’ve bought similar products and can market to these interests.

Use Case Tutorial 2: Kevin Bacon(s) of 2019

User Guide