This page was generated from getting_started/use_case_3_ecommerce_customer_interests.ipynb.

Use Case Tutorial 3: Customer Interest Clustering

This is a tutorial on how to perform customer clustering based on the interests and purchases of customers.

Marketing teams frequently are interested in this analysis.

We’ll show how graph analytics can be used to gain insights about the interests of customers by finding communities of customers who’ve bought similar products.

We’ll accomplish this by creating a bipartite graph of customers and products, using a graph projection to create a graph of customers linked to other customers who’ve bought the same product, and using Louvain community detection to find the communities.

We’ll be using ecommerce transaction data from a U.K. retailer provided by the University of California, Irvine. The data can be found here.

Data Preprocessing

Let’s first look at the data.

First, we’ll need to import some libraries.


import metagraph as mg
import pandas as pd
import networkx as nx

Let’s see what the data looks like.


RAW_DATA_CSV = './data/ecommerce/data.csv' # https://www.kaggle.com/carrie1/ecommerce-data
data_df = pd.read_csv(RAW_DATA_CSV, encoding="ISO-8859-1")
data_df.head()

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/2010 8:26 3.39 17850.0 United Kingdom

Let’s clean the data to make sure there aren’t any missing values.


data_df.InvoiceDate = pd.to_datetime(data_df.InvoiceDate, format="%m/%d/%Y %H:%M")
data_df.drop(data_df.index[data_df.CustomerID != data_df.CustomerID], inplace=True)
assert len(data_df[data_df.isnull().any(axis=1)])==0, "Raw data contains NaN"
data_df = data_df.astype({'CustomerID': int}, copy=False)
data_df.head()

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom

Note that some of these transactions are for returns (denoted by negative quantity values).


data_df[data_df.Quantity < 1].head()

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
141 C536379 D Discount -1 2010-12-01 09:41:00 27.50 14527 United Kingdom
154 C536383 35004C SET OF 3 COLOURED FLYING DUCKS -1 2010-12-01 09:49:00 4.65 15311 United Kingdom
235 C536391 22556 PLASTERS IN TIN CIRCUS PARADE -12 2010-12-01 10:24:00 1.65 17548 United Kingdom
236 C536391 21984 PACK OF 12 PINK PAISLEY TISSUES -24 2010-12-01 10:24:00 0.29 17548 United Kingdom
237 C536391 21983 PACK OF 12 BLUE PAISLEY TISSUES -24 2010-12-01 10:24:00 0.29 17548 United Kingdom

Though customers may have returned these products, they did initially purchase the products (which reflects an interest in the product), so we’ll keep the initial purchases. However, we’ll remove the return transactions (which will also remove any discount transactions as well).


data_df.drop(data_df.index[data_df.Quantity <= 0], inplace=True)
data_df.head()

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom

Community Detection

Let’s now find the communities of customers with similar purchases / interests.

First, we’ll need to create a bipartite graph of customers and products.

Let’s grab the default resolver.

Let’s take a look at the nodes of the bipartite graph we’re going to create.


customer_ids = data_df['CustomerID']
stock_codes = data_df['StockCode']

customer_ids.head()

0    17850
1    17850
2    17850
3    17850
4    17850
Name: CustomerID, dtype: int64

stock_codes.head()

0    85123A
1     71053
2    84406B
3    84029G
4    84029E
Name: StockCode, dtype: object

Our customer ids are ints, but our stock codes are not ints.

Ideally, our graph will have nodes of all the same type since some hardware backends might require this. This isn’t strictly necessary here, but it’s good practice to do this in order to avoid any potential problems any specific backend might have.

We can make our graph nodes all have the same type by mapping our original customer ids and stock codes to node ids and making a graph of those node ids. We can do this with metagraph.NodeLabels.


all_nodes = pd.concat([customer_ids, stock_codes]).unique()
node_ids = range(len(all_nodes))
node_labels = mg.NodeLabels(node_ids, all_nodes)

node_labels maps the customer ids or stock codes to node ids as shown below.


first_customer_id = customer_ids.iloc[0]
first_customer_id

17850

first_customer_id_node_id = node_labels[first_customer_id]
first_customer_id_node_id

0

first_stock_code = stock_codes.iloc[0]
first_stock_code

'85123A'

first_stock_code_node_id = node_labels[first_stock_code]
first_stock_code_node_id

4339

node_labels.ids maps node ids to customer ids or stock codes as shown below.


node_labels.ids[first_customer_id_node_id]

17850

node_labels.ids[first_stock_code_node_id]

'85123A'

assert node_labels.ids[first_customer_id_node_id] == first_customer_id
assert node_labels.ids[first_stock_code_node_id] == first_stock_code

Let’s now create our bipartite graph.


customer_id_node_ids = [node_labels[customer_id] for customer_id in customer_ids]
stock_code_node_ids = [node_labels[stock_code] for stock_code in stock_codes]
edges = zip(customer_id_node_ids, stock_code_node_ids)

nx_bipartite_graph = nx.Graph()
nx_bipartite_graph.add_edges_from(edges)
bipartite_graph = mg.wrappers.BipartiteGraph.NetworkXBipartiteGraph(
    nx_bipartite_graph,
    [customer_id_node_ids, stock_code_node_ids]
)

Next, we’ll need to use a graph projection to create a graph of customers linked to other customers who’ve bought the same product.


customer_similarity_graph = mg.algos.bipartite.graph_projection(bipartite_graph, 0)

We now have an unweighted bipartite graph. Louvain community detection requires weights. A more elegant approach might be taken in practice, but we’ll simply assign every edge to have a weight of 1 for this tutorial.


customer_similarity_graph = mg.algos.util.graph.assign_uniform_weight(
    customer_similarity_graph,
    1.0
)

Now, we’ll need to use Louvain community detection to find similar communities based on purchased products.


community_labels, modularity_score = mg.algos.clustering.louvain_community(
    customer_similarity_graph
)

community_labels is a mapping from node IDs to their community labels.

Let’s see how many / what community labels we have.


type(community_labels)

dict

set(community_labels.values())

{0, 1, 2, 3}

Let’s now merge the labels into our dataframe.


data_df['CustomerCommunityLabel'] = data_df.CustomerID.map(
    lambda customer_id: community_labels[node_labels[customer_id]]
)
data_df.sample(10)

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country CustomerCommunityLabel
174080 551739 22916 HERB MARKER THYME 2 2011-05-04 10:58:00 0.65 18118 United Kingdom 0
337560 566450 22977 DOLLY GIRL CHILDRENS EGG CUP 12 2011-09-12 16:12:00 1.25 15673 United Kingdom 2
346336 567184 22173 METAL 4 HOOK HANGER FRENCH CHATEAU 1 2011-09-18 15:41:00 3.29 16033 United Kingdom 0
493119 578155 22208 WOOD STAMP SET THANK YOU 1 2011-11-23 11:32:00 0.83 12748 United Kingdom 0
406740 571828 22812 PACK 3 BOXES CHRISTMAS PANNETONE 8 2011-10-19 11:52:00 1.95 16440 United Kingdom 2
482809 577484 23295 SET OF 12 MINI LOAF BAKING CASES 1 2011-11-20 11:52:00 0.83 13536 United Kingdom 2
303813 563555 22201 FRYING PAN BLUE POLKADOT 1 2011-08-17 13:21:00 4.25 16755 United Kingdom 3
525561 580632 23552 BICYCLE PUNCTURE REPAIR KIT 2 2011-12-05 12:16:00 2.08 16360 United Kingdom 2
406377 571747 22585 PACK OF 6 BIRDY GIFT TAGS 12 2011-10-19 10:59:00 1.25 13849 United Kingdom 2
344763 567097 23355 HOT WATER BOTTLE KEEP CALM 8 2011-09-16 13:23:00 4.95 13323 United Kingdom 0

We now have clusters of customers who’ve bought similar products and can market to these interests.