This page was generated from getting_started/use_case_3_ecommerce_customer_interests.ipynb.
Use Case Tutorial 3: Customer Interest Clustering¶
This is a tutorial on how to perform customer clustering based on the interests and purchases of customers.
Marketing teams frequently are interested in this analysis.
We’ll show how graph analytics can be used to gain insights about the interests of customers by finding communities of customers who’ve bought similar products.
We’ll accomplish this by creating a bipartite graph of customers and products, using a graph projection to create a graph of customers linked to other customers who’ve bought the same product, and using Louvain community detection to find the communities.
We’ll be using ecommerce transaction data from a U.K. retailer provided by the University of California, Irvine. The data can be found here.
Data Preprocessing¶
Let’s first look at the data.
First, we’ll need to import some libraries.
import metagraph as mg
import pandas as pd
import networkx as nx
Let’s see what the data looks like.
RAW_DATA_CSV = './data/ecommerce/data.csv' # https://www.kaggle.com/carrie1/ecommerce-data
data_df = pd.read_csv(RAW_DATA_CSV, encoding="ISO-8859-1")
data_df.head()
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 12/1/2010 8:26 | 2.55 | 17850.0 | United Kingdom |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 12/1/2010 8:26 | 2.75 | 17850.0 | United Kingdom |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
Let’s clean the data to make sure there aren’t any missing values.
data_df.InvoiceDate = pd.to_datetime(data_df.InvoiceDate, format="%m/%d/%Y %H:%M")
data_df.drop(data_df.index[data_df.CustomerID != data_df.CustomerID], inplace=True)
assert len(data_df[data_df.isnull().any(axis=1)])==0, "Raw data contains NaN"
data_df = data_df.astype({'CustomerID': int}, copy=False)
data_df.head()
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850 | United Kingdom |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850 | United Kingdom |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850 | United Kingdom |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850 | United Kingdom |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850 | United Kingdom |
Note that some of these transactions are for returns (denoted by negative quantity values).
data_df[data_df.Quantity < 1].head()
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
---|---|---|---|---|---|---|---|---|
141 | C536379 | D | Discount | -1 | 2010-12-01 09:41:00 | 27.50 | 14527 | United Kingdom |
154 | C536383 | 35004C | SET OF 3 COLOURED FLYING DUCKS | -1 | 2010-12-01 09:49:00 | 4.65 | 15311 | United Kingdom |
235 | C536391 | 22556 | PLASTERS IN TIN CIRCUS PARADE | -12 | 2010-12-01 10:24:00 | 1.65 | 17548 | United Kingdom |
236 | C536391 | 21984 | PACK OF 12 PINK PAISLEY TISSUES | -24 | 2010-12-01 10:24:00 | 0.29 | 17548 | United Kingdom |
237 | C536391 | 21983 | PACK OF 12 BLUE PAISLEY TISSUES | -24 | 2010-12-01 10:24:00 | 0.29 | 17548 | United Kingdom |
Though customers may have returned these products, they did initially purchase the products (which reflects an interest in the product), so we’ll keep the initial purchases. However, we’ll remove the return transactions (which will also remove any discount transactions as well).
data_df.drop(data_df.index[data_df.Quantity <= 0], inplace=True)
data_df.head()
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850 | United Kingdom |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850 | United Kingdom |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850 | United Kingdom |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850 | United Kingdom |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850 | United Kingdom |
Community Detection¶
Let’s now find the communities of customers with similar purchases / interests.
First, we’ll need to create a bipartite graph of customers and products.
Let’s grab the default resolver.
Let’s take a look at the nodes of the bipartite graph we’re going to create.
customer_ids = data_df['CustomerID']
stock_codes = data_df['StockCode']
customer_ids.head()
0 17850
1 17850
2 17850
3 17850
4 17850
Name: CustomerID, dtype: int64
stock_codes.head()
0 85123A
1 71053
2 84406B
3 84029G
4 84029E
Name: StockCode, dtype: object
Our customer ids are ints, but our stock codes are not ints.
Ideally, our graph will have nodes of all the same type since some hardware backends might require this. This isn’t strictly necessary here, but it’s good practice to do this in order to avoid any potential problems any specific backend might have.
We can make our graph nodes all have the same type by mapping our original customer ids and stock codes to node ids and making a graph of those node ids. We can do this with metagraph.NodeLabels
.
all_nodes = pd.concat([customer_ids, stock_codes]).unique()
node_ids = range(len(all_nodes))
node_labels = mg.NodeLabels(node_ids, all_nodes)
node_labels
maps the customer ids or stock codes to node ids as shown below.
first_customer_id = customer_ids.iloc[0]
first_customer_id
17850
first_customer_id_node_id = node_labels[first_customer_id]
first_customer_id_node_id
0
first_stock_code = stock_codes.iloc[0]
first_stock_code
'85123A'
first_stock_code_node_id = node_labels[first_stock_code]
first_stock_code_node_id
4339
node_labels.ids
maps node ids to customer ids or stock codes as shown below.
node_labels.ids[first_customer_id_node_id]
17850
node_labels.ids[first_stock_code_node_id]
'85123A'
assert node_labels.ids[first_customer_id_node_id] == first_customer_id
assert node_labels.ids[first_stock_code_node_id] == first_stock_code
Let’s now create our bipartite graph.
customer_id_node_ids = [node_labels[customer_id] for customer_id in customer_ids]
stock_code_node_ids = [node_labels[stock_code] for stock_code in stock_codes]
edges = zip(customer_id_node_ids, stock_code_node_ids)
nx_bipartite_graph = nx.Graph()
nx_bipartite_graph.add_edges_from(edges)
bipartite_graph = mg.wrappers.BipartiteGraph.NetworkXBipartiteGraph(
nx_bipartite_graph,
[customer_id_node_ids, stock_code_node_ids]
)
Next, we’ll need to use a graph projection to create a graph of customers linked to other customers who’ve bought the same product.
customer_similarity_graph = mg.algos.bipartite.graph_projection(bipartite_graph, 0)
We now have an unweighted bipartite graph. Louvain community detection requires weights. A more elegant approach might be taken in practice, but we’ll simply assign every edge to have a weight of 1 for this tutorial.
customer_similarity_graph = mg.algos.util.graph.assign_uniform_weight(
customer_similarity_graph,
1.0
)
Now, we’ll need to use Louvain community detection to find similar communities based on purchased products.
community_labels, modularity_score = mg.algos.clustering.louvain_community(
customer_similarity_graph
)
community_labels
is a mapping from node IDs to their community labels.
Let’s see how many / what community labels we have.
type(community_labels)
dict
set(community_labels.values())
{0, 1, 2, 3}
Let’s now merge the labels into our dataframe.
data_df['CustomerCommunityLabel'] = data_df.CustomerID.map(
lambda customer_id: community_labels[node_labels[customer_id]]
)
data_df.sample(10)
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | CustomerCommunityLabel | |
---|---|---|---|---|---|---|---|---|---|
174080 | 551739 | 22916 | HERB MARKER THYME | 2 | 2011-05-04 10:58:00 | 0.65 | 18118 | United Kingdom | 0 |
337560 | 566450 | 22977 | DOLLY GIRL CHILDRENS EGG CUP | 12 | 2011-09-12 16:12:00 | 1.25 | 15673 | United Kingdom | 2 |
346336 | 567184 | 22173 | METAL 4 HOOK HANGER FRENCH CHATEAU | 1 | 2011-09-18 15:41:00 | 3.29 | 16033 | United Kingdom | 0 |
493119 | 578155 | 22208 | WOOD STAMP SET THANK YOU | 1 | 2011-11-23 11:32:00 | 0.83 | 12748 | United Kingdom | 0 |
406740 | 571828 | 22812 | PACK 3 BOXES CHRISTMAS PANNETONE | 8 | 2011-10-19 11:52:00 | 1.95 | 16440 | United Kingdom | 2 |
482809 | 577484 | 23295 | SET OF 12 MINI LOAF BAKING CASES | 1 | 2011-11-20 11:52:00 | 0.83 | 13536 | United Kingdom | 2 |
303813 | 563555 | 22201 | FRYING PAN BLUE POLKADOT | 1 | 2011-08-17 13:21:00 | 4.25 | 16755 | United Kingdom | 3 |
525561 | 580632 | 23552 | BICYCLE PUNCTURE REPAIR KIT | 2 | 2011-12-05 12:16:00 | 2.08 | 16360 | United Kingdom | 2 |
406377 | 571747 | 22585 | PACK OF 6 BIRDY GIFT TAGS | 12 | 2011-10-19 10:59:00 | 1.25 | 13849 | United Kingdom | 2 |
344763 | 567097 | 23355 | HOT WATER BOTTLE KEEP CALM | 8 | 2011-09-16 13:23:00 | 4.95 | 13323 | United Kingdom | 0 |
We now have clusters of customers who’ve bought similar products and can market to these interests.