Customer segmentation and retention analysis of online retail data.

DATA

I used the online retail II dataset from ML Repository from Kaggle. The dataset contains the transactions of online retail between 2009 and 2010. The company mainly sells unique all-occasion gift-ware, and the majority of the customers consist of wholesalers.



OVERVIEW

I filtered the data with less than 1 quantity to work with only purchased products (excluding the returned products). Then I looked at the sum of frequency and the purchase price grouped by country, product and year. United Kingdom ranked the first in frequency, followed by EIRE, Germany and France. As for the amount, United Kingdom ranked the first, followed by EIRE, Netherlands and Germany. Although German customers had higher frequency of purchase, higher amount of sales were coming from the Dutch customers. The most frequently bought products were ‘white-hanging heart t-light holder’, followed by 'regency cake stand 3 tier’ and ‘strawberry ceramic trinket box.’ Products with the highest amount of sales were ‘manual’, ‘dotcom postage’ and ‘regency cake stand 3 tier.’ Most of the sales were from 2010 since the 2009 data only comprised one month of transaction.








CUSTOMER RETENTION

After getting an overview of the data, I separated the UK dataset since it contained the major market. Then, I analyzed the customer retention through Repeat Purchase Rate (RPR), Purchase Frequency (PF), and Time Between Purchases(TBP) metrics.

  • Repeat Purchase Rate (RPR) = Number of customers that bought more than once/ Total number of customers.
  • Purchase Frequency (PF) = Number of orders in the last 365 days/ Number of unique customers over the last 365 days.
  • Time Between Purchases(TBP) = 365 days/ Purchase Frequency.
The Repeat Purchase Rate (RPR) of the UK customers were 97.34% in 2009, and 97.87% in 2010. Compared to the average retention rate in retail industry (63%), the business performed significantly above the average. Part of the reason could be attributed to the majority of the customers being wholesalers, who need to buy products that they can trust, regularly. The Purchase Frequency (PF), that is, the average number of orders per customer, was 1.93 orders in 2009 and 4.83 orders in 2010. Time Between Purchases(TBP) was 188 days in 2009 and 75 days in 2010.






CUSTOMER SEGMENTATION

To segment the customers, I used the behavioral based approach of using RFM metrics: recency, frequency, and monetization.

  • Recency: Number of days between max date in the invoice column and the date of the last purchase for each customer.
  • Frequency: Number of orders for each customer.
  • Monetization: Sum of purchase price for each customer.
I assigned scores to each of the customers based on the quartile they were placed in. As for recency, customers in the lower quartile were assigned higher scores, whereas for frequency and monetization, customers in the upper quartile were assigned higher scores. Then, I drew an elbow graph to find out the best number of segmentations that I can get. I decided to go with 6 clusters, so that each characteristics of the segments are unique enough.




With this number (6 clusters), I employed k-means clustering to segment the customers. You can see the different segments of the customers in the 3d graph below. I named the segments as following:

  • High recency, High frequency, High monetization = MVC: Most Valuable Customers.
  • High recency, High frequency, Mid~Low monetization = VC: Valuable Customers.
  • High recency, Mid frequency, Low monetization = Newcomers.
  • Mid~Low recency, High frequency, High~Mid monetization = Previous MVC.
  • Mid~Low recency, High~Low frequency, High-Low monetization = Previous VC.
  • Low recency, Low frequency, Low-Mid monetization = Hibernating.
    ‘VC’ (22.9%) was the biggest segment, closely followed by ‘Previous VC’(21.4%), then ‘Hibernating’(15.6%), ‘Newcomers'(14.6%), ‘MVC’(13.0%), and ‘Previous VC’(12.5%).