-Anand Modi, Sundar R, Thivyaa Mohan, Nishant S
This project involves the analysis of a dataset that contains information about 2240 customers of a departmental store. The data is fictional and was made to practice Exploratory and Statistical Analysis in a marketing context. With the analysis that follows, we aim to extract insights about:
- Customer profiles
- Product preferences
- Campaign successes/failures
- Channel performance
- ID — Customer ID
- Year_Birth — Customer’s Date of Birth
- Education — Customer’s Education Level
- Marital_Status — Customer’s Marital Status
- Income — Yearly Household Income
- Kidhome — Number of children at the customer’s house
- Teenhome — Number of teenagers at the customer’s house
- Dt_Customer — Date of customer’s enrolment with the company
- Recency — Number of days since last purchase
- MntXYZ — Amount spent on XYZ related products in last 2 years
- NumXYZ — Number of purchases made through XYZ medium
- AcceptedCmpX–1 if customer accepted the offer in the Xth campaign, 0 otherwise
- Response — 1 if the customer accepted the last campaign offer
- Complain — 1 if the customer complained in the last 2 years
- Country — country the customer belongs to
Cleaning and preprocessing
Looking at the nulls first
It looks like income is the only feature that needs to be dealt with for further analysis.
After the initial cleaning, we impute all missing values for income with the median and take care of outliers using quantile-based flooring and capping.
We use Year_Birth to extract the age of the customer (= 2021-Year_Birth).
We replace all the outliers with the median age. Next up, we clean the Marital_Status feature
After doing some digging, we found that the dataset consisted of only three samples that belong to the country ME (Mexico). We decided to remove them as they were producing misleading results while we performed the country based statistical analysis.
- Add up Kidhome and Teenhome to give us the total number of dependents per house
- Add up all accepted campaigns to yield the total number of accepted campaigns
- Add up the amount spent on each individual product to extract the total amount spent by the customer for the past 2 years
- Add up catalogue, deals, store and web purchases to get the total number of purchases made by the customer over 2 years
- One Hot Encode: Age group, marital status, country and education
We used linear regression to model NumStorePurchases as a function of other features. Top contributed features and the nature of their contributions are highlighted with permutation importance and SHAP analysis.
The SHAP summary suggests the following:
- More the total purchases, more the store purchases
- Lesser the catalogue, web and deals purchases, more are the store purchases
We explore the relationship between the country and the products bought. “Country” is a nominal variable with more than two categories. “Products” is a collection of continuous variables. The analysis of the relation between the above variables calls for a One-way Multivariate Analysis of Variance, aka one-way MANOVA. MANOVA assumes that dependent variables (various products bought) are multivariate normally distributed within each group of the independent variable (county). The population covariance matrices of each group are equal (this is tested with Levene’s homogeneity of variances). To conduct this test, these are the steps we used:
- Amount spent on each product standardised by country.
- Levene’s test for each type of product to check for homoscedasticity. The results indicated that the variance among each product sales across countries is not significantly different which is basically a statistical nod for going ahead with the MANOVA.
- MANOVA was performed with MntXYZ as dependent variables and Country as the independent variable.
The confidence levels of each stat returned by the MANOVA suggests that there is no significant contribution of the independent variable (country) to the changes in the dependent variables (MntWines, MntFruits, …).
- Wilk’s Λ suggests that the changes in the independent variable can’t explain 98.5% of the changes in the dependent variables and the p-value is also much greater than 0.05.
- Pillai’s trace is a number between 0 and 1. The closer it is to 1, the more it suggests that the independent variable influences the dependent variables. But here it’s just 0.0151.
One of the prompts given with the dataset was to check if people prefer buying gold products physically by coming to the store. We found that Spearman’s ρ is greater than Pearson’s R for these two variables. Hence we figured that it’s likely that the two are related in a non-linear fashion.
Is there a link between the country and campaigns accepted? To answer this, we bin the countries into two categories — developing and developed. Acceptance across all campaigns are added up to finally give a binary figure (campaign_binary) of 1 that stands for at least one campaign accepted and 0 otherwise. Now, a Chi-square test of independence is performed on the 2x2 crosstab formed (countries vs campaign_binary).
The test returned a chi-square stat < critical value for the given confidence level (0.95) and degrees of freedom (1*1=1) and p-value > 0.05 => there’s no statistically significant relation between country and campaigns accepted.
Campaign Acceptance Prediction
We used a logistic regression model to predict marketing campaign acceptance using customer data. The last campaign, “Response”, is chosen as the target, and the other features (after appropriate feature selection) are used as predictors. The model is fine-tuned using grid search cross validation coupled with a fair amount of hyperparameter tuning.
The model returned an accuracy of 99.8% on the test data.
The above figure suggests that there isn’t too much variability in the spread of the amount spent by customers on different products. This is in line with what we observed with the MANOVA test conducted earlier.
Here’s something interesting. Judging by the total number of purchases, Spain seems to be the store’s biggest market. But, when we looked at the mean amount people were spending, we found that some markets like the US were outperforming Spain a little.
Upon comparing the success rates of the campaigns, we found that some markets with good potential for the store (going by the mean amount spent by customers) had lesser acceptance rates than Spain.
Given below is a heatmap with variables that we wanted to compare. Each cell is annotated with the Spearman correlation scores of the corresponding row and column. We concluded that the store’s website must be improved to attract more online purchases as it has considerably negative correlations with some key parameters.
We binned the age of the customers and observed interesting results by performing category-specific analysis.
We observed that Gen-Y.2 and BBoomers had the most linear relation among all categories while comparing the total number of purchases (x) vs Income (y)
The higher the number of dependants in the household, the less likely that family is to buy from the store.
Given all these factors, we came up with brief marketing strategies to improve overall campaign acceptance.
- The most recent marketing campaign in Mexico was highly successful (>60 percent acceptance)-the same model can be implemented in future advertising campaigns.
- Acceptance for ad campaigns is favourably correlated with income and unfavourably correlated with having children or teenagers -Create two targeted advertising campaigns, one for high-income persons without children or teenagers and the other for lower-income folks with children or teenagers.
- Wines and meats seem to be the most successful products (i.e., the average customer spent the most money on these items).-Concentrate ad campaigns on increasing sales of less popular items.
- Deals and catalogue purchases are the low — performing channels.
- In-store purchases are the most effective channels.
- To reach more customers, concentrate ad campaigns on the most successful channels.
- Churn prediction — Logistic regression model created with around 97% accuracy to predict whether a given customer will accept or reject the next campaign