Marketing Analytics

-Anand Modi, Sundar R, Thivyaa Mohan, Nishant S

7 min readSep 23, 2021

This project involves the analysis of a dataset that contains information about 2240 customers of a departmental store. The data is fictional and was made to practice Exploratory and Statistical Analysis in a marketing context. With the analysis that follows, we aim to extract insights about:

Customer profiles
Product preferences
Campaign successes/failures
Channel performance

Data Description

ID — Customer ID
Year_Birth — Customer’s Date of Birth
Education — Customer’s Education Level
Marital_Status — Customer’s Marital Status
Income — Yearly Household Income
Kidhome — Number of children at the customer’s house
Teenhome — Number of teenagers at the customer’s house
Dt_Customer — Date of customer’s enrolment with the company
Recency — Number of days since last purchase
MntXYZ — Amount spent on XYZ related products in last 2 years
NumXYZ — Number of purchases made through XYZ medium
AcceptedCmpX–1 if customer accepted the offer in the Xth campaign, 0 otherwise
Response — 1 if the customer accepted the last campaign offer
Complain — 1 if the customer complained in the last 2 years
Country — country the customer belongs to

Cleaning and preprocessing

Looking at the nulls first

It looks like income is the only feature that needs to be dealt with for further analysis.

We’ve to remove all non-numeric characters to make this an integer variable

Long tailed distribution with a slight right skew

After the initial cleaning, we impute all missing values for income with the median and take care of outliers using quantile-based flooring and capping.

We use Year_Birth to extract the age of the customer (= 2021-Year_Birth).

Boxplot of the age suggests that there are some outliers

We replace all the outliers with the median age. Next up, we clean the Marital_Status feature

we replace “YOLO”, “Absurd”, and “Alone” with “Single”

After doing some digging, we found that the dataset consisted of only three samples that belong to the country ME (Mexico). We decided to remove them as they were producing misleading results while we performed the country based statistical analysis.

Feature Engineering

Add up Kidhome and Teenhome to give us the total number of dependents per house
Add up all accepted campaigns to yield the total number of accepted campaigns
Add up the amount spent on each individual product to extract the total amount spent by the customer for the past 2 years
Add up catalogue, deals, store and web purchases to get the total number of purchases made by the customer over 2 years
One Hot Encode: Age group, marital status, country and education

Statistical Analysis

We used linear regression to model NumStorePurchases as a function of other features. Top contributed features and the nature of their contributions are highlighted with permutation importance and SHAP analysis.

Weights of features used while predicting the number of store purchases (higher the weight, more the feature contributes to the result)

SHAP analysis capturing the effect of each feature on the output

The SHAP summary suggests the following:

More the total purchases, more the store purchases
Lesser the catalogue, web and deals purchases, more are the store purchases

We explore the relationship between the country and the products bought. “Country” is a nominal variable with more than two categories. “Products” is a collection of continuous variables. The analysis of the relation between the above variables calls for a One-way Multivariate Analysis of Variance, aka one-way MANOVA. MANOVA assumes that dependent variables (various products bought) are multivariate normally distributed within each group of the independent variable (county). The population covariance matrices of each group are equal (this is tested with Levene’s homogeneity of variances). To conduct this test, these are the steps we used:

Amount spent on each product standardised by country.
Levene’s test for each type of product to check for homoscedasticity. The results indicated that the variance among each product sales across countries is not significantly different which is basically a statistical nod for going ahead with the MANOVA.
MANOVA was performed with MntXYZ as dependent variables and Country as the independent variable.

Results of one-way MANOVA implemented using statsmodels

The confidence levels of each stat returned by the MANOVA suggests that there is no significant contribution of the independent variable (country) to the changes in the dependent variables (MntWines, MntFruits, …).

Wilk’s Λ suggests that the changes in the independent variable can’t explain 98.5% of the changes in the dependent variables and the p-value is also much greater than 0.05.
Pillai’s trace is a number between 0 and 1. The closer it is to 1, the more it suggests that the independent variable influences the dependent variables. But here it’s just 0.0151.

One of the prompts given with the dataset was to check if people prefer buying gold products physically by coming to the store. We found that Spearman’s ρ is greater than Pearson’s R for these two variables. Hence we figured that it’s likely that the two are related in a non-linear fashion.

Approximate relation described by a 2nd order curve

Is there a link between the country and campaigns accepted? To answer this, we bin the countries into two categories — developing and developed. Acceptance across all campaigns are added up to finally give a binary figure (campaign_binary) of 1 that stands for at least one campaign accepted and 0 otherwise. Now, a Chi-square test of independence is performed on the 2x2 crosstab formed (countries vs campaign_binary).

The test returned a chi-square stat < critical value for the given confidence level (0.95) and degrees of freedom (1*1=1) and p-value > 0.05 => there’s no statistically significant relation between country and campaigns accepted.

Campaign Acceptance Prediction

We used a logistic regression model to predict marketing campaign acceptance using customer data. The last campaign, “Response”, is chosen as the target, and the other features (after appropriate feature selection) are used as predictors. The model is fine-tuned using grid search cross validation coupled with a fair amount of hyperparameter tuning.

Iterative results of hyperparameter tuning

The model returned an accuracy of 99.8% on the test data.

EDA

The last campaign enjoyed the greatest success rate

Wine is the store’s best performing product followed by meat products

Amount spent on each product scaled down to a (0,1) interval

The above figure suggests that there isn’t too much variability in the spread of the amount spent by customers on different products. This is in line with what we observed with the MANOVA test conducted earlier.

Here’s something interesting. Judging by the total number of purchases, Spain seems to be the store’s biggest market. But, when we looked at the mean amount people were spending, we found that some markets like the US were outperforming Spain a little.

Upon comparing the success rates of the campaigns, we found that some markets with good potential for the store (going by the mean amount spent by customers) had lesser acceptance rates than Spain.

Given below is a heatmap with variables that we wanted to compare. Each cell is annotated with the Spearman correlation scores of the corresponding row and column. We concluded that the store’s website must be improved to attract more online purchases as it has considerably negative correlations with some key parameters.

We binned the age of the customers and observed interesting results by performing category-specific analysis.

We observed that Gen-Y.2 and BBoomers had the most linear relation among all categories while comparing the total number of purchases (x) vs Income (y)

The higher the number of dependants in the household, the less likely that family is to buy from the store.

Households with 2–3 dependants spend significantly lesser

Marketing Strategies

Given all these factors, we came up with brief marketing strategies to improve overall campaign acceptance.

The most recent marketing campaign in Mexico was highly successful (>60 percent acceptance)-the same model can be implemented in future advertising campaigns.
Acceptance for ad campaigns is favourably correlated with income and unfavourably correlated with having children or teenagers -Create two targeted advertising campaigns, one for high-income persons without children or teenagers and the other for lower-income folks with children or teenagers.
Wines and meats seem to be the most successful products (i.e., the average customer spent the most money on these items).-Concentrate ad campaigns on increasing sales of less popular items.
Deals and catalogue purchases are the low — performing channels.
In-store purchases are the most effective channels.
To reach more customers, concentrate ad campaigns on the most successful channels.
Churn prediction — Logistic regression model created with around 97% accuracy to predict whether a given customer will accept or reject the next campaign