Marketing Analytics

This project involves the analysis of a dataset that contains information about 2240 customers of a departmental store. The data is fictional and was made to practice Exploratory and Statistical Analysis in a marketing context. With the analysis that follows, we aim to extract insights about:

  • Customer profiles
  • Product preferences
  • Campaign successes/failures
  • Channel performance

Data Description

  • ID — Customer ID
  • Year_Birth — Customer’s Date of Birth
  • Education — Customer’s Education Level
  • Marital_Status — Customer’s Marital Status
  • Income — Yearly Household Income
  • Kidhome — Number of children at the customer’s house
  • Teenhome — Number of teenagers at the customer’s house
  • Dt_Customer — Date of customer’s enrolment with the company
  • Recency — Number of days since last purchase
  • MntXYZ — Amount spent on XYZ related products in last 2 years
  • NumXYZ — Number of purchases made through XYZ medium
  • AcceptedCmpX–1 if customer accepted the offer in the Xth campaign, 0 otherwise
  • Response — 1 if the customer accepted the last campaign offer
  • Complain — 1 if the customer complained in the last 2 years
  • Country — country the customer belongs to

Cleaning and preprocessing

Looking at the nulls first

Number of nulls by column

It looks like income is the only feature that needs to be dealt with for further analysis.

We’ve to remove all non-numeric characters to make this an integer variable
Long tailed distribution with a slight right skew

After the initial cleaning, we impute all missing values for income with the median and take care of outliers using quantile-based flooring and capping.

We use Year_Birth to extract the age of the customer (= 2021-Year_Birth).

Boxplot of the age suggests that there are some outliers

We replace all the outliers with the median age. Next up, we clean the Marital_Status feature

we replace “YOLO”, “Absurd”, and “Alone” with “Single”

After doing some digging, we found that the dataset consisted of only three samples that belong to the country ME (Mexico). We decided to remove them as they were producing misleading results while we performed the country based statistical analysis.

Feature Engineering

  • Add up Kidhome and Teenhome to give us the total number of dependents per house
  • Add up all accepted campaigns to yield the total number of accepted campaigns
  • Add up the amount spent on each individual product to extract the total amount spent by the customer for the past 2 years
  • Add up catalogue, deals, store and web purchases to get the total number of purchases made by the customer over 2 years
  • One Hot Encode: Age group, marital status, country and education

Statistical Analysis

We used linear regression to model NumStorePurchases as a function of other features. Top contributed features and the nature of their contributions are highlighted with permutation importance and SHAP analysis.

Weights of features used while predicting the number of store purchases (higher the weight, more the feature contributes to the result)
SHAP analysis capturing the effect of each feature on the output

The SHAP summary suggests the following:

  • More the total purchases, more the store purchases
  • Lesser the catalogue, web and deals purchases, more are the store purchases

We explore the relationship between the country and the products bought. “Country” is a nominal variable with more than two categories. “Products” is a collection of continuous variables. The analysis of the relation between the above variables calls for a One-way Multivariate Analysis of Variance, aka one-way MANOVA. MANOVA assumes that dependent variables (various products bought) are multivariate normally distributed within each group of the independent variable (county). The population covariance matrices of each group are equal (this is tested with Levene’s homogeneity of variances). To conduct this test, these are the steps we used:

  • Amount spent on each product standardised by country.
  • Levene’s test for each type of product to check for homoscedasticity. The results indicated that the variance among each product sales across countries is not significantly different which is basically a statistical nod for going ahead with the MANOVA.
  • MANOVA was performed with MntXYZ as dependent variables and Country as the independent variable.
Results of one-way MANOVA implemented using statsmodels

The confidence levels of each stat returned by the MANOVA suggests that there is no significant contribution of the independent variable (country) to the changes in the dependent variables (MntWines, MntFruits, …).

  • Wilk’s Λ suggests that the changes in the independent variable can’t explain 98.5% of the changes in the dependent variables and the p-value is also much greater than 0.05.
  • Pillai’s trace is a number between 0 and 1. The closer it is to 1, the more it suggests that the independent variable influences the dependent variables. But here it’s just 0.0151.

One of the prompts given with the dataset was to check if people prefer buying gold products physically by coming to the store. We found that Spearman’s ρ is greater than Pearson’s R for these two variables. Hence we figured that it’s likely that the two are related in a non-linear fashion.

Approximate relation described by a 2nd order curve

Is there a link between the country and campaigns accepted? To answer this, we bin the countries into two categories — developing and developed. Acceptance across all campaigns are added up to finally give a binary figure (campaign_binary) of 1 that stands for at least one campaign accepted and 0 otherwise. Now, a Chi-square test of independence is performed on the 2x2 crosstab formed (countries vs campaign_binary).


The test returned a chi-square stat < critical value for the given confidence level (0.95) and degrees of freedom (1*1=1) and p-value > 0.05 => there’s no statistically significant relation between country and campaigns accepted.

Campaign Acceptance Prediction

We used a logistic regression model to predict marketing campaign acceptance using customer data. The last campaign, “Response”, is chosen as the target, and the other features (after appropriate feature selection) are used as predictors. The model is fine-tuned using grid search cross validation coupled with a fair amount of hyperparameter tuning.

Iterative results of hyperparameter tuning

The model returned an accuracy of 99.8% on the test data.


The last campaign enjoyed the greatest success rate
Wine is the store’s best performing product followed by meat products
Amount spent on each product scaled down to a (0,1) interval

The above figure suggests that there isn’t too much variability in the spread of the amount spent by customers on different products. This is in line with what we observed with the MANOVA test conducted earlier.

Here’s something interesting. Judging by the total number of purchases, Spain seems to be the store’s biggest market. But, when we looked at the mean amount people were spending, we found that some markets like the US were outperforming Spain a little.

Upon comparing the success rates of the campaigns, we found that some markets with good potential for the store (going by the mean amount spent by customers) had lesser acceptance rates than Spain.

Given below is a heatmap with variables that we wanted to compare. Each cell is annotated with the Spearman correlation scores of the corresponding row and column. We concluded that the store’s website must be improved to attract more online purchases as it has considerably negative correlations with some key parameters.

We binned the age of the customers and observed interesting results by performing category-specific analysis.

Spread of Age

We observed that Gen-Y.2 and BBoomers had the most linear relation among all categories while comparing the total number of purchases (x) vs Income (y)

Gen-Y.2 (Pearson’s R = 0.88)
BBoomers (Pearson’s R = 0.84)

The higher the number of dependants in the household, the less likely that family is to buy from the store.

Households with 2–3 dependants spend significantly lesser

Marketing Strategies

Given all these factors, we came up with brief marketing strategies to improve overall campaign acceptance.

  • The most recent marketing campaign in Mexico was highly successful (>60 percent acceptance)-the same model can be implemented in future advertising campaigns.
  • Acceptance for ad campaigns is favourably correlated with income and unfavourably correlated with having children or teenagers -Create two targeted advertising campaigns, one for high-income persons without children or teenagers and the other for lower-income folks with children or teenagers.
  • Wines and meats seem to be the most successful products (i.e., the average customer spent the most money on these items).-Concentrate ad campaigns on increasing sales of less popular items.
  • Deals and catalogue purchases are the low — performing channels.
  • In-store purchases are the most effective channels.
  • To reach more customers, concentrate ad campaigns on the most successful channels.
  • Churn prediction — Logistic regression model created with around 97% accuracy to predict whether a given customer will accept or reject the next campaign




The Business Club Of NITT

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

(Almost) End to End Log File Analysis with Python

What can we learn from the most popular Kaggle competition ?

Have FDA Initiatives like KASA Helped with the Covid Vaccine?

Targeting Pods Market Is Estimated to Reach USD 5.2 Billion by 2027

Analyzing IPL 2020 venues : Abu Dhabi,Dubai,Sharjah

Build your own movie recommendation system in Python (Part-I)

3 steps to design your visualized charts

What descriptive statistics is all about?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


The Business Club Of NITT

More from Medium

Simplifying data warehousing for marketing analytics using GCP Big Query

Sutherland and Google Cloud Help You Elevate Customer Experiences

The 3 E’s of Customer Engagement

How beauty brands are winning customer retention through loyalty programs