Market Basket Analysis

9 min readJan 11, 2024

By Avantika, Vibha and Pragati

INTRODUCTION

One of the most common and useful parts of information analysis for selling and marketing is market basket analysis. Market basket analysis is a data processing approach used by merchants to analyze their customer’s buying behavior's, such as which things consumers are likely to purchase together from their stores, which can assist the distributor in making the informed decisions. Effective analysis can increase a retailer’s profitability, service quality, and customer satisfaction. Market basket analysis provides value-added information that can be utilized to enhance decision-making. It will concentrate on descriptive analysis of customer purchase patterns, items that are frequently purchased together, and units that are frequently purchased from the store. Our dataset includes variables relating to orders as well as order timing. This analysis will enable Instacart to enhance the user experience by suggesting which products should be placed in the basket.

OBJECTIVE

The project’s main goal is to help Instacart suppliers better understand the purchasing habits of their present customers. To estimate possible customer purchasing behavior, retailers must first study existing consumer activity. Customer transaction data may be used to evaluate and visualize customer purchase behaviour, assortment planning, and inventory management in order to retain customers, improve revenue, and enhance customer relationships.

Brief Overview of the Dataset

To undertake analysis, datasets was taken from Kaggle, which was provided by the Instacart Company. Over 3 million purchases from over 200000 Instacart consumers are included in the databases. 50,000 unique products, multiple product aisles, departments, week, and time of purchase are included in both product and consumer data.

A total of 6 datasets were provided which gave information on customer transactional and purchasing order details.

Aisles: Data provided here includes information about aisles, such as aisle names and aisle IDs, and how products were organized within them
Department: This dataset provides information on the department such as department names and department ID
Order_Products_prior/Order_Products_train: In this dataset, all order details for any prior orders are included, including information on orders, products, and reordered items. Order_Products_train is the same as order_products_prior and it is trained dataset.
Orders: This dataset has information about customer orders like order ID, order number, weekday of the order, an hour of the order, user ID, and days since the prior order.
Products: This dataset gives information on the products which were sold to customer such as product name, product ID, aisle and departments.

When the dataset was analyzed using pandas to understand about missing values, there were 2078068 missing datapoints among the many 33819106 datapoints.

Exploratory Data Analysis

Following data preparation and cleaning, exploratory data analysis of the dataset is important to determine the relation between different columns.

*The above figure illustrates, on an average, it takes 7 days for a user to reorder after their last order*

Most of the users reorder products on the same day of purchase itself.

*Most users’ reorder ratio lies between 41–50% which means user retention is around 17%*

*Reorder low — 0–33.33% Reorder medium — 33.33–66.66% Reorder high — 66.66% — 100% From the graph it is clear that people’s reorder frequency fall in the medium range*

*Products that are ordered at 7 am are likely to be reordered more again. reordering percentage is above 60% for orders placed between 5 am — 10 am.*

***Orders placed on Monday (coded as 1) are likely to be reordered more***

Only products that are added in the order between 1–23 into the cart are significantly reordered and the reordering percentage gradually decreases. First and the second product that go into the cart are likely to be reordered more than 68%.

*Number of orders placed from each department*

***Products from dairy egg department get reordered the most.***

*Top 15 aisles based on number of products ordered*

*Products from milk aisle are reordered the most.*

***Raw Veggie Wrappers are reordered the most***

Usually more products are ordered after 7 days since their previous purchase, but there is only 69% chance that it may be a reorder. 73% percent of products purchased are likely to be reordered on the same day.

*Maximum number of orders are placed between 1–2pm on Sundays and 10am on Saturdays*

Data Modelling

Association rule learning is a form of unsupervised learning technology that determines the reliance of one data item on another and maps appropriately to increase profitability. It attempts to discover meaningful relationships or links between the variables in the dataset. It uses several criteria to uncover interesting relationships between variables in the database.

How does Association Rule Learning work?

Association rule learning works on the concept of If and Else Statement, such as if A then B.

Here the If element is called Antecedent, and then statement is called as Consequent. These types of relationships where we can find out some association or relation between two items is known as single cardinality. It is all about creating rules, and if the number of items increases, then cardinality also increases accordingly. So, to measure the associations between thousands of data items, there are several metrics.:

Support refers to the frequency with which A appears in the dataset. It is defined as the proportion of transaction T that includes the itemset X. If there are X datasets, then for transactions T, it may be expressed as

Confidence represents how frequently the rule has been proven to be correct. Or how frequently the items X and Y appear together in the dataset when the incidence of X is already known. The ratio between the transaction including X and Y and the number of records containing X.

Lift: It is the ratio of actual and predicted support if X and Y are independent of one another.

There are three potential values:

If Lift = 1, then the likelihood of occurrence of the antecedent and consequent is independent of one another.
Lift>1: It indicates the degree to which the two itemsets are dependent on one another.
Lift 1 indicates that one thing serves as a substitute for another, implying a negative impact on the latter.

Apriori Algorithm

The Apriori principle states that if an item set is common, its subset will likewise be frequent. A group of items is considered common if it receives more support than the threshold. Apriori employs an approach known as level-wise search. Initially, the database is scanned for frequently occurring 1-item sets with limited support.

Frequent 1-item sets are utilized to identify frequent 2-item sets, and so on until frequent k-item sets are found. Apriori uses a breadth-first search to count the frequency of candidate items, adhering to the antimonotone principle that any subset of a frequent set must also be frequent.

In this part of the analysis we focus on analyzing overall purchasing behavior's of users, providing insight into the quantity and variety of products they order irrespective of whether it is reordered or not. This gives us a general idea of users’ preferences.

*5 most frequent items that occur in a basket*

Lift greater than 1 indicates that the presence of A has a positive effect on the likelihood of B being purchased, suggesting a stronger association between A and B than would be expected by chance. 1 means there would be no association. So from the above table, we can conclude that Honeycrisp Apple & Apple Honeycrisp Organic, Organic Yellow Onion & Organic Garlic have strong association :- combinations of 2

Organic Raspberries, Organic Strawberries, Organic Garlic and Organic Yellow Onion have high lift value above 4 and thus have strong association and are likely to occur together :- combinations of 4

A positive leverage value indicates that the occurrence of X and Y together is more frequent than what would be expected by chance, suggesting a positive association. Limes & Large Lemon, Organic Garlic & Organic Baby Spinach occur frequently together.

These are the 5 top most products that are usually purchased with bananas/ These combinations occur together frequently in a basket

A high conviction value (> 1) indicates that the presence of X is strongly dependent on the presence of Y, implying a strong association. Organic Lemon and Organic Hass Avocado have strong association

Conclusion: Organic Raspberries, Strawberries, Organic Strawberries and Organic Hass Avocado have confidence of 86%, meaning in 86% of the transactions they occur together.

Inference: White colour node indicates organic strawberries and have 9 connections. Since Organic strawberries have more connections, they are frequently bought together along with other products.

Conclusion: All organic products have more number of connections. We can conclude that customers purchase more of organic products from the store.

This part of the analysis focuses specifically on products that users choose to reorder, capturing patterns related to user loyalty and repeat purchases.

Inference: These 5 products are more frequently reordered as they have high support value which denotes the number of occurrences of a product in a basket.

Positive leverage indicates A and B more frequently occur together. So bag of Organic Bananas & Organic strawberries & Organic Hass Avocado are frequently reordered

F-P Growth Algorithm

The FP growth algorithm compresses transaction records using the FP-tree visual data structure, offering an alternative way to assess frequent item collections. FP growth involves turning datasets into graph representations. The FP growth algorithm creates a lexicographic tree structure, whereas the Apriori technique employs candidate creation and checking procedures. After creating the FP-tree, frequent items are divided into conditional FP-trees. Additionally, conditional FP Trees may be mined and assessed individually. FP-growth approaches address the challenges of identifying minor suffixes and incorporating them into model discovery.

From this algorithm it can inferred that limes, organic strawberries and organic baby spinach occur frequently together. Inferences from both the algorithms seems to be almost similar.

Why F-P Growth algorithm is better than Apriori Algorithm?

The Apriori algorithm has various drawbacks, such as the cost of computing or the length of time required to complete a run. It takes place because in order to build the itemset, the algorithm must repeatedly iterate over the entire dataset.

ECLAT Algorithm

The ECLAT algorithm is a depth-first search method. The vertical database layout stores each item with its cover (also known as titlist) and employs an intersection-based approach, rather than explicitly displaying all transactions.

An technique for calculating the support of an itemset. For tiny item sets, it takes up less space than Apriori. This method works well for small datasets and generates common patterns faster than Apriori.