Demand Forecasting in Supply Chain Management-A time-series approach (2/2)
Project Leaders — Sundar Raghavan, Anand Modi, Sabarni Paul- SIGMA The Business Club Of NIT Trichy
In this article, we’ll be performing some time series analysis and subsequent forecasting on retail data. The related datasets and code can be found here. We have two years worth of data (2010–12) on 45 anonymised retail stores. The goal is to derive insights and finally forecast department-wide sales for the year 2013.
Cleaning and preprocessing
There are three datasets (stores, features and sales) which we clean individually (if needed) and then merge appropriately for further analysis. Let’s take a look at them first.
df_features.head()
There are some null values, let’s inspect further
df_features.isna().sum()
We have to fill up the null values so that we can use these features for further analysis. Let’s look at how the CPI and Unemployment features behave to decide a suitable imputation technique.
df_features.CPI.plot();
df_features.Unemployment.plot();
CPI and Unemployment are continuous only for certain short intervals in their plots. Upon further inspection it was found that they were continuous for each store.
So if we impute CPI and Unemployment by store, using linear interpolation, we’ll get a linear approximation for the null values, which should be pretty reasonable given the spread of the features.
for i in range(1,46):
df_features[df_features.Store==i] =
df_features[df_features.Store == i].interpolate()
This should do it. Let’s check store #20 again.
The same is the case with Unemployment as well. Now moving onto the markdowns. We can just set the null values to 0 as markdowns are reductions in prices made by retailers in an attempt to sell more. So if the data isn’t present, it is reasonable to assume that there was no markdown. In this dataset, columns having indices 4 – 8 are the markdown columns.
df_features[df_features.columns[4:9]] = df_features[df_features.columns[4:9]].fillna(0)
Now we can move on to the stores and sales datasets.
It was found that the above datasets don’t have any null values so we can now merge all three datasets as follows.
df_all_1 = df_features.merge(df_sales, 'right', on = ['Date', 'Store', 'IsHoliday'])df_all = df_all_1.merge(df_stores, 'left', on = 'Store')
df_all = df_all.sort_values('Date')
df_all.reset_index(inplace = True)
First, we merge features and sales on common columns (Date, Store and IsHoliday) coupled with a right join on sales. This is because sales has data till 2012, but features has data till 2013. So let’s train our model on the data till 2012 and forecast the 2013 data. After this, we merge sales+features with stores, sort the final dataset by date, and reset the index.
Let’s do some simple mapping for converting categorical features into numeric ones.
df_all.replace({'IsHoliday':{True:1, False:0}}, inplace=True)
df_all.replace({'Type':{'A':3, 'B':2, 'C':1}}, inplace=True)
EDA
To start with time series analysis, we must sort our data by date and since we’re forecasting overall sales, we can aggregate the other features appropriately.
df_by_date = df_all.groupby('Date',as_index=False).agg({'Temperature': 'mean', 'Fuel_Price': 'mean', 'CPI': 'mean', 'Unemployment': 'mean', 'Weekly_Sales': 'sum', 'IsHoliday': 'mean'})df_by_date.Date = pd.to_datetime(df_by_date.Date, errors='coerce')
df_by_date.set_index('Date', inplace=True)
We create a new dataframe where we group each feature by date and with a suitable aggregate measure for each of them. After that, we convert the Date feature to a pandas DateTime object and set it as the index. This is how the dataframe looks.
Notice that the date doesn’t have a clear cut frequency (i.e., daily, weekly, monthly, etc). To apply certain time series analysis techniques, we need our data to have a proper frequency. So we resample our dataset so that it has a weekly frequency.
df_by_date_new = df_by_date.resample('W').mean().fillna(method='bfill')
The above code will resample the old dataset and create a new one that has a weekly frequency. Of course, the new dataset will have some null values due to the uneven distribution of the old data. We use the ‘bfill’ (backfill) method, which will assign the missing value to the immediate previous non-missing value.
Now we can finally decompose this time series using python’s statsmodels library.
from statsmodels.tsa.seasonal import seasonal_decomposemulti_plot = seasonal_decompose(df_by_date_new['Weekly_Sales'], model = 'add', extrapolate_trend='freq')
Seasonal decompose gives the decomposition of the time series into its estimated trend component, estimated seasonal component, and estimated residual. We also plot the original data to look at what components of the series influence its true value the most.
multi_plot.observed.plot(title = 'weekly sales')
multi_plot.trend.plot(title = 'trend')
multi_plot.seasonal.plot(title = 'seasonal')
multi_plot.resid.plot(title = 'residual')
As it can be observed, the series is strongly influenced by the seasonal component.
Let’s inspect correlations among the features of the newly created dataset.
sns.heatmap(df_by_date_new.corr('spearman'), annot = True)
- strong +ve correlation b/w Fuel_Price and CPI
- strong -ve correlations b/w Unemployment and Fuel_Price and Unemployment and CPI
- surprisingly, the unemployment rate doesn’t really seem to affect the weekly sales (directly at least) suggesting that the stores might be overstaffed.
Let’s see how holidays affect weekly sales
sns.boxplot(data = df_by_date, x = 'IsHoliday', y = 'Weekly_Sales');
Analysing by store type
df_by_store = df_all.groupby('Store').agg({'Weekly_Sales': 'sum',
'Type': 'max'})sns.boxplot(data = df_by_store, x = 'Type', y = 'Weekly_Sales')
Monthly sales
monthly_sales = df_all.groupby(df_all.Date.dt.month).agg({'Weekly_Sales':'sum'})
sns.barplot(x=monthly_sales.index, y=monthly_sales.Weekly_Sales);
Finding the most overperforming and underperforming channels
df_by_dept = df_all.groupby('Dept').agg({'Weekly_Sales':'sum'})df_by_dept.sort_values(by = 'Weekly_Sales', ascending = False, inplace = True)
Forecasting using the Holt-Winters Model
The Holt-Winters model uses the technique of exponential smoothing to make predictions. Exponential smoothing is a technique for smoothening time series data by giving different weights, which are exponentially decreasing over time, unlike the simple moving average method, which assigns equal weightage to all observations. Holt-Winters exponential smoothening applies exponential smoothing three times, usually done when a high-frequency signal has to be removed.
Let’s train and test the model on 2012 data first so that we know how accurate it is. After this, we can train it on all of the 2012 data, and forecast 2013 sales. We have 154 samples for 2012, that have been divided into two parts 120 for training and 34 for testing.
from statsmodels.tsa.holtwinters import ExponentialSmoothing as esfit_model = es(df_by_date_new['Weekly_Sales'[:120], trend = 'add',
seasonal = 'add', seasonal_periods = 52).fit()prediction = fit_model.forecast(34)
Let’s see how good our model’s predictions are
plt.plot(df_by_date_new.index[120:], prediction, 'predicted')plt.plot(df_by_date_new.index[120:], df_by_date_new.Weekly_Sales[120:], 'actual')plt.legend();
def mean_absolute_percentage_error(y_true, y_pred):
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100print("Mean Absolute Percentage Error = {a}%".format(a=mean_absolute_percentage_error(df_by_date_new.Weekly_Sales[120:],prediction)))
Mean Absolute Percentage Error = 4.786445001829046%
Now that we know that our error is reasonable, we can go ahead and forecast 2013 sales.
fit_model = es(df_by_date_new['Weekly_Sales'][:-2],
trend = 'add',seasonal='add',
seasonal_periods=52).fit()preds_2013 = fit_model.forecast(56)plt.plot(df_by_date_new.index, df_by_date_new.Weekly_Sales)
plt.plot(preds_2013, '--')
plt.legend(['2010-2012 actual', '2013 forecast'])