Nielsen Case Study Project

Nielsen Case Study Project

University of Calcutta

Department Of Statistics2nd Year(4th semester)

university roll no:32

Contact no:9038362966

Email: [email protected]

3 / 3 1 / 2 0 1 5

Subhodeep Mukherjee

This contains detailed analysis of the data given on FMCG Sales given as a case study project.

Contents

1. Introduction2. Data Cleaning and Data Manipulation

Missing Data Estimation Tackling the problem of scale

3. Analysis4. Prediction5. Scenario Building6. Appendix7. References.

Introduction

We are given a dataset on FMCG sales for the years 2012-2014 on a monthly basis .Also we are given data on Macro Economic Indicators (like. GDP growth Rate, Consumer Price Index, Producers Price Index and Industrial Production Index). Also we are given data on Crude oil Prices in Indian Rupee per Barrel and Sugar Prices in Indian Rupee per kg .These data re also given on monthly basis.

We are further supplied with the information that the Sales is a function of GDP, Consumer Price Index, Producers Price Index, Industrial Production Index, Crude oil Prices, Sugar Prices and Distribution.

On the basis of the supplied information we have to fulfil two objectives:

1. How does FMCG sales get impacted by movement of these factors2. Predict sales for next 3 Quarters

Data Cleaning and Data Manipulation:

While dealing with the data to carry out the analysis we are mainly facing three problems:

1. The data contains missing values i.e the covariates (Sugar prices ,CPI,IPI and PPI) contains missing values. While Sugar prices contains 4 missing values, the remaining three contains 19 missing values.

2. The data also has a problem of scale. That is the GDP growth rate is given on quarterly basis while the other variables are given on monthly basis. So a problem of data conversion exists.

3. In case of prediction we are not supplied with enough data on the covariates for future values.

4. Note that the first value is given for JAN-13 instead of JAN-12..

For the 3rd case we have to do individual time series analysis on each covariate and obtain future values by time series forecasting.

For the 4th case the value is omitted for further analysis as it has been given wrongly and taking it will lead to overestimation.

The 1st & 2nd case have been discussed in details below.

1. Missing Data Estimation:

The problem of missing data can be tackled in the following ways:

Omit the missing data from the data set and carry out the analysis on the remaining data set.

But the problem with this type of analysis is that if we omit the missing data from the dataset then we are left with very few data points as a result of which the precision of the estimates obtained during analysis will not be so good and cannot be interpreted too well.

Fit a trend line on the data and forecast the missing values in a reverse order.

This can be a good way to solve the problem of missingness. But in the data set provided to us we have to predict 19 missing values for three variables and using time series trend values may be quite unreliable for the later of the 19 values. Also for monthly data there exists seasonality which is hard to remove thus making the analysis more tedious.

Missing data Imputation using some iterative techniques:We can use the multiple imputation technique for missing data. This method though difficult to execute can be done very easily using statistical softwares.(for more details in this technique please refer appendix)

Therefore we use the last method to get rid away with this problem.

So as discussed above we use multiple imputation of missing data technique to impute the missing values in our dataset. We use the “mi” library in R and with the help of “mi” command in R on the variables having missing observations.we obtained the missing values. From the complete datasets we took the last data set .The completed dataset is given below:

Sugar Prices in Indian Rupee per kg

CPI PPI IPI

23.77472 136.6593 177.8132 -0.2698123.84381 144.2253 183.3302 -3.9965423.94445 136.3832 178.7977 5.98477624.09106 139.9016 180.4378 -1.5609324.30464 140.4724 182.6549 2.31256424.85651 134.1633 177.1054 1.38777327.90287 141.3516 182.1839 -1.6860725.20971 135.4464 178.053 1.49274624.34879 143.571 183.6189 -3.048623.84106 137.8738 179.8705 2.99072823.33333 134.5311 177.8108 4.76691223.15673 138.0879 179.6758 -0.0416822.60486 132.115 176.9094 1.2240321.61148 143.4237 183.685 -0.3778122.00883 135.2534 177.929 4.40271621.19205 139.4739 180.4845 2.32402221.16998 136.9631 178.974 2.37694621.83223 137.6496 179.0504 1.95035722.56071 135.4691 178.5486 6.79397324.06181 134.6 177.5 2.6324.76821 136.2 179.7 0.4325.58499 137.6 180.3 2.7624.54746 139.4 181.5 -1.1622.45033 138 179.2 -1.3221.39073 137.4 178.9 -0.1623.22296 137.3 178.9 0.824.08389 138.1 179.8 -1.824.26049 139.1 180.2 -0.523.88521 139.9 181.7 3.423.90728 141.2 182.6 4.7

Month

Jan-13Feb-12

Mar-12Apr-12

May-12Jun-12Jul-12

Aug-12Sep-12Oct-12

Nov-12Dec-12Jan-13Feb-13

Mar-13Apr-13

May-13Jun-13Jul-13

Aug-13Sep-13Oct-13

Nov-13Dec-13Jan-14Feb-14

Mar-14Apr-14

May-14Jun-14Jul-14

2.Tackling the problem of Scale :

The GDP growth rates given in the data set are given below:

Year Growth Rate

Q1 Q2 Q3 Q42012 8 6.7 6.1 5.3

2013 5.5 4.4 4.8 4.7

2014 4.7 4.6

We can see that the data is given on Quarterly basis. But the other variables are given on a monthly basis. So to carry out our analysis further we can do two things.

Firstly we can convert the entire dataset into monthly dataset and carry out our analysis on that data.

Secondly we can convert the GDP growth data set into monthly dataset and carry out our analysis.

On the first Case the main problem will be that the number of data points will be reduced and thus the precision of the estimates will get affected because with more data points the fit of any statistical model will always better than less data points as there is loss of information in that case.

So we consider the second method i.e. to convert the GDP growth data set into monthly dataset. But the question arises that how can we do it? Again we can do it in two ways:

1. Fit a quarterly trend equation on the given data and convert that trend into a monthly trend equation and impute the monthly values from that equation.

2. Assume a fact that the gdp growth rate values rises very slowly almost linearly and does not changes too much between months. Assuming this divide each value by three (as each quarter consists of three months) and distribute the same in those months.

In the first case the value will be more appropriate but the methodology will be more tedious as first we have to fit a trend line ,then convert it into its monthly

counterpart and finally computing the monthly values. Also as such we ignore the seasonal affects and including those affects will make the analysis more tedious and time consuming.

Whereas on the second case though we are loosing the precision and accuracy but we are trading of with the fact that gdp growth rate usually changes very slowly as it is a long time phenomenon and also as such the analysis becomes more easy and less time consuming as we are only dividing it by an integer and we are concerned with the analysis of sales and not the analysis of getting the gdp figures. Moreover while predicting sales we need quarterly values. So after getting the predicted sales in months we sum them to get quarterly sales figures and then the error we are committing minimises due to the summing up process.

Therefore we use the second method over the first. The computed gdp values are given below:

GDP Growth Rate

Month GDP Growth Rate

Month

2.67 Jan-13 1.57 Dec-132.67 Feb-12 1.57 Jan-142.67 Mar-12 1.57 Feb-142.23 Apr-12 1.57 Mar-142.23 May-12 1.53 Apr-142.23 Jun-12 1.53 May-142.03 Jul-12 1.53 Jun-142.03 Aug-12 Jul-142.03 Sep-121.77 Oct-121.77 Nov-121.77 Dec-121.83 Jan-131.83 Feb-131.83 Mar-131.47 Apr-131.47 May-131.47 Jun-13

1.6 Jul-131.6 Aug-131.6 Sep-13

1.57 Oct-131.57 Nov-13

Analysis:

Here we are given time series data on some variables (including the FMCG sales figures and some other factors) and asked to show How does FMCG sales get impacted by movement of these factors?

To answer this we have to frame the Sales as a function of the other factors.

For simplicity we assume a linear model in this case where the FMCG sales figures are the response variable and the other factors as the explanatory variable. The model which is a regression model on time series covariate is a dynamic regression model which is a regression model on time series data.

Let us take the following variables:

y1,t: Value Offtake at time t.

y2,t: Number of Stores at time t.

x1,t: Crude oil Prices in Indian Rupee per Barrel at time t.

x2,t: Sugar Prices in Indian Rupee per kg at time t

x3,t: GDP Growth Rate at time t

x4,t: CPI at time t

x5,t: PPI at time t

x6,t: IPI at time t

Here y is the response variable and x is the explanatory variable.

Now we see that there are two response variable and six explanatory variables in the model. A question arises which of the two response variable we should take in our model. Also there is a problem that the stores does not remain constant at any two time points i.e some store which are available in time t may not be there at time t+1.Again there may be some new stores at time t+1.How to impute their effect on sales figures which largely gets influenced in those conditions?

We can solve these problems by two procedures:

1. Taking y2,t as a covariate along with other covariates to explain the effect of y1,t.And also take y1,(t-1) as a covariate along with other covariates to explain the effect of y2,t. This is similar to the case of simultaneous regression model.

2. Divide y1,t by y2,t and take this y1,t/y2,t as zt ,a new covariate measuring the sales per store assuming the fact that all the sales are distributed equally among the stores over time t.

But in the first case as the two covariates are related and are regressed on some variables which are present in the two models, there may exist the problem of multicollinearity and also cointegration.

So we shall use the 2nd method to carry out the analysis.

We take our model as

Δzt=β0+β1 Δx1,t+⋯+β6 Δx6,t+et

where βi s are regression coefficients & Δ is used as we have to difference the data to make it nonstationary.

We assume that the errors are autocorrelated.

The dynamic regression model has the following steps:

1. Check that the response variable and all predictors are stationary. If not, apply differencing until all variables are stationary. Where appropriate, use the same differencing for all variables to preserve interpretability.

2. Fit the regression model with o ARIMA(2,0,0)(1,0,0)m errors for seasonal data.3. Calculate the errors from the fitted regression model and identify an appropriate

ARMA model for them.4. Re-fit the entire model using the new ARMA model for the errors.5. Check that the et series looks like white noise.6. We finally check the best model by AIC.

So following the algorithm given above we first have to check whether the variables are stationary or not. To check stationarity we use Dickey Fueller Test (augmented) on each variables. If the test is accepted i.e the variable is stationary we have to difference it until the variable is stationary.

We see that all the variables are non stationary (as the Dickey Fueller Test is accepted).

For this reason we have to difference zt ,x3,t ,x4,t and x5,t ,variable once; x1,t ,x2,t and x6,t variables twice to get stationary time series.

Now we follow the algorithm given above on these differenced values and obtain the following results:

α1 = -0.0753 Thus we can see that a SARIMA(Seasonal Autoregressive

α2 = 0.0563 Moving Average )model with order (2,0,1) and Seasonal

αs1= 0.9986 Order (1,0,1) with period of seasonality is 12 on error is

β0 = 0.0011 appropriate as it has the minimum AIC= -217.84.

β1 = 0 Here α’s denotes the AR coefficients whereas αs is the seasonal

β2 = -0.0004 AR coefficient ,β’s are the regression coefficients, γ is MA

β3 = -0.0020 coefficient & γs is the seasonal MA coefficient.

β4 =0.0019 By Ljung-Box test we can see that the error part is white noise.

β5 =-0.0026

β6 =-0.0001

γ1 = -1.0000

γs1 = -0.6803

We also take plots on fitted and actual value.

Time

z1 -

fita

$re

sid

ua

ls

2012.5 2013.0 2013.5 2014.0 2014.5

-0.0

2-0

.01

0.0

00

.01

0.0

2

The plot given above also shows that the plot is good.

By Δzt ,we mean the change in sales per store in unit time. Similarly the covariates also measure change in that factor per unit time. The regression coefficients β2, β3, β5 and β6 are negative which shows that the change in sales per store is negatively affected by the change in these factors( i.e, sugar prices ,GDP growth rate, PPI and IPI) i.e if these factors increases there is a decrease in the sales figures.

We can interpret it by saying that as sugar prices increases the cost of the commodity also increases and the sales figures thus decreases due to the costlier good .

Again the same logic applies to the PPI values which is a measure of changes in price charged by producers which if increases the cost of the goods also increases thereby decrease in the sales of that goods.

Now if GDP growth rate decreases then the prices of commodities increases due to inflation as the GDP cannot fulfil the needs of the country as a result the same commodity is sold at a more price thereby increasing the sales figures.

IPI influences the GDP and hence influences the sales figures similarly.

Now β4 =0.0019 , this shows that the change in sales per store is positively affected by the change in these factor(CPI).This can be interpreted as with the increase in CPI ,the changes in the price paid by consumers on the basket of goods increases i.e. consumers are paying more for the goods which increases the sales of that goods.

The blue line represents the original value while the black lines represent the fitted values

Here β1 = 0 .Hence we can say that the changes in crude oil price is not affecting the sales very much.

Lastly we know that the no of stores always affects the sales. As if there are more stores the sales will be more and for this reason we dare to divide it to the number of sales at the start of the analysis. But monthly figures of no of sores will affect the sales very slowly.

Prediction:

For prediction purpose we first have to obtain the predicted value of the explanatory variables. For this we have to fit time series models on these covariates.

Fitting Time Series Model on Crude Oil prices: The best fitted model is SARIMA(0,2,0)(0,1,0) with AIC= 239.75

The predicted values for next 9 months is given below :

Jul-14 Aug-14 Sep-14 Oct-14 Nov-14 Dec-14 Jan-15 Feb-15 Mar-156789.36 7176.93 7121.3 6539.42 6319.06 6267.92 6263.51 6203.41 5821.31

Fitting Time Series Model on Sugar prices: The best fitted model is ARIMA(1,2,0) with AR coeff= -0.3722 & AIC=

102.64


Jul-14 Aug-14 Sep-14 Oct-14 Nov-14 Dec-14 Jan-15 Feb-15 Mar-1523.781

423.710

623.6194

423.535

823.4493

323.3639

123.278

123.1924

423.1067

2

Fitting Time Series Model on CPI: The best fitted model is ARIMA(2,1,0) with drift & AIC= 152.09

ar1 ar2 drift

-1.0132 -0.3136 0.0763


Jul-14 Aug-14 Sep-14 Oct-14 Nov-14 Dec-14 Jan-15 Feb-15 Mar-15143.7 140.936

9143.13 141.952

2142.635

4142.490

2142.600

7142.711

9142.7422

Fitting Time Series Model on PPI:The best fitted model is ARIMA(1,1,0) with drift & AIC=133.07

ar1 drift

-0.7534 0.1221


Jul-14 Aug-14 Sep-14 Oct-14 Nov-14 Dec-14 Jan-15 Feb-15 Mar-15184.6 183.307

3184.495

2183.814

3184.541

3184.207

6184.673 184.536

4184.8533

Fitting Time Series Model on IPI:The best fitted model is ARIMA(2,2,0) with AIC=159.14

ar1 ar2

-1.3598 -0.7281


Jul-14 Aug-14 Sep-14 Oct-14 Nov-14 Dec-14 Jan-15 Feb-15 Mar-153.4 7.642473 1.24454 2.11361 5.22728 7.18221 9.80662 8.36434

4.525297

Note: The value of July-2014 is supplied in the given data set.

Fitting Time Series Model on GDP Growth Rate:For the GDP Growth Rate we fit ARIMA model on the quarterly data and calculate the predicted values for the next three Quarters. The best fitted model is given by ARIMA(0,1,0) with drift =-0.3778 & AIC= 17.46 .

The predicted values for next 3 Quarters are : 4.222222 ,3.844444 ,3.466667.

Now we divide it by 3 and assign it to the corresponding months.


Jul-14 Aug-14 Sep-14 Oct-14 Nov-14 Dec-14 Jan-15 Feb-15 Mar-151.4074 1.4074 1.4074 1.28148

11.28148

11.28148

11.15555

61.15555

61.15555

6

Now combining these predicted values with the observed values ,we check the stationarity through Dickey Fueller test and differencing the variables we convert the entire data into a stationary one ,we extract the values of the variables from Jul-2014 to Mar-2015.

[we have to difference x3,t ,x4,t and x5,t ,variable once; x1,t ,x2,t and x6,t variables twice to get stationary time series]

Now these variables are used in the dynamic regression model that we had evaluated earlier to generate the predicted values of zt’s.

The predicted values of zt’s(Sales per store) is given below:

Month Predicted ValueJul-14 0.009638649

Aug-14 0.001584345Sep-14 -0.003331575Oct-14 0.005307183

Nov-14 -0.000312891Dec-14 0.007996624Jan-15 -0.008294165Feb-15 -0.018496504

Mar-15 0.028595458

Fitting Time Series Model on No Of Stores:The best fitted model is SARIMA (0,2,1)(0,1,0) with AIC=434.47

Coefficients:

ma1

-0.7237

The predicted values for next 9 months is given below : Month Predicted

ValueJul-14 8748512

Aug-14 8727997Sep-14 8711155Oct-14 8690370

Nov-14 8668353Dec-14 8648771Jan-15 8627323Feb-15 8609123

Mar-15 8587109

Now we see that the following relation holds:

Δzt = Δ(yt /xt ) = (yt /xt) -(yt-1 /xt-1 )

(yt /xt ) = Δzt +(yt-1 /xt-1 ) yt = xt *(Δzt +(yt-1 /xt-1 ))

For prediction purpose we can modify the relation as :

yt+1 = xt+1 *(Δzt+1 +(yt /xt ))

So we have the present value of x . Also we have the predicted value of z. Therefore using the relation given above we can find out the predicted value of y i.e the predicted sales figures.

The predicted sales figures are given below:

Month Predicted ValueJul-14 1987896.66

Aug-14 1997063.26Sep-14 1964187.76Oct-14 2005622.55

Nov-14 1997829.06Dec-14 2062476.90Jan-15 1985805.74Feb-15 1822377.85

Mar-15 2063270.25

The predicted sales figures for the next three quarters is given by summing three consecutive predicted monthly sales figures which is given in the following table:

Quarter Predicted Value Offtake

Predicted No of Stores

Q3 5949147.68 26187664Q4 6065928.51 26007494Q1 5871453.84 25823555

Scenario Building:

We cannot ignore the fact that the values that we predicted till now may not be the same in reality. It will remain same if the same conditions prevail. But in reality this condition often holds.

So we have to see how far our prediction gets influenced if the condition changes due to some drastic economic activity.

Let us assume that due to some economic disaster GDP growth rate decreases by more 2% than our predicted value, then the predicted sales figures will be:


Q3 5950145.17Q4 6067188.58Q1 5872572.66

We see that the predicted value of sales increases. We can interpret it by considering the fact that if GDP growth rate decreases then the prices of commodities increases due to inflation as the GDP cannot fulfil the needs of the country as a result the same commodity is sold at a more price thereby increasing the sales figures.

Now we assume that due to some economic disaster GDP growth rate decreases by more 2% than our predicted value (as before) and the no of stores also decreases by 10% , then the predicted sales figures will be:


Q3 5355130.97Q4 5460595.15Q1 5285439.95

We know as the no of stores decreases ,sales of the goods also decreases. This fact is reflected above showing decrease in sales figures.

Appendix:

1. Multiple Imputation:

Multiple imputation provides a useful strategy for dealing with data sets with missing values. Instead of filling in a single value for each missing value, Rubin’s (1987) multiple imputation procedure replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute. These multiply imputed data sets are then analyzed by using standard procedures for complete data and combining the results from these analyses. No matter which complete data analysis is used, the process of combining results from different imputed data sets is essentially the same. This results in valid statistical inferences that properly reflect the uncertainty due to missing values.

References:

1. Statistical Software R, Minitab

2. Wikipedia for definitions

3. CRAN task view in R for time series analysis

4. https://www.otexts.org/

5. Some Journals on Multiple Imputation

https://www.otexts.org/

Thank You

Nielsen Case Study Project

Documents