Guide to Linear Regression

19 Oct 2015

Linear Regression is a statistical modelling technique used to derive relationships between independent variables and the dependent variable.

The dependent variable is the target of interest. For example – Sales, transaction which is continuous and random in nature. For example if our target is of the form – whether a customer is likely or unlikely to respond to a promotional offer, whether a customer is likely or unlikely to default a loan taken from a bank, it becomes of a discrete nature. In such a case, linear regression cannot be applied and we may have to use logistic regression instead.

The independent variables are the factors that influence the target variable. Such factors can be fixed or incremental, continuous or discrete. For example – what causes sales to change? Price could be one factor which adds as a base factor or that which applies even if no promotions take place. As soon as we apply a discount, it now becomes an incremental factor, which means in addition to the the base price, what additional sales can be generated when discount is given on the product.

Linear regression helps us to study an isolated impact of discount alone, in this case on sales and at the same time controlling for factors that are likely to impact sales. In statistical terms this is how a linear regression looks like –

Y = a + bx + e

Here Y represents the dependent variable on the left hand side or the sales of a product under consideration. ‘a’ represents the intercept or in other words, the factors that are constant. For example even when there is no variation in price or promotional offers going, there will still be some sales happening just because the product is available in the market. In a linear regression, since we are mostly interested in what causes the sales to change and the magnitude of such impact, not much focus is given to the intercept terms.

‘X’ in this equation are the factors that impact the target variable. We could have multiple factors in this equation X1, X2, X3..Xn for example and ‘b’ would represent the magnitude of the impact that X has on Y. In other words – price, promotional activities, seasonality, trend, distribution of the product can have an impact on sales. Or our dependent variable could be Price and what causes it to change. In such cases, we can have competitor price, promotional activities of target and competitor products, seasonality that can have impact on the price of a product.

‘b’ is the beta coefficient which says that a unit increase in X will call beta times increase in Y. For example – b is -1000. It will mean that when the price a product increases by one unit let’s say from $2.00 to $3.00, it will cause a decline in unit sales by 1000 units. In linear regression we are essentially trying to estimate this beta coefficient.

‘e’ refers to the error terms. What we are essentially doing in a linear regression is that we are trying to come up with a predicted Y using historical data. For example when we substitute the values on the right-hand side from the existing data and the estimates from the model, we will get predicted sales from the regression. We will have the actual sales from the data that we are using for the analysis. We then try to fit a line that best represents the sales with the actual data. Since sales is a random variable, not all factors that influence sales can be predicted using the linear regression model. Whatever cannot be captured will be then represented in this ‘e’ or error term. The assumption is that the average of this error over the weeks should be close to zero because in some case one may overpredict, in other under predict but net average over the week should be close to zero.

If you found this article interesting, why don’t you also check out our 5 Super Tips to Improve Your Linear Regression Models.

5 Super Tips to Improve Your Linear Regression Models