When we talk about regression, we look at an equation (linear or not) that best fits the data. We have a look that this equation is made just by analyzing the sample. This equation helps predict the value of Y (the dependent variable) based on the value of the X (the independent variable).
This equation doesn’t fit all the data, so that we can have some distance between the sample and the line/curve plotted by the equation. This distance is residual. The regression model does not explain that. We can look at a graphical example of that in image1.
The residual analysis studies all this distance and helps validate the regression model’s goodness. For example, the R-squared value (that we have a look at in the regression analysis) is a value that explains the goodness of the model but calculating this value isn’t sufficient. We also need to analyze the residual itself to look at their characteristics.
Looking at the residual we can find that:
- Pattern in the residual: probably the regression model doesn’t correctly fit the observation, and maybe we have selected the wrong one (for example, if we use a linear regression model and the data follow a curve);
- Heteroscedasticity: that happens when the standard deviations of the dependent variable, monitored on some independent variable, are not constant. In other words, we look at the distance of the error increase at some point in our graphics. With heteroscedasticity, you can’t use a linear regression model;
- Outliers: they can hurt the definition of the regression model. For this reason, we say that we need to remove it before finding the model.