Regression

Predicting unknown futures

Regression is a go-to method in data analytics because it offers the tantalising promise of being able to predict future unknowns.Identifying cause and effect is a process that is hardwired into our approach to problem solving. If we can identify a cause of a problem, we have a chance of intervening and altering the path of the future.

Regression allows us to explore two different variables and the way that they interact. In the simplest cases, we can easily measure the impact of variables such as water, sunlight, temperature and soil condition on the quality of the plant that grows. With enough trial and error, we can probably come up with an optimal level of watering and position in the garden to make healthy and happy plants more likely - not guaranteed, but more likely.

Think back to your school days and measuring the height of your classmates, plotted against the month of their birth. Although there was some variation, you could draw a line going upwards to show that, in general, the taller children were also the older children. Plain obvious logic tells us that age affects height, and not the other way around.

The line is known as the line of best fit. Machines will sample a position for the line and then measure how far the ‘error’ is, or the distance each data point is from lying directly on the line. We call this a ‘measure of least squares’. We don’t much care if the data point is below (negative) or above the line (positive) so instead of using the real distance (+2 or -6.5) we use the square of that distance (4 and 42.25). By squaring, we immediately get rid of negative values. The line of best fit, or regression line, is the line that falls where the the squared distances overall measure the least.

A little terminology

You have probably heard the phrase ‘correlation is not causation’. When we say that two things correlate, it means that we have found two things where an increase in the value of one seems to result in a corresponding increase or decrease in the value of another. It is important to bear in mind that we can never statistically evidence causality. Perhaps, if there is a time element involved and one thing consistently happens first, we might assume that it is causal. Generally, human observation and common sense are our main source of insight.

Below are some key terms when interpreting the output of regression.

  • The coefficient is the amount that one unit of the independent variable, like the number of hours of study, on the dependent variable, such as he scores in a test. For example, each hour of study may be associated with a 2-point increase in scores, on average.

    NB Sometimes the correlation coefficient might be negative. For example, an increase of one kilo of vegetables in a diet might be associated with - 1 pounds in weight of the human body at the end of the week. In other words, you lose weight.

  • Things that we measure in the real world will vary naturally. For example, in our primary school height chart, some children of the exact same age will be taller or shorter. R-squared is the statisitical measure that represents the proportion of the variance (or difference) that can be explained by the independent variable. For example, and R squared of 0.6 would mean that a tremendous amount - a whopping 60% of the variance - can be explained by the influence of that one independent variable.

  • This rather grand sounding word simply means that the accuracy of the prediction may vary depending on what point along the line we look at. For example, the points may be tightly clustered towards the lower levels, and widely spaced apart at the higher levels. We would do well to remember this inaccuracy and varied performance of our prediction when we come to use it. You may be given a weighted least squares model to help with decision making.

  • There will always be some variance naturally occurring. Many models, rather than produce a line of best fit, may also show you the ‘confidence interval’. It is up to you to decide how confident you want to be about where that line correctly falls. Statistically, 95% confidence is a fairly popular choice. It explains 19 out of 20 cases accurately. But you may prefer to work with 99% confidence. This can be shown as a blocked out area around the line of best fit.

It is important not to rush to the conclusion that you have found a causal relationship, as tempting as it may be. There are may things that could explain what you have observed, the most common being that there may be an unseen, unrecorded third variable that is the true cause. For example, it is well known that ice-cream sales tend to increase in proportion with the murder rate in many large cities. The unobserved variable would be consistently hot temperatures, which both cause tempers to flare, and people to seek out cold treats. Once you start to carry out multiple regression, looking at the impact of the range of complex factors on one variable, it is rare that any one variable is the sole predictor.

When we observe a correlation, it is more normal to simply state that you have observed an association, and leave it to others to reach their own conclusions about whether it is causal.

Avoiding mono-causal explanations

There is an old Korean story about a farmer and his son that has much to teach us about evaluating the present and the future.