In this post, we are going to talk about linear regression, one of the most popular techniques in predictive modeling. In fact this is where it all begins. I am sure the below diagram seems familiar.
We know that if we are trying to find out a relationship between a dependent variable y with the help of an independent variable x, and there could be many examples of this. For example, you are trying to predict how much of a loan can be sanctioned to a customer depending on his credit rating. So the credit rating becomes the independent variable x which determines the eligibility in terms of the loan amount.
For a student, we might want to check if we can predict the scores obtained in the exams basis average hours studied each day. Here, hours studied each day will become x and scores obtained in the exams becomes y.There could be numerous examples of such relationships.
We are drawing a straight line which represents all the red points here.This line has a slope of ß1, and an intercept of ß0, which is the point where it is crossing the y-axis, or intersecting the y-axis.We can simply put it in the form of an equation given below:
y=(ß0) + (ß1)x
There is a value called y that is being predicted in terms of x.
In this equation we are giving importance to both the intercept and slope.
Now lets have a look at what will happen if we give importance to only one of these. If we were to consider only the slope, and not the intercept, there would be a possibility of drawing multiple lines which are parallel. That could be one way to represent it.
What if we were to consider only the intercept and not the slope. Again in the case, we could have drawn multiple lines with varying slopes, and the same intercept.
So, end of the day, if we were to conclude on one line which fairly represents all these points , we have to concentrate both the slope and the intercept. But there could be infinite combinations of the slope and intercept.
How do we decide which ones to choose and then there comes the ordinary least squares method.As discussed, this line is trying to represent these points. but not that all the points are necessarily trying to fit onto the line. They are some distance apart.
If we were to drop a perpendicular from each point onto the line, we can measure that distance. And what is that distance? It is a deviation. Deviation in terms of where the points stand, which is observed value versus the predicted value. That is also known as Residual(error).For each given x, this difference represents the distance of the point from the line.
Now the question comes, how do we decide on the line?
If we see these distances, they could be positive or negative, depending on whether the point is above or below the line.
So for simplicity, we take the sum of the square of these distances. We try to find the line such that the sum of the squared distances all added up is the minimum, and that becomes a regression line.
Till the time we have one x that is trying to explain the y – we call it the simple linear regression. But the same concept can be extended to n variables..y is being predicted not only with the help of one variable, but there are multiple variables.
Often we will be facing such situations in real-life scenarios where the good outcome can be explained with the help of a number of independent variables, not just one variable alone.
Lets discuss one of the most popular use case of regression.
The moneyball challenge
It is about moneyball – a book published in 2003, and a hollywood movie released in 2007. This is based on a true story. A team called Oakland A’s which once used to be a rich team, but faced stringent budget cuts in 1995.
Low budgets in sports implies an inability to retain or attract star players. Any player in the form would be charging a lot of money.But if your budgets are right you will be able to afford that player in the first place.
An interesting part which attracted the story for a book which became a best seller and a movie which became a hit is that despite all this team Oakland A continued to improve each year from 1997 to 2001. They continued to perform and made it to the play offs or the finals from 2000 to 2003. if you follow baseball, you would know that it means a lot to make it to the playoffs.
The best part was that they delivered a performance which was equivalent to teams which had 3 times higher budget. It is a story of belief, a story of intelligence, a story of how you utilize the available assets to the best of your benefits.
Let’s understand how this would have been made possible.
The General Manager of this team, named Billie Beane, engaged a Harvard graduate whose name was Paul DePodesta to help him with managing this task. All these things seem to be easy for us now in hindsight since this is 2020. Things are not that easy when people are facing that challenge.
That’s where this is a beautiful case of implementation of linear regression to solve a real-world problem.
Level1: First task before Paul DePodesta is to figure out the total no of wins needed. Basis the historical data, he found that there is a good linear relation between no. of wins and the run difference.
In fact Run difference, which is the difference between the number of runs scored and number of runs allowed – could be a very good predictor of the number of wins.
But how do we calculate runs scored and runs allowed. Again there was a second level of linear regression equations that were applied. For example, runs scored could be predicted using On-base % and Slugging.
RS = a0 + a1(On-base %) + a2(Slugging)
There is no need to worry if these terms or these jargons are not familiar to you.End of the day, we are trying to understand how a problem can be solved in a step by step approach using regression.
Similarly Runs Allowed(RA) can be very well predicted using Opponents On Base % and Opponent’s slugging.
Let’s look at the table as to what was predicted using the historical data versus what actually realized.
Here is a comparison.
Predictions made by Paul DePodesta claimed that Oakland A needs to score between 800-820 runs and they actually achieved 800 runs. Similarly, the runs allowed Paul DePodesta estimated it to be between 650 and 670, and that was actually 653.
Paul DePodesta claimed that if we gain more than 95 victories, they will be able to make it to the playoffs.
But with all the considerations that were given while forming the teams basis the numbers, the team was actually able to achieve much more. This led to 103 wins.
This led to a new stream of analytics in baseball known as sabermetrics.Oakland A clearly had an edge over a period of 6-7 years, when others gradually started to catch up on this. But When they initiated it, they were the pioneers in this field to have introduced sabermetrics.
Compare the performance of a team that spent 30 million USD, and is performing at par with the teams which are being paid 80 million to 90 million USD. This difference of 60 million dollars that you see between two teams is very well explained or compensated with the kind of intelligence that was put behind, a difference in the way of thinking. That is the beautiful thing about analytics because it gives you the power to make the right decisions with the data and you are not making random guesses.
Analytics is about being better than random and this is a beautiful example.
Coefficient of determination(R squared value)
It is an important measure of a model’s performance.For a simple linear regression, this R square is nothing but the square of the Pearson’s correlation coefficient. The calculation, of course , becomes a little complex as you move to multiple linear regression.
R-squared value of 80% means that 80% of the variability in our dependent variable can be explained by the independent variables considered in our model. While forming an equation in terms of a linear relationship of y with a given set of x’s, you might take a lot of x, but not necessarily all of them become important. Trying to understand if the x’s you have taken into account are able to sufficiently to explain your y or your outcome of interest.
For example, if we were to consider the case of loan amount eligibility, the bank might want to consider multiple x’s.It could be of course, your credit rating, the association that you have with the bank for years, the balance that is carried forward each month, what kind of deposits you have with the bank, are those secure or not, and a lot of other factors, like your age, dependents, many more factors. y could be explained with the best combination of x’s. These x’s is not randomly picked.They have been recommended by subject matter experts of the area.
Assumptions of linear regression
- The relationship between the dependent variable and the independent variable(s) should be linear and additive.In the left part of this diagram, there is a chance of fitting a straight line through the points.If these points were like a curve, like on the right part of the diagram, it would be like force-fitting a line through these points unnecessarily.So whenever working with linear regression, we will always see whether x and y can be associated linearly.
- The independent variables should not be correlated. Collinearity is the degree of association and the direction of association between two variables, but we don’t want to have x’s in our model which are predicting each other. If our independent variables are related to each other, we are going to make a model which is redundant. So we want to add variables to ur model which add value. When you add variables to our model, it will increase the R square value. But then there is adjusted R square to keep a check on the variables. As an analytics professional, you don’t want to complicate your model.
- The error terms must have constant variance. This is known as Homescedasticity. When we plot the residuals or the error( the difference between observed value and predicted value on the regression line ) , the residuals should be constant with respect to the predicted values. As you can see on the left side, the values or the error is nearly constant, it is pretty close bound. But if you see on the right side, it is forming a cone shape which shows that the residuals are not constant with respect to the predicted values.
- The error terms must be normally distributed.