(A Note for my SJC MAT 151 class – Summer 2018)
What we would have talked about on Friday, had we met for class, was Linear Regression.
Consider this Body Fat Data Set I found which compares patients’ body fat measurements to other measurements on their bodies. The first sheet in the excel spreadsheet is simply all of the data I was given. I was curious if any of these variables were correlated with body fat in a significant way…
In the second sheet you will notice that I isolated Body Fat and Ankle (“size”?). I’m not really sure what measurement they were using on the ankle and it really isn’t important for our purposes. Anyway, if you check out a video like this one you should see that it is fairly simple to create the scatterplot with corresponding trendline and coefficient.
Let’s take a moment to understand what how the trendline is chosen.
Choosing the Line of Best Fit
Suppose you are given a collection of data points with two variables. For our purposes, let’s consider the third sheet of the spreadsheet above where the variable is the Body Fat and the variable is the Age of the individual. Imagine that we proposed two different candidates for the line of best fit:
To be clear, the 5 data points that the red line is passing through are supposed to be the same 5 data points that the blue line is passing through, I just thought that separating the two pictures was clearer than drawing this:
At any rate, here is how we figure out which line, red or blue, is the better fit:
- For each of the five points in the data set, find the vertical distance between the data point and the red curve.
- Square each of these vertical distances.
- Add up these squares of vertical distances. Label this total sum, SSR, for “sum of square on the red line”.
- Complete steps 1-3 for the blue line and label this total sum, SSB.
- Whichever line has a smaller sum of squares, SSR for the red line, or SSB for the blue line, is the better fit.
The beautiful thing is that mathematicians have figured out how to imagine all such possible lines, figure out their hypothetical SSX (i.e. sum of square for line X), and then find out which of these potential lines has the minimum value for SSX. If you wanted to learn this procedure, you would have to know some Calculus or Linear Algebra.
Understanding the Correlations Coefficient
While calculating the sum of the squares of the vertical distances, SSX, for a proposed line of best fit, mathematicians are also able to calculate a very handy number: or the correlation coefficient. Consider the Body Fat Data Set from above, where in the second sheet, I had Excel compute a trend line between Body Fat (x) and Ankle (y) measurement. Notice that it states on the chart: . We will understand what this means for this example and then I will give you a general statement below. Suppose I measured my ankle, and I found that it measured “25”. Since the trend line has the equation
Then solving this equation when plugging in , I get . The problem is, since , this method of solving my Body Fat from my Ankle measurement will only work of the time!!
That’s the point: the correlation coefficient, , tells us how often this whole process will be useful in linking the two variables using the trend line. Since we can’t ever expect the points to all lie perfect in a line, we can’t ever really expect this model to ever work perfectly. However, a higher value tells us that this linear model will be more useful.
Does anything predict Body Fat?
I spend a few minutes messing around with the data and didn’t find anything which correlated to body fat very well: Ankle measurement works 7.1% of the time, Age works 8.4% of the time, however, taking the difference of Abdomen and Hip seemed to predict Body Fat 56.56% of the time! That’s still a miserable percentage; you might as well flip a coin (not that flipping a coin makes sense here but…) but at least it’s better. Can you find any better percentages from comparing other variables I had not considered yet? Or maybe we shouldn’t even focus on Body Fat!? Why was I looking at that anyway? All psychological analysis of my priorities aside, can you find any other two variables in that data set that have an of higher than 0.9? Note: I didn’t check such a pair exists, this is an open ended question!
Why using the word “predict” is problematic
You should have heard this phrase before
Causation vs Causality
The point is that even if the correlation coefficient between Body Fat and let’s say, Eyebrow Size… even if the were 99%, who is to say that the body fat doesn’t cause a change in eyebrow size? That is to to say, just because 99% of the time these two are related, does not mean that the mathematics reveals which is the “chicken” and which is the “egg” (which is a bad example because if saying which were the chicken and which were the egg were helpful then there probably wouldn’t be much controversy over which came first!).