###### (A Note for my SJC MAT 151 class – Summer 2018)

What we would have talked about on Friday, had we met for class, was *Linear Regression*.

Consider this **Body Fat Data Set** I found which compares patients’ body fat measurements to other measurements on their bodies. The first sheet in the excel spreadsheet is simply all of the data I was given. I was curious if any of these variables were * correlated* with body fat in a significant way…

In the second sheet you will notice that I isolated *Body Fat* and *Ankle* (“size”?). I’m not really sure what measurement they were using on the ankle and it really isn’t important for our purposes. Anyway, if you check out a video like this one you should see that it is fairly simple to create the scatterplot with corresponding *trendline *and *coefficient*.

Let’s take a moment to understand what how the trendline is chosen.

### Choosing the Line of Best Fit

Suppose you are given a collection of data points with two variables. For our purposes, let’s consider the third sheet of the spreadsheet above where the variable is the *Body Fat* and the variable is the *Age* of the individual. Imagine that we proposed two different candidates for the line of best fit:

To be clear, the 5 data points that the red line is passing through are supposed to be the same 5 data points that the blue line is passing through, I just thought that separating the two pictures was clearer than drawing this:

At any rate, here is how we figure out which line, **red** or **blue**, is the *better fit:*

- For each of the five points in the data set, find the
**vertical distance**between the data point and the**red**curve. **Square**each of these**vertical distances**.- Add up these
**squares**of**vertical distances**. Label this total sum,**SSR**, for “sum of square on the red line”. - Complete steps 1-3 for the
**blue**line and label this total sum,**SSB**. - Whichever line has a smaller sum of squares, SSR for the red line, or SSB for the blue line, is the
**better fit**.

The beautiful thing is that mathematicians have figured out how to imagine all such possible lines, figure out their hypothetical SSX (i.e. sum of square for line X), and then find out which of these potential lines has the **minimum** value for SSX. If you wanted to learn this procedure, you would have to know some *Calculus *or *Linear Algebra*.

### Understanding the Correlations Coefficient

While calculating the *sum of the squares of the vertical distances, ***SSX**, for a proposed line of best fit, mathematicians are also able to calculate a very handy number: or the **correlation coefficient**. Consider the **Body Fat Data Set** from above, where in the second sheet, I had Excel compute a trend line between **Body Fat (x) **and **Ankle (y) **measurement. Notice that it states on the chart: . We will understand what this means for this example and then I will give you a general statement below. Suppose I measured my ankle, and I found that it measured “25”. Since the trend line has the equation

.

Then solving this equation when plugging in , I get . The problem is, since , this method of solving my **Body Fat** from my **Ankle** measurement will only work of the time!!

That’s the point: the **correlation coefficient**, , tells us how often this whole process will be useful in linking the two variables using the trend line. Since we can’t ever expect the points to all lie perfect in a line, we can’t ever really expect this model to ever work perfectly. However, a higher value tells us that this linear model will be more useful.

### Does anything predict Body Fat?

I spend a few minutes messing around with the data and didn’t find anything which correlated to body fat very well: **Ankle **measurement works 7.1% of the time, **Age** works 8.4% of the time, however, taking the difference of **Abdomen** and **Hip **seemed to predict **Body Fat** 56.56% of the time! That’s still a miserable percentage; you might as well flip a coin (not that flipping a coin makes sense here but…) but at least it’s better. Can you find any better percentages from comparing other variables I had not considered yet? Or maybe we shouldn’t even focus on Body Fat!? Why was I looking at that anyway? All psychological analysis of my priorities aside, can you find any other two variables in that data set that have an of higher than 0.9? **Note: **I didn’t check such a pair exists, this is an open ended question!

### Why using the word “predict” is problematic

You should have heard this phrase before

Causation vs Causality

The point is that even if the **correlation coefficient** between **Body Fat** and let’s say, **Eyebrow Size**… even if the were 99%, who is to say that the body fat doesn’t *cause *a change in eyebrow size? That is to to say, just because 99% of the time these two are *related*, does not mean that the mathematics reveals which is the “chicken” and which is the “egg” (which is a bad example because if saying which were the chicken and which were the egg were helpful then there probably wouldn’t be much controversy over which came first!).