Introduction
Data rarely fit a straight line exactly. Usually, you must be satisfied with rough predictions. Typically, you have a set of data with a scatter plot that appear to fit a straight line. This is called a line of best fit or least-squares regression line.
Collaborative Exercise
If you know a person’s pinky (smallest) finger length, do you think you could predict that person’s height? Collect data from your class (pinky finger length, in inches). The independent variable, x, is pinky finger length and the dependent variable, y, is height. For each set of data, plot the points on graph paper. Make your graph big enough and use a ruler. Then, by eye, draw a line that appears to fit the data. For your line, pick two convenient points and use them to find the slope of the line. Find the y-intercept of the line by extending your line so it crosses the y-axis. Using the slopes and the y-intercepts, write your equation of best fit. Do you think everyone will have the same equation? Why or why not? According to your equation, what is the predicted height for a pinky length of 2.5 inches?
Example 12.6
A random sample of 11 statistics students produced the data in Table 12.3, where x is the third exam score out of 80 and y is the final exam score out of 200. Can you predict the final exam score of a random student if you know the third exam score?
x (third exam score) | y (final exam score) |
---|---|
65 | 175 |
67 | 133 |
71 | 185 |
71 | 163 |
66 | 126 |
75 | 198 |
67 | 153 |
70 | 163 |
71 | 159 |
69 | 151 |
69 | 159 |
SCUBA divers have maximum dive times they cannot exceed when going to different depths. The data in Table 12.4 show different depths in feet, with the maximum dive times in minutes. Use your calculator to find the least squares regression line and predict the maximum dive time for 110 feet.
x (depth) | y (maximum dive time) |
---|---|
50 | 80 |
60 | 55 |
70 | 45 |
80 | 35 |
90 | 25 |
100 | 22 |
The third exam score, x, is the independent variable, and the final exam score, y, is the dependent variable. We will plot a regression line that best fits the data. If each of you were to fit a line by eye, you would draw different lines. We can obtain a line of best fit using either the median–median line approach or by calculating the least-squares regression line.
Let’s first find the line of best fit for the relationship between the third exam score and the final exam score using the median–median line approach. Remember that this is the data from Example 12.6 after the ordered pairs have been listed by ordering x values. If multiple data points have the same y values, then they are listed in order from least to greatest y (see data values where x = 71). We first divide our scores into three groups of approximately equal numbers of x values per group. The first and third groups have the same number of x values. We must remember first to put the x values in ascending order. The corresponding y values are then recorded. However, to find the median, we first must rearrange the y values in each group from the least value to the greatest value. Table 12.5 shows the correct ordering of the x values but does not show a reordering of the y values.
x (third exam score) | y (final exam score) |
---|---|
65 | 175 |
66 | 126 |
67 | 133 |
67 | 153 |
69 | 151 |
69 | 159 |
70 | 163 |
71 | 159 |
71 | 163 |
71 | 185 |
75 | 198 |
With this set of data, the first and last groups each have four x values and four corresponding y values. The second group has three x values and three corresponding y values. We need to organize the x and y values per group and find the median x and y values for each group. Let’s now write out our y values for each group in ascending order. For group 1, the y values in order are 126, 133, 153, and 175. For group 2, the y values are already in order. For group 3, the y values are also already in order. We can represent these data as shown in Table 12.6, but notice that we have broken the ordered pairs — (65, 126) is not a data point in our original set.
Group | x (third exam score) | y (final exam score) | Median x value | Median y value |
---|---|---|---|---|
1 | 65
|
126
|
66.5
|
143
|
2 | 69
|
151
|
69 | 159 |
3 | 71
|
159
|
71 | 174 |
When this is completed, we can write the ordered pairs for the median values. This allows us to find the slope and y-intercept of the median–median line.
The ordered pairs are (66.5, 143), (69, 159), and (71, 174).
The slope can be calculated using the formula m−y2−y1x2−x1. Substituting the median x and y values from the first and third groups gives m=174−14371−66.5, which simplifies to m≈ 6.9.
The y-intercept may be found using the formula b=Σy−mΣx3, which means the quantity of the sum of the median y values minus the slope times the sum of the median x values divided by three.
The sum of the median x values is 206.5, and the sum of the median y values is 476. Substituting these sums and the slope into the formula gives b=476−6.9(206.5)3, which simplifies to b≈−316.3.
The line of best fit is represented as y=mx+b.
Thus, the equation can be written as y = 6.9x − 316.3.
The median–median line may also be found using your graphing calculator. You can enter the x and y values into two separate lists; choose Stat, Calc, Med-Med; and then press Enter. The slope, a, and y-intercept, b, will be provided. The calculator shows a slight deviation from the previous manual calculation as a result of rounding. Rounding to the nearest tenth, the calculator gives the median–median line of y=6.9x−315.5. Each point of data is of the the form (x, y), and each point of the line of best fit using least-squares linear regression has the form (x, ŷ).
The ŷ is read y hat and is the estimated value of y. It is the value of y obtained using the regression line. It is not generally equal to y from data, but it is still important because it can help make predictions for other values.
The term y0 – ŷ0 = ε0 is called the error or residual. It is not an error in the sense of a mistake. The absolute value of a residual measures the vertical distance between the actual value of y and the estimated value of y. In other words, it measures the vertical distance between the actual data point and the predicted point on the line, or it measures how far the estimate is from the actual data value.
If the observed data point lies above the line, the residual is positive and the line underestimates the actual data value for y. If the observed data point lies below the line, the residual is negative and the line overestimates that actual data value for y.
In Figure 12.7, y0 – ŷ0 = ε0 is the residual for the point shown. Here the point lies above the line and the residual is positive.
ε = the Greek letter epsilon
For each data point, you can calculate the residuals or errors, yi – ŷi = εi for i = 1, 2, 3, . . . , 11.
Each |ε| is a vertical distance.
For the example about the third exam scores and the final exam scores for the 11 statistics students, there are 11 data points. Therefore, there are 11 ε values. If you square each ε and add, you get the sum of ε squared from i = 1 to i = 11, as shown below.
(ε1)2+(ε2)2+...+(ε11)2=11Σi = 1ε2.
This is called the sum of squared errors (SSE).
Using calculus, you can determine the values of a and b that make the SSE a minimum. When you make the SSE a minimum, you have determined the points that are on the line of best fit. It turns out that the line of best fit has the equation
where
a=ˉy−bˉx
and b=∑(x−ˉx)(y−ˉy)∑(x−ˉx)2.
The sample means of the x values and the y values are ˉx and ˉy, respectively. The best-fit line always passes through the point (ˉx,ˉy).
The slope (b) can be written as b=r(sysx) where sy = the standard deviation of the y values and sx = the standard deviation of the x values. r is the correlation coefficient, which shows the relationship between the x and y values. This will be discussed in more detail in the next section.