Skip to main content

Section 1.6 Regression

Subsection 1.6.1 Terminology: the term "football-shaped"

The text uses the term "football-shaped" or "shaped like a football" very loosely throughout Ch 8–12 to describe a scatter diagram that has linear association. Worse, the term "football-shaped" is sometimes used to imply homoscedasticity. Even worse, "football-shaped" is used to imply that data in thin vertical strips in the scatter diagram is normally distributed (see the box on p.197). Because of this vagueness and ambiguity, we will instead use the terms "shows linear association", "homoscedastic", and "data in thin vertical strips is normally distributed" for these various attributes that may or may not apply to 2-variable data. In particular, in the box on p.197, and also in item 7 of the Ch 11 summary on p.201, replace the first sentence with "Suppose that a scatter diagram shows linear association, is homoscedastic, and data in thin vertical strips is normally distributed."

Subsection 1.6.2 Examples

Here are some basic problem types about regression.
  1. (Ch 10) The scatter diagram for data called \(X,Y\) shows a linear association, with the following summary statistics.
    \begin{align*} \AVE(X)\amp =20.13 \amp \amp \amp \AVE(Y)\amp =15.86\\ \SD(X)\amp =3.65 \amp \amp \amp \SD(Y)\amp =4.90\\ \amp \amp r\amp=.82 \amp \amp \end{align*}
    1. Estimate the average \(Y\) value for data whose \(X\) value is near \(24.8\text{.}\)
    2. Estimate the average \(X\) value for data whose \(Y\) value is near \(21.0\text{.}\)
  2. (Ch 10) Suppose that data called \(V,W\) have a linear association with a correlation of \(r=.65\text{,}\) and that \(V,W\) are both approximately normally distributed. Estimate the average percentile rank for \(W\) data whose \(V\) value is close to 80th percentile.
  3. (Ch 11) Continuing problem 1, suppose now that the \(Y\) data is normally distributed within each thin vertical strip, and that the scatter diagram is homoscedastic. For data whose \(X\) value is near \(24.8\text{,}\) estimate the percent of \(Y\) data in the range \(15.86\) and higher.
Here are the solutions. Here are the drawing method
 1 
images/regression_sample_solutions.pdf
solution steps taught in the textbook.
    1. Use the facts that the regression line passes close to vertical centers of thin vertical strips, that the regression line passes through the point of averages \((\AVE(X),\AVE(Y))\text{,}\) and has slope equal to \(r\frac{\SD(Y)}{\SD(X)}\text{.}\) Let \((A,B)=(\AVE(X),\AVE(Y))\) be the point of averages and let \((P,Q)\) be the point on the regression line with \(P=24.8\text{,}\) so that \(Q\) is the \(Y\)-coordinate we are looking for. The run is \(P-A=4.67\text{,}\) the rise is \(\text{(slope)(run)}=.82\frac{4.90}{3.65}(4.67)\approx 5.137\text{,}\) and \(Q=\text{rise } + B \approx 21.0\text{.}\)
    2. We switch the horizontal and vertical roles of \(X\) and \(Y\) so that we can find the desired \(X\) value as the center of a thin vertical strip. Use the point of averages \((\AVE(Y),\AVE(X))\) and the slope \(r\frac{\SD(X)}{\SD(Y)}\approx .611\text{.}\) The run is \(21.0-15.86= 5.14\text{,}\) the rise is \(\text{(slope)(run)} =(.611)(5.14)\approx 3.14 \text{,}\) and the desired prediction is \(20.13+3.14= 23.27\text{.}\)
  1. First, find the \(z\) value for the 80th percentile of the normal data \(V\text{.}\) Let’s call this value \(A\text{.}\) There is \(20\%\) of the \(V\) data to the right of \(A\text{,}\) so there is \(100-2(20)=60\%\) of the \(V\) data between \(-A\) and \(+A\text{.}\) The normal table gives \(A\approx .84\text{,}\) so the horizontal run from \(\AVE(V)\) to the 80th percentile \(V\) value is \(.84\) standard units for \(V\text{.}\) The vertical rise along the regression line is \(r(.84)=0.65(.84)\approx .55\) standard units for \(W\text{.}\) Now convert the \(z\) value \(.55\) to percentile. The normal table shows that the area between \(-.55\) and \(+.55\) is about \(42\%\text{,}\) so the area to the left of \(.55\) is about \(50+42/2\approx 71\%\text{.}\)
  2. This question is about the \(Y\) data in the thin vertical strip that corresponds to \(X=24.8\text{.}\) Let’s call this data \(U\text{.}\) From the previous problem, we have the estimate \(\AVE(U)=21.0\text{.}\) From the assumption of homoscedasticity, the rms error \(4.90\sqrt{1-.82^2}\approx 2.80\) for the regression line is a good estimate for \(\SD(U)\text{.}\) Using the fact that \(U\) is normally distributed, we are looking for the area under a normal curve that lies to the right of \(z=\frac{15.86-21.0}{2.80}\approx -1.83\text{.}\) The normal table value for \(z=1.83\) is about \(93\%\text{,}\) so the area to the right of \(z=-1.83\) is about \(50+93/2\approx 96.5\%\) or about \(97\%\text{.}\)

Subsection 1.6.3 Regression practice problems

Exercises Exercises

1.
(Ch 10) The scatter diagram for data called \(X\) and \(Y\) shows linear association. For each row in the table below, use the regression method to estimate the \(Y\) value for data whose \(X\) value is near the number in the column called \(P\text{.}\) The first row is complete, and shows the results of the sample problem 1a above. Fill in the remaining values in the column labeled "\(Y\text{ est.}\)".
\begin{align*} \AVE(X) \amp\spacer\amp \SD(X) \amp\spacer\amp \AVE(Y) \amp\spacer\amp \SD(Y) \amp\spacer\amp r \amp\spacer\amp P \amp\spacer\amp Y\text{ est.} \\ \rule{.7in}{.1ex} \amp \amp \rule{.6in}{.1ex} \amp \amp \rule{.7in}{.1ex} \amp \amp \rule{.6in}{.1ex} \amp \amp \rule{.5in}{.1ex} \amp \amp \rule{.5in}{.1ex} \amp \amp \rule{.5in}{.1ex} \\ 20.13 \amp\spacer\amp 3.65 \amp\spacer\amp 15.86 \amp\spacer\amp 4.90 \amp\spacer\amp .82 \amp\spacer\amp 24.8 \amp\spacer\amp 21.00\\ 20.13 \amp\spacer\amp 3.65 \amp\spacer\amp 15.86 \amp\spacer\amp 4.90 \amp\spacer\amp .82 \amp\spacer\amp 17.5 \amp\spacer\amp \\ 103.20 \amp\spacer\amp 4.71 \amp\spacer\amp 26.78 \amp\spacer\amp 5.78 \amp\spacer\amp -.53 \amp\spacer\amp 110.0 \amp\spacer\amp \\ 103.20 \amp\spacer\amp 4.71 \amp\spacer\amp 26.78 \amp\spacer\amp 5.78 \amp\spacer\amp -.53 \amp\spacer\amp 95.0 \amp\spacer\amp \end{align*}
Answer.
\begin{align*} \AVE(X) \amp\spacer\amp \SD(X) \amp\spacer\amp \AVE(Y) \amp\spacer\amp \SD(Y) \amp\spacer\amp r \amp\spacer\amp P \amp\spacer\amp Y\text{ est.}\\ \rule{.7in}{.1ex} \amp \amp \rule{.6in}{.1ex} \amp \amp \rule{.7in}{.1ex} \amp \amp \rule{.6in}{.1ex} \amp \amp \rule{.5in}{.1ex} \amp \amp \rule{.5in}{.1ex} \amp \amp \rule{.5in}{.1ex} \\ 20.13 \amp\spacer\amp 3.65 \amp\spacer\amp 15.86 \amp\spacer\amp 4.90 \amp\spacer\amp .82 \amp\spacer\amp 24.8 \amp\spacer\amp 21.0\\ 20.13 \amp\spacer\amp 3.65 \amp\spacer\amp 15.86 \amp\spacer\amp 4.90 \amp\spacer\amp .82 \amp\spacer\amp 17.5 \amp\spacer\amp 12.97\\ 103.2 \amp\spacer\amp 4.71 \amp\spacer\amp 26.78 \amp\spacer\amp 5.78 \amp\spacer\amp -.53 \amp\spacer\amp 110.0 \amp\spacer\amp 22.36\\ 103.2 \amp\spacer\amp 4.71 \amp\spacer\amp 26.78 \amp\spacer\amp 5.78 \amp\spacer\amp -.53 \amp\spacer\amp 95.0 \amp\spacer\amp 32.11 \end{align*}
2.
(Ch 11) Continuing the previous problem, suppose now that the \(Y\) data is normally distributed within each thin vertical strip, and that the scatter diagram is homoscedastic. For data whose \(X\) value is near the value in the column \(P\text{,}\) estimate the percent of \(Y\) data in the range \(\AVE(Y)\) and higher. The first row is complete, and shows the results of the sample problem 2 above. Fill in the remaining values in the column labeled "\(\text{ est. percent}\)".
\begin{align*} \AVE(X) \amp\spacer\amp \SD(X) \amp\spacer\amp \AVE(Y) \amp\spacer\amp \SD(Y) \amp\spacer\amp r \amp\spacer\amp P \amp\spacer\amp \text{ est. percent}\\ \rule{.7in}{.1ex} \amp \amp \rule{.6in}{.1ex} \amp \amp \rule{.7in}{.1ex} \amp \amp \rule{.6in}{.1ex} \amp \amp \rule{.5in}{.1ex} \amp \amp \rule{.5in}{.1ex} \amp \amp \rule{.9in}{.1ex} \\ 20.13 \amp\spacer\amp 3.65 \amp\spacer\amp 15.86 \amp\spacer\amp 4.90 \amp\spacer\amp .82 \amp\spacer\amp 24.8 \amp\spacer\amp 97\%\\ 20.13 \amp\spacer\amp 3.65 \amp\spacer\amp 15.86 \amp\spacer\amp 4.90 \amp\spacer\amp .82 \amp\spacer\amp 17.5 \amp\spacer\amp \\ 103.2 \amp\spacer\amp 4.71 \amp\spacer\amp 26.78 \amp\spacer\amp 5.78 \amp\spacer\amp -.53 \amp\spacer\amp 110.0 \amp\spacer\amp \\ 103.2 \amp\spacer\amp 4.71 \amp\spacer\amp 26.78 \amp\spacer\amp 5.78 \amp\spacer\amp -.53 \amp\spacer\amp 95.0 \amp\spacer\amp \end{align*}
Answer.
\begin{align*} \AVE(X) \amp\spacer\amp \SD(X) \amp\spacer\amp \AVE(Y) \amp\spacer\amp \SD(Y) \amp\spacer\amp r \amp\spacer\amp P \amp\spacer\amp \text{ est. percent}\\ \rule{.7in}{.1ex} \amp \amp \rule{.6in}{.1ex} \amp \amp \rule{.7in}{.1ex} \amp \amp \rule{.6in}{.1ex} \amp \amp \rule{.5in}{.1ex} \amp \amp \rule{.5in}{.1ex} \amp \amp \rule{.9in}{.1ex} \\ 20.13 \amp\spacer\amp 3.65 \amp\spacer\amp 15.86 \amp\spacer\amp 4.90 \amp\spacer\amp .82 \amp\spacer\amp 24.8 \amp\spacer\amp 97\%\\ 20.13 \amp\spacer\amp 3.65 \amp\spacer\amp 15.86 \amp\spacer\amp 4.90 \amp\spacer\amp .82 \amp\spacer\amp 17.5 \amp\spacer\amp 15\%\\ 103.2 \amp\spacer\amp 4.71 \amp\spacer\amp 26.78 \amp\spacer\amp 5.78 \amp\spacer\amp -.53 \amp\spacer\amp 110.0 \amp\spacer\amp 18\%\\ 103.2 \amp\spacer\amp 4.71 \amp\spacer\amp 26.78 \amp\spacer\amp 5.78 \amp\spacer\amp -.53 \amp\spacer\amp 95.0 \amp\spacer\amp 86\% \end{align*}