Skip to main content

Section 2.2 Regression

This section is coordinated with Freedman, Statistics [1], Chapters 8–12.

Subsection 2.2.1 RMS Error of Fit

Given two lists of numbers \(X,Y\) with the same number of entries, and given a linear function \(y=L(x)=ax+b\) (\(a\) and \(b\) are constants), the rms error for the line \(L\) is the rms of the list \(Y-L(X)\text{.}\) (For example, if \(A\) is an entry in the list \(X\) and \(B\) is the corresponding entry in the list \(Y\text{,}\) then the corresponding entry in the list \(Y-L(X)\) is \(B-(aA+b)\text{.}\)) We interpret the rms error for a line as a measure of how well \(L\) performs as a tool for predicting the \(Y\) value that corresponds to a given \(X\) value. Given an \(X\) value \(A\text{,}\) the value \(B=L(A)\) has a "prediction error" equal to \(B-L(A)\text{.}\) The line \(L\) is a good prediction tool if the rms error for \(L\) is low, and the line \(L\) is a bad prediction tool if the rms error for \(L\) is high. We interpret the rms error for a line \(L\) as the size of the typical error that we would observe if we used \(L\) to predict an unknown \(Y\) value for a given \(X\) value.

Subsection 2.2.2 The Regression Line

Among all possible lines, there is one single line with the least possible rms error, called the regression line for \(Y\) on \(X\). This means that the regression line is the best possible line to use as a prediction tool to guess an unknown corresponding \(Y\) value for a given \(X\) value. The regression line passes through the point \((\AVE(X),\AVE(Y))\text{,}\) called the point of averages for the lists \(X,Y\text{,}\) and has slope equal to \(r\cdot \SD(Y)/\SD(X)\text{,}\) where \(r\) is the number
\begin{equation*} r = \AVE\left(\left(\frac{X-\AVE(X)}{\SD(X)}\right) \left(\frac{Y-\AVE(Y)}{\SD(Y)}\right)\right) \end{equation*}
called the correlation coefficient for the pair of lists \(X,Y\text{.}\) The rms error for the regression line is
\begin{equation*} \text{rms error for regression line } = \SD(Y) \cdot \sqrt{1-r^2}. \end{equation*}

Subsection 2.2.3 The Scatter Diagram

The list \((X,Y)\) of pairs of points from the lists \(X,Y\) is called the scatter diagram of the lists \(X,Y\text{.}\) The scatter diagram is said to show linear association if the points are scattered roughly symmetrically about a linear axis. The size of the correlation coefficient determines the tightness or looseness of scattering of points \((X,Y)\) about the regression line. For \(|r|\) near 1, points in the scatter diagram are tightly clustered about the regression line. As \(|r|\) approaches 0, the scatter diagram is more and more loosely clustered about the regression line.
A thin vertical strip in the scatter diagram is a set of points \((X,Y)\) whose \(X\)-coordinate lies in a range from \(A\pm e\text{,}\) where \(A\) is a number within the range of entries of the list \(X\) and \(e\) is a small increment. If the various sets of \(Y\) values in thin vertical strips (for various values of \(A\)) has a similar SD as \(A\) varies, the scatter diagram is said to be homoscedastic. Otherwise, the scatter diagram is said to be heteroscedastic.

Subsection 2.2.4 More on Prediction and Error

Let’s write \(\REGR(A)\) for the \(y\)-coordinate of a point on the regression line for the \(x\)-value \(A\text{.}\) Here’s a formula
 1 
The notation \(\REGR\) is not a standard notation. We use it here merely for convenience.
.
\begin{equation*} \REGR(A) = r \cdot \left(\frac{A-\AVE(X)}{\SD(X)}\right) \cdot \SD(Y) + \AVE(Y) \end{equation*}
A residual is another name for a vertical error \(Y-\REGR(X)\) for a data point \((X,Y)\text{.}\) The graph of residuals is the set of points \((X,Y-\REGR(X))\text{.}\) You can see linear association and homoscedasticity (or the lack of either one) very plainly in the graph of residuals: data lists \(X,Y\) have a linear association if the graph of residuals is roughly symmetric across the \(X\)-axis; and the \(X,Y\) data are homoscedastic if the graph of residuals has fairly even densities and vertical ranges in thin vertical strips across the horizontal range of \(X\text{.}\)
Table 2.2.1 shows a comparison of three lines and their rms errors. All three lines pass through the point of averages. The SD line is defined to have positive slope if \(r\gt 0\) and negative slope if \(r\lt 0\text{.}\)
Table 2.2.1. Comparison of Lines and Prediction Errors
line slope rms error
regression line \(r\cdot \SD(Y)/\SD(X)\) \(\SD(Y) \sqrt{1-r^2}\)
SD line \(\pm \SD(Y)/\SD(X)\) \(\SD(Y) \sqrt{2(1-|r|)}\)
average line \(0\) \(\SD(Y)\)

Subsection 2.2.5 Facts about the correlation coefficient \(r\)

  • The value of \(r\) is in the range \(-1\) to \(1\text{.}\) The value of \(|r|\) is 1 precisely when all the points \((X,Y)\) on the scatter diagram lie on a straight line.
  • The correlation coefficient is unitless. The value of \(r\) does not change if you rescale \(X\) or \(Y\text{.}\)
  • It is tempting to summarize data using averages, but this has a misleading effect on correlation. Suppose that the points in a scatter diagram are subdivided into subsets, and that each subset is replaced by a single point
    \begin{equation*} \text{(average of the } X \text{ values in the subset, average of the } Y \text{ values in the subset)}. \end{equation*}
    The correlation for this smaller data set of average points is usually higher (in absolute value) than the correlation for the original scatter diagram. This phenomenon is called ecological correlation.
  • A strong correlation provides no evidence of a causal link between the variables \(X\) and \(Y\text{.}\)

Subsection 2.2.6 The Regression Effect and the Regression Fallacy

Start with an \(X\)-value \(A\text{,}\) somewhere above or below the value \(\AVE(X)\text{.}\) Use the regression line for \(Y\) on \(X\) to predict the average \(Y\) value for data points whose \(X\) value is near \(A\text{.}\) Call this prediction \(B\text{.}\) Now use the regression line for \(X\) on \(Y\) to predict the average \(X\) value for data points whose \(Y\) value is near \(B\text{.}\) Call this prediction \(C\text{.}\) It will always turn out that \(C\) is between \(\AVE(X)\) and \(A\text{.}\) This is due simply to the fact that, for real data, the value of \(|r|\) is less than 1. This is called the regression effect. Any attempt to explain why \(C\) is between \(\AVE(X)\) and \(A\) by any other reason than simply the fact that \(|r|\lt 1\) is called a regression fallacy.