Descriptive Statistics

Section 2.1 Descriptive Statistics

This section is coordinated with Freedman, Statistics [1], Chapters 3–6.

Subsection 2.1.1 The Histogram

A histogram for a list of numbers \(X\) is a drawing with a collection of rectangles. The base of each rectangle is an interval, called a class interval, of a horizontal axis whose units are the same as \(X\) (that is, if the numbers in \(X\) are dollars, or years, then the units of the horizontal axis are dollars or years, etc). The area of a rectangle whose base is the class interval from \(A\) to \(B\) is the fraction of the numbers in the list \(X\) that lie between \(A\) and \(B\text{.}\) For example, if \(40\%\) of the numbers in the list \(X\) are in the range 10 to 30, then a rectangle in a histogram for \(X\) whose base is the class interval from 10 to 30 has area \(0.4 = 40\%\text{,}\) and therefore has a height of \(.4/(30-10) = .02 = 2\) percent per \(X\) unit. The vertical axis in a histogram has units called density units. Rectangles in a histogram must be non-overlapping, and the total area of all the rectangles must be \(100\%\text{.}\) [See Example 1 on p.40 of [1] for an illustration.]

Subsection 2.1.2 Average, Median, Percentiles

The average (also called the mean) of a list of numbers is the sum of all the numbers divided by the number of entries in the list. We write \(\AVE(X)\) to denote the average of a list \(X\text{.}\) The average of the list \(X\) turns out to be the balance point of a histogram for \(X\text{,}\) in the following sense. Imagine the rectangles of the histogram are made of uniformly thick clay. Glue the rectangles together along the edges where they meet. This solid clay histogram balances on a fulcrum placed at the average point on the base.

Technical assumption: the numbers that fall within each rectangle base interval must be close to evenly distributed within that interval for this balance point interpretation to be valid.

[See Figure 6 on p.63 of [1] for an illustration.]

Now suppose we have a histogram for a list \(X\) that has no rectangles of height zero. Suppose that \(L\) is the location, along the horizontal axis, of the left edge of the left-most rectangle in the histogram, and \(R\) is the location of the right edge of the right-most rectangle. A number \(A\) in the range

\begin{equation*} L \leq A \leq R \end{equation*}

is said to have \(p\)-th percentile rank if the area of the histogram on the interval from \(L\) to \(A\) is \(p\) percent. The median of the list \(X\) is the number with 50th percentile rank. The interquartile range of a list \(X\) is the length of the interval from \(A\) to \(B\text{,}\) where \(A\) has 25th percentile rank and \(B\) has 75th percentile rank.

Note: these definitions of percentile and median are dependent on the histogram that is chosen.

[See Example 10 on p. 90 of [1] for an illustration.]

Subsection 2.1.3 Operations on Lists

Given lists \(X,Y\) with the same number of entries, and given constants \(a,b\text{,}\) we can form new lists \(X^2\text{,}\) \(aX+b\text{,}\) and \(XY\) by doing the obvious operations on the entries of the list(s). For example, if \(X = -2,1,3\) and \(Y = 1,4,0\text{,}\) then we have the following lists.

\begin{align*} X \amp = -2,1,3\\ Y \amp = 1,4,0\\ X^2 \amp = (-2)^2,1^2,3^3 = 4,1,9\\ 2X - 3 \amp = 2(-2)-3, 2(1)-3, 2(3)-3 = -7,-1,3\\ XY \amp = (-2)(1),(1)(4),(3)(0) = -2,4,0 \end{align*}

The root mean square of a list \(X\text{,}\) denoted \(\rms(X)\), is the square root of the average of the list \(X^2\text{.}\) Here is \(\rms(X)\) in symbols.

\begin{equation*} \rms(X) = \sqrt{\AVE(X^2)} \end{equation*}

Subsection 2.1.4 Standard Deviation and Standard Units

The standard deviation of a list \(X\text{,}\) denoted \(\SD(X)\), is the rms of the list \(X-\AVE(X)\text{.}\) The list \(X-\AVE(X)\) is called the list of deviations from average. Here is \(\SD(X)\) in symbols.

\begin{equation*} \SD(X) = \rms(X-\AVE(X)) \end{equation*}

The standard deviation is interpreted as a the (absolute) size of a typical error, that is, distance from average, in a game of chance where you draw entries from the list \(X\) at random.

The list \(X\) in standard units is the list \(\frac{X-\AVE(X)}{\SD(X)}\text{.}\)

Subsection 2.1.5 The Standard Normal Distribution

A list \(X\) is said to follow a normal distribution if the percent of entries between pairs of numbers

\begin{equation*} -z=-\frac{A-\AVE(X)}{\SD(X)},\;\;+z=+\frac{A-\AVE(X)}{\SD(X)} \end{equation*}

in the list \((X-\AVE(X))/\SD(X)\) is close to the numbers given in the standard normal table in the Appendix of the textbook, for every entry \(A\) in the list \(X\text{.}\)

Prev Top Next