Histograms

A useful way of studying numerical data is to plot histograms.  A histogram is a bar chart that plots the number of occurrences that fall into each of several bins or categories.  Here is a simple example:  The following table lists fifty numbers randomly chosen between 0 and 1.
 
0.219134 0.987375 0.220627 0.262194 0.439981
0.414747 0.137787 0.747472 0.17967 0.544239
0.899229 0.046705 0.962411 0.123689 0.304441
0.411475 0.224437 0.935955 0.277133 0.432346
0.072928 0.698489 0.573209 0.557312 0.843534
0.026029 0.638002 0.324959 0.636505 0.776848
0.114812 0.725069 0.040069 0.823524 0.323655
0.397221 0.312105 0.81174 0.222747 0.38988
0.871722 0.313783 0.396813 0.560247 0.337068
0.898394 0.995081 0.764259 0.314586 0.125965
 
The following plot shows how many of these numbers fall between 0 and 0.1, between 0.1 and 0.2, between 0.2 and 0.3, etc.  This is a typical histogram.

How this can be done using Excel:  We assume the numbers for which we wish to draw a histogram occupy some range of cells in the spreadsheet.  In a convenient range of cells put the category numbers 0.1, 0.2, 0.3, etc up to 0.9 (or whatever categories you want; in general, put in the top limit of each category except the last category).  Suppose your block of data is in cells A1:E10 (i.e., the first 10 rows and 5 columns).  Suppose your category numbers are in cells G1:G9.  Nearby select (highlight) a range of cells of the same size but one more cell (say H1:H10).  Then type in the function =FREQUENCY(A1:E10,G1:G9) and then press control-shift-enter (hold control and shift down while pressing enter).  Into the range of cells you've highlighted will appear the numbers for the histogram (counts giving how many numbers fit into each category).  (See below.)  The last number will give how many numbers are bigger than 0.9.  Now you can use the Chart Wizard to plot a bar chart for these numbers -- suitably labled, that will be the histogram.  (Make the horizontal axis be the category numbers 0.1, 0.2, 0.3 etc.)

One interesting point:  One might try practicing drawing a histogram by first creating a batch of instant random numbers.  Random numbers can be created with the function =RAND().  (This will produce numbers between 0.0 and 1.0.)  Simply enter that formula into a cell, and copy it to a range of cells to produce lots of random numbers.  (If you instead want, say,  whole numbers between 0 and 100 instead of fractions, try =ROUND(100*RAND(),0).)  But something will happen that may be annoying:  Everytime you change something in the spreadsheet, the entire spreadsheet may be recalculated.  This will cause all your random numbers to instantly change.  If you don't want all your random numbers to keep changing, copy all the cells containing them (that is, select or highlight them, and then press control-c, or select copy on the Edit menu), and then paste them to another range of cells of the same size -- but don't do an ordinary paste.  Instead, do a Paste Special (in the Edit menu); you'll get a little dialog asking you what you would like to paste.  Select paste value.  This will only paste the number in the cell and not the formula.  So you'll end up with a batch of numbers that do not keep changing all the time.

Example:  We can use Excel to create a histogram for the outcomes for the "unfair die".  We can also produce a relative frequency histogram, which shows what percent of the occurrences fall into each category.  This can be done by dividing each count by the total number of occurrences, using simple spreadsheet cell functions, before drawing the chart.

Example:  We can use Excel to create histograms for the outcomes of the sum of two dice.

Example:  We can use Excel to create a histogram for the number of heads when three coins are tossed.

 For each of these examples, the histogram provides an immediate visual display showing which outcomes are most likely (or at least, which ones occurred most often).  Histograms can help us understand random variables with continuous probability distributions, which we now consider.
 

Continuous probability distributions

Up to now, we've only considered experiments or circumstances with finitely many outcomes, and whole numbers only.  For example, when three dice are rolled, there are exactly 216 possible outcomes.  But many random processes generate infinitely many possible outcomes.  A simple example is the type of spinner often included in board games.  One spins the metal arrow, and it comes to rest on a random location, a random number between 0 and 1 say.  (See the spreadsheet Spinner.xls, which simulates such a spinner.  [This spreadsheet is part of the Excel add-in Insight.xla by Sam Savage.])  Another example might be provided by light bulbs.  A certain brand of light bulbs might be tested by lighting them until they burn out; the lifespans of the bulbs will be some variety of real numbers (not necessarily whole numbers of course).  Another example might be the exact arrival times of customers at a checkout line at a store.

Example:  The spreadsheet Spinner.xls demonstrates a uniform distribution.  Every time the spinner is spun, a randomly chosen number between 0 and 1 is produced.  Each number is equally likely to appear.

Example: The spreadsheet cell function =RAND( )  generates a random number between 0.0 and 1.0.  One can type this into a cell, then copy and paste into a whole range of cells, to generate a large number of random numbers for which a histogram can be drawn.  (See Winston/Albright p.587 for a detailed discussion, including issues like freezing the values of the random numbers, or tying them to the histogram so that if the numbers are recomputed, the histogram automatically updates.)  One sees that each column of the histogram is about the same height; the salient property of a uniform distribution is that outcomes are equally likely in categories of equal width.  That is, with the uniform distribution, a random number is equally likely to fall between, say, 0.2 and 0.3 as it is to fall between 0.6 and 0.7..

Example:   Repeating the previous example with the cell function = -1*Mean*LN(RAND( ))  instead of just =RAND( ) produces a rather different random distribution.  Here Mean stands for a number chosen by the user, and LN is the natural logarithm function.  The random distribution that results is known as the exponential distribution.  If you draw a histogram of such a distribution, you will find that columns tend not to be the same height: earlier columns tend to be tallest.  For example, try Mean equal to 5.0 and plot a histogram with categories from 0.0 to 1.0, from 1.0 to 2.0, up to 9.0 to 10.0 or greater.  The column for 0.0 to 1.0 is apt to be highest.  That is, numbers are more likely to be fall into that category than into the other categories.

The "shape" of the exponential distribution is described by its "density function" y = lexp(-lx)  where l is the reciprocal of the number Mean.  (Here l is the Greek letter Lambda.)  The exponential function in this formula explains the name of the distribution.  If histograms are plotted, the height of the columns tend to fall according to the graph of this function.

The exponential distribution is used to model various phenomena.  One application might be the failure of components or devices such as light bulbs.  The number Mean would be the average time until a light bulb fails.  See the spreadsheet Expon.xls for a demonstration of this.  Another application (where the distribution is an extremely good model) is the decay of radioactive atoms.  Another application is the length of time to the arrival of the next customer.  Here Mean would be the average time for the next customer to appear.