The Normal Distribution

In many situations, random numbers or data are distributed in such a way that their histograms have a shape like a mound:  there will be some columns in the middle that are high, but the columns on either side drop off to essentially zero in height.  So most of the numbers cluster in the middle, and there are not that many numbers which are very small or very large.  Simple situations where this might be true are the prices of homes in a neighborhood, or the weights of cows at a dairy farm.  A good model for situations like this is often the normal distribution, often known as the Gaussian distribution in honor of the German mathematician K. F. Gauss.  It is often appropriate when an outcome depends on lots of different more-or-less independent factors.  For example, the weight of a dairy cow might depend on various genetic factors, variations in diet, exposure to different viruses or bacteria, etc. It is commonly used to approximate errors in measurements.

The "probability density function" for the normal distribution happens to have the following formula: 

The graph of this function is the famous "bell curve":

This graph is supposed to give the typical shape of a histogram governed by a normal distribution.  This is actually the "standard" normal distribution, arranged to have mean 0 and standard deviation 1.  Adjusting this curve in a simple way will produce distributions appropriate for situations with other means and standard deviations.

A simple probability experiment that results in distributions similar to the normal distribution would be the number of heads obtained when a number of coins are tossed.  The outcome depends on all of the coins, which each have a relatively small contribution.  If you toss 20 coins, you will probably get about 10 heads; you could get 1 or 2 heads, or 19 or 20 heads, but that is much less likely.  So a histogram of the coin tosses will approximate the bell curve of the normal distribution.  While coin tossing might not be a very interesting situation, the exact same "binomial" distribution is used in statistics to study the number of successes in a sequence of repeated experiments, each of which have the same probability of success.  (Such a situation is called a "Bernoulli trial".)

Situations where the normal distribution wouldn't be so good to use would include times between arrivals of customers.  We recall of course that the exponential distribution is an appropriate model; indeed, histograms of the exponential distribution tend to be "asymmetrical" (lop-sided); the columns on the right tend to be taller than columns on the left.  Also, the columns don't drop off to zero on the left.  (The first column is apt to be tallest.)  Another situation where the normal distribution is not so good would be weight data of live stock for a farmer who owns sheep and cows.  Plotting such data in a histogram would produce two mounds, one for sheep and the other for cows.  Such a distribution is called "bimodal".

The bell curve of the normal distribution is an example of a probability density function.  It can be used to find the probability of an event under this distribution.  Suppose we wish to find the probability that a number from the normal distribution is between -0.6 and 1.4.  What we are supposed to do is find the area underneath the graph of the function and above the x-axis, between x = -0.6 and x = 1.4.

Texts on probability theory and statistics often have tables that give the area under the density function of the normal distribution.  Those tables can then be used to find probabilities.  A common arrangement is to tabulate values of a function F(x), where F(x) gives the area under the graph to the left of x.  So the probability that a number from the normal distribution is between a and b is given by F(b) - F(a).  (The function F is an example of a probability distribution function or the cumulative probability.  It gives the probability that a randomly selected number from the distribution is less than x.)  Another common arrangement is to tabulate the area from 0 to x. Of course, statistical software will find these probabilities automatically.  In Excel, the NORMDIST  function can be used to find values of this function.

To handle more general normal distributions (with mean m and standard deviation s), we use z scores.  The idea is this: If we draw random numbers Z from a standard normal distribution, the random numbers X = sZ+m will happen to have mean m and standard deviation s.  (Numbers drawn from the standard normal distribution itself will have mean 0 and standard deviation 1.)  So if you happen to have numbers X with mean m and standard deviation s, compute the numbers Z = (X - m)/s.  These will have mean 0 and standard deviation 1, so we can use the standard normal distribution to model them.

Example:  Suppose the heights of 5 year old pine trees on a certain tree farm are normally distributed with mean 4.5 m and standard deviation of 0.5 m.   What is the probability that a randomly selected tree has height between 4.2 m and 5.2 m?  Solution:  Here m = 4.5 and s = 0.5.  So if X = 4.2 then Z = (X - m)/s = (4.2 - 4.5)/0.5 = -0.6 and if X = 5.2 then Z = (X - m)/s = (5.2 - 4.5)/0.5 = 1.4.  It turns out that F(-0.6) = 0.274 and F(1.4) = 0.919, so the probability in question is 0.919 - 0.274 = 0.645.  Note that we have just found the area shown in the graph above.

In Excel, we can use the NORMDIST  function:  the value of F(-0.6) is given by =NORMDIST(-0.6,0,1,TRUE).  Actually, the NORMDIST function can be used without finding z-scores first:  =NORMDIST(value,mean,standard deviation,TRUE) would give the cumulative probability for X = value, for a normal distribution with given mean and standard deviation.  That is, it gives the probability that a random number selected from the normal distribution with that mean and standard deviation will be less than the number value.  So with mean 4.5 and standard deviation 0.5, the cumulative probability of 4.2 is given by =NORMDIST(4.2,4.5,0.5,TRUE) and the cumulative probability of 5.2 is given by =NORMDIST(5.2,4.5,0.5,TRUE).  You get 0.274 and 0.919 and again the solution is 0.919 - 9.274 = 0.645.   Here, =NORMDIST(4.2,4.5,0.5,TRUE) gives the probability that a pine tree will be 4.2 m or less in height.  [The 'TRUE' that goes in the last slot of the NORMDIST function tells it to compute this type of probability.]

Example:  The weight of fish from a lake happen to be normally distributed with mean 3 kg and standard deviation 0.6 kg.  What is the probability that a randomly caught fish will have weight between 2 kg and 4 kg?  Solution:  Here we just use Excel.   We use  =NORMDIST(2,3,0.6,TRUE) to get 0.047.  This is the probability that a fish is 2 kg or less in weight.  We use =NORMDIST(4,3,0.6,TRUE) to get 0.952.  This is the probability that a fish is 4 kg or less in weight.  Therefore the probability that a fish is between 2 kg and 4 kg in weight is 0.952 - 0.047 = 0.905.

It is worth giving z-scores for this problem.  Here m = 3 and s = 0.6.  So if X = 2 then Z = (X - m)/s = (2 - 3)/0.6 = -1.67 and if X = 4 then Z = (X - m)/s = (4 - 3)/0.6 = 1.67.  It turns out that F(-1.67) = 0.047 and F(1.67) = 0.952, so the probability in question is 0.952 - 0.047 = 0.090.  (This is the area between -1.67 and 1.67 under the standard normal density function.)

Example:  SAT (Scholastic Aptitude Test) scores are supposed to have mean 1000 and standard deviation 150.  Assuming they are normally distributed, find the probability that an SAT is greater than 1200.   Here, we just use Excel:  =NORMDIST(1200,1000,150,TRUE) gives 0.909.   This is the probability of an SAT being 1200 or less.  Therefore the probability of an SAT being above 1200 is 1 - 0.909 = 0.091.   By the way, the z-score here is (1200 - 1000)/150 = 1.33.  The area under the standard normal density function to the left of 1.33 is 0.909.

Incidentally, people often refer to an outcome as being within one standard deviation of the mean, or two standard deviations from the mean, etc.  This is a handy way of thinking of how unusual an outcome is.  One standard deviation means a z-score of 1; it turns out that in a normal distribution, 68% of the data lies within one standard deviation of the mean.  Two standard deviations means a z-score of 2; in a normal distribution, 95% of the data lies within two standard deviations from the mean.