# Busted!

“If toast always lands butter-side down, and cats always land on their feet, what happens if you strap toast on the back of a cat and drop it?”
– Steven Wright

I’m a fan of the TV show “Mythbusters.” In case you’re not familiar with it (have you been living in a cave?), they test common “myths” and “urban legends” like “Can yodeling trigger an avalanche” (no) and “Can you really freeze your tongue to a pole in cold weather?” (yes). Myths that are verified are classified “confirmed,” those that are contradicted are “busted,” and when the result is inconclusive it’s classified “plausible.”

The Mythbusters team (led by Jamie Hyneman and Adam Savage) do an excellent job. They approach each question logically, applying both healthy skepticism and open-mindedness. They’re quite expert as engineers of the near-impossible (most of them having extensive experience in “special effects” for film and TV). They apply sound scientific principles and measuring instruments. All these are good scientific qualities, but perhaps their most popular quality is that they’re so good at blowing things up (usually, intentionally).

But there’s one scientific discipline on which I think they consistently fail: statistics. Consider for example the show where they tested the myth that “toast always lands butter-side down.” They built (several) machines to drop toast and in true scientific fashion, they tested their machines on toast with no butter as a “control” to see whether or not their setup was biased to favor “up” or “down” regardless of the presence of butter. In one of the tests, out of 10 pieces of (unbuttered) toast 3 landed “up” and 7 landed “down.” They considered this clear evidence of a bias in favor of “down,” indicating that the design required adjustment.

But is the evidence so clear? Suppose the machine were perfectly unbiased, with exactly equal probability (50/50) of the toast landing “up” and “down.” What are the odds of getting 3 up and 7 down? Or of getting 3 of one side (up or down) and 7 of the other (down or up)? Or of getting a result that extreme or more so?

We can compute the probability $P_k$ of getting $k$ of a given result (like “up”) out of $n$ tries, when the probability for a single try is $p$, using the binomial distribution:

$P_k = {n! \over k! (n-k)!} p^k (1-p)^{n-k}$.

The exclamation point indicates the “factorial” of a number, which is the product of all the numbers from 1 up to that number

$n! = (1)(2)(3)(4) ... (n-2)(n-1)n$.

Let’s compute the results for $n=10$ tries, assuming a 50/50 chance of “up” or “down” so that $p=0.5$, for all possible values of $k$ (from 0 meaning no “up” results, to 10 meaning all “up” results). We can graph the probabilities here:

Clearly an even split (5 of each) is the most likely outcome; the chance of getting that result is nearly 1-out-of-4. The next most likely outcomes are the 4-6 splits; both 4up-6down and 4down-6up have slightly greater than 1-out-of-5 chance of occuring. It’s worth noticing that because there are two ways to get a 4-6 split the chance of getting either one is 41%, so it’s more likely that you’ll get one of the 4-6 splits than that you’ll get a 5-5 split — but 5-5 is still more likely (but just a little) than either 4-6 split on its own.

But there’s also a significant chance of a 3-7 split. Both 3up-7down and 3down-7up will occur slightly more than 10% of the time (nearly 12% in fact). Again there are two ways to get a 3-7 split, so the probability splitting 3-7 (either way) is twice that, or 0.2344. With a 23.44% chance of a 3-7 split when the machine is unbiased, such a result can hardly be called “clear evidence of a bias”!

The Mythbusters team operated on the idea that with 10 tries, a 3-7 split was “extreme” enough (i.e., different enough from an even split) to consider a clear sign of difference from 50/50. By that measure, the only cases which are not so extreme are a 4-6 split (either way) or a 5-5 split. The chance of that happening when the machine is unbiased is 0.6562. Hence if the machine is unbiased, with 10 tries there’s only a 66% chance we’d conclude is was unbiased. The chance is more than 1/3 (34.4%) that we’d conclude — incorrectly — there was bias even though there is none.

### Statistical Testing

In this case we’re performing a statistical test of the null hypothesis: the machine is unbiased so there’s an equal chance for the toast to land “up” and “down.” The data are that 3 of 10 samples were “up” while 7 of 10 were “down.” The actual test used was that any case as extreme as (or more extreme than) a 3-7 split (either way) would be considered sufficient evidence to reject the null hypothesis, concluding that the machine was biased. As we’ve seen, with this test there’s a 34.4% chance we’ll reject the null hypothesis even when it’s true.

No matter what test we had used, there’s almost always some chance of mistakenly rejecting the null hypothesis. It’s even possible (although extremely unlikely) that the machine is unbiased but — entirely by accident, mind you — all 10 tries land “up” or all 10 tries land “down.” However, with an unbiased machine such an event is so unlikely that it’s less plausible than the idea that the machine is biased, hence it’s logical to conclude that the null hypothesis of “unbiased” is faulty. This is the essential idea behind hypothesis testing in statistics.

It’s usually possible to compute the probability of rejecting the null hypothesis even though it’s true; we did exactly that for the toast-dropping experiment. This particular kind of mistake — rejecting the null hypothesis in spite of its being true — is called a type I error. The complementary mistake, accepting the null hypothesis in spite of its being false, is called a type II error.

The chance of a type I error (which is what we computed for the toast experiment) is called the size of the test. If we take the chance of a type II error and subtract it from 1, we get the power of the test. We generally hope to minimize both kinds of error; we want a small size so there’s little chance of rejecting the null hypothesis when it’s true, and we want high power so there’s little chance of accepting the null hypothesis when it’s false.

Although it’s usually possible to compute the chance of a type I error (just as we did for the toast experiment), it’s usually not possible to compute the chance of a type II error. That’s because we usually don’t know the details of the case for which the null hypothesis is false. In the 10-times-toast-dropping experiment, for example, the chance of a type II error depends strongly on just how likely “up” is compared to “down” for a single slice of toast — but we don’t know what that is. If the chance of “up” is zero (so the toast always lands butter-side down) then we’re sure to get 0 “up” and 10 “down”. In that case, as long as a 10-0 split causes us to reject the null hypothesis, we’re sure to reject the null, and the power of the test is 1 (there’s no chance of a type II error).

If, on the other hand, the actual probability of “up” is 0.500000001, then the null hypothesis (that $p=0.5$) is indeed false and should be rejected. But the actual probability is so close to the null-hypothesis probability that it’ll be extremely difficult to discriminate between them. In this case, there’s a high probability of accepting the null hypothesis even though we should, technically, reject it (the machine is biased, but only slightly so). The high probability of a type II error means the test has low power.

We were able to compute the probability of a type I error (the “size” of the test) even without knowing the “true” value of $p$. For this reason, many statistical tests are based solely on selecting a “good” or “reasonable” value for the chance of a type I error. In fact, the de facto standard in scientific research is usually 0.05, which means that if the likelihood of the observed or a more extreme result when the null hypothesis is true is only 0.05 (1/20), then the observation is considered sufficiently unlikely to reject the null hypothesis. Another way to say this is that there’s only a 5% “false-alarm probability,” or that the null hypothesis is confirmed/denied at “95% confidence.”

What’s the appropriate test when dropping 10 slices of toast, for the null hypothesis that $p=0.5$? A 0-10 split is the most extreme case we can observe, and the chance of that is 1/1024 = 0.000977. There are two cases of a 0-10 split (0up-10down and 0down-10up) so the chance of either one is 1/512 = 0.001953. Hence the chance of a 0-10 split either way is far less than 0.05, and we can consider them to be sufficient evidence of bias in the toast-dropping mechanism.

The chance of a 1-9 split is 0.009766, and again there are two such cases (1up-9down and 1down-9up) to the chance of either is 0.01953. The chance of observing either 1-9 split or more extreme is the combined chance of either 1-9 split or either 0-10 split, which is 0.02148. This again is less than 0.05, so any case as extreme as, or more extreme than, a 1-9 split, is unlikely enough that we should reject the null hypothesis for a 95% confidence test.

The chance of a 2-8 split is 0.043945. The chance of either 2-8 split (2up-8down or 2down-8up) is 0.08789. The chance of a result which is as extreme as, or more extreme than, a 2-8 split, is 0.109375. This is not less than 0.05, so a 2-8 split is not sufficiently unlikely to reject the null hypothesis. Hence the correct test, for a 95% confidence level, would be to reject the null when the split is 1-9 or 0-10, but not in any other case. In fact the probability of a 2-8 split or more extreme isn’t even less than 0.1, so it’s not even strong enough to reject the null hypothesis at 90% confidence, let alone at 95% confidence. The 3-7 case actually observed isn’t even close to strong enough.

### p-values

We calculated earlier that when the null hypothesis is true, the probability of getting a result as extreme as, or more extreme than, the observed 3-7 split is 0.344. This would be strong enough if we were testing for a false-alarm probability of 34.4%, i.e., for a test at 65.6% confidence. However, intuition strongly suggests (and much experience confirms) that relying on tests that have only 65.6% confidence (that have a 34.4% chance of being wrong) is a bad idea. There’s a reason that 95% confidence is the de facto standard. Sometimes tests are applied more loosely, only requiring 90% confidence, but I’m not aware of any application which considered less than 70% confidence to be reliable evidence.

The chance of the observed-result-or-more-extreme (in this case, 0.344) is often called the “p-value” of the result. If the p-value is less than our cutoff value (in most cases 0.05) the null hypothesis is rejected, otherwise not. If the p-value is ridiculously small, say one out of a billion, then we often conclude that the observed result is so unlikely that the null hypothesis is simply not believable. Very strong results like rejecting the null hypothesis at 99.9% confidence, correspond to p-values less than 0.001.

Some researchers prefer simply to state the p-value, without referring to any specific cutoff value. This often happens when the p-value is so low (say, 0.001 or less) that the conclusion can be considered “obvious.”

### Confidence Intervals

Hypothesis testing isn’t as popular with statisticians as it used to be. For one thing, it’s rare that we know the alternative hypothesis precisely enough to know how unlikely the null hypothesis is compared to the alternative. We saw this in the case of the “toast test,” that if the single-slice probability isn’t exactly 0.5 we don’t know what it is, so we don’t know how likely a 3-7 split is — is it impossible because the true value is $p=0$, or is it almost exactly the same as the null-hypothesis likelihood because the true value is $p=0.5000000001$?

If we can specify the alternative hypothesis more precisely, or if we can assign “prior probabilities” to the various possible alternatives, it may be possible directly to compare the likelihood of getting the observed result under the null hypothesis to its likelihood under various alternatives. We may even be able to estimate the “relative likelihood” of various alternative hypotheses.

Another approach is simply to use the available data to estimate the single-slice probability $p$, and a confidence interval for that estimate. We might, for example, estimate a 95% confidence interval as a lower and upper limit $L$ and $U$, such that it’s 95% likely that the true probability value falls in the range from $L$ to $U$.

There are often multiple ways to estimate these confidence limits. For the binomial distribution (which is relevant to the problem of falling toast), the most-often taught method in elementary statistics courses is to estimate the single-slice probability as the number of “up” events divided by the total number of tries

$\hat p = k/n$.

We then estimate the standard deviation of this estimate according to a well-known formula

$\hat \sigma = \sqrt{\hat p (1 - \hat p) / n}$.

We then define lower and upper confidence limits $L$ and $U$ by adding 1.96 standard deviations to the basic estimate

$\hat p \pm 1.96 \hat \sigma$.

The minus sign gives the lower limit $L$, the plus sign gives us the upper limit $U$.

However, in many instances this isn’t really a good estimate, in fact in some cases it’s downright bad. Suppose none of the slices of toast is observed to land “up.” Then our estimated single-slice probability is

$\hat p = 0/n = 0$.

Then standard deviation is estimated as

$\hat \sigma = \sqrt{ (0) (1) /n } = 0$.

Then the lower and upper confidence limits are $L=0$ and $U=0$. This is the same as saying that we’re 95% sure the value is $p=0$. We can see how wrong this is in the case that we test only one slice so $n=1$. If it lands “down” then $k=0$ so our 95% confidence interval is from 0 to 0 — but it’s obviously wrong to conclude from testing only one slice, that there’s a 95% chance the probability is exactly equal to $p=0$!

The “naive” confidence interval is actually pretty good when the sample size ($n$) is large and the estimated probability isn’t too close to zero or one. No matter what the observed data, there are better ways to estimate a confidence interval for the binomial distribution. One excellent choice is the Clopper-Pearson interval, which is a “secure” estimate, meaning it has the virtue that no matter what the true value is, it’s 95% likely that the estimated confidence interval will include it. The Clopper-Pearson intervals are actually conservative — for many values of $p$ it’s more than 95% likely our confidence interval will include it.

I prefer to make what’s called a Bayesian estimate of the likelihood of various values of $p$, and base my confidence interval on that. This gives intervals very nearly the same as the Clopper-Pearson intervals, but not quite so conservative.

### Back to Toast

What do these confidence intervals say about the toast test? If we observe 3 out of 10 land “up,” then the Bayesian confidence interval extends from 0.11 to 0.61, meaning that it’s 95% likely the true $p$ value is somewhere in that range. Notice that the null hypothesis value $p=0.5$ is inside that interval, so we don’t really have much evidence that the true value is different from its null-hypothesis value.

We can also notice that the 95% confidence interval is quite large. All we can really say with confidence, based on the available evidence, is that the single-slice probability is between 0.11 and 0.61 — quite a wide range! This emphasizes that for this problem, a sample of only 10 tries isn’t really big enough to narrow down the possibilities very much. If we really want to know, with any precision, whether the toast-dropping machine is biased, we should run more tests.

The Mythbusters team elaborated their machine, and finally ran tests on both unbuttered (“control”) toast and buttered toast. Testing 24 slices of each they observed 11up-13down for unbuttered toast and an even split 12up-12down with buttered toast. They concluded that butter doesn’t make toast more likely to land upside-down when it’s dropped. But even this sample is still too small to be very informative. For buttered toast, the 12-12 split gives a confidence interval of 0.31 to 0.69. So it’s possible that $p=0.31$, in which case toast is twice as likely to land buttered-side-down as buttered-side-up, and the results of the test are entirely consistent with that case. That doesn’t really answer the question with much precision. Again, more tests are called for.

They also dropped toast, not from table height, but from the roof of their workshop to the ground. Dropping 48 of each case (unbuttered and buttered) they observed 26up-22down unbuttered and 29up-19-down with buttered toast. The confidence intervals are: for unbuttered toast 0.40 to 0.67, for buttered toast 0.46 to 0.73. They hypothesized that for a lot of the buttered toast that landed butter side up, the buttered side was pressed in, forming a cup that affected the way the toast dropped. But clearly both confidence intervals include $p=0.5$ so there’s really no evidence that either case favors up over down, and there’s no real evidence that buttered and unbuttered toast land differently.

Again, more samples are needed to reach a precise conclusion. It’s surprising how often it happens that results which seem to be meaningful, don’t really meet that criterion when subjected to proper analysis. Intuition is a great tool for generating ideas, but not for testing them. But hey, that’s why we invented statistics.

In any case, the notion that toast always lands butter-side-down is Busted!