Type I and Type II Errors

error type

In statistics, Type I and type II errors are errors that happen when a coincidence occurs while doing statistical inference, which gives you a wrong conclusion. A Type I error is saying the original question is false, when it is actually true (e.g. a jury finding an innocent person guilty, a ‘false positive’); a Type II error is saying the original question is true, when it is actually false (e.g. a jury finding a guilty person not guilty, a ‘false negative’ or simply a ‘miss’).

Usually a type I error leads one to conclude that a thing or relationship exists when really it doesn’t: for example, that a patient has a disease being tested for when really the patient does not have the disease, or that a medical treatment cures a disease when really it doesn’t. Examples of type II errors would be a blood test failing to detect the disease it was designed to detect, in a patient who really has the disease; or a clinical trial of a medical treatment failing to show that the treatment works when really it does.

In statistical test theory the notion of statistical error is an integral part of hypothesis testing. The test requires an unambiguous statement of a null hypothesis, which usually corresponds to a default ‘state of nature,’ for example ‘this person is healthy,’ ‘this accused is not guilty,’ or ‘this product is not broken.’ An alternative hypothesis is the negation of null hypothesis, for example, ‘this person is not healthy,’ ‘this accused is guilty,’ or ‘this product is broken.’ The result of the test may be negative, relative to null hypothesis (not healthy, guilty, broken) or positive (healthy, not guilty, not broken). If the result of the test corresponds with reality, then a correct decision has been made. However, if the result of the test does not correspond with reality, then an error has occurred. Due to the statistical nature of a test, the result is never, except in very rare cases, free of error.

Type I errors (false positive) are philosophically a focus of skepticism (critical thinking) and Occam’s razor (preference for simpler explanations. A Type I error occurs when we believe a falsehood. In terms of folk tales, an investigator may be ‘crying wolf’ without a wolf in sight (raising a false alarm). A type II error, also known as an error of the second kind, occurs when the null hypothesis is false, but erroneously fails to be rejected. It is failing to assert what is present, a miss. A Type II error is committed when we fail to believe a truth. In terms of folk tales, an investigator may fail to see the wolf (‘failing to raise an alarm’). A false negative error is where a test result indicates that a condition failed, while it actually was successful. A common example is a guilty prisoner freed from jail. The condition: ‘Is the prisoner guilty?’ actually had a positive result (yes, he is guilty). But the test failed to realize this, and wrongly decided the prisoner was not guilty.

Both types of errors are problems for individuals, corporations, and data analysis. A false positive (with null hypothesis of health) in medicine causes unnecessary worry or treatment, while a false negative gives the patient the dangerous illusion of good health and the patient might not get an available treatment. A false positive in manufacturing quality control (with a null hypothesis of a product being well made) discards a product that is actually well made, while a false negative stamps a broken product as operational. A false positive (with null hypothesis of no effect) in scientific research suggest an effect that is not actually there, while a false negative fails to detect an effect that is there.

Based on the real-life consequences of an error, one type may be more serious than the other. For example, NASA engineers would prefer to throw out an electronic circuit that is really fine than to use one on a spacecraft that is actually broken. In that situation a type I error raises the budget, but a type II error would risk the entire mission. On the other hand, criminal courts set a high bar for proof and procedure and sometimes acquit someone who is guilty rather than convict someone who is innocent. In totalitarian states, the opposite may occur, with the preference to jail someone innocent, rather than allow an actual dissident to roam free. Each system makes its own choice regarding where to draw the line. Minimizing errors of decision is not a simple issue; for any given sample size the effort to reduce one type of error generally results in increasing the other type of error. The only way to minimize both types of error, without just improving the test, is to increase the sample size, and this may not be feasible.

The extent to which a test shows that the ‘speculated hypothesis’ has (or has not) been nullified is called its ‘significance level’; and the higher the significance level, the less likely it is that the phenomena in question could have been produced by chance alone. British statistician Sir Ronald Aylmer Fisher stressed that the ‘null hypothesis’: ‘… is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.’ The probability that an observed positive result is a false positive (as contrasted with an observed positive result being a true positive) may be calculated using Bayes’ theorem. The key concept of Bayes’ theorem is that the true rates of false positives and false negatives are not a function of the accuracy of the test alone, but also the actual rate or frequency of occurrence within the test population; and, often, the more powerful issue is the actual rates of the condition within the sample being tested.

Security vulnerabilities are an important consideration in the task of keeping computer data safe, while maintaining access to that data for appropriate users. Security measures attempt to avoid type I errors (or false positives) that classify authorized users as impostors, and type II errors (or false negatives) that classify impostors as authorized users.’ A false positive occurs when spam filtering or spam blocking techniques wrongly classify a legitimate email message as spam and, as a result, interferes with its delivery. While most anti-spam tactics can block or filter a high percentage of unwanted emails, doing so without creating significant false-positive results is a much more demanding task. A false negative occurs when a spam email is not detected as spam, but is classified as non-spam. A low number of false negatives is an indicator of the efficiency of spam filtering.

False positives are routinely found every day in airport security screening, which are ultimately visual inspection systems. The installed security alarms are intended to prevent weapons being brought onto aircraft; yet they are often set to such high sensitivity that they alarm many times a day for minor items, such as keys, belt buckles, loose change, mobile phones, and tacks in shoes. The ratio of false positives (identifying an innocent traveler as a terrorist) to true positives (detecting a would-be terrorist) is, therefore, very high; and because almost every alarm is a false positive, the positive predictive value of these screening tests is very low. The relative cost of false results determines the likelihood that test creators allow these events to occur. As the cost of a false negative in this scenario is extremely high (not detecting a bomb being brought onto a plane could result in hundreds of deaths) whilst the cost of a false positive is relatively low (a reasonably simple further inspection) the most appropriate test is one with a low statistical specificity but high statistical sensitivity (one that allows a high rate of false positives in return for minimal false negatives).

In the practice of medicine, there is a significant difference between the applications of screening and testing. Screening involves relatively cheap tests that are given to large populations, none of whom manifest any clinical indication of disease (e.g., Pap smears). Testing involves far more expensive, often invasive, procedures that are given only to those who manifest some clinical indication of disease, and are most often applied to confirm a suspected diagnosis. For example, most states in the USA require newborns to be screened for phenylketonuria and hypothyroidism, among other congenital disorders. Although they display a high rate of false positives, the screening tests are considered valuable because they greatly increase the likelihood of detecting these disorders at a far earlier stage. Also, the simple blood tests used to screen possible blood donors for HIV and hepatitis have a significant rate of false positives; however, physicians use much more expensive and far more precise tests to determine whether a person is actually infected with either of these viruses.

Perhaps the most widely discussed false positives in medical screening come from the breast cancer screening procedure mammography. The US rate of false positive mammograms is up to 15%, the highest in world. One consequence of the high false positive rate in the US is that, in any 10-year period, half of the American women screened receive a false positive mammogram. False positive mammograms are costly, with over $100 million spent annually in the U.S. on follow-up testing and treatment. They also cause women unneeded anxiety. As a result of the high false positive rate in the US, as many as 90–95% of women who get a positive mammogram do not have the condition.  The lowest rate in the world is in the Netherlands, 1%. The lowest rates are generally in Northern Europe where mammography films are read twice and a high threshold for additional testing is set (the high threshold decreases the power of the test).

False negatives and false positives are significant issues in medical testing. False negatives may provide a falsely reassuring message to patients and physicians that disease is absent, when it is actually present. This sometimes leads to inappropriate or inadequate treatment of both the patient and their disease. A common example is relying on cardiac stress tests to detect coronary atherosclerosis, even though cardiac stress tests are known to only detect limitations of coronary artery blood flow due to advanced stenosis (abnormal narrowing of the blood vessel). False negatives produce serious and counter-intuitive problems, especially when the condition being searched for is common. If a test with a false negative rate of only 10%, is used to test a population with a true occurrence rate of 70%, many of the negatives detected by the test will be false.

In systems theory an additional type III error is often defined: asking the wrong question and using the wrong null hypothesis. Florence Nightingale David a sometime colleague of both Neyman and Pearson at the University College London, making a humorous aside at the end of her 1947 paper, suggested that, in the case of her own research, perhaps Neyman and Pearson’s ‘two sources of error’ could be extended to a third: ‘I have been concerned here with trying to explain what I believe to be the basic ideas [of my ‘theory of the conditional power functions’], and to forestall possible criticism that I am falling into error (of the third kind) and am choosing the test falsely to suit the significance of the sample.’ In 1948, Frederick Mosteller argued that a ‘third kind of error’ was required to describe circumstances he had observed, namely: ‘correctly rejecting the null hypothesis for the wrong reason.’

In 1957, Allyn W. Kimball, a statistician with the Oak Ridge National Laboratory, proposed his ‘error of the third kind’ as being ‘the error committed by giving the right answer to the wrong problem.’ Mathematician Richard Hamming expressed his view that ‘It is better to solve the right problem the wrong way than to solve the wrong problem the right way.’ Harvard economist Howard Raiffa describes an occasion when he, too, ‘fell into the trap of working on the wrong problem.’ In 1974, Ian Mitroff and Tom Featheringham extended Kimball’s category, arguing that ‘one of the most important determinants of a problem’s solution is how that problem has been represented or formulated in the first place.’ They defined type III errors as either ‘the error … of having solved the wrong problem … when one should have solved the right problem’ or ‘the error … [of] choosing the wrong problem representation … when one should have … chosen the right problem representation.’ The 2009 book ‘Dirty rotten strategies’ by Ian I. Mitroff and Abraham Silvers described type III (developing good answers to the wrong questions) and type IV (deliberately selecting the wrong questions for intensive and skilled investigation) errors.

In 1969, Raiffa jokingly suggested ‘a candidate for the error of the fourth kind: solving the right problem too late. In 1970, L. A. Marascuilo and J. R. Levin proposed a ‘type IV error’ which they defined in a Mosteller-like manner as being the mistake of ‘the incorrect interpretation of a correctly rejected hypothesis’; which, they suggested, was the equivalent of ‘a physician’s correct diagnosis of an ailment followed by the prescription of a wrong medicine.’

Tags:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s