Problem With Running Significance Tests Over and Over Again
Abstruse
Testing many cipher hypotheses in a single written report results in an increased probability of detecting a pregnant finding just by chance (the problem of multiplicity). Debates have raged over many years with regard to whether to right for multiplicity and, if so, how it should be done. This article first discusses how multiple tests lead to an inflation of the α level, and then explores the following different contexts in which multiplicity arises: testing for baseline differences in various types of studies, having >1 outcome variable, conducting statistical tests that produce >1 P value, taking multiple "peeks" at the data, and unplanned, post hoc analyses (i.e., "data dredging," "fishing expeditions," or "P-hacking"). Information technology then discusses some of the methods that have been proposed for correcting for multiplicity, including single-step procedures (e.grand., Bonferroni); multistep procedures, such as those of Holm, Hochberg, and Šidák; simulated discovery rate control; and resampling approaches. Note that these various approaches depict different aspects and are not necessarily mutually exclusive. For case, resampling methods could be used to control the false discovery rate or the family-wise error charge per unit (as defined later in this article). However, the utilise of 1 of these approaches presupposes that we should correct for multiplicity, which is non universally accustomed, and the article presents the arguments for and against such "correction." The terminal section brings together these threads and presents suggestions with regard to when it makes sense to employ the corrections and how to practice so.
INTRODUCTION
In 2010, Bennett et al. (ane) used fMRI, involving "a 6-parameter rigid-body affine realignment of the fMRI time series, coregistration of the information to a T1-weighted anatomical image, and viii mm full-width at half-maximum (FWHM) Gaussian smoothing" (whatsoever all that means) to demonstrate that 3 detail voxels in the brain images showed significant bespeak changes when the subject area was shown photographs of people expressing specific emotions in social situations. Perhaps the reason that this finding did not upshot in headlines in the major news sources was that the subject in this study was a salmon; non but that, it was a expressionless one. Another enquiry discovery that failed to garner significant printing coverage was by Austin et al. (2). By using a database containing 223 of the virtually common diagnoses, they found that people built-in under the sign of Sagittarius had an increased risk of fractures of the humerus, whereas Leos had a higher probability of gastrointestinal hemorrhage than did those born under the other astrological signs combined.
Fortunately for the reputation of science (and the scientists), neither of these manufactures meant for their conclusions to be taken seriously. Rather, their aim was to highlight the trouble of multiple comparisons, also known as multiplicity. This is an issue that arises when many statistical tests are performed within a single written report; with each test that is run, the probability of finding statistical significance just past hazard increases, so information technology becomes progressively more difficult to separate out true differences or associations from those due to chance. In this article, I discuss 1) why it is a trouble, two) under what circumstances multiplicity rears its caput, 3) diverse ways of correcting for multiplicity, 4) the controversy with regard to correcting for multiplicity and 5) offering some suggestions regarding when and how nosotros should correct for it.
Bear in mind that this article is written from what is called the "frequentist" perspective; that is, that the probability of an event is determined by its relative frequency observed in a study. In that location are other perspectives, primarily the Bayesian, in which previous probabilities are taken into account, only I practise not discuss these in this article.
WHY IS MULTIPLICITY A Trouble?
If we adopt an α level of 0.05, so by definition, assuming that all of the null hypotheses (H 0due south) are true, on boilerplate 5% of the statistical tests will show a pregnant difference or clan. The more than tests that are run, the greater the likelihood that at least one will be significant by take a chance, and the question that arises is the probability that this will occur. If 3 tests are conducted, each of which can have 1 of 2 results (significant or not), there are two3 = 8 possible outcomes, which are shown in the left-well-nigh columns of Table 1, labeled "Test result." The probabilities associated with each result are shown in the side by side 3 columns, chosen "Test probability," and the last column is the probability of that outcome. So, the probability for outcome 2 (not significant, not significant, significant) would be 0.95 × 0.95 × 0.05 = 0.045125. Note that at that place are two essential assumptions to these calculations: 1) all of the cypher hypotheses are truthful and 2) all of the statistical tests are valid under the zip, meaning that they yield P-value distributions that are uniform on the interval [0,1].
Tabular array i
The 8 possible outcomes of 3 tests bold the nil hypothesis is true 1
| Test result | Exam probability | ||||||
| Outcome | A | B | C | A | B | C | Event probability |
| 1 | NS | NS | NS | 0.95 | 0.95 | 0.95 | 0.857375 |
| 2 | NS | NS | Sig | 0.95 | 0.95 | 0.05 | 0.045125 |
| 3 | NS | Sig | NS | 0.95 | 0.05 | 0.95 | 0.045125 |
| 4 | NS | Sig | Sig | 0.95 | 0.05 | 0.05 | 0.002375 |
| five | Sig | NS | NS | 0.05 | 0.95 | 0.95 | 0.045125 |
| 6 | Sig | NS | Sig | 0.05 | 0.95 | 0.05 | 0.002375 |
| 7 | Sig | Sig | NS | 0.05 | 0.05 | 0.95 | 0.002375 |
| viii | Sig | Sig | Sig | 0.05 | 0.05 | 0.05 | 0.000125 |
| Total | 1.000000 | ||||||
| Test result | Test probability | ||||||
| Outcome | A | B | C | A | B | C | Outcome probability |
| 1 | NS | NS | NS | 0.95 | 0.95 | 0.95 | 0.857375 |
| two | NS | NS | Sig | 0.95 | 0.95 | 0.05 | 0.045125 |
| iii | NS | Sig | NS | 0.95 | 0.05 | 0.95 | 0.045125 |
| 4 | NS | Sig | Sig | 0.95 | 0.05 | 0.05 | 0.002375 |
| 5 | Sig | NS | NS | 0.05 | 0.95 | 0.95 | 0.045125 |
| 6 | Sig | NS | Sig | 0.05 | 0.95 | 0.05 | 0.002375 |
| 7 | Sig | Sig | NS | 0.05 | 0.05 | 0.95 | 0.002375 |
| 8 | Sig | Sig | Sig | 0.05 | 0.05 | 0.05 | 0.000125 |
| Total | 1.000000 | ||||||
TABLE 1
The viii possible outcomes of 3 tests bold the null hypothesis is true 1
| Examination event | Test probability | ||||||
| Issue | A | B | C | A | B | C | Upshot probability |
| 1 | NS | NS | NS | 0.95 | 0.95 | 0.95 | 0.857375 |
| ii | NS | NS | Sig | 0.95 | 0.95 | 0.05 | 0.045125 |
| iii | NS | Sig | NS | 0.95 | 0.05 | 0.95 | 0.045125 |
| 4 | NS | Sig | Sig | 0.95 | 0.05 | 0.05 | 0.002375 |
| 5 | Sig | NS | NS | 0.05 | 0.95 | 0.95 | 0.045125 |
| half-dozen | Sig | NS | Sig | 0.05 | 0.95 | 0.05 | 0.002375 |
| 7 | Sig | Sig | NS | 0.05 | 0.05 | 0.95 | 0.002375 |
| 8 | Sig | Sig | Sig | 0.05 | 0.05 | 0.05 | 0.000125 |
| Full | ane.000000 | ||||||
| Test upshot | Test probability | ||||||
| Event | A | B | C | A | B | C | Event probability |
| one | NS | NS | NS | 0.95 | 0.95 | 0.95 | 0.857375 |
| 2 | NS | NS | Sig | 0.95 | 0.95 | 0.05 | 0.045125 |
| 3 | NS | Sig | NS | 0.95 | 0.05 | 0.95 | 0.045125 |
| four | NS | Sig | Sig | 0.95 | 0.05 | 0.05 | 0.002375 |
| 5 | Sig | NS | NS | 0.05 | 0.95 | 0.95 | 0.045125 |
| 6 | Sig | NS | Sig | 0.05 | 0.95 | 0.05 | 0.002375 |
| 7 | Sig | Sig | NS | 0.05 | 0.05 | 0.95 | 0.002375 |
| 8 | Sig | Sig | Sig | 0.05 | 0.05 | 0.05 | 0.000125 |
| Total | 1.000000 | ||||||
We can respond the question of the number of outcomes with at least 1 significant finding when in that location really is none by calculation upwards the probabilities of all of the rows that take one or more of them (i.e., rows 2–8). We can, but this tin become laborious when at that place are 5 tests (two5 = 32 outcomes) and borders on the masochistic when at that place are 10 of them (twoten = 1024 outcomes). Fortunately, in that location'south a much easier way. No matter how many tests there are, the sum of the outcome probabilities is e'er ane; that is, there is a 100% adventure that ane of those outcomes volition occur. So, we tin can simply subtract the result from the kickoff row (no significant tests) from 1, and nosotros will get the same answer. In other words:
where Pr means "probability."
The probability in row 1 is 0.953. To generalize a scrap, if there were g tests performed, and so the probability would exist 0.95 m . Nosotros can generalize fifty-fifty further. We said that the test would exist wrong 5% of the time, but we're not limited to that; we can use whatever value, such as ane% or 10%. If nosotros designate the false-positive rate as α, then the probability is (1 – α) k . That means that the probability of at least ane test beingness significant is:
The more tests there are, the greater the probability that one will be significant, equally shown in Figure i. If you run enough tests, you're almost guaranteed to observe something significant.
Figure 1
Probability that at least one exam will be positive, for varying numbers of tests and an α = 0.05.
Effigy ane
Probability that at least ane examination will be positive, for varying numbers of tests and an α = 0.05.
WHEN MULTIPLICITY CAN ARISE
Multiplicity tin can arise under a number of different circumstances. Information technology is of import to differentiate among them, because the answer to whether or non to correct differs from one situation to the next. The various situations are:
-
1) Testing for baseline differences in a randomized controlled trial (RCT).2 The variables usually consist of various demographic factors, as well as variables that may be possible confounders.
-
2) Looking for differences between or among groups on a number of outcome measures.
-
3) Running a statistical procedure that yields >1 P value, such every bit factorial or repeated-measures ANOVA, multiple regressions, and so along.
-
4) Peeking at information. This involves analyzing the results before all of the participants have been entered to decide whether more people need to be added to reach significance.
-
5) Acting analyses. These are most oft planned ahead of fourth dimension to see if the study should be ended early.
-
half-dozen) Fishing expeditions; that is, unplanned searches for differences between groups or relations among variables, likewise as unplanned subgroup analyses.
Within this list of situations, we can make a number of distinctions. The first is between confirmatory data analysis and exploratory data analysis. The former consists of testing hypotheses that have been specified a priori and the results dictate whether the written report is accounted successful or non. In dissimilarity, the researcher does not have explicit hypotheses in the latter case only is rather searching for relations or differences after the fact. The distinction is clear-cut at the extremes but can become blurry in practice. For instance, the Cardiovascular Health Study (3) was a prospective cohort written report looking at the relation between metabolic syndrome and cardiovascular disease. Without recourse to the original proposal (assuming it exists), it is hard to know if subsequent analyses of the data looking at the effects of gender and race are exploratory or confirmatory. This is 1 reason that some journals now insist on seeing the proposal as function of the review process.
The second distinction that is sometimes made is betwixt the family-wise error rate (FWER; i.e., the number of false discovery errors in a family of tests) and the experiment-wise error rate (i.due east., the number of faux positives in the entire study). Imagine that we did a written report in which participants were randomly assigned to ane of 4 diets, and we are looking at two different outcomes afterward 6 mo: the change in BMI and satisfaction with the diet, measured on some scale. Each of the variables would be analyzed with a i-gene ANOVA. A significant F ratio would indicate that there is a difference betwixt the groups just would not tell us where the deviation lies. To decide that, we would have to do six post hoc tests—grouping 1 compared with two, grouping 1 compared with iii, group 1 compared with 4, group ii compared with 3, grouping two compared with 4, and group 3 compared with 4—and each of those would have its own P level. This would be referred to as a family of tests. On the other hand, the study as a whole has ii contained measures, each tested with its own ANOVA, and so that in that location may be inflation of the α level at the experiment level.
Notwithstanding again, nonetheless, reality rears its ugly caput to blur this distinction at iii levels. Commencement, these terms are not used consistently from 1 article to the next; some authors use the 2 terms as synonyms and would describe the latter situation as a family of tests because they arose from the same study or experiment. 2nd, what constitutes an "experiment"? Would it include later replications? All of the studies conducted by that author examining the aforementioned question? Studies of the same question by other research groups? At that place is no easy answer. Finally, as we will hash out subsequently, if nosotros correct for postal service hoc tests, as in the case of the ANOVA, why don't nosotros right for the multiple t tests that accompany a multiple regression? In this article, we will use the term FWER to encompass both definitions.
HOW TO Correct FOR MULTIPLICITY
Given all the ways in which multiplicity can ascend, how can nosotros correct for it? In fact, there are many means, which can roughly be divided into four areas: post hoc tests run after a significant ANOVA, those that try to right for many independent analyses in a written report, those that try to control the false discovery rate (FDR; which will be divers a bit later), and those based on resampling procedures.
In that location are a multitude of mail hoc tests, such equally the Studentized range test, Fisher's least significant deviation, Tukey'due south honestly significant difference and his wholly significant difference, the Newman-Keuls exam, Dunnett'due south t, and many others. All of them are variations of t tests, with different ways of trying to control the overall α level (4). The Newman-Keuls (too called the Student-Newman-Keuls, or Due south-Due north-K) is the default option in programs such every bit SPSS, perhaps because information technology is the virtually powerful of the techniques (i.east., has the highest likelihood of finding a deviation betwixt means) but does not control the FWER when at that place are >3 groups (five).
Amidst the techniques that correct for all of the statistical tests run in a report, arguably the virtually widely used one is the Bonferroni correction, which is an application of the Bonferroni inequality. It was named after the Italian mathematician Carlo Emilio Bonferroni and probably was first introduced into the statistical world by Olive Dunn (six, vii). Its popularity is due to a number of factors. First, information technology is the essence of simplicity. If we desire to preserve an overall FWER credence rate, αFWER, then nosotros divide α past the number of tests being done (k). That is,
So, if in that location were 10 statistical tests and we want to restrain the FWER at 0.05, we would apply a Bonferroni-adjusted α level (αB) of 0.05/10 = 0.005 for each. 2nd, it is very flexible and can be used with whatsoever type of statistical test, non just ANOVAs.
Unfortunately, there is a steep price to pay for these benefits, and that is the extremely conservative nature of the correction. Every bit the probability of a blazon I error (stating there was an event when none was nowadays) decreases, that of a type Ii mistake (concluding that there was no outcome of the intervention or no association between variables when in fact there is 1) increases. This loss of power is due to a number of reasons. First, information technology assumes that the null hypothesis is truthful for all of the tests, and this is unreasonable, most especially later on a significant omnibus F test. Second, it assumes that all of the tests are independent, which is not true when pairwise comparisons are run, equally is the instance with the post hoc tests after an ANOVA. For example, if there are iii groups, A, B, and C, then the comparisons would consist of A vs. B, A vs. C, and B vs. C.
A number of modifications to the Bonferroni have been proposed, such as the Šidák-Bonferroni (eight), which uses the post-obit value:
However, the effect is very similar to the simpler Bonferroni value; if in that location are 10 tests, so αB = 0.005, whereas the Å idák-Bonferroni–adjusted α level (αS–B) = 0.00511, which is 1 reason information technology is rarely used.
Ane difficulty with the Bonferroni-class of corrections is that they become increasingly bourgeois if the outcomes are correlated with i another. Another problem is more philosophical and arises in areas in which many statistical tests are performed (sometimes running into the thousands), oftentimes with relatively few subjects, such equally genomics and brain scanning. A sure number of false-positive results is tolerable, because they would exist discarded when the study is replicated. The more relevant quantity to control is the positive FDR (pFDR); that is, the proportion of false positives among the set of rejected zilch hypotheses (which are referred to equally discoveries).
The difference betwixt controlling the FWER and the pFDR tin can exist seen past referring to Table two. When the aim is to control the FWER, we are concerned with the proportion of type I errors or imitation discoveries (cell C) relative to the total number of true aught hypotheses (cells A + C). Still, when the objective is to control the pFDR, the business organisation is the proportion in cell C relative to the total number of rejected null hypotheses (cells C + D). Thus, the pFDR is the expected proportion of false positives among all of the significant statistical tests. Phrased another way, if we use an α level of 0.05, and then we expect that ≤5% of all tests will result in type I errors. Nevertheless, by using the pFDR arroyo, we expect that ≤v% of the significant tests will be false positives. Benjamini and Hochberg (nine), who coined the term FDR, stated that this approach is more powerful than the Bonferroni and is ameliorate at separating the important few from the many trivial effects tested.
TABLE 2
The outcomes from many tests of significance i
| Total | H 0 True | H 0 Simulated | Full |
| Not chosen significant | A: truthful negative | B: false negative | A + B |
| Called significant | C: fake discovery | D: truthful discovery | C + D |
| A + C | B + D | k |
| Total | H 0 True | H 0 Imitation | Total |
| Non called significant | A: true negative | B: faux negative | A + B |
| Called significant | C: false discovery | D: true discovery | C + D |
| A + C | B + D | grand |
Tabular array two
The outcomes from many tests of significance 1
| Total | H 0 True | H 0 False | Total |
| Not chosen pregnant | A: true negative | B: false negative | A + B |
| Called pregnant | C: imitation discovery | D: true discovery | C + D |
| A + C | B + D | one thousand |
| Total | H 0 True | H 0 False | Total |
| Non called significant | A: true negative | B: simulated negative | A + B |
| Chosen significant | C: false discovery | D: true discovery | C + D |
| A + C | B + D | k |
The Holm, or Holm-Bonferroni, procedure (ten) was proposed to try to mitigate the problem of the often-bourgeois nature of the Bonferroni correction. Although information technology was adult before the term FDR was introduced, it can exist seen as the first—and now perhaps best known—of the techniques to control the pFDR. In contrast to the single-stride Bonferroni technique, the Holm method is a sequential, step-downwards one, in which the per-test α level is inverse for each test. If there are one thousand meaning tests, then their associated P levels are rank ordered, and so that P ane ≤ P ii ≤ … ≤ Pgrand . The beginning value is evaluated against the criterion of α/grand. If information technology is significant, so the next uses the criterion of α/(k − 1). This is continued with decreasing values of yard until a P value is reached that is not significant, and so it and all larger values of P are declared nonsignificant. This somewhat labor-intensive task has been made easier by the availability of a number of gratuitous online and downloadable estimator programs.
The Hochberg method (xi), on the other hand, is a pace-up procedure, which begins by testing P grand (i.e., the largest P level) against the benchmark of α. If it is meaning, testing is stopped, and all smaller P levels will also be significant. It the result is not significant, then P k−1 is evaluated with α/two. Again, if the result is significant, testing is stopped and P levels smaller than this would be significant. If it is not significant, so P k−2 is compared against α/3, and and so forth, until nonsignificance is reached. Of the 3 procedures, the Bonferroni is the well-nigh conservative, the Hochberg the least, and the Holm method falls between them.
A unlike class of approaches to correcting for multiplicity is based on resampling procedures. Ane variant of this technique, called bootstrapping, was introduced by Efron and Tibshirani (12). Information technology involves cartoon repeated random samples from the information (sometimes numbering 1000 or even 10,000 samples) and computing the parameters for each sample (e.g., the mean), which can and so exist averaged, thus yielding estimates of the variability of the parameters across unlike bootstrapped samples. It may seem somewhat odd drawing so many samples from a modest information gear up, only it's possible considering the samples are fatigued with replacement. That means that if participant number 24 is selected, we really leave her data in the data prepare where it tin exist fatigued over again. Thus, each sample is somewhat different from the others. Considering of this, it was not feasible until desktop computers became widely available and software was developed that can perform the calculations. It was get-go applied to the result of multiplicity by Westfall and Young (thirteen, 14) and gained popularity in the field of genetics, where microarray studies can generate thousands of hypotheses tested simultaneously (15).
The major reward of resampling approaches is that they have into account the estimated dependency among the test statistics, which is non true for some other corrections for multiplicity (east.g., the Bonferroni correction), and therefore tend to exist more powerful (less conservative). Resampling methods also make minimal assumptions virtually the underlying distribution. They can too be applied to single-step (i.e., Bonferroni-type), multistep (Holm and Hochberg), and FDR approaches. The mathematics of resampling techniques are beyond the scope of this article; interested readers should read one of the cited texts (xiii–15).
SHOULD WE Correct FOR MULTIPLICITY?
The discussion of how to correct for multiplicity has made the implicit assumption that we should correct for it, simply this is by no means a position accepted by everyone. In its favor, Moyé (16) holds that "Type I error accumulates with each executed hypothesis test and must exist controlled by the investigators" (p. 354); Cormier and Pagano (17) state that "The more tests we want to make, the more than bourgeois we have to be in lodge to preserve our overall significance level α" (p. 333); and Blakesley et al. (18) write that "Failure to control blazon I errors when examining multiple outcomes may yield simulated inferences, which may deadening or sidetrack research progress" (p. 256).
On the other side of the debate, Rothman (xix) argues that correcting for multiplicity is predicated on 2 assumptions: 1) that the principal cause of unusual findings is hazard and two) that no 1 would want to farther investigate phenomena that may have been caused past chance. Rothman disputes both of these beliefs. With regard to the first, he states that "Scientists assume instead that the universe is governed by natural laws, and that underlying the variability that nosotros observe is a network of factors related to one some other through causal connections. To entertain the universal zip hypothesis is, in effect, to suspend belief in the existent globe and thereby to question the premise of empiricism" (p. 45). As for the second, he writes that "Existence impressed by an farthermost issue should not be considered a error in a universe brimming with interrelated phenomena. The possibility that we may be misled is inherent to the trial-and-fault process of science; nosotros might avert all such errors by eschewing science completely, but then nosotros acquire nothing" (p. 46).
The danger is that, by correcting for multiplicity, we increment the probability of a blazon II error, and thus may overlook possibly interesting findings. It should be noted that this is based on the supposition that another group will attempt to replicate the findings and fail. However, some journals will not publish replication studies, and many tenure and promotions committees place more than value on original research, then this correction process may not accept place.
What Rothman overlooks is that the purpose of nil hypothesis significance testing is not simply to reject or not reject the null hypothesis but also to decide both the direction of the effect and its magnitude. Furthermore, as Cohen points out in his delightfully titled newspaper, "The earth is round (p < .05)" (20), at that place is a difference between the nada hypothesis (the hypothesis to be nullified) and the nil hypothesis (nothing is going on). Most often, the ii are the same, but they demand not exist. Nosotros can state that our null hypothesis is that a correlation is at or beneath a certain value (equally is often done in determining the reliability of a scale) or that a departure betwixt groups is a given amount or more than (every bit is done in noninferiority trials).
A different argument against correcting for multiplicity is offered by Schulz and Grimes (21). Suppose that an RCT of preoperative parenteral nutrition resulted in an increase in noninfectious complications (e.g., 22), and that this finding was significant at the 0.04 level. Now as well assume that the investigators looked at a second outcome, length of hospitalization, which was likewise significant at the 0.04 level. Equally would be expected, these ii outcomes are highly correlated, which is the case for many endpoints in clinical trials. The fact that both outcomes resulted in meaning findings in the same direction might reinforce our confidence in the results. However, were we to apply a Bonferroni-type correction, neither result would be pregnant, which is counterintuitive. Notwithstanding, a weakness of this statement is that the 2 (or more) dependent variables are unlikely to be contained if they are measures of conceptually related outcomes. Taken to the farthermost, imagine testing for a treatment effect on weight loss in pounds and and so testing for a treatment effect in the aforementioned data assail weight loss in kilograms. Because kilograms is just a linear transformation of pounds, the 2 tests will yield identical P values and then obtaining 2 P values of 0.04 would not tell united states annihilation different from one P value of 0.04. This case may be absurd, but it points out the difficulty in trying to debate that testing multiple related outcomes in some manner obviates the multiple testing problem. It is true that when the outcomes are highly related, use of a multiple testing procedure that incorporates that dependency into the correction may be apt. Both sides of the statement make valid points, resulting in 1 leading statistician, Doug Altman (23) to make the conclusion that "It is hard to see views such as [the ones but cited] existence reconciled" (p. 2383).
WHEN MIGHT Nosotros Right OR Not Right FOR MULTIPLICITY
Peradventure a way out of this quandary tin be plant by looking at the various conditions in which multiple hypothesis testing arises, described earlier, and discuss when it may or may not brand sense to correct for multiplicity. The get-go situation occurs when we look for baseline differences in an RCT. This is rationalized for 2 reasons: equally a check on the randomization procedure and to decide whether any variables should be used as covariates in subsequent analyses. Nosotros can actually deal with this quite hands—it shouldn't be done. Despite the fact that such testing is almost universally reported equally the get-go table in any RCT, nosotros (4) and others (24) believe that it is misguided on both grounds. When nosotros run a statistical test after an intervention, nosotros are testing 2 hypotheses: that any divergence was due to chance (H 0) or that it was due to the fact that the groups were treated differently (the alternative hypothesis). However, at baseline, there is no culling; if differences exist, they must be due to chance (assuming that the randomization process has not been subverted, either in some nefarious style or innocently, such as by replacing dropouts or not reading the instructions). Hence, the P level is meaningless; the probability that chance was responsible is 100%, irrespective of the value printed out past the computer. The second argument, that variables that differ at baseline should be used as covariates, is equally misguided. Whether or not a variable should be included as a covariate should be based on theory or previous noesis of its influence on the outcome and not on inspection of baseline differences (25). Furthermore, because including covariates unremarkably reduces the within-group variance and increases the precision of the estimate of the treatment effect (4), it is often recommended to include prognostic variables as covariates, even if they practice not differ significantly between the groups.
Looking for group differences amidst the outcome variables, the 2d situation, is the almost fraught with difficulties because it is on the basis of these analyses that the study is said to have provided evidence to reject the main null hypotheses or non, and it is here that the differences between the varying viewpoints are sharpest. To reiterate, the debate is between finding differences that are actually due to chance compared with overlooking potentially useful findings. The most sensible communication was given by Schulz and Grimes (21): "Researchers should restrict the number of principal endpoints tested. They should specify a priori the primary endpoint or endpoints in their protocol" (p. 1592). That is, the problem of correcting for multiplicity is eliminated by making it unnecessary; there are only a pocket-size number of endpoints (ideally, i) and they will likely exist correlated. Because of this, Bonferroni-type corrections could undermine the conclusions, as pointed out earlier, because the outcomes would reinforce each other rather than having i lessen the significance of the other (begetting in listen the injunction that the variables should non simply be the same issue measured in dissimilar ways). Their injunction about specifying the outcomes a priori is to preclude substituting a significant secondary result for a primary one that was not meaning; a practice that was (and is) all too common (e.one thousand., 26). It is for this reason that many medical journals now require all trials to be registered before the kickoff patient is enrolled, and some journals require authors to include the protocol when the results are submitted for publication.
However sensible Schulz and Grimes' (21) advice is with regard to limiting the number of outcomes, information technology is infeasible when evaluating complex interventions. These are defined as ones with "several interacting components" (27, p. six) and are ofttimes used in health services, public wellness, and social policy research. The guidelines promulgated by the Medical Research Quango in Great U.k. state that "Identifying a single primary issue may not make best employ of the data; a range of measures will be needed, and unintended consequences picked up where possible" (27, p. 7). For example, the Moving to Opportunity project (28) was an RCT that evaluated the consequence of neighborhood on the evolution of obesity and diabetes. There was a wide range of outcomes indicating health outcomes, including height, weight, and concentration of glycated hemoglobin. Because these outcomes were all specified a priori in the protocol, there was no correction for multiplicity.
The third state of affairs, post hoc tests afterwards a significant omnibus test, is a somewhat confusing one. On the one paw, about statistical packages provide ≥one of the post hoc tests mentioned before (east.g., Newman-Keuls, the honestly significant deviation) later on an ANOVA, and this appears to exist accepted practice. On the other hand, multiple linear regression with categorical predictor variables is mathematically identical to ANOVA (29), whereby group differences are reflected in the b or β weights of a dummy coded variable. Nevertheless, it is highly unusual to run into any correction for multiplicity practical to the t tests of these weights. Why this difference? In the words of Tevye the Milkman (from Fiddler on the Roof), Tradition!
Actually, this "explanation" is less facetious than it may first appear. Near lx y ago, Lee J Cronbach (30) wrote well-nigh the ii "disciplines" of psychology: the experimental and the observational. The former tries to exert all possible control over a study, views within-group variance every bit dissonance to be minimized every bit much equally possible, and examines just a few hypotheses at a time; in contrast, the latter looks at nature equally it is, welcomes variance every bit necessary to explore between-person differences, and studies many variables at the aforementioned time (hence leading to the aphorism, "Ane person'south error variance is another person'south occupation"). The experimentalists favored ANOVA-blazon statistics, whereas the observationists relied primarily on correlations and regressions. This may be the origin of the divergence, with the first discipline concerned almost spurious findings and the second welcoming unexpected results, and is reflected in the more contemporary differing viewpoints of Blakesley et al. (18), on the one hand, and Rothman (19), on the other.
Situation four involves "peeking" at the information; that is, analyzing them partway through the written report to determine whether the sample size needs to be increased. This exercise is about definitely one that should be avoided entirely. The issue, equally Armitage et al. (31) pointed out, is that assuming that H 0 is truthful, the probability of a significant test event grows rapidly with each analysis. The problem is compounded by the fact that the sample size is increased afterward each nonsignificant result, making the likelihood of finding significance even greater. Eventually, given plenty peeks at the data, the researcher will find what he or she is looking for. The only accepted exercise is to decide the sample size a priori, and to stick with that.
The fifth situation, interim analyses, also involves peeking at the information, but is most often built into many large RCTs at the design stage (32). The rationale is primarily an ethical one. If an analysis partway through the study shows that the new intervention will not prove to be superior to the comparison (either placebo or treatment every bit usual) when the full sample size is reached, or if 1 group has significantly more agin outcomes than the other, it would be unethical to continue with the written report. The lack of effectiveness was the reason that the tolbutamide and diet arm of the Academy Grouping Diabetes Programme (33) was dropped halfway through the trial. Conversely, if the new intervention is clearly superior, so it would exist equally unethical to deny it to those in the comparator condition. For instance, the Multicenter Automatic Defibrillator Implantation Trial (34) was ended early because the interim analysis showed a significantly greater reduction in all-cause bloodshed for those given an implantable cardioverter-defibrillator.
Diverse schemes have been proposed to protect the overall α level, including the use of the same significance level for each interim analysis (35) or a gradually decreasing criterion (36). Common practice now appears to be to separate α into 2 parts, α1 for the interim analysis and α2 for the terminal one (assuming only 1 interim assay), then that α1 + α2 = 0.05. Following the recommendation of Peto et al. (37), a very stringent criterion is used for α1 (e.grand., <0.001). The rationale for this is that trials that are concluded early tend to overestimate the outcome size and underestimate the CI (38). As a result, acting analyses should exist used primarily when 1) in that location is serious take a chance of harm from adverse events or delaying the new handling and 2) when they have been built into the study from the beginning for reasons of efficiency so 3) used but with greatest of caution.
It is in the terminal situation, involving unplanned and subgroup analyses, that the tension between chance findings on the one hand and overlooking potentially interesting observations on the other is most acute. The problem in fact goes across simply doing and reporting a large number of analyses subsequently the fact. Every bit Gelman and Loken (39) pointed out, fifty-fifty without going on a "fishing trek," a study tin can have a big number of "researcher degrees of freedom," potentially involving choice of statistical examination and whether and how to categorize continuous variables, nonlinear transformations, outlier removal, etc. The (often undisclosed) use of such techniques in search of statistically significant results is referred to as "P value niggling" or "P-hacking." That is, there is always a very big number of potential analyses (e.k., analyzing the data by gender, age, comorbidity condition, previous history), of which the researcher may perform only a few, but the pick is often conditional on the data themselves. Fifty-fifty a cursory look at the data may pb the researcher to make up one's mind that it is non worth running some analyses because the difference looks quite pocket-sized. The issue is that although no formal statistical procedures have been performed, breezy, "eyeball" tests were run. Thus, there is a greater likelihood that the statistical tests that really were run will be significant. Compounding the problem, the comparisons that were rejected as not probable to be fruitful are non counted when correcting for multiplicity, further increasing the probability of a type I error.
Perhaps the virtually prudent grade of action in these circumstances would consist of 3 parts. Outset, there should be a correction for the number of tests that were really performed. The correction could exist either a Bonferroni-type 1 or a pFDR type. Second, both the corrected and uncorrected P levels should be reported, and then that the readers are able to determine for themselves which tests they would regard as statistically significant or non. Finally, any conclusions based on such analyses should exist clearly reported as tentative and hypothesis generating, rather than as hypothesis testing.
SUMMARY AND RECOMMENDATIONS
Whether or not to right for multiplicity is an outcome that shows no sign of early resolution. The issues on both sides are compelling. Not correcting for information technology increases the probability of spurious pregnant findings, possibly resulting in time and resources being wasted chasing down false leads. On the other mitt, correcting for multiplicity may have the opposite consequence, in which potentially interesting observations are discarded equally chance findings. After reviewing the various situations in which multiplicity can arise, the post-obit recommendations are offered, in the full realization that they may exist contested and debated:
-
1) The decision regarding whether or not to correct for multiple testing is a philosophical one and in that location is no way to prove that correcting is or is not the right thing to exercise unless one specifies 1's values (due east.k., one'due south preferences regarding the potential commission of certain types of errors) in advance.
-
two) In determining whether groups differ at baseline in an RCT, the significance levels are meaningless (unless 1 suspects that the randomization was not legitimately implemented) and therefore significance testing should not be done.
-
iii) When assessing the outcomes of a clinical trial, some degree of judgment is necessary. Ideally, at that place will be just a pocket-size number of outcomes, which have been specified a priori and are nearly likely correlated with one another. In such cases, correcting for multiplicity may be judged to be unnecessary and counterproductive; if the outcomes are in the aforementioned direction, they would strengthen confidence in the results. Nevertheless, if many endpoints are used, some investigators would observe the potential for FWER aggrandizement unacceptable.
-
4) Correcting for many P values inside a single statistic, such as complex ANOVA designs and multiple regression, appears to exist dictated more than by habit and tradition than past logic. Post hoc tests corrected for multiple testing are routinely used in the former instance and nigh never in the latter, even though the two techniques are mathematically identical. It is unlikely that these practices will change in the foreseeable future.
-
5) "Peeking" at the data to determine whether a larger sample size is needed is poor practice and should never be washed unless a preplanned interim analysis strategy is used that protects the overall α level. On the other hand, interim analyses are an integral role of many RCTs to decide whether a trial should exist ended early because of futility (i.e., the intervention will not show do good fifty-fifty with the full sample size), an excess of agin events in ane group, or the clear superiority of the intervention, meaning that it would be unethical to withhold it from the comparison grouping. Withal, considering trials that are concluded early on often overestimate the true issue size and underestimate the width of the CI, this should exist washed only using methods designed for such situations (east.g., twoscore).
-
6) When conducting unplanned, post hoc analyses of the data, including subgroup analyses, correcting for multiplicity should exist used. It is strongly recommended that both the corrected and uncorrected P values be reported and that all findings be reported equally tentative and hypothesis generating, rather than hypothesis testing.
-
7) If corrections are used, the Bonferroni-type is uncomplicated simply non optimal and should more often than not not be a commencement pick with a large number of tests. Holm and Hochberg methods are superior in this regard. Indeed, the researcher may wish to change the criterion with regard to which outcomes are significant and apply the pFDR approach. Resampling techniques are very promising for those with the wherewithal to implement them, and their utilise will likely become more widespread equally they are incorporated into the unremarkably used software packages.
TO READ FURTHER
In addition to the manufactures listed in References, other useful books near correcting (or non correcting) for multiplicity would include the following:
-
Dmitrienko A, Molenberghs G, Chuang-Stein C, Offen WW. Analysis of clinical trials using SAS: a practical guide. Cary (NC): SAS Institute; 2005.
-
Dmitrienko A, Tamhane AC, Bretz F, editors. Multiple testing problems in pharmaceutical statistics. Boca Raton (FL): Chapman & Hall/CRC; 2010.
-
Dudoit S, van der Laan MJ. Multiple testing procedures with applications to genomics. New York: Springer-Verlag; 2008.
-
Hochberg Y, Tamhane Air-conditioning. Multiple comparison procedures. New York: Wiley; 1987.
-
Hsu J. Multiple comparisons: theory and methods. Boca Raton (FL): Chapman & Hall/CRC; 1996.
-
Toothaker LE. Multiple comparisons for researchers. Newbury Park (CA): Sage; 1991.
The writer did not declare any conflicts of interest.
REFERENCES
i.
Bennett
CM
, Baird AA Miller MB Wolford GL
Neural correlates of interspecies perspective taking in the postal service-mortem Atlantic salmon: an argument for proper multiple comparisons correction
.
Periodical of Serendipitous and Unexpected Results
2010
;
1
:
i
–
5
.
2.
Austin
PC
, Mamdani MM Juurlink DN Hux JE
Testing multiple statistical hypotheses resulted in spurious associations: a study of astrological signs and wellness
.
J Clin Epidemiol
2006
;
59
:
964
–
9
.
3.
McNeill
AM
, Katz R Girman CJ Rosamond WD Wagenknecht LE Barzilay JI Tracy RP Savage PJ Jackson SA
Metabolic syndrome and cardiovascular disease in older people: the Cardiovascular Health Report
.
J Am Geriatr Soc
2006
;
54
:
1317
–
24
.
4.
Norman
GR
, Streiner DL
Biostatistics: the blank essentials.
4th ed.
Shelton (Connecticut)
:
PMPH U.s.a.
;
2015
.
5.
Seaman
MA
, Levin JR Serlin RC
New developments in pairwise multiple comparisons: some powerful and practicable procedures
.
Psychol Bull
1991
;
110
:
577
–
86
.
half dozen.
Dunn
OJ
.
Estimation of the medians for dependent variables
.
Ann Math Stat
1959
;
thirty
:
192
–
7
.
seven.
Dunn
OJ
.
Multiple comparisons among ways
.
J Am Stat Assoc
1961
;
56
:
52
–
64
.
8.
Šidák
ZK
.
Rectangular conviction regions for the means of multivariate normal distributions
.
J Am Stat Assoc
1967
;
62
:
626
–
33
.
9.
Benjamini
Y
, Hochberg Y
Controlling the fake discovery rate: a applied and powerful approach to multiple testing
.
J R Stat Soc B
1995
;
57
:
289
–
300
.
10.
Holm
S
.
A simple sequentially rejective multiple test process
.
Scand J Stat
1979
;
vi
:
65
–
lxx
.
xi.
Hochberg
Y
.
A sharper Bonferroni procedure for multiple tests of significance
.
Biometrika
1988
;
75
:
800
–
2
.
12.
Efron
B
, Tibshirani R
An introduction to the bootstrap.
New York
:
Chapman & Hall
;
1993
.
13.
Westfall
PH
, Young SS
p Value adjustments for multiple tests in multivariate binomial models
.
J Am Stat Assoc
1989
;
84
:
780
–
half-dozen
.
14.
Westfall
PH
, Young SS
Resampling-based multiple testing: examples and methods for p-value adjustment.
New York
:
Wiley
;
1993
.
16.
Moyé
LA
.
P-value interpretation and blastoff allotment in clinical trials
.
Ann Epidemiol
1998
;
8
:
351
–
7
.
17.
Cormier
KD
, Pagano Yard
Multiple comparisons: a cautionary tale about the dangers of fishing expeditions
.
Diet
1999
;
xv
:
332
–
three
.
eighteen.
Blakesley
RE
, Mazumdar South Dew MA Houck PR Tang G Reynolds CF Butters MA
Comparisons of methods for multiple hypothesis testing in neuropsychological inquiry
.
Neuropsychology
2009
;
23
:
255
–
64
.
19.
Rothman
KJ
.
No adjustments are needed for multiple comparisons
.
Epidemiology
1990
;
1
:
43
–
6
.
xx.
Cohen
J
.
The globe is round (p <. 05)
.
Am Psychol
1994
;
49
:
997
–
1003
.
21.
Schulz
KF
, Grimes DA
Multiplicity in randomised trials I: endpoints and treatments
.
Lancet
2005
;
365
:
1591
–
5
.
22.
Bozzetti
F
, Gavazzi C Miceli R Rossi N Mariani L Cozzaglio Fifty Bonfanti G Piacenza S
Perioperative total parenteral nutrition in malnourished, gastrointestinal cancer patients: a randomized, clinical trial
.
JPEN J Parenter Enteral Nutr
2000
;
24
:
7
–
fourteen
.
23.
Altman
DG
.
Statistics in medical journals: some recent trends
.
Stat Med
2000
;
19
:
3275
–
3289
.
24.
Altman
DG
.
Comparability of randomised groups
.
Statistician
1985
;
34
:
125
–
36
.
25.
Roberts
C
, Torgerson DJ
Baseline imbalance in randomised controlled trials
.
BMJ
1999
;
319
:
185
.
26.
Marti-Carvajal
A
. Taking aim at a moving target: when a report changes in the middle. In: Streiner DL Sidani S
When inquiry goes off the rails: why it happens and what to do about it.
New York
:
Guilford
;
2010
. p.
299
–
303
.
28.
Ludwig
J
, Sanbonmatsu L Gennetian L Adam Due east Duncan GJ Katz LF Kessler RC Kling JR Lindau ST Whitaker RC
Neighborhoods, obesity, and diabetes—a randomized social experiment
.
Due north Engl J Med
2011
;
365
:
1509
–
19
.
29.
Cohen
J
.
Multiple regression as a general data-analytic organisation
.
Psychol Balderdash
1968
;
70
:
426
–
43
.
thirty.
Cronbach
LJ
.
The 2 disciplines of scientific psychology
.
Am Psychol
1957
;
12
:
671
–
84
.
31.
Armitage
P
, McPherson CK Rowe BC
Repeated significance tests on accumulating data
.
J R Stat Soc Ser A-G
1969
;
132
(
2
):
235
–
244
.
32.
Coffey
CS
.
Statistical concepts for the stroke community: yous may accept worked on more adaptive designs than you think
.
Stroke
2015
;
46
:
e26
–
8
.
33.
Meinert
CL
, Knatterud GL Prout TE Klimt CR
The Academy Group Diabetes Program
.
A study of the effects of hypoglycemic agents on vascular complications in patients with developed-onset diabetes
.
Diabetes
1970
;
19
(
Suppl ii
):
789
–
830
.
34.
Moss
AJ
, Zareba Due west Hall WJ Klein H Wilber DJ Cannom DS Daubert JP Higgins SL Brown MW Andrews ML
Prophylactic implantation of a defibrillator in patients with myocardial infarction and reduced ejection fraction
.
N Engl J Med
2002
;
346
:
877
–
83
.
35.
Pocock
SJ
.
Grouping sequential methods in the design and analysis of clinical trials
.
Biometrika
1977
;
64
:
191
–
nine
.
36.
O'Brien
PC
, Fleming TR
A multiple testing procedure for clinical trials
.
Biometrics
1979
;
35
:
549
–
56
.
37.
Peto
R
, Pike MC Armitage P Breslow NE Cox DR Howard SV Mantel N McPherson Thousand Peto J Smith PG
Design and assay of randomized clinical trials requiring prolonged observation of each patient. I. introduction and design
.
Br J Cancer
1976
;
34
:
585
–
612
.
38.
Pocock
S
, White I
Trials stopped early: as well expert to be truthful?
Lancet
1999
;
353
:
943
–
4
.
40.
Bowalekar
S
.
Adaptive designs in clinical trials
.
Perspect Clin Res
2011
;
2
:
23
–
7
.
ABBREVIATIONS
-
FDR
-
FWER
-
H 0
-
pFDR
positive false discovery rate
-
RCT
randomized controlled trial
Author notes
1 The author reported no funding received for this study.
© 2015 American Guild for Nutrition
Source: https://academic.oup.com/ajcn/article/102/4/721/4564678
0 Response to "Problem With Running Significance Tests Over and Over Again"
Post a Comment