Problem With Running Significance Tests Over and Over Again

Abstruse

Testing many cipher hypotheses in a single written report results in an increased probability of detecting a pregnant finding just by chance (the problem of multiplicity). Debates have raged over many years with regard to whether to right for multiplicity and, if so, how it should be done. This article first discusses how multiple tests lead to an inflation of the α level, and then explores the following different contexts in which multiplicity arises: testing for baseline differences in various types of studies, having >1 outcome variable, conducting statistical tests that produce >1 P value, taking multiple "peeks" at the data, and unplanned, post hoc analyses (i.e., "data dredging," "fishing expeditions," or "P-hacking"). Information technology then discusses some of the methods that have been proposed for correcting for multiplicity, including single-step procedures (e.grand., Bonferroni); multistep procedures, such as those of Holm, Hochberg, and Šidák; simulated discovery rate control; and resampling approaches. Note that these various approaches depict different aspects and are not necessarily mutually exclusive. For case, resampling methods could be used to control the false discovery rate or the family-wise error charge per unit (as defined later in this article). However, the utilise of 1 of these approaches presupposes that we should correct for multiplicity, which is non universally accustomed, and the article presents the arguments for and against such "correction." The terminal section brings together these threads and presents suggestions with regard to when it makes sense to employ the corrections and how to practice so.

INTRODUCTION

In 2010, Bennett et al. (ane) used fMRI, involving "a 6-parameter rigid-body affine realignment of the fMRI time series, coregistration of the information to a T1-weighted anatomical image, and viii mm full-width at half-maximum (FWHM) Gaussian smoothing" (whatsoever all that means) to demonstrate that 3 detail voxels in the brain images showed significant bespeak changes when the subject area was shown photographs of people expressing specific emotions in social situations. Perhaps the reason that this finding did not upshot in headlines in the major news sources was that the subject in this study was a salmon; non but that, it was a expressionless one. Another enquiry discovery that failed to garner significant printing coverage was by Austin et al. (2). By using a database containing 223 of the virtually common diagnoses, they found that people built-in under the sign of Sagittarius had an increased risk of fractures of the humerus, whereas Leos had a higher probability of gastrointestinal hemorrhage than did those born under the other astrological signs combined.

Fortunately for the reputation of science (and the scientists), neither of these manufactures meant for their conclusions to be taken seriously. Rather, their aim was to highlight the trouble of multiple comparisons, also known as multiplicity. This is an issue that arises when many statistical tests are performed within a single written report; with each test that is run, the probability of finding statistical significance just past hazard increases, so information technology becomes progressively more difficult to separate out true differences or associations from those due to chance. In this article, I discuss 1) why it is a trouble, two) under what circumstances multiplicity rears its caput, 3) diverse ways of correcting for multiplicity, 4) the controversy with regard to correcting for multiplicity and 5) offering some suggestions regarding when and how nosotros should correct for it.

Bear in mind that this article is written from what is called the "frequentist" perspective; that is, that the probability of an event is determined by its relative frequency observed in a study. In that location are other perspectives, primarily the Bayesian, in which previous probabilities are taken into account, only I practise not discuss these in this article.

WHY IS MULTIPLICITY A Trouble?

If we adopt an α level of 0.05, so by definition, assuming that all of the null hypotheses (H ⁰due south) are true, on boilerplate 5% of the statistical tests will show a pregnant difference or clan. The more than tests that are run, the greater the likelihood that at least one will be significant by take a chance, and the question that arises is the probability that this will occur. If 3 tests are conducted, each of which can have 1 of 2 results (significant or not), there are two³ = 8 possible outcomes, which are shown in the left-well-nigh columns of Table 1, labeled "Test result." The probabilities associated with each result are shown in the side by side 3 columns, chosen "Test probability," and the last column is the probability of that outcome. So, the probability for outcome 2 (not significant, not significant, significant) would be 0.95 × 0.95 × 0.05 = 0.045125. Note that at that place are two essential assumptions to these calculations: 1) all of the cypher hypotheses are truthful and 2) all of the statistical tests are valid under the zip, meaning that they yield P-value distributions that are uniform on the interval [0,1].

Tabular array i

The 8 possible outcomes of 3 tests bold the nil hypothesis is true ¹

	Test result			Exam probability
Outcome	A	B	C	A	B	C	Event probability
1	NS	NS	NS	0.95	0.95	0.95	0.857375
2	NS	NS	Sig	0.95	0.95	0.05	0.045125
3	NS	Sig	NS	0.95	0.05	0.95	0.045125
4	NS	Sig	Sig	0.95	0.05	0.05	0.002375
five	Sig	NS	NS	0.05	0.95	0.95	0.045125
6	Sig	NS	Sig	0.05	0.95	0.05	0.002375
7	Sig	Sig	NS	0.05	0.05	0.95	0.002375
viii	Sig	Sig	Sig	0.05	0.05	0.05	0.000125
Total							1.000000

	Test result			Test probability
Outcome	A	B	C	A	B	C	Outcome probability
1	NS	NS	NS	0.95	0.95	0.95	0.857375
two	NS	NS	Sig	0.95	0.95	0.05	0.045125
iii	NS	Sig	NS	0.95	0.05	0.95	0.045125
4	NS	Sig	Sig	0.95	0.05	0.05	0.002375
5	Sig	NS	NS	0.05	0.95	0.95	0.045125
6	Sig	NS	Sig	0.05	0.95	0.05	0.002375
7	Sig	Sig	NS	0.05	0.05	0.95	0.002375
8	Sig	Sig	Sig	0.05	0.05	0.05	0.000125
Total							1.000000

TABLE 1

The viii possible outcomes of 3 tests bold the null hypothesis is true ¹

	Examination event			Test probability
Issue	A	B	C	A	B	C	Upshot probability
1	NS	NS	NS	0.95	0.95	0.95	0.857375
ii	NS	NS	Sig	0.95	0.95	0.05	0.045125
iii	NS	Sig	NS	0.95	0.05	0.95	0.045125
4	NS	Sig	Sig	0.95	0.05	0.05	0.002375
5	Sig	NS	NS	0.05	0.95	0.95	0.045125
half-dozen	Sig	NS	Sig	0.05	0.95	0.05	0.002375
7	Sig	Sig	NS	0.05	0.05	0.95	0.002375
8	Sig	Sig	Sig	0.05	0.05	0.05	0.000125
Full							ane.000000

	Test upshot			Test probability
Event	A	B	C	A	B	C	Event probability
one	NS	NS	NS	0.95	0.95	0.95	0.857375
2	NS	NS	Sig	0.95	0.95	0.05	0.045125
3	NS	Sig	NS	0.95	0.05	0.95	0.045125
four	NS	Sig	Sig	0.95	0.05	0.05	0.002375
5	Sig	NS	NS	0.05	0.95	0.95	0.045125
6	Sig	NS	Sig	0.05	0.95	0.05	0.002375
7	Sig	Sig	NS	0.05	0.05	0.95	0.002375
8	Sig	Sig	Sig	0.05	0.05	0.05	0.000125
Total							1.000000

We can respond the question of the number of outcomes with at least 1 significant finding when in that location really is none by calculation upwards the probabilities of all of the rows that take one or more of them (i.e., rows 2–8). We can, but this tin become laborious when at that place are 5 tests (two⁵ = 32 outcomes) and borders on the masochistic when at that place are 10 of them (two^ten = 1024 outcomes). Fortunately, in that location'south a much easier way. No matter how many tests there are, the sum of the outcome probabilities is e'er ane; that is, there is a 100% adventure that ane of those outcomes volition occur. So, we tin can simply subtract the result from the kickoff row (no significant tests) from 1, and nosotros will get the same answer. In other words:

where Pr means "probability."

The probability in row 1 is 0.95³. To generalize a scrap, if there were g tests performed, and so the probability would exist 0.95 ^m . Nosotros can generalize fifty-fifty further. We said that the test would exist wrong 5% of the time, but we're not limited to that; we can use whatever value, such as ane% or 10%. If nosotros designate the false-positive rate as α, then the probability is (1 – α) ^k . That means that the probability of at least ane test beingness significant is:

The more tests there are, the greater the probability that one will be significant, equally shown in Figure i. If you run enough tests, you're almost guaranteed to observe something significant.

Figure 1

Probability that at least one exam will be positive, for varying numbers of tests and an α = 0.05.

WHEN MULTIPLICITY CAN ARISE

Multiplicity tin can arise under a number of different circumstances. Information technology is of import to differentiate among them, because the answer to whether or non to correct differs from one situation to the next. The various situations are:

1) Testing for baseline differences in a randomized controlled trial (RCT).² The variables usually consist of various demographic factors, as well as variables that may be possible confounders.
2) Looking for differences between or among groups on a number of outcome measures.
3) Running a statistical procedure that yields >1 P value, such every bit factorial or repeated-measures ANOVA, multiple regressions, and so along.
4) Peeking at information. This involves analyzing the results before all of the participants have been entered to decide whether more people need to be added to reach significance.
5) Acting analyses. These are most oft planned ahead of fourth dimension to see if the study should be ended early.
half-dozen) Fishing expeditions; that is, unplanned searches for differences between groups or relations among variables, likewise as unplanned subgroup analyses.

Within this list of situations, we can make a number of distinctions. The first is between confirmatory data analysis and exploratory data analysis. The former consists of testing hypotheses that have been specified a priori and the results dictate whether the written report is accounted successful or non. In dissimilarity, the researcher does not have explicit hypotheses in the latter case only is rather searching for relations or differences after the fact. The distinction is clear-cut at the extremes but can become blurry in practice. For instance, the Cardiovascular Health Study (3) was a prospective cohort written report looking at the relation between metabolic syndrome and cardiovascular disease. Without recourse to the original proposal (assuming it exists), it is hard to know if subsequent analyses of the data looking at the effects of gender and race are exploratory or confirmatory. This is 1 reason that some journals now insist on seeing the proposal as function of the review process.

The second distinction that is sometimes made is betwixt the family-wise error rate (FWER; i.e., the number of false discovery errors in a family of tests) and the experiment-wise error rate (i.due east., the number of faux positives in the entire study). Imagine that we did a written report in which participants were randomly assigned to ane of 4 diets, and we are looking at two different outcomes afterward 6 mo: the change in BMI and satisfaction with the diet, measured on some scale. Each of the variables would be analyzed with a i-gene ANOVA. A significant F ratio would indicate that there is a difference betwixt the groups just would not tell us where the deviation lies. To decide that, we would have to do six post hoc tests—grouping 1 compared with two, grouping 1 compared with iii, group 1 compared with 4, group ii compared with 3, grouping two compared with 4, and group 3 compared with 4—and each of those would have its own P level. This would be referred to as a family of tests. On the other hand, the study as a whole has ii contained measures, each tested with its own ANOVA, and so that in that location may be inflation of the α level at the experiment level.

Notwithstanding again, nonetheless, reality rears its ugly caput to blur this distinction at iii levels. Commencement, these terms are not used consistently from 1 article to the next; some authors use the 2 terms as synonyms and would describe the latter situation as a family of tests because they arose from the same study or experiment. 2nd, what constitutes an "experiment"? Would it include later replications? All of the studies conducted by that author examining the aforementioned question? Studies of the same question by other research groups? At that place is no easy answer. Finally, as we will hash out subsequently, if nosotros correct for postal service hoc tests, as in the case of the ANOVA, why don't nosotros right for the multiple t tests that accompany a multiple regression? In this article, we will use the term FWER to encompass both definitions.

HOW TO Correct FOR MULTIPLICITY

Given all the ways in which multiplicity can ascend, how can nosotros correct for it? In fact, there are many means, which can roughly be divided into four areas: post hoc tests run after a significant ANOVA, those that try to right for many independent analyses in a written report, those that try to control the false discovery rate (FDR; which will be divers a bit later), and those based on resampling procedures.

In that location are a multitude of mail hoc tests, such equally the Studentized range test, Fisher's least significant deviation, Tukey'due south honestly significant difference and his wholly significant difference, the Newman-Keuls exam, Dunnett'due south t, and many others. All of them are variations of t tests, with different ways of trying to control the overall α level (4). The Newman-Keuls (too called the Student-Newman-Keuls, or Due south-Due north-K) is the default option in programs such every bit SPSS, perhaps because information technology is the virtually powerful of the techniques (i.east., has the highest likelihood of finding a deviation betwixt means) but does not control the FWER when at that place are >3 groups (five).

Amidst the techniques that correct for all of the statistical tests run in a report, arguably the virtually widely used one is the Bonferroni correction, which is an application of the Bonferroni inequality. It was named after the Italian mathematician Carlo Emilio Bonferroni and probably was first introduced into the statistical world by Olive Dunn (six, vii). Its popularity is due to a number of factors. First, information technology is the essence of simplicity. If we desire to preserve an overall FWER credence rate, α_FWER, then nosotros divide α past the number of tests being done (k). That is,

So, if in that location were 10 statistical tests and we want to restrain the FWER at 0.05, we would apply a Bonferroni-adjusted α level (α_B) of 0.05/10 = 0.005 for each. 2nd, it is very flexible and can be used with whatsoever type of statistical test, non just ANOVAs.

Unfortunately, there is a steep price to pay for these benefits, and that is the extremely conservative nature of the correction. Every bit the probability of a blazon I error (stating there was an event when none was nowadays) decreases, that of a type Ii mistake (concluding that there was no outcome of the intervention or no association between variables when in fact there is 1) increases. This loss of power is due to a number of reasons. First, information technology assumes that the null hypothesis is truthful for all of the tests, and this is unreasonable, most especially later on a significant omnibus F test. Second, it assumes that all of the tests are independent, which is not true when pairwise comparisons are run, equally is the instance with the post hoc tests after an ANOVA. For example, if there are iii groups, A, B, and C, then the comparisons would consist of A vs. B, A vs. C, and B vs. C.

A number of modifications to the Bonferroni have been proposed, such as the Šidák-Bonferroni (eight), which uses the post-obit value:

However, the effect is very similar to the simpler Bonferroni value; if in that location are 10 tests, so α_B = 0.005, whereas the Šidák-Bonferroni–adjusted α level (α_S–B) = 0.00511, which is 1 reason information technology is rarely used.

Ane difficulty with the Bonferroni-class of corrections is that they become increasingly bourgeois if the outcomes are correlated with i another. Another problem is more philosophical and arises in areas in which many statistical tests are performed (sometimes running into the thousands), oftentimes with relatively few subjects, such equally genomics and brain scanning. A sure number of false-positive results is tolerable, because they would exist discarded when the study is replicated. The more relevant quantity to control is the positive FDR (pFDR); that is, the proportion of false positives among the set of rejected zilch hypotheses (which are referred to equally discoveries).

The difference betwixt controlling the FWER and the pFDR tin can exist seen past referring to Table two. When the aim is to control the FWER, we are concerned with the proportion of type I errors or imitation discoveries (cell C) relative to the total number of true aught hypotheses (cells A + C). Still, when the objective is to control the pFDR, the business organisation is the proportion in cell C relative to the total number of rejected null hypotheses (cells C + D). Thus, the pFDR is the expected proportion of false positives among all of the significant statistical tests. Phrased another way, if we use an α level of 0.05, and then we expect that ≤5% of all tests will result in type I errors. Nevertheless, by using the pFDR arroyo, we expect that ≤v% of the significant tests will be false positives. Benjamini and Hochberg (nine), who coined the term FDR, stated that this approach is more powerful than the Bonferroni and is ameliorate at separating the important few from the many trivial effects tested.

TABLE 2

The outcomes from many tests of significance ⁱ

Total	H ₀ True	H ₀ Simulated	Full
Not chosen significant	A: truthful negative	B: false negative	A + B
Called significant	C: fake discovery	D: truthful discovery	C + D
	A + C	B + D	k

Total	H ₀ True	H ₀ Imitation	Total
Non called significant	A: true negative	B: faux negative	A + B
Called significant	C: false discovery	D: true discovery	C + D
	A + C	B + D	grand

Tabular array two

The outcomes from many tests of significance ¹

Total	H ₀ True	H ₀ False	Total
Not chosen pregnant	A: true negative	B: false negative	A + B
Called pregnant	C: imitation discovery	D: true discovery	C + D
	A + C	B + D	one thousand

Total	H ₀ True	H ₀ False	Total
Non called significant	A: true negative	B: simulated negative	A + B
Chosen significant	C: false discovery	D: true discovery	C + D
	A + C	B + D	k

The Holm, or Holm-Bonferroni, procedure (ten) was proposed to try to mitigate the problem of the often-bourgeois nature of the Bonferroni correction. Although information technology was adult before the term FDR was introduced, it can exist seen as the first—and now perhaps best known—of the techniques to control the pFDR. In contrast to the single-stride Bonferroni technique, the Holm method is a sequential, step-downwards one, in which the per-test α level is inverse for each test. If there are one thousand meaning tests, then their associated P levels are rank ordered, and so that P _ane ≤ P _ii ≤ … ≤ P_grand . The beginning value is evaluated against the criterion of α/grand. If information technology is significant, so the next uses the criterion of α/(k − 1). This is continued with decreasing values of yard until a P value is reached that is not significant, and so it and all larger values of P are declared nonsignificant. This somewhat labor-intensive task has been made easier by the availability of a number of gratuitous online and downloadable estimator programs.

The Hochberg method (xi), on the other hand, is a pace-up procedure, which begins by testing P _grand (i.e., the largest P level) against the benchmark of α. If it is meaning, testing is stopped, and all smaller P levels will also be significant. It the result is not significant, then P ^k−1 is evaluated with α/two. Again, if the result is significant, testing is stopped and P levels smaller than this would be significant. If it is not significant, so P _k−2 is compared against α/3, and and so forth, until nonsignificance is reached. Of the 3 procedures, the Bonferroni is the well-nigh conservative, the Hochberg the least, and the Holm method falls between them.

A unlike class of approaches to correcting for multiplicity is based on resampling procedures. Ane variant of this technique, called bootstrapping, was introduced by Efron and Tibshirani (12). Information technology involves cartoon repeated random samples from the information (sometimes numbering 1000 or even 10,000 samples) and computing the parameters for each sample (e.g., the mean), which can and so exist averaged, thus yielding estimates of the variability of the parameters across unlike bootstrapped samples. It may seem somewhat odd drawing so many samples from a modest information gear up, only it's possible considering the samples are fatigued with replacement. That means that if participant number 24 is selected, we really leave her data in the data prepare where it tin exist fatigued over again. Thus, each sample is somewhat different from the others. Considering of this, it was not feasible until desktop computers became widely available and software was developed that can perform the calculations. It was get-go applied to the result of multiplicity by Westfall and Young (thirteen, 14) and gained popularity in the field of genetics, where microarray studies can generate thousands of hypotheses tested simultaneously (15).

The major reward of resampling approaches is that they have into account the estimated dependency among the test statistics, which is non true for some other corrections for multiplicity (east.g., the Bonferroni correction), and therefore tend to exist more powerful (less conservative). Resampling methods also make minimal assumptions virtually the underlying distribution. They can too be applied to single-step (i.e., Bonferroni-type), multistep (Holm and Hochberg), and FDR approaches. The mathematics of resampling techniques are beyond the scope of this article; interested readers should read one of the cited texts (xiii–15).

SHOULD WE Correct FOR MULTIPLICITY?

The discussion of how to correct for multiplicity has made the implicit assumption that we should correct for it, simply this is by no means a position accepted by everyone. In its favor, Moyé (16) holds that "Type I error accumulates with each executed hypothesis test and must exist controlled by the investigators" (p. 354); Cormier and Pagano (17) state that "The more tests we want to make, the more than bourgeois we have to be in lodge to preserve our overall significance level α" (p. 333); and Blakesley et al. (18) write that "Failure to control blazon I errors when examining multiple outcomes may yield simulated inferences, which may deadening or sidetrack research progress" (p. 256).

On the other side of the debate, Rothman (xix) argues that correcting for multiplicity is predicated on 2 assumptions: 1) that the principal cause of unusual findings is hazard and two) that no 1 would want to farther investigate phenomena that may have been caused past chance. Rothman disputes both of these beliefs. With regard to the first, he states that "Scientists assume instead that the universe is governed by natural laws, and that underlying the variability that nosotros observe is a network of factors related to one some other through causal connections. To entertain the universal zip hypothesis is, in effect, to suspend belief in the existent globe and thereby to question the premise of empiricism" (p. 45). As for the second, he writes that "Existence impressed by an farthermost issue should not be considered a error in a universe brimming with interrelated phenomena. The possibility that we may be misled is inherent to the trial-and-fault process of science; nosotros might avert all such errors by eschewing science completely, but then nosotros acquire nothing" (p. 46).

The danger is that, by correcting for multiplicity, we increment the probability of a blazon II error, and thus may overlook possibly interesting findings. It should be noted that this is based on the supposition that another group will attempt to replicate the findings and fail. However, some journals will not publish replication studies, and many tenure and promotions committees place more than value on original research, then this correction process may not accept place.

What Rothman overlooks is that the purpose of nil hypothesis significance testing is not simply to reject or not reject the null hypothesis but also to decide both the direction of the effect and its magnitude. Furthermore, as Cohen points out in his delightfully titled newspaper, "The earth is round (p < .05)" (20), at that place is a difference between the nada hypothesis (the hypothesis to be nullified) and the nil hypothesis (nothing is going on). Most often, the ii are the same, but they demand not exist. Nosotros can state that our null hypothesis is that a correlation is at or beneath a certain value (equally is often done in determining the reliability of a scale) or that a departure betwixt groups is a given amount or more than (every bit is done in noninferiority trials).

A different argument against correcting for multiplicity is offered by Schulz and Grimes (21). Suppose that an RCT of preoperative parenteral nutrition resulted in an increase in noninfectious complications (e.g., 22), and that this finding was significant at the 0.04 level. Now as well assume that the investigators looked at a second outcome, length of hospitalization, which was likewise significant at the 0.04 level. Equally would be expected, these ii outcomes are highly correlated, which is the case for many endpoints in clinical trials. The fact that both outcomes resulted in meaning findings in the same direction might reinforce our confidence in the results. However, were we to apply a Bonferroni-type correction, neither result would be pregnant, which is counterintuitive. Notwithstanding, a weakness of this statement is that the 2 (or more) dependent variables are unlikely to be contained if they are measures of conceptually related outcomes. Taken to the farthermost, imagine testing for a treatment effect on weight loss in pounds and and so testing for a treatment effect in the aforementioned data assail weight loss in kilograms. Because kilograms is just a linear transformation of pounds, the 2 tests will yield identical P values and then obtaining 2 P values of 0.04 would not tell united states annihilation different from one P value of 0.04. This case may be absurd, but it points out the difficulty in trying to debate that testing multiple related outcomes in some manner obviates the multiple testing problem. It is true that when the outcomes are highly related, use of a multiple testing procedure that incorporates that dependency into the correction may be apt. Both sides of the statement make valid points, resulting in 1 leading statistician, Doug Altman (23) to make the conclusion that "It is hard to see views such as [the ones but cited] existence reconciled" (p. 2383).

WHEN MIGHT Nosotros Right OR Not Right FOR MULTIPLICITY

Peradventure a way out of this quandary tin be plant by looking at the various conditions in which multiple hypothesis testing arises, described earlier, and discuss when it may or may not brand sense to correct for multiplicity. The get-go situation occurs when we look for baseline differences in an RCT. This is rationalized for 2 reasons: equally a check on the randomization procedure and to decide whether any variables should be used as covariates in subsequent analyses. Nosotros can actually deal with this quite hands—it shouldn't be done. Despite the fact that such testing is almost universally reported equally the get-go table in any RCT, nosotros (4) and others (24) believe that it is misguided on both grounds. When nosotros run a statistical test after an intervention, nosotros are testing 2 hypotheses: that any divergence was due to chance (H ₀) or that it was due to the fact that the groups were treated differently (the alternative hypothesis). However, at baseline, there is no culling; if differences exist, they must be due to chance (assuming that the randomization process has not been subverted, either in some nefarious style or innocently, such as by replacing dropouts or not reading the instructions). Hence, the P level is meaningless; the probability that chance was responsible is 100%, irrespective of the value printed out past the computer. The second argument, that variables that differ at baseline should be used as covariates, is equally misguided. Whether or not a variable should be included as a covariate should be based on theory or previous noesis of its influence on the outcome and not on inspection of baseline differences (25). Furthermore, because including covariates unremarkably reduces the within-group variance and increases the precision of the estimate of the treatment effect (4), it is often recommended to include prognostic variables as covariates, even if they practice not differ significantly between the groups.

Looking for group differences amidst the outcome variables, the 2d situation, is the almost fraught with difficulties because it is on the basis of these analyses that the study is said to have provided evidence to reject the main null hypotheses or non, and it is here that the differences between the varying viewpoints are sharpest. To reiterate, the debate is between finding differences that are actually due to chance compared with overlooking potentially useful findings. The most sensible communication was given by Schulz and Grimes (21): "Researchers should restrict the number of principal endpoints tested. They should specify a priori the primary endpoint or endpoints in their protocol" (p. 1592). That is, the problem of correcting for multiplicity is eliminated by making it unnecessary; there are only a pocket-size number of endpoints (ideally, i) and they will likely exist correlated. Because of this, Bonferroni-type corrections could undermine the conclusions, as pointed out earlier, because the outcomes would reinforce each other rather than having i lessen the significance of the other (begetting in listen the injunction that the variables should non simply be the same issue measured in dissimilar ways). Their injunction about specifying the outcomes a priori is to preclude substituting a significant secondary result for a primary one that was not meaning; a practice that was (and is) all too common (e.one thousand., 26). It is for this reason that many medical journals now require all trials to be registered before the kickoff patient is enrolled, and some journals require authors to include the protocol when the results are submitted for publication.

However sensible Schulz and Grimes' (21) advice is with regard to limiting the number of outcomes, information technology is infeasible when evaluating complex interventions. These are defined as ones with "several interacting components" (27, p. six) and are ofttimes used in health services, public wellness, and social policy research. The guidelines promulgated by the Medical Research Quango in Great U.k. state that "Identifying a single primary issue may not make best employ of the data; a range of measures will be needed, and unintended consequences picked up where possible" (27, p. 7). For example, the Moving to Opportunity project (28) was an RCT that evaluated the consequence of neighborhood on the evolution of obesity and diabetes. There was a wide range of outcomes indicating health outcomes, including height, weight, and concentration of glycated hemoglobin. Because these outcomes were all specified a priori in the protocol, there was no correction for multiplicity.

The third state of affairs, post hoc tests afterwards a significant omnibus test, is a somewhat confusing one. On the one paw, about statistical packages provide ≥one of the post hoc tests mentioned before (east.g., Newman-Keuls, the honestly significant deviation) later on an ANOVA, and this appears to exist accepted practice. On the other hand, multiple linear regression with categorical predictor variables is mathematically identical to ANOVA (29), whereby group differences are reflected in the b or β weights of a dummy coded variable. Nevertheless, it is highly unusual to run into any correction for multiplicity practical to the t tests of these weights. Why this difference? In the words of Tevye the Milkman (from Fiddler on the Roof), Tradition!

Actually, this "explanation" is less facetious than it may first appear. Near lx y ago, Lee J Cronbach (30) wrote well-nigh the ii "disciplines" of psychology: the experimental and the observational. The former tries to exert all possible control over a study, views within-group variance every bit dissonance to be minimized every bit much equally possible, and examines just a few hypotheses at a time; in contrast, the latter looks at nature equally it is, welcomes variance every bit necessary to explore between-person differences, and studies many variables at the aforementioned time (hence leading to the aphorism, "Ane person'south error variance is another person'south occupation"). The experimentalists favored ANOVA-blazon statistics, whereas the observationists relied primarily on correlations and regressions. This may be the origin of the divergence, with the first discipline concerned almost spurious findings and the second welcoming unexpected results, and is reflected in the more contemporary differing viewpoints of Blakesley et al. (18), on the one hand, and Rothman (19), on the other.

Situation four involves "peeking" at the information; that is, analyzing them partway through the written report to determine whether the sample size needs to be increased. This exercise is about definitely one that should be avoided entirely. The issue, equally Armitage et al. (31) pointed out, is that assuming that H ₀ is truthful, the probability of a significant test event grows rapidly with each analysis. The problem is compounded by the fact that the sample size is increased afterward each nonsignificant result, making the likelihood of finding significance even greater. Eventually, given plenty peeks at the data, the researcher will find what he or she is looking for. The only accepted exercise is to decide the sample size a priori, and to stick with that.

The fifth situation, interim analyses, also involves peeking at the information, but is most often built into many large RCTs at the design stage (32). The rationale is primarily an ethical one. If an analysis partway through the study shows that the new intervention will not prove to be superior to the comparison (either placebo or treatment every bit usual) when the full sample size is reached, or if 1 group has significantly more agin outcomes than the other, it would be unethical to continue with the written report. The lack of effectiveness was the reason that the tolbutamide and diet arm of the Academy Grouping Diabetes Programme (33) was dropped halfway through the trial. Conversely, if the new intervention is clearly superior, so it would exist equally unethical to deny it to those in the comparator condition. For instance, the Multicenter Automatic Defibrillator Implantation Trial (34) was ended early because the interim analysis showed a significantly greater reduction in all-cause bloodshed for those given an implantable cardioverter-defibrillator.

Diverse schemes have been proposed to protect the overall α level, including the use of the same significance level for each interim analysis (35) or a gradually decreasing criterion (36). Common practice now appears to be to separate α into 2 parts, α₁ for the interim analysis and α₂ for the terminal one (assuming only 1 interim assay), then that α₁ + α₂ = 0.05. Following the recommendation of Peto et al. (37), a very stringent criterion is used for α₁ (e.grand., <0.001). The rationale for this is that trials that are concluded early tend to overestimate the outcome size and underestimate the CI (38). As a result, acting analyses should exist used primarily when 1) in that location is serious take a chance of harm from adverse events or delaying the new handling and 2) when they have been built into the study from the beginning for reasons of efficiency so 3) used but with greatest of caution.

It is in the terminal situation, involving unplanned and subgroup analyses, that the tension between chance findings on the one hand and overlooking potentially interesting observations on the other is most acute. The problem in fact goes across simply doing and reporting a large number of analyses subsequently the fact. Every bit Gelman and Loken (39) pointed out, fifty-fifty without going on a "fishing trek," a study tin can have a big number of "researcher degrees of freedom," potentially involving choice of statistical examination and whether and how to categorize continuous variables, nonlinear transformations, outlier removal, etc. The (often undisclosed) use of such techniques in search of statistically significant results is referred to as "P value niggling" or "P-hacking." That is, there is always a very big number of potential analyses (e.k., analyzing the data by gender, age, comorbidity condition, previous history), of which the researcher may perform only a few, but the pick is often conditional on the data themselves. Fifty-fifty a cursory look at the data may pb the researcher to make up one's mind that it is non worth running some analyses because the difference looks quite pocket-sized. The issue is that although no formal statistical procedures have been performed, breezy, "eyeball" tests were run. Thus, there is a greater likelihood that the statistical tests that really were run will be significant. Compounding the problem, the comparisons that were rejected as not probable to be fruitful are non counted when correcting for multiplicity, further increasing the probability of a type I error.

Perhaps the virtually prudent grade of action in these circumstances would consist of 3 parts. Outset, there should be a correction for the number of tests that were really performed. The correction could exist either a Bonferroni-type 1 or a pFDR type. Second, both the corrected and uncorrected P levels should be reported, and then that the readers are able to determine for themselves which tests they would regard as statistically significant or non. Finally, any conclusions based on such analyses should exist clearly reported as tentative and hypothesis generating, rather than as hypothesis testing.

SUMMARY AND RECOMMENDATIONS

Whether or not to right for multiplicity is an outcome that shows no sign of early resolution. The issues on both sides are compelling. Not correcting for information technology increases the probability of spurious pregnant findings, possibly resulting in time and resources being wasted chasing down false leads. On the other mitt, correcting for multiplicity may have the opposite consequence, in which potentially interesting observations are discarded equally chance findings. After reviewing the various situations in which multiplicity can arise, the post-obit recommendations are offered, in the full realization that they may exist contested and debated:

1) The decision regarding whether or not to correct for multiple testing is a philosophical one and in that location is no way to prove that correcting is or is not the right thing to exercise unless one specifies 1's values (due east.k., one'due south preferences regarding the potential commission of certain types of errors) in advance.
two) In determining whether groups differ at baseline in an RCT, the significance levels are meaningless (unless 1 suspects that the randomization was not legitimately implemented) and therefore significance testing should not be done.
iii) When assessing the outcomes of a clinical trial, some degree of judgment is necessary. Ideally, at that place will be just a pocket-size number of outcomes, which have been specified a priori and are nearly likely correlated with one another. In such cases, correcting for multiplicity may be judged to be unnecessary and counterproductive; if the outcomes are in the aforementioned direction, they would strengthen confidence in the results. Nevertheless, if many endpoints are used, some investigators would observe the potential for FWER aggrandizement unacceptable.
4) Correcting for many P values inside a single statistic, such as complex ANOVA designs and multiple regression, appears to exist dictated more than by habit and tradition than past logic. Post hoc tests corrected for multiple testing are routinely used in the former instance and nigh never in the latter, even though the two techniques are mathematically identical. It is unlikely that these practices will change in the foreseeable future.
5) "Peeking" at the data to determine whether a larger sample size is needed is poor practice and should never be washed unless a preplanned interim analysis strategy is used that protects the overall α level. On the other hand, interim analyses are an integral role of many RCTs to decide whether a trial should exist ended early because of futility (i.e., the intervention will not show do good fifty-fifty with the full sample size), an excess of agin events in ane group, or the clear superiority of the intervention, meaning that it would be unethical to withhold it from the comparison grouping. Withal, considering trials that are concluded early on often overestimate the true issue size and underestimate the width of the CI, this should exist washed only using methods designed for such situations (east.g., twoscore).
6) When conducting unplanned, post hoc analyses of the data, including subgroup analyses, correcting for multiplicity should exist used. It is strongly recommended that both the corrected and uncorrected P values be reported and that all findings be reported equally tentative and hypothesis generating, rather than hypothesis testing.
7) If corrections are used, the Bonferroni-type is uncomplicated simply non optimal and should more often than not not be a commencement pick with a large number of tests. Holm and Hochberg methods are superior in this regard. Indeed, the researcher may wish to change the criterion with regard to which outcomes are significant and apply the pFDR approach. Resampling techniques are very promising for those with the wherewithal to implement them, and their utilise will likely become more widespread equally they are incorporated into the unremarkably used software packages.

TO READ FURTHER

In addition to the manufactures listed in References, other useful books near correcting (or non correcting) for multiplicity would include the following:

Dmitrienko A, Molenberghs G, Chuang-Stein C, Offen WW. Analysis of clinical trials using SAS: a practical guide. Cary (NC): SAS Institute; 2005.
Dmitrienko A, Tamhane AC, Bretz F, editors. Multiple testing problems in pharmaceutical statistics. Boca Raton (FL): Chapman & Hall/CRC; 2010.
Dudoit S, van der Laan MJ. Multiple testing procedures with applications to genomics. New York: Springer-Verlag; 2008.
Hochberg Y, Tamhane Air-conditioning. Multiple comparison procedures. New York: Wiley; 1987.
Hsu J. Multiple comparisons: theory and methods. Boca Raton (FL): Chapman & Hall/CRC; 1996.
Toothaker LE. Multiple comparisons for researchers. Newbury Park (CA): Sage; 1991.

The writer did not declare any conflicts of interest.

REFERENCES

Bennett

Baird

Miller

Wolford

Neural correlates of interspecies perspective taking in the postal service-mortem Atlantic salmon: an argument for proper multiple comparisons correction

Periodical of Serendipitous and Unexpected Results

2010

;

–

Austin

Mamdani

Juurlink

Hux

Testing multiple statistical hypotheses resulted in spurious associations: a study of astrological signs and wellness

J Clin Epidemiol

2006

;

964

–

McNeill

Katz

Girman

Rosamond

Wagenknecht

Barzilay

Tracy

Savage

Jackson

Metabolic syndrome and cardiovascular disease in older people: the Cardiovascular Health Report

J Am Geriatr Soc

2006

;

1317

–

Norman

Streiner

Biostatistics: the blank essentials.

4th ed.

Shelton (Connecticut)

PMPH U.s.a.

;

2015

Seaman

Levin

Serlin

New developments in pairwise multiple comparisons: some powerful and practicable procedures

Psychol Bull

1991

;

110

577

–

half dozen.

Dunn

Estimation of the medians for dependent variables

Ann Math Stat

1959

;

thirty

192

–

seven.

Dunn

Multiple comparisons among ways

J Am Stat Assoc

1961

;

–

Šidák

Rectangular conviction regions for the means of multivariate normal distributions

J Am Stat Assoc

1967

;

626

–

Benjamini

Hochberg

Controlling the fake discovery rate: a applied and powerful approach to multiple testing

J R Stat Soc B

1995

;

289

–

300

10.

Holm

A simple sequentially rejective multiple test process

Scand J Stat

1979

;

–

lxx

xi.

Hochberg

A sharper Bonferroni procedure for multiple tests of significance

Biometrika

1988

;

800

–

12.

Efron

Tibshirani

An introduction to the bootstrap.

New York

Chapman & Hall

;

1993

13.

Westfall

Young

p Value adjustments for multiple tests in multivariate binomial models

J Am Stat Assoc

1989

;

780

–

half-dozen

14.

Westfall

Young

Resampling-based multiple testing: examples and methods for p-value adjustment.

New York

Wiley

;

1993

16.

Moyé

P-value interpretation and blastoff allotment in clinical trials

Ann Epidemiol

1998

;

351

–

17.

Cormier

Pagano

Yard

Multiple comparisons: a cautionary tale about the dangers of fishing expeditions

Diet

1999

;

332

–

three

eighteen.

Blakesley

Mazumdar

South

Dew

Houck

Tang

Reynolds

III,

Butters

Comparisons of methods for multiple hypothesis testing in neuropsychological inquiry

Neuropsychology

2009

;

255

–

19.

Rothman

No adjustments are needed for multiple comparisons

Epidemiology

1990

;

–

xx.

Cohen

The globe is round (p <. 05)

Am Psychol

1994

;

997

–

1003

21.

Schulz

Grimes

Multiplicity in randomised trials I: endpoints and treatments

Lancet

2005

;

365

1591

–

22.

Bozzetti

Gavazzi

Miceli

Rossi

Mariani

Cozzaglio

Fifty

Bonfanti

Piacenza

Perioperative total parenteral nutrition in malnourished, gastrointestinal cancer patients: a randomized, clinical trial

JPEN J Parenter Enteral Nutr

2000

;

–

fourteen

23.

Altman

Statistics in medical journals: some recent trends

Stat Med

2000

;

3275

–

3289

24.

Altman

Comparability of randomised groups

Statistician

1985

;

125

–

25.

Roberts

Torgerson

Baseline imbalance in randomised controlled trials

BMJ

1999

;

319

185

26.

Marti-Carvajal

. Taking aim at a moving target: when a report changes in the middle. In:

Streiner

Sidani

, editors.

When inquiry goes off the rails: why it happens and what to do about it.

New York

Guilford

;

2010

. p.

299

–

303

28.

Ludwig

Sanbonmatsu

Gennetian

Adam

Due east

Duncan

Katz

Kessler

Kling

Lindau

Whitaker

, et al.

Neighborhoods, obesity, and diabetes—a randomized social experiment

Due north Engl J Med

2011

;

365

1509

–

29.

Cohen

Multiple regression as a general data-analytic organisation

Psychol Balderdash

1968

;

426

–

thirty.

Cronbach

The 2 disciplines of scientific psychology

Am Psychol

1957

;

671

–

31.

Armitage

McPherson

Rowe

Repeated significance tests on accumulating data

J R Stat Soc Ser A-G

1969

;

132

(

235

–

244

32.

Coffey

Statistical concepts for the stroke community: yous may accept worked on more adaptive designs than you think

Stroke

2015

;

e26

–

33.

Meinert

Knatterud

Prout

Klimt

The Academy Group Diabetes Program

A study of the effects of hypoglycemic agents on vascular complications in patients with developed-onset diabetes

Diabetes

1970

;

(

Suppl ii

789

–

830

34.

Moss

Zareba

Due west

Hall

Klein

Wilber

Cannom

Daubert

Higgins

Brown

Andrews

Prophylactic implantation of a defibrillator in patients with myocardial infarction and reduced ejection fraction

N Engl J Med

2002

;

346

877

–

35.

Pocock

Grouping sequential methods in the design and analysis of clinical trials

Biometrika

1977

;

191

–

nine

36.

O'Brien

Fleming

A multiple testing procedure for clinical trials

Biometrics

1979

;

549

–

37.

Peto

Pike

Armitage

Breslow

Cox

Howard

Mantel

McPherson

Thousand

Peto

Smith

Design and assay of randomized clinical trials requiring prolonged observation of each patient. I. introduction and design

Br J Cancer

1976

;

585

–

612

38.

Pocock

White

Trials stopped early: as well expert to be truthful?

Lancet

1999

;

353

943

–

40.

Bowalekar

Adaptive designs in clinical trials

Perspect Clin Res

2011

;

–

ABBREVIATIONS

FDR
FWER
H ₀
pFDR

positive false discovery rate
RCT

randomized controlled trial

Author notes

The author reported no funding received for this study.

seguramostor.blogspot.com

Source: https://academic.oup.com/ajcn/article/102/4/721/4564678

Problem With Running Significance Tests Over and Over Again

Abstruse

INTRODUCTION

WHY IS MULTIPLICITY A Trouble?

WHEN MULTIPLICITY CAN ARISE

HOW TO Correct FOR MULTIPLICITY

SHOULD WE Correct FOR MULTIPLICITY?

WHEN MIGHT Nosotros Right OR Not Right FOR MULTIPLICITY

SUMMARY AND RECOMMENDATIONS

TO READ FURTHER

REFERENCES

ABBREVIATIONS

Author notes

0 Response to "Problem With Running Significance Tests Over and Over Again"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel