Send any comments to peter@biostat.ucsf.edu._
The article below was adapted in 2008 by Nancy Hills and may need updating. More recent lecture notes are available at the parent page
Almost no one would really defend these first two statements:
1. The p-value tells you whether an observed difference, effect or association is real or not.
2. If the result is not statistically significant, that proves there is no difference.
These are too naive and clearly wrong. We all know that just because a result could have arisen by chance alone does not mean that it must have arisen by chance alone. That would be very bad logic.
But how about the following statement:
3. If the result is not statistically significant, you "have to" conclude that there is no difference. (And you certainly can't claim that there is any suggestion of an effect.)
This statement may seem a bit more defensible, because it resembles what people are taught about statistical hypothesis testing and "accepting" the null hypothesis. This may seem only fair: you made an attempt and came up short, so you must admit failure. The problem is that in practice, this has the same operational consequences as the two clearly incorrect statements above. If you are interested in getting at the truth rather than following a notion of "fair play" in a hypothesis testing game, then believing in this will not serve you well. Unfortunately, some reviewers and editors seem to feel that it is very important to enforce such "fair play".
This is still a poor approach, because reasoning via p-values and power is convoluted and unreliable. One problem is that power calculations are usually inaccurate, since they have to be based on assumptions that are hard to know in advance. Inaccuracy is theoretically inevitable and empirically verified. A study of randomized clinical trials in 4 top medical journals ( NEJM, JAMA, Annals of Internal Medicine, and Lancet) found that more than half used assumed SD's off by enough to produce a >2-fold difference in sample size.[1]
Of course, one could do better by re-estimating power after the study is completed. But the assumptions needed for power calculations are still not fully known, and post-hoc power calculations are not considered meaningful. The CONSORT guidelines for reporting randomized clinical trials specifically warn against this practice, stating that, "There is little merit in a post hoc calculation of statistical power using the results of a trial".[2] Why is this not worth doing? Because there is a simpler and better alternative: Confidence intervals show simply and directly what possibilities are reasonably consistent with the observed data, and are widely acknowledged to be superior and sufficient. (Additional references here.)
Here are some other situations that make it tempting to believe that a large p-value must be conclusive:
p>0.05 + Large N = No effect
p>0.05 + Huge Expense = No effect
p>0.05 + Massive Disappointment = No effect
But not if contradicted by the CIs! Sometimes we want to believe that a study must be conclusive, because it was such a good attempt or because it looks like it should be conclusive or because nothing as good will ever be done again. But these considerations carry no scientific weight and cannot overrule what is shown by the CI. If the CI is wide enough to leave doubt about the conclusion, then we are stuck with that uncertainty.
Treatment A: 16 deaths in 100
Treatment B: 8 deaths in 100
These produce the following analyses:
Odds ratio: 2.2, 95% CI 0.83 to 6.2, p=0.13
Risk difference: 8.0%, 95% CI -0.9% to 16.9%
This was reported as "no difference in death rates", presumably based on the p-value of 0.13. This type of interpretation is alarmingly common, but the difference is not zero, which would really be "no difference"; it is 8%.
A sensible interpretation would be:
"Our study suggests an important benefit of Treatment B, but this did not reach statistical significance."
Preeclampsia: RR 1.20 (0.82 - 1.75) This certainly suggests the the vitamins are not effective, because the estimate is a 20% increase in the outcome. But the CI does include values that would constitute some effectiveness, so the conclusion may be a bit overstated.
Growth restriction: RR 0.87 (0.66 - 1.16) Here, we have a big problem. The estimate is a 13% reduction in the outcome, so the definitive statement that vitamins do not reduce this outcome is contradicted by the study's own data. Vitamins did appear to reduce this outcome, and the CI extends to a fairly substantial 34% reduction in risk.
Serious outcomes: RR 0.79 (0.61 - 1.02) The same problem is present here, and it is even more severe. An observed 21% reduction in the most important outcome has been interpreted as definitive evidence against effectiveness. If we knew that this observed estimate were correct, then vitamin supplementation, or at least further study, would probably be worthwhile. In fact, the data in the paper correspond to an estimate of needing to treat 39 women for each serious outcome prevented, a rate that would almost certainly make treatment worthwhile.
The primary result was an estimated 9% reduction in risk of invasive breast cancer (Hazard Ratio = 0.91, 95% CI 0.83 - 1.01, p=0.07). An accurate sound bite would have been, "Lowering fat appears to reduce risk, but study not definitive." An interesting additional result was a reduction in the risk of breast cancer mortality (Hazard Ratio = 0.77, 95% CI 0.48 -1.22). The estimate here is a more substantial reduction in risk, but the uncertainty is wider. If this estimate turned out to be true, this would be very important. Unfortunately, the authors chose to primarily emphasize the fact that the p-value was >0.05. This gave the clear (and incorrect) impression that the evidence favors no benefit of a low-fat diet. The primary conclusion in the JAMA abstract was "a low-fat dietary pattern did not result in a statistically significant reduction in invasive breast cancer risk." The overemphasis of the the fact that the p-value was >0.05 promoted considerable misunderstanding.
This practice is often followed, but then ignored when data is interpreted. The above examples provided estimates and confidence intervals, but these were ignored in the major conclusions, which were based only on the fact that the p-values were >0.05.
But what if the results are less conclusive, such as those in the vitamin study discussed above? For the following results:
"Our results suggest that Vitamin C and E supplementation may substantially reduce the risk of growth restriction and the risk of death or other serious outcomes in the infant, but confidence intervals were too wide to rule out the possibility of no effect."
This interpretation reflects the key facts that 1) the estimates indicate big enough protective effects to be important and 2) the uncertainty around them is too large to permit a strong conclusion that any protective effect exists.
What would have happened if the vitamin paper had been submitted with this more reasonable interpretation? The paper probably wouldn't have ended up in NEJM! Unfortunately, this more accurate interpretation would probably have greatly reduced the paper's chance of acceptance.
The tension between fair interpretation and marketability for high-profile journals is the "elephant in the room" when it comes to conflict of interest. We are all under pressure to make our papers seem as interesting as possible. Despite the careful attention to financial conflicts of interest in medical research, the conflict between scientifically accurate interpretation versus maximizing interest, getting into a high prestige journal, and generally attracting attention (and citations) is largely unrecognized. It is always present, can have a strong influence on presentation and interpretation, and yet gets little attention.
The p-value fallacy can help make negative studies seem more conclusive and interesting. The most prestigious journals tend to prefer results that seem conclusive, perhaps because they are trying to serve clinicians in addition to scientists, and clinicians can make the most use of clear results. Although there is a lot of pressure to make results seem as interesting as possible, this should only go so far. Using the p-value fallacy to make a study seem definitive in one direction instead of suggestive in the other direction would clearly be going too far. In many cases, this is probably not deliberate. In this case, the authors may have felt that p>0.05 was definitive because the study was large and expensive, or perhaps because they had done a power calculation (but their assumptions were wildly off, as usual with power calculations).
Be vigilant and be honest! The usual safeguards against bias due to conflict of interest are disclosure and correspondingly increased vigilance. Because this conflict is always present, the only obvious response is to always be vigilant.
In the case of the Women's Health Initiative, we care about the biological effect of dietary fat and about actual cases of breast cancer that could be prevented.The disconnect between the author's statements and how they were interpreted illustrates why this Best Practice is important.
WHI conclusion: "a low-fat dietary pattern did not result in a statistically significant reduction in invasive breast cancer risk...However, the nonsignificant trends...indicate that longer, planned, non-intervention follow-up may yield a more definitive comparison."
The week after the article mentioned above, Newsweek published a follow-up concerning the difficulties that the press and the public have in understanding scientific results, particularly about diet research. (Scientists also have difficulty with these issues.) Despite this focus, the writers still did not understand what the WHI article stated, perhaps because they assumed--quite reasonably, but incorrectly--that the article must be addressing the real-world question.
Newsweek's interpretation of the WIH conclusion: "The conclusion of the breast-cancer study--that a low-fat diet did not lower risk--was fairly nuanced. It suggested that if the women were followed for a longer time, there might be more of an effect ."
Both the major conclusion from the abstract and the caveat that followed it concerned statistical significance rather than what is really true. Although the "nonsignificant trends" were mentioned, their implications for the important issues were not discussed. The Newsweek writers mis-translated these into more relevant--but incorrect--statements. The two phrases highlighted in red in the two statements above are not the same, and the statements highlighted in green also do not match--the authors meant that the difference may reach p<0.05, not that it will get bigger.
Because the WIH authors chose to completely neglect any direct assessment of the implications of their findings for what may really be true, they made serious misunderstandings virtually inevitable.
Best Practice 3 and Best Practice 2 are complementary. Following Best Practice 2 will usually keep you on track for Best Practice 3, and vice versa.
Perform searches for the words "no" and "not". Whole word searches on these two terms should find most negative interpretations of statistical analyses. Check each sentence found and ask yourself the following questions:
Many detailed examples of how to word interpretations that reflect estimates and CIs are available online.
Figure 1.
Interpretation:"We found strong evidence against any substantial harm or benefit." Because we have strong evidence against any values outside the CI, both these cases argue strongly that any effect is clinically unimportant. Note that this is true even though one is statistically significant.
Figure 2.
Interpretation: In this figure, there is suggestion of substantial benefit, because the estimate would be an important benefit if true. However, there may be no effect (not statistically significant), because the CI includes no effect.
Which of these results would be more exciting? The lower one, even though it has wider uncertainty, because the estimate is so much better.
Figure 3.
Interpretation: In figure 3, there is strong evidence of a benefit (statistically significant). A substantial benefit appears likely, but the CI is too wide to rule out a clinically unimportant benefit, i.e., the CI includes some benefits that would be too small to be clinically important.
Figure 4.
Interpretation: In this figure, we see strong evidence of substantial clinical benefit. This is the most satisfying type of result. Even the upper confidence bound is in the substantial benefit range.
Figure 5.
Interpretation: In this figure, no conclusions are possible due to a very wide CI. This is the least satisfying type of result. There is very little information in the study data.
Also see online resource at https://ctspedia.org/do/view/CTSpedia/ResultsInterpretation
Final wording: "Estimated effects of DHEA on lean body mass, fat mass, and bone density were small, but the confidence intervals around them were too wide to rule out effects large enough to be important."
I find that modifications like this are needed in the majority of papers that I am asked to co-author.
Be sure to use statements like "no interaction terms of treatment with other predictors in the model had p<0.1" rather than "there were no interactions of treatment with other predictors in the model", which would be an instance of the p-value fallacy. Another example is "We checked linearity assumptions by adding quadratic terms for each linear predictor, and none had p<0.05", not "there was no non-linearity", which again would be based on the p-value fallacy.
Here is an example from a paper on estimating past hepatitis C infection risk from reported risk factor histories:[10]
"We note that the confidence intervals were not narrow enough to rule out potentially important interactions, but in the absence of strong evidence for such interactions we focus on the simpler models without them."
"Our results do not support..."
"We found no evidence for..."
"Our data did not confirm..."
"There is no scientific evidence that BSE [Mad Cow Disease] can be transmitted to humans or that eating beef causes it in humans." --Prime Minister John Major, 1995
Of course, it turned out the BSE was transmitted to humans, and over 150 people died from it. "There is no evidence" is commonly used to give the impression that there is evidence on the other side. Although this and similar phrases sound "scientific", they promote sloppy reasoning.
Wording similar to this is popular in politics and advertising, because it gives the misleading impression of a strong case against a conclusion when no such case exists, without being technically incorrect. For the same reasons, the strange popularity of these phrases in scientific writing is disturbing. Whle such phrases may be technically correct, they are bound to be misread as implying evidence against an association or effect. They often involve or promote the p-value fallacy. And when there is strong evidence against an effect, they are too weak.
This no only prevents deception, but also can make statements clearer and stronger. Oddly, investigators often understate their conclusions using weak phrasing. The phrases quoted above seem like safe, conventional ways to state interpretations, despite their drawbacks. I still find that these are the phrases that first pop into my mind when I think about how to state a finding.
FRAM, a nationwide study of fat abnormalities in HIV, was a large study that carefully investigated changes in fat in various anatomical sites among persons with HIV. Its results strongly contradicted established thinking in this area, which was that visceral fat (known as "VAT") increased as peripheral fat decreased and the two were causally linked. Among other results supporting its conclusion of no reciprocal change was the following, showing that peripheral fat loss did not have any substantial association with central fat gain (note upper confidence bound of 1.06):
Peripheral fat loss association with central fat gain, OR: 0.71, CI: 0.47 to 1.06, p=0.10.
Despite the strong results, an early draft stated that "our results do not support the existence of a single syndrome with reciprocal findings". This was revised to read more appropriately: "We found evidence against any reciprocal increase in VAT in HIV-infected persons with peripheral lipoatrophy". [11]
Another example comes from a study of the safety of cannabinoids in persons with treated HIV:
Marijuana effect on log10 VL: -0.06 (-0.26 to 0.13)
Dronabinol: -0.07 (-0.24 to 0.06)
These upper confidence bounds were considered too small to be important, so this was strong evidence against any substantial harm.
In the first draft, the stated conclusion read, "Overall there was no evidence that cannabinoids increased HIV RNA levels over the 21-day study period".
The final draft more appropriately stated that "This study provides evidence that short-term use of cannabinoids, either oral or smoked, does not substantially elevate viral load in individuals with HIV infection".[12]
A good example is from the WHI study of diet and breast cancer that we have been discussing:
"There were departures from the design assumptions that likely reduced study power...If the WHI design assumptions are revised to take into account these departures [less dietary fat reduction], projections are that breast cancer incidence in the intervention group would be 8% to 9% lower than in the comparison group [and] the trial would be somewhat underpowered (projected power of approximately 60%) to detect a statistically significant difference, which is consistent with the observed results."
This illustrates the contorted sort of reasoning that speculation about low power requires. What exactly are they trying to say?
Their intended meaning seems to boil down to the following: "There might be a 9% reduction in risk. We could have missed it because power was only 60%." This speculation is completely pointless, because the conclusion is better supported, and much clearer, from the examination of the estimate and the confidence interval. The HR=0.91, so of course a 9% reduction is possible. It's what they actually saw!
Let's be clear on what happened here. World-class researchers reporting a very high-profile (and hugely expensive) study lapsed into convoluted and completely pointless reasoning. This is a dramatic illustration of how exclusive focus on whether p<0.05 leads to easily avoided problems. Had the authors followed Best Practice 2, no such speculation would have been required, and the implications of their results would have been much clearer. To repeat:
Consider a "negative" study of vitamin E in diabetics that claimed to have proven that vitamin E supplementation does not prevent cancer (this was based on the p-value fallacy, by the way). The study used ITT: "To reduce bias, we included continuing follow-up from those who declined active participation in the study extension and stopped taking the study medication". [13]
But ITT produces underestimates of actual biological effects: it is biased toward no effect. Thus, in addition to ignoring their estimates and CIs, they based a negative conclusion on an approach that is biased in that direction. This is very different from still having a positive finding despite some bias in the other direction, which is where ITT analysis works well.
This is an area where the WHI study did reasonably well. They used specialized methods to attempt to estimate what effect the fat-lowering intervention would have had if it were followed as recommended. These methods try to avoid the self-selection bias that simple per-protocol analyses (or observational studies) would have, while also avoiding the biases of ITT analysis.
The WHI estimate of effect if adherent to a low-fat diet was estimated as a breast cancer Hazard Ratio of 0.85 (0.71 to 1.02). This was a bit lower than the 0.91 from the primary analysis, but it still just misses p<0.05. They went further in trying to account for adherence to the intervention, but balked at going beyond saying that the use of a more stringent adherence definition "leads to even smaller HR estimates and to 95% CIs that exclude 1".
So a specific Best Practice is to
Two closely related problems are:
Following are examples of how to justify not using multiple comparisons adjustments, both for papers and in grant proposals:
Example text for responding to a manuscript review or for inclusion in a paper:
"Although we examine many differences and issues, we report nominal p-values, without adjustment for multiple testing. Such adjustment would be focused on avoidance of one or more results with p<0.05 in the case where all differences are truly zero,[14-16] which is an extremely unrealistic hypothesis about the state of nature in our situation. In addition, adjustment would require that each result detract from the others, but there are clear biological relationships among many of the issues that we examine, and these permit coherent sets of findings to reinforce each other rather than detract from one another. Thus, multiple comparison adjustment would do exactly the wrong thing in such cases.[17] We therefore rely on scientific judgment rather than formal adjustment methods to indicate where caution is warranted despite findings with p<0.05."
Example text for inclusion in a grant proposal:
"Although this Aim involves many different measures, we do not plan formal adjustments for multiple comparisons. This is because we expect many measures to show statistically significant differences, and that directions and magnitudes of differences (perhaps including some with p>0.05) will fit a biologically coherent pattern. In this case, each result will reinforce the other, rather than detracting from one another as required by formal multiple comparisons adjustments such as the Bonferroni method. Conversely, if only one or a very few measures reach statistical significance and their directions and/or magnitudes do not coherently fit with << our substantive theory >>, then we will note that the result(s) with p<0.05 lack biological plausibility and could be due to chance despite meeting the conventional cutoff for statistical significance."
An investigator who has just concluded an animal experiment is puzzled and worried. The experiment included:
Because this had a specific focus on certain pairwise comparisons to address the scientific questions, he had done t- tests and estimated pairwise differences, obtaining positive results that he thought made sense. But he thought that he "had to" perform ANOVA, and this produced a p-value a bit larger than 0.05. So he thought that to be "rigorous" he would have to reach the opposite conclusion of what he found with the focused analyses. In fact, the focused results were what mattered; unfortunately, there is a risk that reviewers may think otherwise.
In fact, reviewers often state flatly that omnibus tests and multiple comparisons adjustments must be used when in fact those approaches would be very inappropriate.
Scenario 2:
A primarily descriptive study that examined effects of four different administration routes had many positive results, not a single one that was likely due to chance. However, the reviewer stated:
"Repeated measures analysis of variance should be completed. *Only if the time-by-treatment interaction is significant *, should time-specific comparisons be made. Then multiple comparison procedures, such as Tukey's test, should be used rather than repeated t- tests."
This would treat p>0.05 on the unfocused omnibus test of time-by-treatment interaction as a reliable indicator that no important differences are present: Problem 1.
The reviewer's comment, particularly the phrase highlighted in red, may sound rigorous, but it is only "rigorous" in the sense of being rigid or harsh, not in the sense of being exactly precise. It requires extreme conservatism--not accuracy--which could result in missing or understating important findings.
Scenario 3:
Another study of the biology of morphine addiction had a great deal of scientific structure that omnibus tests or multiple comparisons adjustments would not take into account. The design of the study was very complex, involving:
A reviewer of the manuscript ignored the consistency of the findings and wrote the following strident comment:
"The statistical analyses are naive. The authors compute what appear to be literally dozens of t- tests without any adjustment to the α level--indeed *the probability of obtaining false positives grows with the number of such tests computed *. The authors should have conducted ANOVAs followed by the appropriate post-hoc tests. Their decision to simple [ sic ] compute t- tests on all possible combinations of means is statistically unacceptable."
The statement highlighted in red is incorrect. The chance of obtaining at least one false positive increases if the null hypothesis holds for all comparisons. False positives in general do not become more likely, and the chance of getting many false positives that all fit together in a coherent biological theory is extremely small. This is a clear case where the results of multiple analyses all reinforce each other rather than detracting from each other as required by omnibus tests and multiple comparisons adjustments.The probability of obtaining multiple positive results exactly where expected and negative results exactly where expected does not grow; it becomes vanishingly small.
Striving for the following best practices will often lead to much better analyses and interpretations than use of omnibus tests and multiple comparisons adjustments.
Best Practice 1 and Best Practice 2, obtaining estimates and CIs and taking them into account when interpreting results, will often be helpful for achieving accurate interpretation. They will usually steer you away from overly conservative, automatic methods based only on p-values.
EXAMPLE 1: Using body mass index (BMI) as a predictor of central fat.
In the case of fat abnormalities in HIV infected persons, early studies controlled for BMI when assessing whether peripheral fat loss was associated with central fat gain. Because fat amounts contribute directly to BMI, this caused a spurious association.
Those with low peripheral fat were less likely to have high central fat. But the opposite picture emerges if you control for BMI. Those with low values of both peripheral and central fat had their low central fat "explained" by a low BMI, thereby leaving the estimated association to be determined only by the rare cases with low peripheral fat and high central fat. This field of inquiry was seriously misguided for many years because of this seemingly obvious problem.
EXAMPLE 2: Total time on treatment as a (fixed) predictor of survival time.
Patients can only be treated if they are alive. Those who die early will necessarly have their total time on treatment limited by how long they lived. Someone who died after 2 days will have at most 2 days of treatment. Conversely, someone with 5 years of treatment must have lived at least 5 years. Proportion of time on treatment, although less obviously biased, also may depend on survival time and be potentially biased.
EXAMPLE 3: Number of rejection episodes as a predictor of post-transplant survival.
Someone who died on day 1 would have at most 1 rejection episode, while someone with 5 episodes would have to have lived quite a bit longer.
To avoid these problems, either
1) ensure that the outcome is not part of the definition of a predictor, and vice versa, or
2) be very careful and clear with interpretation.
In general, it is simplest and safest to make sure that outcome and predictors are fully distinct. For time-to-event analyses, use time-dependent covariates, defined only using measurements up to the present. In general any fixed (non-time-dependent) covariates should be known from information only up to time zero. And even for time-dependent covariates, any use of "look-ahead" information is likely to cause trouble.
There is sometimes confusion about whether to show SDs or SEs. Show SEs (or CIs, which are usually better) to indicate the precision of estimates. Show SDs to indicate the variability in the population.
For odds ratios, relative risks, and hazard ratios, do one of the following:
Do not use "p<" for values of 0.001 or more; use "=". That is, don't say p<0.01 when you could say p=0.0058.
Never use "p=NS" or "p>0.1". This gives needlessly vague information and encourages Problem 1.
Do not show chi-squared or other statistics that provide the same information as p-values (but are less interpretable). These add no information and clutter presentation of results. They may seem to add some technical cachet, but leaving out unimportant details actually conveys a better impression of technical savvy.
For example, an OR of 1.0051 per 1 cell/mm³ increase in CD3 count is very hard to interpret. It is also hard to rescale this by eye, because the OR for a 100 cell increase is (1.0051)¹ºº, which most people cannot calculate in their heads (it's 1.66).
To avoid this problem, rescale numeric predictors before running regression models (or use lincom, in Stata, afterwards). For eample, make a new variable cd4per100=cd4/100.
Avoid the use of the word "significant" by itself. Use "statistically significant" if meaning p<0.05; use "important", "substantial", or "clinically significant" if that is the intended meaning. As noted under Problem 1, some journals reserve the word "significant" used by itself to mean "statistically significant". If they do not allow the full term, just avoid using it at all (this may be a good strategy anyway).
Use "Relative Hazard" or "Hazard Ratio" for proportional hazards model results, not "Relative Risk"; this is sometimes used for analysis of a binary outcome.
Use "Mann-Whitney" instead of "Wilcoxon" or "Wilcoxon rank-sum" to avoid confusion with "Wilcoxon signed-rank". Both "Mann-Whitney" and "Wilcoxon rank-sum" are used, due to the near simultaneous, independent development of the method.
Prevention of missing data is best. If is is unavoidable, however,
Problem 2. Misleading and vague phrasing
Problem 3. Speculation about low power
Problem 4. Exclusive reliance on intent-to-treat analysis
Problem 5. Reliance on omnibus tests
Problem 6. Overuse of multiple comparisons adjustments
Best Practice 2. Ensure that major conclusions reflect the estimates and the uncertainty around them.
Best Practice 2a. Never interpret large p-values as establishing negative conclusions.
Best Practice 3. Discuss the implications of your findings for what may be true in general. Do not focus on "statistical significance" as if it were an end in itself.
Best Practice 4. State what you did find or learn, not what you didn't.
Best Practice 5. Learn as much as you can from your data.
Best Practice 5a. Also do per-protocol analyses, especially if:
Best Practice 6a. Rely on scientific considerations to guard against overinterpretation of findings with p<0.05.
Best Practice 6b. Acknowledge the desirability of independent replication, particularly for unexpected findings.
Best Practice 7. Choose accuracy over conservatism whenever possible.
2. Altman DG, Schulz KF, Moher D, Egger M, Davidoff F, Elbourne D, PC, Lang T, for the CONSORT Group. The revised CONSORT statement for reporting randomized trails: explanation and elaboration. Ann Intern Med. 2001 134:663-94. Page 670.
3. Tukey JW, Tightening the clinical trial. Controlled Clinical Trials. 1993; 14:266-285. P. 281.
4. Goodman SN, Berlin JA. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med. 1994; 121:200-6.
5. Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. American Statistician. 2001; 55:19-34.
6. Senn SJ. Power is indeed irrelevant in interpreting completed studies. BMJ. 2002; 325:1304.
7. Rumbold AR, Crowther CA, Haslam RR, Dekker GA, Robinson JS. Vitamins C and E and the risks of preeclampsia and perinatal complications. NEJM. 2006; 354: 1796-1806.
8. Women's Health Initiative. Low-fat dietary pattern and risk of invasive breast cancer. JAMA. 2006; 295:629-42.
[[http://www.ncbi.nlm.nih.gov/pubmed/16467232?ordinalpos=1&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_DefaultReportPanel.Pubmed_RVDocSum] ]]
9. Newman TB, Liljestrand P, Jeremy RJ; Jaundice and Infant Feeding Study Team. Outcomes among newborns with total serum bilirubin levels of 25 mg per deciliter or more. NEJM. 2006; 354: 1889-1900. http://content.nejm.org/cgi/reprint/354/18/1889.pdf
10. Bacchetti P, Tien PC, Seaberg EC, O'Brien TR, Augenbraun MH, Kral AH, Busch MP, Edlin BR. Estimating past hepatitis C infection risk from reported risk factor histories: implications for imputing age of infection and modeling fibrosis progression. BMC Infectious Diseases, 7: 145, doi:10.1186/1471-2334-7-145, 2007).
11. Bacchetti P, Gripshover B, Grunfeld C, et al. Fat distribution in men with HIV infection. JAIDS, 2005; 40 (2):121-31.
12. Abrams DI, Hilton JF, Leiser RJ , Shade SB, et al. Short-term effects of cannabinoids in patients with HIV-1 infection: a randomized, placebo-controlled clinical trial. Ann Intern Med, 2003; 139:258-66.
13. Lonn E, Bosch J, Yusuf S, Sheridan P, et al., for the HOPE and HOPE-TOO Trial Investigators. Effects of long-term vitamin E supplementation on cardiovascular events and cancer: a randomized controlled trial. JAMA. 2005; 293: 1338-47.
14. Rothman KJ. No adjustments are needed for multiple comparisons. Epidemiology, 1990; 1 (1):43-6.
15. Savitz DA, Olshan, AF. Multiple comparisons and related issues in the interpretation of epidemiologic data. Am J Epidemiol, 1995; 142 (9); 904-8.
16. Perneger TV. What's wrong with Bonferroni adjustments. Br Med J, 1998; 316 (7139): 1236-8.
17. Bacchetti P. Peer review of statistics in medical research: the other problem. Br Med J, 2002; 324:1271-723.
18. Wittes J. On looking at subgroups. Circulation, 2009; 119:912-15.