You are here: CTSPedia>CTSpedia Web>ContentInterest>SampleSizeFlaws (24 Jan 2014, PeterBacchetti)EditAttach

Bacchetti P. Current sample size conventions: flaws, harms, and alternatives.

A PDF from Biomed Central is available here.

The conventional expectation is that a study must have at least 80% power or else be considered scientifically unsound and even unethical [1]. Some challenges to this dogma have been based on the idea that some information is better than none and that even a small amount of inconclusive information may contribute to a later systematic review [2-4], but conventions remain entrenched and failing to anticipate systematic reviews is only one aspect of only one of three fundamental flaws. I present here a wider challenge to current conventions, including how they cause serious harm. Alternatives could produce both better studies and fairer peer review of proposed studies.

Experience with previous related papers suggests that many readers will immediately formulate objections or counterarguments. Anticipating and pre-empting all these is not possible, but I comment on two of the more likely ones here.

Unfortunately for the standard approach, the real relationship is radically different from a threshold, instead having a concave shape that continually flattens, reflecting diminishing marginal returns. This characteristic shape was recently verified for a wide variety of measures of projected value that have been proposed for use in sample size planning, including power [5]. Falling short of any particular arbitrary goal, notably 80% power, is therefore not the calamity presumed by conventional thinking. The lack of any threshold undercuts the foundation of current standards--they guard against a non-existent danger.

Figure 1. Qualitative depiction of how sample size influences a study's projected scientific and/or practical value. A threshold shaped relationship (dashed line) would create a meaningful distinction between adequate and inadequate sample sizes, but such a relation does not exist. The reality (solid line) is qualitatively different, exhibiting diminishing marginal returns. Under the threshold myth, cutting a sample size in half could easily change a valuable study into an inadequate one, but in reality such a cut will always preservemorethan half of the projected value.

Inaccuracy of sample size calculations is not only theoretically inevitable [6, 8] but also empirically verified. One careful study of assumed standard deviations in a seemingly best-case scenario--randomized trials published in four leading general medical journals--found that about a quarter had >5-fold inaccuracy in sample size and more than half had >2-fold inaccuracy [9]. Another study recently found high rates of inaccuracy along with many other problems [10]. This problem cannot be solved by simply trying harder to pinpoint where 80% power will be achieved, because inaccuracy is inherent in the conventional framework.

False assurance leads directly to the following logic: "Sample size is adequate to ensure a definitive result, the result is not definitively positive (i.e., p>0.05), therefore the result is definitively negative." I have encountered many researchers who believe this logic, and the widespread practice of considering power when interpreting "negative" studies [15] seems aimed at determining when this reasoning can be applied. This resolves the design-use mismatch, but in the wrong way, by focusing only on whether p<0.05. Although investigators usually report estimates, confidence intervals, and attained p-values, they often ignore these very informative results when interpreting their studies. For example, a study of vitamin C and E supplementation in pregnancy reported rates of infant death or other serious outcomes that implied one outcome prevented for every 39 women treated [16]. The authors nevertheless concluded definitively that supplementation "does not reduce the rate," because the p-value was 0.07. Interpreting p>0.05 as indicating that the results actually observed must be an illusion is very poor reasoning, but I find it in most draft manuscripts I review and many published articles I read. Interpretation of p<0.05 as ensuring that an observed effect is real and accurate also seems to be widespread, despite being unreliable [17].

Table 1. Sample layout of sensitivity analysis. Shown are possible study results with a given sample size (935 per group, based on the vitamin study discussed above [16]), for a yes or no outcome. Rows have differing assumptions concerning precision of the estimates, ranging from high precision (top row) to low precision (bottom row). For a continuous outcome, the rows would instead be for optimistic (small), expected, and pessimistic (large) standard deviations.The entries in the table are exactly the key results that interpretation should focus on when the study is completed, so this properly aligns planning with eventual use. The middle row can be a best guess such as would be used for conventional calculations; the other rows should reflect a reasonable range of uncertainty, which will depend on what is already known about the topic being studied. For the columns, inclusion of the intermediate case is important, because this will often include the most problematic or disappointing potential results. The vitamin study [16] paired a safe and inexpensive intervention with a severe outcome, so even results in the middle column would be regarded as encouraging; the actual completed study landed essentially in box 7, which should have been interpreted as very encouraging even though not definitive. Boxes 8 and 9 will usually be the least useful, but as noted above (False assurance), the risk of disappointing results is always present and should not be considered a flaw in study design.

Research training should not present current conventions as unquestionable dogma. Although trainees must know about the culture they will have to face, education about sample size should be balanced. For example, this article could be discussed.

When preparing a study proposal, courageous investigators could use an alternative approach from the previous section. This may be most practical for highly innovative proposals where standard power calculations would most clearly be meaningless. For other studies, use of detailed value of information methods may be convincing when the effort they require can be devoted. In many cases, it may be safer to supplement standard power calculations with more meaningful reasoning regarding sample size. This avoids dishonesty and at least gives reviewers the option of focusing on what really matters, and the juxtaposition of standard and alternative reasoning may help promote recognition of the standard approach's inadequacies.

Stemming criticism of sample size in the peer review process is necessary to allow alternative approaches to take hold. Reviewers should usually refrain from criticizing sample size and should challenge fellow reviewers who do. If fellow reviewers feel that a study is only half as large as it should be, remind them that this does not mean that the study is doomed to be worthless; instead, it will have *more *than half the projected value that it would with the doubled size. Sample size criticism is currently too easy and convenient; challenging fellow reviewers will help to change this.

Reports of completed studies should not include power calculations, and guidelines that have the effect of requiring that they be reported [11] should be changed to instead discourage their reporting.. Reporting power calculations has been justified as a way to disclose the primary outcome and the original target sample size [21, 32], but these can be stated directly without any reference to a power calculation [33]. Because power calculations are often not the real reason for the chosen sample size, providing them for completed studies does not promote--but rather subverts--full, transparent reporting. In addition, power is irrelevant for interpreting completed studies [15, 20, 34, 35], because estimates and confidence intervals allow more direct and reliable interpretation. Reporting power calculations inevitably gives the impression that they matter for interpretation, which serves to reinforce the widespread misconception that they allow any result with p>0.05 to be interpreted as proving the null hypothesis [33].

**Competing Interests**

The author declares that he has no competing interests.

**Acknowledgements**

This publication was supported by United States National Institutes of Health Grant Number UL1 RR024131. Its contents are solely the responsibility of the author and do not necessarily represent the official views of the National Institutes of Health. I thank Andrew Vickers of the Memorial Sloan-Kettering Cancer Center for helpful comments on a previous draft of this paper.

2. Edwards SJL, Lilford RJ, Braunholtz D, Jackson J: **Why ''underpowered'' trials are not necessarily unethical**. *Lancet *1997, **350**:804-807.

3. Guyatt GH, Mills EJ, Elbourne D: **In the era of systematic reviews, does the size of an individual trial still matter? ***PLoS Medicine *2008, **5**:3-5.

4. Vail A: **Experiences of a biostatistician on a UK research ethics committee**. *Statistics in Medicine *1998, **17**:2811-2814.

5. Bacchetti P, McCulloch CE, Segal MR: **Simple, defensible sample sizes based on cost efficiency**. *Biometrics *2008, **64**:577-585. Available here.

6. Kraemer HC, Mintz J, Noda A, Tinklenberg J, Yesavage JA: **Caution regarding the use of pilot studies to guide power calculations for study proposals**. *Archives of General Psychiatry *2006, **63**:484-489.

7. Horrobin DF: **Are large clinical trials in rapidly lethal diseases usually unethical? ***Lancet *2003, **361**:695-697.

8. Matthews JNS: **Small clinical-trials - are they all bad**. *Statistics in Medicine *1995, **14**:115-126.

9. Vickers AJ: **Underpowering in randomized trials reporting a sample size calculation**. *Journal of Clinical Epidemiology *2003, **56**:717-720.

10. Charles P, Giraudeau B, Dechartres A, Baron G, Ravaud P: **Reporting of sample size calculation in randomised controlled trials: review**. *British Medical Journal *2009, **338**:b1732.

11. Altman DG, Schulz KF, Moher D, Egger M, Davidoff F, Elbourne D, Gotzsche PC, Lang T, Grp C: **The revised CONSORT statement for reporting randomized trials: Explanation and elaboration**. *Annals of Internal Medicine *2001, **134**:663-694.

12. Gardner MJ, Altman DG: **Confidence-intervals rather than P-values - estimation rather than hypothesis-testing**. *British Medical Journal *1986, **292**:746-750.

13. Goodman SN: **P-values, hypothesis tests, and likelihood - implications for epidemiology of a neglected historical debate**. *American Journal of Epidemiology *1993, **137**:485-496.

14. Prentice RL, Caan B, Chlebowski RT, Patterson R, Kuller LH, Ockene JK, Margolis KL, Limacher MC, Manson JE, Parker LM *et al*: **Low-fat dietary pattern and risk of invasive breast cancer - The women's health initiative randomized controlled dietary modification trial. ***Jama-Journal of the American Medical Association *2006, **295**:629-642.

15. Hoenig JM, Heisey DM: **The abuse of power: The pervasive fallacy of power calculations for data analysis**. *American Statistician *2001, **55**:19-24.

16. Rumbold AR, Crowther CA, Haslam RR, Dekker GA, Robinson JS: **Vitamins C and E and the risks of preeclampsia and perinatal complications**. *New England Journal of Medicine *2006, **354**:1796-1806.

17. Ioannidis JPA: **Why most published research findings are false**. *PLoS Medicine *2005, **2**:696-701.

18. Detsky AS: **Using cost-effectiveness analysis to improve the efficiency of allocating funds to clinical-trials**. *Statistics in Medicine *1990, **9**:173-184.

19. Senn S: **Statistical issues in drug development**, 2nd edn. Chichester, England ; Hoboken, NJ: John Wiley & Sons; 2007.

20. Goodman SN, Berlin JA: **The use of predicted confidence-intervals when planning experiments and the misuse of power when interpreting results**. *Annals of Internal Medicine *1994, **121**:200-206.

21. Schulz KF, Grimes DA: **Epidemiology 1 - Sample size calculations in randomised trials: mandatory and mystical**. *Lancet *2005, **365**:1348-1353.

22. Norman GR, Streiner DL: **PDQ statistics**, 3rd edn. Hamilton, Ont.: B.C. Decker; 2003.

23. Bacchetti P: **Peer review of statistics in medical research: the other problem**. *British Medical Journal *2002, **324**:1271-1273.

24. Panel On Scientific Boundaries For Review: **Recommendations for change at the NIH's center for scientific review: Phase 1 report. http://www.csr.nih.gov/EVENTS/summary012000.htm** 2000, accessed January 31, 2010.

25. Bacchetti P, Wolf LE, Segal MR, McCulloch CE: **Ethics and sample size**. *American Journal of Epidemiology *2005, **161**:105-110. Available here.

26. Bacchetti P, Wolf LE, Segal MR, McCulloch CE: **Bacchetti et al. Respond to "Ethics and sample size - Another view"**. *American Journal of Epidemiology *2005, **161**(2):113-113.

27. Bacchetti P, Wolf LE, Segal MR, McCulloch CE: **Re: "Ethics and sample size" - Reply**. *American Journal of Epidemiology *2005, **162**:196-196.

28. Breslow N: **Are Statistical Contributions to Medicine Undervalued?** *Biometric Bulletin* 2002, **19**: 1-2.** http://www.tibs.org/WorkArea/showcontent.aspx?id=660**, accessed September 27, 2009.

29. Bacchetti P, McCulloch CE, Segal MR: **Simple, defensible sample sizes based on cost efficiency - Rejoinder**. *Biometrics *2008, **64**:592-594.

30. Willan AR: **Optimal sample size determinations from an industry perspective based on the expected value of information**. *Clinical Trials *2008, **5**:587-594.

31. Willan AR, Pinto EM: **The value of information and optimal clinical trial design**. *Statistics in Medicine *2005, **24**:1791-1806.

32. Altman DG, Moher D, Schulz KF: **Peer review of statistics in medical research - Reporting power calculations is important**. *British Medical Journal *2002, **325**:492-492.

33. Bacchetti P: **Peer review of statistics in medical research - Author's thoughts on power calculations**. *British Medical Journal *2002, **325**:492-493.

34. Senn SJ: **Power is indeed irrelevant in interpreting completed studies**. *British Medical Journal *2002, **325**:1304-1304.

35. Tukey JW: **Tightening the clinical-trial**. *Controlled Clinical Trials *1993, **14**:266-285.

-- PeterBacchetti - 08 Jan 2012

I | Attachment | Action | Size | Date | Who | Comment |
---|---|---|---|---|---|---|

xls | Cost-based_sample_size.xls | manage | 59.0 K | 08 Jan 2012 - 17:07 | PeterBacchetti | |

JPG | Figure_1.JPG | manage | 32.9 K | 09 Jan 2012 - 17:21 | PeterBacchetti | |

jpg | table_1_large.jpg | manage | 143.0 K | 09 Jan 2012 - 17:29 | PeterBacchetti |

Topic revision: r8 - 24 Jan 2014 - 16:19:02 - PeterBacchetti

Copyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.

Ideas, requests, problems regarding CTSPedia? Send feedback

Ideas, requests, problems regarding CTSPedia? Send feedback