Please add comments and then click on the "Add comment" button.
I believe that another important limitation of multiple comparison adjustments is that they cannot take subjectmatter knowledge into account. An important case where this matters is when multiple results all fit a coherent pattern. For example, suppose treatment A appears superior to treatment B for survival, quality of life, health care costs, toxicity, and days of work missed. If each of these comparisons had P=0.015, then a naive Bonferroni adjustment would give them all P=0.06, which could be misinterpreted as evidence against superiority. In many studies, the ensemble of results may reinforce one another, instead of detracting from one another as automatically assumed by multiple comparison adjustment methods. I’ve described some other examples in some lecture notes posted at CTSpedia.
I had thought that such considerations were usually not relevant in extremely highmultiplicity genetic studies, but a trainee recently told me about such a case where subject matter knowledge was important. The smallest Pvalue was just larger than the conventional cutoff (10^7.5, I think), but that SNP happened to be one of the few (out of the >million tested) known to relate to a plausible mechanism of the disease being studied. Requiring the conventional Pvalue cutoff for such special finding does not make any sense, but her group was having difficulty getting the study published.
I agree that the “essential disconnect” that Jeffrey writes about is very important, and I expect that this will indeed come up in the next one (or two) journal club discussions. I’ve discussed this as one of the three fundamental flaws in current sample size conventions; see article reprinted at CTSpedia.
The 4 to 1 ratio of Type II to Type I error was proposed by Jacob Cohen in a 1965 book chapter. He argued (with some caveats) that Type I errors were often about 4 times worse than Type II errors, so with alpha fixed by convention at 0.05, beta should be set to 0.20. I think this is poor reasoning, even if the valuation of the errors follows the 4 to 1 ratio. I don’t think that optimization of the tradeoff between alpha and beta given a fixed N will generally result in alpha and beta following that ratio, as implicitly assumed. More importantly, in sample size planning, we are not trading off alpha and beta with a fixed N. We are instead trading off reduction in beta versus increasing cost and burdening more participants. Optimization would therefore require considering the relevant quantities, costs and burdens. In fact, some colleagues and I have argued in detail that planning sample size based only on cost makes more sense than the supposed ideal of planning it without any consideration of cost. Properly considering the tradeoffs also has ethical implications, as we pointed out in a pending letter to the editor regarding the article being discussed.
Re both Peter and Knut’s comments on 3.9: I suspect these lines of thought will continue naturally into our next meeting. Let’s set aside the complications of multiplicity for a moment: Both Peter & Knut observe that the relative cost of inferential errors is contextdependent, yet the scientific community has become quite rigid with respect to alpha (demanding 0.05) and is becoming so with respect to beta (0.20 becoming standard). Why should this 4to1 ratio be universally imposed? It contributes to meaningless power calculations where the error rates and the feasible N are all constrained, so that the target effect size is forced to be unrealistically large (not the “minimum clinically meaningful value” or even a good prior guess at the actual value). The only remaining option in such cases is to abandon the study. The latter action may sometimes be the correct ethical decision, and the statistician may need to advocate for it to an investigator who really wants to do the study. Unless statisticians can influence journals, funding agencies, etc., to accept wellreasoned alternative choices for error rates, or move to a different inferential framework, we will continually face these difficult situations.
The essential disconnect between NeymanPearson hypothesis testing and the goals of most scientific studies is highlighted by the related point on dichotomizing results and quantification of evidence. The simplevsimple hypothesis testing model used for power calculations requires an alternative point hypothesis, which may be unrealistic (as above) and even if realistic will play no special role in the actual analysis, where it is just another point in the space of alternative hypotheses that are really under consideration. At that point, focus falls on the pvalue as a measure of strength of evidence, a role for which it is illsuited. Why the shift in focus? Because choosing between point hypotheses is so rarely what we want to do – we want to summarize evidence, most often over a range of hypotheses. Someone may sometimes have to make actual decisions based on that evidence, such as which set of genes to pursue, but the need to express the evidence is present regardless of whether a decision must follow. Mary asks what measure of evidence should be used – I find the likelihood ratio the most convincing (see work by Jeffrey Blume such as his 2002 tutorial in Statistics in Medicine, or the book by his advisor Richard Royall “Statistical Evidence”).
Re 3.9., I agree with Peter and would like to expand on this a bit. Sometimes, the problem is not the lack, but the abuse of testing for multiplicity. Most "genomic studies", such as genomewide association and expression studies are better seen as selection procedures (Bechhofer 1954), rather than confirmatory tests. As the aim in selection procedures is to balance power vs the size of the selected set, rather than the level, "adjusting" pvalues to the number of tests used is not only irrelevant for the set being selected, but the choice of an essentially arbitrary cutoff (e.g., 10^7.5, irrespective of chip density) is misleading. In particular, a larger set (equivalent to a higher pvalue) should be chosen in a smaller study to ensure sufficient power. Statisticians should be as vigilantly prevent the abuse of "correction for multiplicity" in selection studies as enforce the appropriate use in confirmatory statistics.
Re 3.5., the focus on the "assumption [of a] Gaussian distribution" may also me misleading. First, leastsquare methods are highly robust against deviations of the empirical distribution of residuals from the Gaussian distribution (Scheffe 1959). Second, lack of a "significant" result of a test for deviation does not prove the null hypothesis (of a Gaussian distribution). Hence, requiring "empirical verification" of assumptions could create the problems it is supposed to address. Finally, the focus on the Gaussian distribution may result in other assumptions, such as the adequatness of the measure of central tendency being used (arithmetic mean, geometric mean, median, ...) being overlooked. Which measure of central tendency to choose can rarely be decided (or verified) based on the data. Instead, knowledge of the subject matter are needs to be applied to select this measure. In particular, an approximate answer for the correct question may be better than the exact answer for the wrong question.
I like Peter's point about thinking about the "amounts of evidence" but how do we code this? Laura Lee Johnson and I have had a few conversations about how to do this type of scoring or coding. What do people suggest that we do?
Title_Discussion_Topic  Balancing false discovery and missed discovery 
Name_Topic_Initiator  Peter Bacchetti 
Online_Journal_Club_Meeting  Meeting 1 
Description  Problem to be explored 
I’m not clear on the phrase in the second paragraph of section 3.9, “biased towards false discovery.” What would no bias toward false discovery look like? The ideal false discovery rate is clear (zero), but it is not attainable except at the cost of forgoing all discovery. I believe that great care is needed to find the right balance of risk of false discovery and risk of missing true discoveries. For example, when the payoff from a true discovery is huge and the cost of a false discovery is trivial, leaning strongly toward further investigation of leads could be warranted, even if the vast majority won’t work out. A related point is that the concept of “discovery” seems to needlessly assume dichotomization of results as yes or no. In general, I would argue that it is better to think in terms of amounts of evidence (in keeping with the title of the section). 
See Also 

Disclaimer  The views expressed within CTSpedia are those of the author and must not be taken to represent policy or guidance on the behalf of any organization or institution with which the author is affiliated. 