From P Values to Bayesian Statistics, It's All in the Numbers

Steven Ross
The Scientist, Apr. 26, 2004

On first consideration, it seems a straightforward question: How effective and safe is drug A in treating condition B? But the design and analysis of the clinical trials that set out to answer this question are far from straightforward, involving an overwhelming number of variables.

First, the subjects: Any group of human beings will show boundless variation in terms of both genetic makeup and nongenetic variation, such as age and lifestyle.

Then the disease: Behind the convenient categorization, each case of the "same" disease is as unique as the patient in terms of stage, underlying cause, previous treatment, and host interaction.

The impact of the tested drug will be influenced by dose, the patient's metabolism, genetics, and compliance with the trial regimen. Even seemingly trivial variability in the way individuals in different centers implement the trial design will add to the uncertainty and the inevitable errors in reading, recording, and analyzing data.

Extracting meaning from this noisy data is an industry in itself, and one that has its fair share of controversies. Most prominent are the definitions of significance, including the appropriateness of relying on P values; the interpretation of trial results involving multiple drugs; and the so-called meta-analysis of results from the same drug used in different trials.

Are the problems serious? "We don't know how many drugs are approved that shouldn't be. We don't know how many drugs are not approved that should be. Are there drugs that hurt people? No doubt. But we don't know what they are," says statistics professor Joseph B. Kadane, Carnegie Mellon University, Pittsburgh, Pa.

PANNING P Although the P value appears to be clean, clear-cut, and comforting, "There is a recognition that you don't want to put [only] a P value behind a clinical trial result," says Robert T. O'Neill, director of the Office of Biostatistics at the FDA's Center for Drug Evaluation and Research. P is a probability measurement that observed results are due to chance. The standard for clinical trials--and almost everything else in the statistics world-- has been that a P value is significant when it is less than 0.05; that is, an indicated probability of less than one in 20 indicates that the findings are due to chance.

This statistical value is being questioned for its relevance, because some, and often much, of the causes of variation in clinical trials are overlooked in the analysis. For example, in an open trial of a heart medication, the fact that a person who receives drug A doesn't have a heart attack could be due partly or entirely to genetic predisposition, lifestyle, or the medication; however, the statistical analysis might be interpreted to credit the drug. And it can cut both ways. A drug that is effective in a small subpopulation of patients may fail to reach significance, because the majority of the trial subjects have a subtly different, undetected, and unsusceptible disease.

Source: PhRMA, 2003

MULTIPLEXING MIRE Often, several drugs will be used in combination in a Phase II trial or in a trial conducted after drug approval, to expand label uses. The idea is to save money or to accelerate the approval of drugs that are to be taken as a last resort (these are almost always used in combination with existing treatments, or after existing treatments have failed). "The FDA is under pressure from all sides and ... a large portion of the FDA approval system is now funded by industry to speed approval," says Steve Goodman, Johns Hopkins University, and the new editor of Clinical Trials.

The statistical limitations of multiple-drug trials, though well understood, are subject to some judgment calls. Multidrug trials work best when the treatments work independently, says Finlay McAlister of the University of Alberta Hospital. If the drug effects are not additive, or if one drug sharply affects the other's actions, then the statistical power can be dramatically diminished. "Our general feeling," says O'Neill, "is that you have to take seriously that you may have an interaction between A and B. ... The company may be kidding themselves if they are not recognizing that they have a multiplicity."

However, McAlister conducted a systematic review of this issue last year in JAMA,¹ and he says that only two of the 31 trials demonstrated a clear interaction between two drugs that would have made analysis problematic.

The FDA can require more testing, and does. But it rarely issues what it calls a "refusal to file" notice, flatly turning down a trial that a drug company deems acceptable, say Goodman and O'Neill.

The meta-analysis, another multiple approach in clinical trials, is ubiquitous in the medical literature. Meta-analyses compare many trials of the same treatment method, but the approach can be sloppy. In a course that Goodman teaches at Johns Hopkins, he gives his students a published meta-analysis and tells them to study the same question. "Two-thirds to three-quarters of the time the verdict of the group is that [the study] was crap, and a large percent [of the class] is totally mystified as to why the studies were done in the first place." The heart of a meta-analysis, he says, is to study the reasons for differences among the studies. They get published, says O'Neill, because "doctors love them. That's a whole different problem. You talk about [needing] remedial education ..."

Statisticians are beginning to use other methods for calculating drug trial results, which may provide predictive information sooner by requiring data from fewer patients than traditional analyses. One of the more popular methods, Bayesian statistics, incorporates prior knowledge and accumulated experience into probability calculations. This method has been used widely in the approval of medical devices. As clinicians use them, they amass knowledge that can be incorporated into later clinical trials to create a model for the outcome against which the statistics can be tested. "To my mind, the most important reason for thinking about statistics in a Bayesian way is that Bayesian statistics is internally consistent," wrote Kadane in 1995.²

Steve Ross (SSR3@columbia.edu) is an associate professor, Professional Practice, Columbia University Graduate School of Journalism.

References
1. F.A. McAlister et al., "Analysis and Reporting of Factorial Trials," JAMA, 289:2545-53, 2003.

2. J.B. Kadane, "Prime time for Bayes," Control Clin Trials, 16:313-18, 1995.