Jump to: Page Content, Section Navigation, Site Navigation, Site Search, Account Information, or Site Tools.
|
|
Technical CommentsComment on "The Consensus Coding Sequences of Human Breast and Colorectal Cancers"![]()
Sjöblom et al. (Research Article, 13 October 2006, p. 268) reported nearly 200 novel cancer genes said to have a 90% probability of being involved in colon or breast cancer. However, their analysis raises two statistical concerns. When these concerns are addressed, few genes with significantly elevated mutation rates remain. Although the biological methodology in Sjöblom et al. is sound, more samples are needed to achieve sufficient power.
1 Broad Institute of Massachusetts Institute of Technology and Harvard University, Cambridge, MA 02142, USA.
2 Department of Statistics, Stanford University, Stanford, CA 94305, USA. 3 Dana-Farber Cancer Institute, Boston, MA 02115, USA. 4 Department of Medicine, Children's Hospital Boston, Boston, MA 02115, USA. 5 Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA. 6 Harvard Medical School, Boston, MA 02115, USA. 7 Department of Health Research and Policy, Stanford University, Stanford, CA 94305, USA. 8 Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA.
* These authors contributed equally to this work.
Sjöblom et al. (1) reported the first genome-wide effort to identify genes mutated in cancer. They also introduced a two-stage design in which they screened a large set of genes (13,023) for somatic mutations in a discovery set (11 breast and 11 colorectal cancers) and then screened only the small subset of genes that harbored at least one somatic mutation in a validation set (24 breast or colorectal tumors). They identified genes as candidate cancer genes (CAN genes) by applying a statistical model designed to assess the likelihood that the observed somatic nonsynonymous mutations would occur by chance. The approach employed the false discovery rate (FDR) approach of Benjamini and Hochberg (2) and used an assumed background mutation rate of µ = 1.2 x10–6.
The Sjöblom et al. analysis yielded rank-ordered lists of candidate genes with 122 and 69 genes in breast and colorectal cancers, respectively. These genes were said to have a 90% chance of being true cancer genes, that is, harboring mutations at a frequency significantly greater than expected by chance, based on the FDR approach (that is, FDR
First, the authors incorrectly apply the FDR formula. The formula requires the tail probabilities [Prob(X Second, the analysis is highly sensitive to the background mutation rate µ used in the statistical model (see Supporting Online Material). Different tumors and cell lines may have different background mutation rates, and accurate estimation of µ requires large amounts of sequence data generated from the same tumor population. Sjöblom et al. estimated µ based on a different, smaller data set. However, an estimate based on their own data yields substantially higher mutation rates—by factors of about 1.9 and 1.4 in breast and colorectal cancers, respectively (estimated in two ways; see SOM). If these rates are inserted into the analysis, the number of candidate genes falls to only 1 for breast cancer and 11 for colorectal cancer. Only four of these genes were not previously reported as mutated in cancer. We also note that the analysis assumes that µ is constant across the genome. It is well known that the germline mutation rate shows regional variation (see SOM), and similar variation could be estimated in cancer from silent mutations in adjacent sites. Such variation would bias the discovery screen to select genes with higher background mutation rates; therefore, an increased effective value of µ should be used in calculating significance. Allowing for plausible variation among genes (CV = 0.4) would increase the effective value of µ by a factor of more than 1.3. The candidate lists would be reduced to only known cancer genes. We note that the authors have recently performed a simulation study (3) based on the empirical Bayes or plug-in approach of Efron et al. (4) as an alternative way to estimate the FDR of their gene lists. The results are said to indicate that the results of Sjöblom et al. are conservative, that is, that the true FDR is even lower than 10%. However, we have discovered that their simulation study contains a subtle but important statistical shortcoming (SOM). Specifically, their analysis uses a score (CaMP score) for each gene that is highly sensitive to the presence of true cancer genes in the data. Therefore, a simulation that assumes no true cancer genes cannot be used to estimate the FDR in settings in which even a single true cancer gene exists. Replacing their CaMP score with one that does not suffer from this functional dependency among genes yields much higher FDRs. We emphasize that the mathematical shortcomings discussed above do not simply reflect different but reasonable approaches to the analysis but are fundamental statistical problems. We also suggest other statistical tests to detect candidate cancer genes, some of which are more powerful than the one above (see SOM, Appendices A to D). Using our estimated background mutation rates, even these more powerful tests yield few candidate genes.
After correcting the statistical analysis and using a background mutation rate that better fits the data, one cannot conclude that the
Supporting Online Materialwww.sciencemag.org/cgi/content/full/317/5844/1500b/DC1 SOM Text Figs. S1 and S2 Tables S1 to S7 Data Tables References
Received for publication 11 December 2006. Accepted for publication 22 August 2007.
The editors suggest the following Related Resources on Science sites:In Science Magazine
THIS ARTICLE HAS BEEN CITED BY OTHER ARTICLES:
|
Science. ISSN 0036-8075 (print), 1095-9203 (online)