Retrospective Analyses of Pivotal Prospective Studies With Population Segmentation—Statistically Based Inferences and Clinical Relevance

Retrospective analyses of pivotal prospective studies are important for verifying the inferences made as a result of the original studies and for generating new hypotheses. However, careful attention should be given to the comprehensiveness and completeness of a retrospective analysis and how it is ultimately used. A recent retrospective analysis of the Digital Mammographic Imaging Screening Trial (DMIST) underscores several important points related to inference generation and generalization of the results on the basis of summary performance indexes, as well as the importance of incorporating a clinically relevant perspective when generating inferences primarily on the basis of statistical test results. This article highlights three important points related to (1) the use of performance indexes (namely, area under the receiver-operating characteristic curve), (2) applied statistical methods (namely, Bonferroni corrections for multiple comparison), and (3) practical conclusions (namely, consideration of all possible inferences that could be generated from the data), as well as possible implications and limitations of these retrospective analyses. The discussion in this paper is based on one specific retrospective analysis of a prospective study, but the topics addressed are quite basic, general, and potentially applicable to a number of retrospective analyses of data that are experimentally ascertained during pivotal prospective studies, as well as during observer performance studies.

In a recent comprehensive “retrospective” analysis of the Digital Mammographic Imaging Screening Trial (DMIST) ( ), several conclusions and inferences were made largely on the basis of statistical outcomes of comparisons of performance levels between conventional or film mammography and full-field digital mammography (FFDM) in subsets of the originally studied population ( ). The inferences stated in the report ( ) could be interpreted very differently. The possible implications that could be derived from analyses of similarly ascertained observer performance data are reviewed and discussed using the work by Pisano et al ( ) as a specific example because the three primary issues addressed here may be common to other retrospective analyses of prospective studies, as well as the analyses of observer performance studies that are de facto becoming the standard in our field. Specifically, this article focuses on three primary points that should not be ignored when analyses similar to those published in the paper by Pisano et al are performed ( ).

Performance indexes, baseline performance levels, and fitted receiver-operating characteristic (ROC) curves

Pisano et al ( ) stated that film mammography in the younger group (age < 50 years) of pre- or perimenopausal women with dense breasts had an area under the curve (AUC; estimated as the area under the fitted binormal ROC curve) of 0.544, with a 95% confidence interval (CI) of 0.397 to 0.684. This performance level was compared to that of FFDM, with an AUC of 0.79 ( Table 1 ), and a statistically significant difference was reported ( P = .0015). The first question that arises is whether the investigators themselves believe that in this particular subgroup of women, the performance of film mammography was not different from “chance,” or an AUC of 0.50. If the authors believe that film mammography should be performed at this level for this subgroup of women, there should have been some discussion, on the basis of their own results, regarding whether we should immediately and strongly recommend that women aged < 50 years who are pre- or perimenopausal with dense breasts be totally excluded from undergoing screen film mammography. Surely, there is some radiation risk, related anxiety, and cost associated with annual mammography that is no better than flipping a coin. This group constituted 61.4% (7,315 of 11,915) of all young (age < 50 years) pre- or perimenopausal women in DMIST. If this is the case, we could potentially and easily determine the nature of the breast density by a one-time very low exposure or by other means. This very issue of the considerations for performance baseline levels of the reference practice, in this case film mammography, was recently addressed elsewhere ( ).

Table 1

Performance Levels for Digital and Film Mammography as Measured by Area Under the Receiver-operating Characteristic Curve (AUC) for the Two Subgroups of Interest ( )

Subgroup AUC (95% CI) AUC Comparison Digital Mammography Film Mammography Difference_P_ Value Age < 50 y and premenopausal or perimenopausal with dense breasts 0.791 (0.691–0.869) 0.544 (0.397–0.684) 0.247 .0015 Age > 65 y with nondense breasts 0.705 (0.578–0.811) 0.877 (0.804–0.929) −0.172 .0025

CI, confidence interval.

Clinical screening practices in this country, and more so worldwide, operate primarily in but a small part (ie, high specificity, or low false-positive rate) of the ROC domain. Consequently, the collected data often lead to estimated ROC points concentrated in the left part of the ROC space (eg, as in the original analysis of DMIST [2]). In such cases, there is an increased possibility for “hooks” in binormal ROC curves (eg, as was shown in the ROC curves of the original analysis of DMIST for several subsets [2]), and the use of parametric or nonparametric methods could lead to different conclusions ( ). Using the provided values of sensitivity and specificity for film mammography for the subgroup of women aged < 50 years and pre- or perimenopausal with dense breasts (0.273 and 0.889, respectively), one can compute an estimated area under the ROC “curve” obtained by connecting the binary ROC point to the trivial points at (0,0) and (1,1), which is equivalent to the Youden’s index/2 + 0.5 ( ). This estimated AUC is equal to 0.581, which is greater than the binormal AUC of 0.544. This suggests that there is a “hook” in the fitted binormal curve, given that the experimental point is close to the fitted curve.

Thus, because screening practices in general operate at the very left-hand side of the ROC domain (high specificity range), investigators should present the actual ROC curves and, as important, the actual experimentally ascertained operating points of the radiologists ( ). In many situations, valid and clinically relevant interpretation of results cannot be achieved otherwise. Furthermore, the AUC might not be as relevant in this case as the partial area ( ). However, this comment applies to a large number of studies, and a detailed discussion of this issue is beyond the scope of this paper.

Get Radiology Tree app to read full this article<

Statistically based study conclusions must be carefully assessed and considered in context

Get Radiology Tree app to read full this article<

Consistency, reproducibility, and the possible undue weight of a “pivotal” study

Get Radiology Tree app to read full this article<

References

1. Pisano E.D., Hendrick R.E., Yaffe M.J., et. al., DMIST Investigators Group: Diagnostic accuracy of digital versus film mammography: exploratory analysis of selected population subgroups in DMIST. Radiology 2008; 246: pp. 376-383.
2. Pisano E.D., Gatsonis C., Hendrick E., et. al., Digital Mammographic Imaging Screening Trial (DMIST) Investigators Group: Diagnostic performance of digital versus film mammography for breast-cancer screening. N Engl J Med 2005; 353: pp. 1773-1783.
3. Gur D.: Imaging technology and practice assessment studies: importance of the baseline or reference performance level. Radiology 2008; 247: pp. 8-11.
4. Gur D., Bandos A.I., Rockette H.E.: Comparing areas under the receiver operating characteristic curves: the potential impact of the “last” experimentally measured operating point. Radiology 2008; 247: pp. 12-15.
5. Youden W.J.: An index for rating diagnostic tests. Cancer 1950; 3: pp. 32-35.
6. McClish D.: Analyzing a portion of the ROC curve. Med Decis Making 1989; 9: pp. 190-195.
7. Stojadinovic A., Nissan A., Shriver C.D., et. al.: Electrical impedance scanning as a new breast cancer risk stratification tool for young women. J Surg Oncol 2008; 97: pp. 112-120.
8. Gur D.: Digital mammography: do we need to convert now?. Radiology 2007; 245: pp. 10-11.

Retrospective Analyses of Pivotal Prospective Studies With Population Segmentation—Statistically Based Inferences and Clinical Relevance

Performance indexes, baseline performance levels, and fitted receiver-operating characteristic (ROC) curves

Statistically based study conclusions must be carefully assessed and considered in context

Consistency, reproducibility, and the possible undue weight of a “pivotal” study

References

Further Reading

Abstracts of Funded National Institutes of Health Grants

Advances in Radiological Image Analysis from MICCAI 2007

Automated 11 C-PiB Standardized Uptake Value Ratio