This is a response to a recent thought-provoking paper by Gur and Rockette that raises issues regarding the applicability of the free-response receiver-operating characteristic (FROC) paradigm to imaging systems evaluations. Unlike many tests, most tasks in diagnostic imaging provide information about the location(s) of disease, in addition to its presence or absence. However, the receiver-operating characteristic (ROC) method only considers disease presence or absence and disregards location. For some clinical tasks, the ROC method is appropriate. For example, a task such as detecting diffuse pulmonary fibrosis that does not involve focal lesions is appropriately analyzed using the ROC method. However, tasks such as detecting lung nodules on chest radiography or microcalcifications on screening mammography, which involve detecting localized lesions, especially multiple lesions, are more appropriately handled using FROC analysis.
By way of disclosure, having worked in this area since about 1984, I am vested in FROC methodology. There are two other location-specific paradigms not mentioned in Gur and Rockette’s paper, the location ROC paradigm and the region-of-interest (ROI) paradigm . In the location ROC paradigm, the radiologist provides an overall rating and marks the most suspicious region. In the ROI paradigm, the investigator segments the image into ROIs, and the radiologist rates each ROI for the presence of disease. Like FROC, these paradigms were developed to address the localization and multiple-lesions limitations of ROC methodology, and most of the issues attributed to FROC apply to these methods as well.
Neglect of location information implies suboptimal measurement precision (ie, low statistical power), which diminishes the ability to detect differences between modalities, the most common application of observer performance studies. Early analysis tools developed by me drew fair criticism because they ignored correlations. This issue was resolved in 2004 by the jackknife alternative FROC (JAFROC) method , which was demonstrated to have substantially higher power than the ROC method and passed rigorous statistical validation. As evidenced by recent publications in journals and proceeding papers at a major international conference on medical imaging, JAFROC is gaining acceptance. However, resistance to it is also increasing, which is to be expected as part of normal scientific discourse. Gur and Rockette have done a service to the imaging community by expressing their concerns publicly, and I am grateful to the editor for giving me the opportunity to respond.
I agree with the authors on some issues: the ambiguity of the acceptance target (how close a mark must be to a lesion to be counted as a true-positive finding); the suitability of the figure of merit for multiple lesions with different clinical significances; handling multiple views per case; simulator-related issues, such as distributional assumptions and the lack of accounting for satisfaction of search; and so on. However, I choose to regard these as research opportunities and thank the authors for laying out a detailed research road map. This research, quite apart from its obvious application to modality assessment, could substantially extend our understanding of medical decision making. I am making progress in some of these areas, but others need to get involved. Unfortunately, if one accepts the authors’ premise that ROC is more clinically relevant than FROC for location-specific tasks such as screening mammography, then few will be motivated to do research in FROC analysis.
When a screening radiologist refers a patient for further investigation of a possible breast lesion to a colleague, the location of the lesion and which breast is involved are crucial. Screening programs require the documentation of lesion characteristics, including location, in addition to the overall recommendation to “recall” or “return to screening.” The location(s) identified at screening guide the subsequent diagnostic workup and the decision to biopsy the lesion(s). Just knowing that a woman has a malignancy somewhere in her breasts is obviously less helpful to the mammographer doing the diagnostic workup than knowing the locations and types of abnormalities detected at screening by a colleague. Neglecting location information can lead to the scenario in which, in the ROC paradigm, the radiologist is credited for detecting an abnormal condition, when in fact a lesion was missed and a normal structure was mistaken for a lesion (“right for the wrong reason”). It is obvious that the clinical consequences of the two mistakes are serious: the undetected cancer is allowed to grow, and a biopsy is made at the wrong location (see point 1 in the following discussion).
Because claims are being made that ROC is clinically more relevant than FROC in some scenarios, a definition of clinical relevance is needed. Evaluation methods form a six-level hierarchy of efficacies: 1) technical, 2) diagnostic, 3) diagnostic thinking, 4) therapeutic, 5) patient outcome, and 6) societal. I will interpret the “clinical relevance” of a measurement as its hierarchy level. The difficulty and cost of measurement increase as one moves up the hierarchy. At the lowest level, technical efficacy (eg, spatial resolution) is easiest to measure. Level 2 ROC measurements have a reputation of being time consuming and costly. Level 3 measurements, such as positive predictive value, are even more laborious . One way of showing clinical relevance is to perform measurements at the higher level and show that they confirm the lower level measurements regarding which modality is superior. If the performance difference is small, demonstrating clinical relevance can be very difficult. As an example, the initial optimistic expectations of computer-aided diagnosis in mammography, which were based on ROC studies, have not been confirmed in some large-scale clinical trials . Because it is difficult to prove the clinical relevance of ROC, one can hardly claim that it is more relevant than FROC. Black and Dwyer studied the issue of global versus local measures of accuracy and their effects on the post-test probability of disease, which is a level 3 measure. They considered mediastinal lymph node metastasis, which is more likely to be present in the right lower paratracheal region than in the left lower paratracheal region. As expected, post-test probability was higher if the radiologist knew that lymph node metastasis was found in the right lower paratracheal region rather than the left lower paratracheal region, and this knowledge influenced the subsequent action (eg, biopsy or surgery). However, if the location information was ignored, the post-test probability was equal for the two cases. Black and Dwyer concluded that “the local versus global distinction supports the commonsense notion that information pertaining to the anatomic distribution of disease is crucial for test interpretation.”
Get Radiology Tree app to read full this article<
Get Radiology Tree app to read full this article<
Get Radiology Tree app to read full this article<
Get Radiology Tree app to read full this article<
Get Radiology Tree app to read full this article<
Get Radiology Tree app to read full this article<
Get Radiology Tree app to read full this article<
Get Radiology Tree app to read full this article<
Get Radiology Tree app to read full this article<
Get Radiology Tree app to read full this article<
Get Radiology Tree app to read full this article<
Get Radiology Tree app to read full this article<
Get Radiology Tree app to read full this article<
Get Radiology Tree app to read full this article<
Get Radiology Tree app to read full this article<
Get Radiology Tree app to read full this article<
Get Radiology Tree app to read full this article<
Get Radiology Tree app to read full this article<
Get Radiology Tree app to read full this article<
References
1. Gur D., Rockette H.E.: Performance assessment of diagnostic systems under the FROC paradigm: experimental, analytical, and results interpretation issues. Acad Radiol 2008; 15: pp. 1312-1315.
2. Bunch P.C., Hamilton J.F., Sanderson G.K.: A free-response approach to the measurement and characterization of radiographic-observer performance. J Appl Photogr Eng 1978; 4: pp. 166-171.
3. Swensson R.G.: Unified measurement of observer performance in detecting and localizing target objects on images. Med Phys 1996; 23: pp. 1709-1725.
4. Obuchowski N.A., Lieber M.L., Powell K.A.: Data analysis for detection and localization of multiple abnormalities with application to mammography. Acad Radiol 2000; 7: pp. 516-525.
5. Chakraborty D.P., Berbaum K.S.: Observer studies involving detection and localization: modeling, analysis and validation. Med Phys 2004; 31: pp. 2313-2330.
6. Fryback D.G., Thornbury J.R.: The efficacy of diagnostic imaging. Med Decis Making 1991; 11: pp. 88-94.
7. Receiver operating characteristic analysis in medical imaging. ICRU Report 79. J Int Comm Radiat Units Measure 2008; pp. 8.
8. Fenton J.J., Taplin S.H., Carney P.A., et. al.: Influence of computer-aided detection on performance of screening mammography. N Engl J Med 2007; 356: pp. 1399-1409.
9. Astley S.M., Gilbert F.J.: Computer-aided detection in mammography. Clin Radiol 2004; 59: pp. 390-399.
10. Black W.C., Dwyer A.J.: Local versus global measures of accuracy: an important distinction for diagnostic imaging. Med Decis Making 1990; 10: pp. 266-273.
11. Song T., Bandos A.I., Rockette H.E., Gur D.: On comparing methods for discriminating between actually negative and actually positive subjects with FROC type data. Med Phys 2008; 35: pp. 1547-1558.
12. Bandos AI, Rockette HE, Song T, Gur D. Area under the free-response ROC curve (FROC) and a related summary index. Biometrics. In press.
13. Wagner R.F., Metz C.E., Campbell G.: Assessment of medical imaging systems and computer aids: a tutorial review. Acad Radiol 2007; 14: pp. 723-748.
14. Chakraborty D.P.: Validation and statistical power comparison of methods for analyzing free-response observer performance studies. Acad Radiol 2008; 15: pp. 1554-1566.
15. Dodd L.E., Wagner R.F., Armato S.G., et. al.: Assessment methodologies and statistical issues for computer-aided diagnosis of lung nodules in computed tomography: contemporary research topics relevant to the lung image database consortium. Acad Radiol 2004; 11: pp. 462-475.
16. Neisser U.: Cognitive psychology.1967.Appleton-Century-CroftsNew York
17. Chakraborty D.P.: A search model and figure of merit for observer data acquired according to the free-response paradigm. Phys Med Biol 2006; 51: pp. 3449-3462.
18. Rutter C.M.: Bootstrap estimation of diagnostic accuracy with patient-clustered data. Acad Radiol 2000; 7: pp. 413-419.