Home Correlation Between Screening Mammography Interpretive Performance on a Test Set and Performance in Clinical Practice
Post
Cancel

Correlation Between Screening Mammography Interpretive Performance on a Test Set and Performance in Clinical Practice

Rationale and Objectives

Evidence is inconsistent about whether radiologists’ interpretive performance on a screening mammography test set reflects their performance in clinical practice. This study aimed to estimate the correlation between test set and clinical performance and determine if the correlation is influenced by cancer prevalence or lesion difficulty in the test set.

Materials and Methods

This institutional review board-approved study randomized 83 radiologists from six Breast Cancer Surveillance Consortium registries to assess one of four test sets of 109 screening mammograms each; 48 radiologists completed a fifth test set of 110 mammograms 2 years later. Test sets differed in number of cancer cases and difficulty of lesion detection. Test set sensitivity and specificity were estimated using woman-level and breast-level recall with cancer status and expert opinion as gold standards. Clinical performance was estimated using women-level recall with cancer status as the gold standard. Spearman rank correlations between test set and clinical performance with 95% confidence intervals (CI) were estimated.

Results

For test sets with fewer cancers ( N = 15) that were more difficult to detect, correlations were weak to moderate for sensitivity (woman level = 0.46, 95% CI = 0.16, 0.69; breast level = 0.35, 95% CI = 0.03, 0.61) and weak for specificity (0.24, 95% CI = 0.01, 0.45) relative to expert recall. Correlations for test sets with more cancers ( N = 30) were close to 0 and not statistically significant.

Conclusions

Correlations between screening performance on a test set and performance in clinical practice are not strong. Test set performance more accurately reflects performance in clinical practice if cancer prevalence is low and lesions are challenging to detect.

Introduction

The interpretive performance of screening mammography varies extensively among US radiologists . Given US radiologists have relatively low interpretive volume, on average , and often do not work up their own recalled cases , they have limited opportunities to know, directly or indirectly, whether women they recalled or did not recall on screening mammograms experienced benign or malignant outcomes. A test set of selected mammography images could be an efficient method to assess radiologists’ skill level and to identify potential opportunities for improvement. Additionally, test sets could help radiologists meet Part 2 of the American Board of Radiology’s Maintenance of Certification requirements (Lifelong Learning and Self-Assessment) .

Findings from prior studies are inconsistent about whether interpretive performance on screening mammography test sets is correlated with performance in clinical practice, possibly due to small samples (of radiologists or images) and variability in test set composition, performance measures evaluated, and statistical approaches used . In a study of 27 US radiologists who interpreted a test set of 113 film screening mammography examinations (30 with cancer), Rutter and Taplin found moderate correlation between the specificity of screening mammography interpreted in clinical and test settings (0.41; 95% Bayesian credible interval: 0.16, 0.62), but no evidence of correlation between clinical and test set sensitivity (−0.18, 95% Bayesian credible interval: −0.27, 0.59). In contrast, Soh et al. found significant, moderate correlations of 0.30–0.57 between several clinical audit measures and two test set measures (location sensitivity and jackknifing free-response operating characteristic figure-of-merit) of 60 cases (20 with cancer) read by 20 radiologists, but no correlation with test set specificity. Similarly, Scott et al. found significant, moderate correlations of 0.29–0.41 between several performance measures on the PERFORMs test set and clinical performance among 39 readers in the UK. None of these prior studies evaluated the influence of breast cancer prevalence or lesion difficulty on the strength of the correlations.

Get Radiology Tree app to read full this article<

Materials and Methods

Study Population

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Test Set Development

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

TABLE 1

Test Set Composition

Test Set 1 Test Set 2 Test Set 3 Test Set 4 Test Set 5 Number of exams 109 109 109 109 110 Screens with cancer, N 15 15 30 30 15 Difficulty to detect, N (%) Obvious 5 (33%) 3 (20%) 10 (33%) 6 (20%) 3 (20%) Intermediate 8 (53%) 7 (47%) 16 (53%) 14 (47%) 7 (47%) Subtle 2 (13%) 5 (33%) 4 (13%) 10 (33%) 5 (33%) Finding type, N (%) Mass 4 (27%) 3 (20%) 9 (30%) 6 (20%) 3 (20%) Calcification 7 (47%) 6 (40%) 11 (37%) 10 (33%) 6 (40%) Asymmetric densities 2 (13%) 4 (27%) 5 (17%) 8 (27%) 4 (27%) Architectural distortion 2 (13%) 2 (13%) 5 (17%) 6 (20%) 2 (13%) Screens without cancer, N 94 94 79 79 95 Recalled by experts \* , N (%) 21 (22%) 21 (22%) 21 (27%) 21 (27%) 14 (15%) Other noncancers, N (%) 73 (78%) 73 (78%) 58 (73%) 58 (73%) 81 (85%)

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Radiologists’ Interpretive Performance on the Test Set

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Clinical Performance on the Test Set Exams

Get Radiology Tree app to read full this article<

Radiologists’ Interpretive Performance in Clinical Practice

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Analysis

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Results

Get Radiology Tree app to read full this article<

TABLE 2

Characteristics of Participating Radiologists ( N = 83)

Characteristics_N_ % Breast imaging specialist Yes 5 6 No 78 94 Fellowship training in breast or women’s imaging Yes 7 8 No 76 92 Main practice with academic radiology group Yes 7 8 No 76 92 Years interpreting mammography 1–5 15 18 6–10 12 15 11–20 36 43 >20 20 24 Average days per week working in breast imaging ≤1 21 25 2 19 23 3 18 22 4 9 11 5 16 19 Mammographic examinations interpreted per week ≤10 2 2 11–49 17 21 50–99 31 37 100–199 20 24 ≥200 13 16

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

TABLE 3

Mean (95% Confidence Interval) Sensitivity and Specificity on the Test Sets Compared to Performance for Same Screening Examinations in Clinical Practice, Using Woman-level Recall with Cancer Status as the Gold Standard

Test Set 1: Lower Cancer Prevalence and Less Difficult Lesions Test Set 2: Lower Cancer Prevalence and More Difficult Lesions Test Set 3: Higher Cancer Prevalence and Less Difficult Lesions Test Set 4: Higher Cancer Prevalence and More Difficult Lesions Test Set 5: Lower Cancer Prevalence and More Difficult Lesions Number of radiologists who took the test set 22 24 17 20 48 Sensitivity Test set 83% (68%–92%) 75% (61%–85%) 74% (63%–82%) 72% (60%–81%) 74% (62%–84%) Clinical \* 73% (48%–89%) 80% (53%–94%) 83% (68%–92%) 83% (69%–92%) 80% (53%–93%)P value 0.23 0.60 0.21 0.12 0.59 Specificity Test set 65% (58%–71%) 62% (55%–69%) 69% (61%–77%) 65% (57%–72%) 68% (62%–74%) Clinical \* 55% (44%–66%) 55% (44%–66%) 47% (36%–58%) 47% (36%–58%) 62% (51%–72%)P value 0.07 0.19 0.0001 0.001 0.22

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Figure 1, Woman-level sensitivity in clinical practice versus woman-level test set sensitivity among cases recalled by the expert panel. Symbols 1–5 indicate the test set number. Test set 1 had lower cancer prevalence and less difficult lesions, test sets 2 and 5 had lower cancer prevalence and more difficult lesions, test set 3 had higher cancer prevalence and less difficult lesions, and test set 4 had higher cancer prevalence and less difficult lesions. CI, confidence interval.

Figure 2, Woman-level specificity in clinical practice versus woman-level test set specificity among cases not recalled by the expert panel. Symbols 1–5 indicate the test set number. Test set 1 had lower cancer prevalence and less difficult lesions, test sets 2 and 5 had lower cancer prevalence and more difficult lesions, test set 3 had higher cancer prevalence and less difficult lesions, and test set 4 had higher cancer prevalence and less difficult lesions. CI, confidence interval.

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

TABLE 4

Spearman Rank Correlation (95% Confidence Interval) Between Radiologists’ Test Set and Clinical Performance Measures

Test Sets 1, 2, 5: Lower Cancer Prevalence Test Sets 3, 4: Higher Cancer Prevalence Test Sets 1, 3: Less Difficult Lesions Test Sets 2, 4, 5: More Difficult Lesions Test Sets 2, 5: Lower Cancer Prevalence and More Difficult Lesions Sensitivity Number of radiologists with at least 10 cancer cases in clinical practice 52 26 27 51 36 Test set cancer cases Woman level, correlation 0.21 (−0.06, 0.46) −0.13 (−0.49, 0.27) 0.04 (−0.34, 0.42) 0.13 (−0.16, 0.39)0.37 (0.04, 0.62) Breast level, correlation 0.21 (−0.07, 0.45) −0.10 (−0.47, 0.30) 0.09 (−0.30, 0.46) 0.10 (−0.18, 0.37) 0.32 (−0.01, 0.59) Test set cases recalled by experts Woman level, correlation0.33 (0.06, 0.55) −0.28 (−0.60, 0.12) −0.04 (−0.41, 0.35) 0.21 (−0.07, 0.46)0.46 (0.16, 0.69) Breast level, correlation 0.25 (−0.03, 0.49) −0.29 (−0.61, 0.11) 0.02 (−0.36, 0.39) 0.09 (−0.19, 0.36)0.35 (0.03, 0.61) Specificity Number of radiologists with at least 100 noncancer cases in clinical practice 94 37 39 92 72 Test set noncancer cases, correlation0.24 (0.04, 0.42) 0.05 (−0.28, 0.37) −0.05 (−0.36, 0.27)0.26 (0.06, 0.44)0.28 (0.05, 0.48) Test set cases not recalled by experts, correlation 0.19 (−0.01, 0.38) 0.08 (−0.25, 0.39) −0.07 (−0.38, 0.25)0.23 (0.03, 0.42)0.24 (0.01, 0.45)

Numbers in bold are statistically significantly different than zero.

Get Radiology Tree app to read full this article<

Discussion

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Acknowledgments

Get Radiology Tree app to read full this article<

Supplementary Data

Get Radiology Tree app to read full this article<

Appendix S1

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

References

  • 1. Elmore J.G., Jackson S., Abraham L., et. al.: Variability in interpretive performance at screening mammography and radiologist characteristics associated with accuracy. Radiology 2009; 253: pp. 641-651.

  • 2. Rosenberg R.D., Yankaskas B.C., Abraham L.A., et. al.: Performance benchmarks for screening mammography. Radiology 2006; 241: pp. 55-66.

  • 3. Smith-Bindman R., Miglioretti D.L., Rosenberg R., et. al.: Physician workload in mammography. AJR Am J Roentgenol 2008; 190: pp. 526-532.

  • 4. Lewis R.S., Sunshine J.H., Bhargavan M.: A portrait of breast imaging specialists and of the interpretation of mammography in the United States. AJR Am J Roentgenol 2006; 187: pp. W456-W468.

  • 5. Buist D.S., Anderson M.L., Smith R.A., et. al.: Effect of radiologists’ diagnostic work-up volume on interpretive performance. Radiology 2014; 273: pp. 351-364.

  • 6. American Board of Radiology : American Board of Radiology maintenance of certification. Version 2.2.20; Available at https://www.theabr.org/moc-gen-landing Accessed May 10, 2016

  • 7. Soh B.P., Lee W.B., Mello-Thoms C., et. al.: Certain performance values arising from mammographic test set readings correlate well with clinical audit. J Med Imaging Radiat Oncol 2015; 59: pp. 403-410.

  • 8. Scott H.J., Evans A., Gale A.G., et. al.: The relationship between real life breast screening and an annual self assessment scheme.Sahiner B.Manning D.J.Medical imaging 2009: image perception, observer performance, and technology assessment.2009.Society of Photo-Optical Instrumentation EngineersLake Buena Vista, FL: Proc. SPIE 7263,72631E

  • 9. Rutter C.M., Taplin S.: Assessing mammographers’ accuracy. A comparison of clinical and test performance. J Clin Epidemiol 2000; 53: pp. 443-450.

  • 10. Geller B.M., Bogart A., Carney P.A., et. al.: Educational interventions to improve screening mammography interpretation: a randomized controlled trial. AJR Am J Roentgenol 2014; 202: pp. W586-W596.

  • 11. Carney P.A., Bogart T.A., Geller B.M., et. al.: Association between time spent interpreting, level of confidence, and accuracy of screening mammography. AJR Am J Roentgenol 2012; 198: pp. 970-978.

  • 12. Onega T., Anderson M.L., Miglioretti D.L., et. al.: Establishing a gold standard for test sets: variation in interpretive agreement of expert mammographers. Acad Radiol 2013; 20: pp. 731-739.

  • 13. American College of Radiology : American College of Radiology breast imaging reporting and data system atlas (BI-RADS Atlas).4th ed.2003.American College of RadiologyReston, VA

  • 14. Miglioretti D.L., Heagerty P.J.: Marginal modeling of multilevel binary data with time varying covariates. Biostatistics 2004; 5: pp. 381-398.

  • 15. Miglioretti D.L., Heagerty P.J.: Marginal modeling of nonnested multilevel data using standard software. Am J Epidemiol 2007; 165: pp. 453-463.

  • 16. Gur D., Rockette H.E., Armfield D.R., et. al.: Prevalence effect in a laboratory environment. Radiology 2003; 228: pp. 10-14.

  • 17. Gur D., Bandos A.I., Fuhrman C.R., et. al.: The prevalence effect in a laboratory environment: changing the confidence ratings. Acad Radiol 2007; 14: pp. 49-53.

  • 18. Evans K.K., Birdwell R.L., Wolfe J.M.: If you don’t find it often, you often don’t find it: why some cancers are missed in breast cancer screening. PLoS ONE 2013; 8: pp. e64366.

  • 19. Gur D., Bandos A.I., Cohen C.S., et. al.: The “laboratory” effect: comparing radiologists’ performance and variability during prospective clinical and laboratory mammography interpretations. Radiology 2008; 249: pp. 47-53.

  • 20. Soh B.P., Lee W., McEntee M.F., et. al.: Screening mammography: test set data can reasonably describe actual clinical reporting. Radiology 2013; 268: pp. 46-53.

  • 21. Duffy F.D., Holmboe E.S.: Self-assessment in lifelong learning and improving performance in practice: physician know thyself. JAMA 2006; 296: pp. 1137-1139.

This post is licensed under CC BY 4.0 by the author.