Establishing a Gold Standard for Test Sets

Rationale and Objectives

Test sets for assessing and improving radiologic image interpretation have been used for decades and typically evaluate performance relative to gold standard interpretations by experts. To assess test sets for screening mammography, a gold standard for whether a woman should be recalled for additional workup is needed, given that interval cancers may be occult on mammography and some findings ultimately determined to be benign require additional imaging to determine if biopsy is warranted. Using experts to set a gold standard assumes little variation occurs in their interpretations, but this has not been explicitly studied in mammography.

Materials and Methods

Using digitized films from 314 screening mammography exams ( n = 143 cancer cases) performed in the Breast Cancer Surveillance Consortium, we evaluated interpretive agreement among three expert radiologists who independently assessed whether each examination should be recalled, and the lesion location, finding type (mass, calcification, asymmetric density, or architectural distortion), and interpretive difficulty in the recalled images.

Results

Agreement among the three expert pairs for recall/no recall was higher for cancer cases (mean 74.3 ± 6.5) than for noncancers (mean 62.6 ± 7.1). Complete agreement on recall, lesion location, finding type and difficulty ranged from 36.4% to 42.0% for cancer cases and from 43.9% to 65.6% for noncancer cases. Two of three experts agreed on recall and lesion location for 95.1% of cancer cases and 91.8% of noncancer cases, but all three experts agreed on only 55.2% of cancer cases and 42.1% of noncancer cases.

Conclusion

Variability in expert interpretive is notable. A minimum of three independent experts combined with a consensus should be used for establishing any gold standard interpretation for test sets, especially for noncancer cases.

In radiology, test sets have been used for decades to assess and improve interpretive performance . Typically, the gold standard for interpretation is either based on observed patient outcomes or is based on expert review where a panel of experts comes to consensus on the interpretation. In the latter, the consensus decision becomes the gold standard and provides the basis for measuring individual performance. Little is known about the extent to which expert radiologists vary in interpretive assessments. Importantly, agreement among mammography experts has not been examined in the context of test set development in screening mammography, although test sets are frequently used for educational purposes.

The high prevalence of screening mammography use in the population and wide variability in radiologists’ interpretive performance of mammography makes the issue of testing radiologists for interpretation ability clinically important. Test sets are also useful for evaluating interventions aimed at improving interpretation, because it is difficult to assess changes in screening mammography performance in clinical practice because of low, within-practice breast cancer prevalence and the long lag-time for obtaining true cancer status. For screening mammography test sets, breast cancer status may be considered the ultimate gold standard, but it also unrealistically applies a diagnostic standard of performance to a screening test. In contrast, a gold standard for whether or not the exam should be recalled for additional workup and the location and type of any significant findings would be more clinically relevant, because it measures performance based on the fact that some interval cancers are occult on prior mammography and some findings ultimately determined to be benign required additional imaging to determine if biopsy is warranted. The nature of screening makes having clear objective criteria for a recall decision to evaluate a suspicious finding or identification of significant findings difficult, but using biopsy results within 1 year of screening as the gold standard unrealistically judges all false negatives and false positives as avoidable errors.

Get Radiology Tree app to read full this article<

Materials and methods

Protection of Study Subjects

Get Radiology Tree app to read full this article<

Mammograms Reviewed by Experts

Get Radiology Tree app to read full this article<

Expert Review Process

Get Radiology Tree app to read full this article<

Analysis

Get Radiology Tree app to read full this article<

Table 1

Measurement of Expert Agreement and Gold Standard Development

Five Successive Levels of Agreement Agreement Criteria Woman-level Recall Breast-level Recall and laterality Lesion-level Recall, laterality, location Finding-level Recall, laterality, location, finding type Complete agreement Recall, laterality, location, finding type, difficulty

Two Methods for Establishing Gold Standard Interpretation

Get Radiology Tree app to read full this article<

Results

Get Radiology Tree app to read full this article<

Table 2

Characteristics of Women Whose Mammograms Were Reviewed by the Panel of Experts

n % Total 314 100 Age 40–44 47 15 45–49 57 18.2 50–54 64 20.4 55–59 66 21 60–64 50 15.9 65–69 30 9.6 Current hormone therapy use No 192 63.8 Yes 109 36.2 (Missing) ‡ 13 (4.1) Postmenopausal No 101 32.6 Yes 209 67.4 (Missing) ‡ 4 (1.3) Breast density † BI-RADS 1 11 4.4 BI-RADS 2 93 34.3 BI-RADS 3 140 50.1 BI-RADS 4 28 10.2 (Missing) ‡ 42 (13.4) Cancer within a year of screen ∗ No 171 54.5 Yes 143 45.5

BI-RADS, Breast Imaging Reporting and Data System.

Get Radiology Tree app to read full this article<

Table 3

Cancer Characteristics of Cancer Cases Reviewed by the Panel of Experts

n % Number of cancers 143 100 Cancer histologic type Ductal carcinoma in situ 27 18.9 All invasive 116 81.1 Cancer size ∗ (mm) ≤5 13 11.9 6–10 24 22.0 11–15 25 22.9 16–20 22 20.2 >20 25 22.9 Unknown † 7 (6.0) Axillary lymph node status ∗ Negative 79 71.2 Positive 32 28.8 Unknown † 5 (4.3) Grade ∗ 1: Well-differentiated 20 20.2 2: Moderately differentiated 46 46.5 3: Poorly differentiated 32 32.3 4: Undifferentiated 1 1.0 Unknown † 17 (14.7) ER/PR status ∗ ER+/PR+ 60 71.4 ER+/PR− 11 13.1 ER−/PR+ 0 0.0 ER−/PR− 13 15.5 Unknown † 32 (27.6)

ER, estrogen receptor; PR, progesterone receptor.

Get Radiology Tree app to read full this article<

Table 4

Pairwise Agreement of Expert Reviews

No Cancer ( n = 171) Radiologist Pair Cancer ( n = 143) Radiologist Pair (1 and 2) (1 and 3) (2 and 3) (1 and 2) (1 and 3) (2 and 3)n (%)n (%)n (%)n (%)n (%)n (%) Recall (woman-level) Agree to recall 29 (17.0) 12 (7.0) 17 (9.9) 104 (72.7) 77 (53.9) 82 (57.3) Agree on no-recall 70 (40.9) 109 (63.7) 84 (49.1) 13 (9.1) 25 (17.5) 18 (12.6) Disagree on recall 72 (42.1) 50 (29.2) 70 (40.9) 26 (18.2) 41 (28.7) 43 (30.1) Overall agreement Woman-level 99 (57.9) 121 (70.8) 101 (59.1) 117 (81.8) 102 (71.3) 100 (69.9) Breast-level 96 (56.1) 121 (70.8) 99 (57.9) 110 (76.9) 95 (66.4) 96 (67.1) Lesion-level 86 (50.3) 118 (69.0) 97 (56.7) 105 (73.4) 94 (65.7) 95 (66.4) Finding-level 80 (46.8) 116 (67.8) 95 (55.6) 81 (56.6) 81 (56.6) 80 (55.9) Complete 75 (43.9) 112 (65.5) 92 (53.8) 53 (37.1) 60 (42.0) 52 (36.4)

Get Radiology Tree app to read full this article<

Table 5

Agreement among All Three Experts

All 3 Agree Majority Opinion: Require Any 2 of 3 Agree No Cancer Cancer No Cancer Cancer_n_ = 171n = 143n = 171n = 143 Recall_n_ (%)n (%)n (%)n (%) Agree on no-recall 64 (37.4) 13 (9.1) 135 (78.9) 30 (21.0) Disagree on recall 96 (56.1) 55 (38.5) N/A N/A Agree to recall 11 (6.4) 75 (52.4) 36 (21.1) 113 (79.0) Overall agreement Woman-level 75 (43.9) 88 (61.5) 171 (100.0) 143 (100.0) Breast-level 75 (43.9) 81 (56.6) 166 (97.1) 139 (97.2) Lesion-level 72 (42.1) 79 (55.2) 157 (91.8) 136 (95.1) Finding-level 70 (40.9) 58 (40.6) 151 (88.3) 126 (88.1) Complete 66 (38.6) 30 (21.0) 147 (86.0) 105 (73.4)

Get Radiology Tree app to read full this article<

Table 6

Agreement in Finding Type When Experts Recall the Same Lesion

Findings Number of Experts Who Recalled the Lesion 3 Experts 2 of 3 Experts_n_ % ∗ n % ∗ All agree51693870 All C 21 41 13 34 All M 19 37 10 26 All AD 7 14 4 11 All AS 4 8 11 29 Disagree23311630 C, M 1 4 2 13 C, AD 0 0 4 25 C, AS 3 13 0 0 M, AD 1 4 2 13 M, AS 16 70 6 38 AS, AD 2 9 2 13Total7410054100

AD, architectural distortion; AS, asymmetry; C, calcification; M, mass.

Get Radiology Tree app to read full this article<

Discussion

Get Radiology Tree app to read full this article<

Acknowledgments

Get Radiology Tree app to read full this article<

References

1. Elmore J.G., Miglioretti D.L., Carney P.A.: Does practice make perfect when interpreting mammography? Part II. JNCI 2003; 95: pp. 250-252.
2. Nodine C.F., Kundel H.L., Mello-Thomas C., et. al.: How experience and training influence mammography expertise. Acad Radiol 1999; 6: pp. 575-585.
3. Elmore J.G., Wells C.K., Lee C.H., et. al.: Variability in radiologists’ interpretations of mammograms. New Engl J Med 1994; 331: pp. 1493-1499.
4. Elmore J.G., Jackson S.L., Abraham L., et. al.: Variability in interpretive performance at screening mammography and radiologists’ characteristics associated with accuracy. Radiology 2009; 253: pp. 641-651.
5. Baker J.A., Kornguth P.J., Floyd C.E.: Breast imaging reporting and data system standardized mammography lexicon: observer variability in lesion description. Am J Roentgenol 1996; 166: pp. 773-778.
6. Berg W.A., Campassi C., Langenberg P., et. al.: Breast Imaging Reporting and Data System: inter- and intraobserver variability in feature analysis and final assessment. Am J Roentgenol 2000; 174: pp. 1769-1777.
7. Kerlikowske K., Grady D., Barclay J., et. al.: Variability and accuracy in mammographic interpretation using the American College of Radiology Breast Imaging Reporting and Data System. J Nat Cancer Inst 1998; 90: pp. 1801-1809.
8. Kundel H.L., Polansky M.: Measurement of observer agreement. Radiology 2003; 228: pp. 303-308.
9. Elmore J.G., Wells C.K., Lee C.H., et. al.: Variability in radiologists’ interpretations of mammograms. N Engl J Med 1994; 331: pp. 1493-1499.
10. Ciatto S., Houssami N., Apruzzese A., et. al.: Reader variability in reporting breast imaging according to BI-RADS ® assessment categories (the Florence experience). Breast 2006; 15: pp. 44-51.
11. Revesz G., Kundel H.L., Bonitatibus M.: The effect of verification on the assessment of imaging techniques. Invest Radiol 1990; 25: pp. 461-464.
12. Ballard-Barbash R., Taplin S.H., Yankaskas B.C., et. al.: Breast Cancer Surveillance Consortium: a national mammography screening and outcomes database. AJR Am J Roentgenol 1997; 169: pp. 1001-1008.
13. National Cancer Institute. Breast Cancer Surveillance Consortium Homepage. http://breastscreening.cancer.gov/ . Accessed April 14, 2009.
14. Carney P.A., Geller B.M., Moffett H., et. al.: Current medico-legal and confidentiality issues in large multi-center research programs. Am J Epidemiol 2000; 152: pp. 371-378.
15. Carney P.A., Bogart T.A., Geller B.M., et. al.: Association between time spent interpreting, level of confidence, and accuracy of screening mammography. AJR Am J Roentgenol 2012; 198: pp. 970-978.
16. American College of Radiology (ACR) : ACR BI-RADS - Mammography.ACR Breast Imaging and Reporting and Data System, Breast Imaging Atlas.2003.American College of RadiologyReston, VA:
17. Hukkinen K., Kivisaari L., Vehmas T.: Impact of the number of readers on mammography interpretation. Acta Radiol 2006; 47: pp. 655-659.
18. Shaw C.M., Flanagan F.L., Fenlon H.M., et. al.: Consensus review of discordant findings maximizes cancer detection rate in double-reader screening mammography: Irish National Breast Screening Program experience. Radiology 2009; 250: pp. 354-362.
19. Duijm L.E., Louwman M.W., Groenewoud J.H., et. al.: Inter-observer variability in mammography screening and effect of type and number of readers on screening outcome. Br J Cancer 2009; 100: pp. 901-907.
20. Robinson P.J.A., Wilson D., Coral A., et. al.: Variation between experienced observers in the interpretation of accident and emergency radiographs. Br J Radiol 1999; 72: pp. 323-330.
21. Elsheikh T.M., Asa S.L., Chan J.K.C., et. al.: Interobserver and intraobserver variation among experts in the diagnosis of thyroid follicular lesions with borderline nuclear features of papillary carcinoma. Am J Clin Pathol 2008; 130: pp. 736-744.
22. Brealey S., Scally A.J.: Bias in plain film reading performance studies. Br J Radiol 2001; 74: pp. 307-316.
23. Brealey S.D., Scally A.J., Hahn S., et. al.: Evidence of reference standard related bias in studies of plain radiograph reading performance: a meta-regression. Br J Radiol 2007; 80: pp. 406-413.
24. Sickles E.A.: The American College of Radiology’s mammography interpretive skills assessment (MISA) examination. Semin Breast Dis 2003; 6: pp. 133-139.

Establishing a Gold Standard for Test Sets

Rationale and Objectives

Materials and Methods

Results

Conclusion

Materials and methods

Protection of Study Subjects

Mammograms Reviewed by Experts

Expert Review Process

Analysis

Results

Discussion

Acknowledgments

References

Further Reading

Assessing Splenomegaly

Combined Diffusion-Weighted, Blood Oxygen Level–Dependent, and Dynamic Contrast-Enhanced MRI for Characterization and Differentiation of Renal Cell Carcinoma

Comparison of 3D Phase-Sensitive Inversion-Recovery and 2D Inversion-Recovery MRI at 3.0 T for the Assessment of Late Gadolinium Enhancement in Patients with Hypertrophic Cardiomyopathy