Determinants of Difficulty and Discriminating Power of Image-based Test Items in Postgraduate Radiological Examinations

Rationale and Objectives

The psychometric characteristics of image-based test items in radiological written examinations are not well known. In this study, we explored difficulty and discriminating power of these test items in postgraduate radiological digital examinations.

Materials and Methods

We reviewed test items of seven Dutch Radiology Progress Tests (DRPTs) that were taken from October 2013 to April 2017. The DRPT is a semiannual formative examination, required for all Dutch radiology residents. We assessed several stimulus and response characteristics of test items. The response format of test items included true or false, single right multiple choice with 2, 3, 4, or ≥5 answer options, pick-N multiple-choice, drag-and-drop, and long-list-menu formats. We calculated item P values and item-rest-correlation (Rir) values to assess difficulty and discriminating power. We performed linear regression analysis in image-based test items to investigate whether P and Rir values were significantly related to stimulus and response characteristics. Also, we compared psychometric indices between image-based test items and text-alone items.

Results

P and Rir values of image-based items ( n = 369) were significantly related to the type of response format ( P < .001), and not to which of the seven DRPTs the item was obtained from, radiological subspecialty domain, nonvolumetric or volumetric character of images, or context-rich or context-free character of the stimulus. When accounted for type of response format, difficulty and discriminating power of image-based items did not differ significantly from text-alone items ( n = 881). Test items with a relatively large number of answer options were generally more difficult, and discriminated better among high- and low-performing candidates.

Conclusion

In postgraduate radiological written examinations, difficulty and discriminating power of image-based test items are related to the type of response format and are comparable to those of text-alone items. We recommend a response format with a relatively large number of answer options to optimize psychometric indices of radiological image-based test items.

Introduction

Radiological knowledge and visual skills are imperative in the profession of a radiologist and are therefore trained during residency. To assess whether residents have reached sufficient competence in these domains, both workplace assessments and written examinations are essential. In many countries, written examinations are included in radiology residency programs . These examinations usually contain image-based test items in addition to text-alone items. Although radiological images were traditionally x-ray photos, they are nowadays, to a large extent, obtained through cross-sectional techniques, such as computed tomography scanning and magnetic resonance imaging.

Test Item Characteristics

Viewed from the perspective of test theory, test items in examinations can be characterized by their stimulus and their response format . The stimulus refers to what is being asked by the test item and is considered the main determinant of what type of competence is tested . The stimulus may be context-rich or context-free. Context-rich items consist of a (case) scenario and ask for decisions that are related to the scenario, whereas context-free items are mainly aimed at testing factual knowledge . The response format refers to how the candidate’s answer is captured. Examples of response formats are true or false items, multiple-choice questions with a single best answer option, and long-list-menu items with a large number of predefined answer options . All answers of the candidate lead to a test score for the examination. According to classical test theory, the observed test score is the summation of the true score (the candidate’s actual knowledge) and measurement errors . Sources of measurement errors are the test itself, the testee, and the tester . The amount of measurement error is estimated by calculating the reliability of the test. If measurement errors are reduced, the observed score of the candidate will approach the true score . Postexamination item analysis is recommended to investigate how test items perform in light of the objectives of the examination . In classical test theory, this analysis typically involves computing item difficulty and discrimination indices . Obviously, such psychometric characteristics have to be sound if the objective of the examination is to identify candidates who have not attained expected learning outcomes .

Get Radiology Tree app to read full this article<

Quality of Items with and without Images

Get Radiology Tree app to read full this article<

Aim of the Study

Get Radiology Tree app to read full this article<

Methods

Dutch Radiology Progress Test

Get Radiology Tree app to read full this article<

Data Collection

Get Radiology Tree app to read full this article<

Statistical Analysis

Get Radiology Tree app to read full this article<

Institutional Review Board Approval

Get Radiology Tree app to read full this article<

Results

Get Radiology Tree app to read full this article<

TABLE 1

Overall Characteristics of Test Items

Total Number of Test Items 1280 Subspecialty domain Cardiothoracic radiology 274 (21%) Neuro- and head-and-neck radiology 264 (21%) Abdominal radiology 248 (19%) Musculoskeletal radiology 212 (17%) Pediatric radiology 104 (8%) Breast radiology 74 (6%) Interventional radiology 72 (6%) Nuclear medicine 32 (3%) Images in test items Yes 376 (29%) Nonvolumetric 235 (18%) Volumetric 141 (11%) No 904 (71%) Stimulus format Context-rich 471 (37%) Context-free 809 (63%) Response format True/false 912 (71%) Single best answer multiple-choice 292 (23%) 2 options 14 (1%) 3 options 32 (3%) 4 options 138 (11%) ≥5 options 108 (8%) Pick-N multiple-choice 8 (1%) Drag-and-drop 22 (2%) Long-list-menu 46 (4%) Test items with flaws in post-examination feedback round 23 (2%)

Values are given as number, with percentage of total number of items in parentheses.

Get Radiology Tree app to read full this article<

TABLE 2

Subspecialty Domain and Stimulus Format in Image-based Test Items vs Text-alone Test Items

Image-based Test Items ( n = 369) Text-alone Test Items ( n = 881) Chi-square Test Subspecialty domain Cardiothoracic radiology 73 (20%) 190 (22%) n.s. Neuro- and head-and-neck radiology 75 (20%) 183 (21%) Abdominal radiology 70 (19%) 171 (19%) Musculoskeletal radiology 64 (17%) 146 (17%) Pediatric radiology 29 (8%) 74 (8%) Breast radiology 28 (8%) 45 (5%) Interventional radiology 20 (5%) 51 (6%) Nuclear medicine 10 (3%) 21 (2%) Stimulus format Context-rich 315 (85%) 144 (16%)P < .001 Context-free 54 (15%) 737 (84%)

n.s., not significant.

Values are given as number, with percentage in parentheses.

Get Radiology Tree app to read full this article<

TABLE 3

P Values in Image-based Test Items vs Text-alone Test Items, Categorized According to Type of Response Format

Type of Response Format Image-based Test Items Text-alone Test Items Mann-Whitney U Test n_P_ Value (median (Q1-Q3))n__P Value (median (Q1-Q3)) TF/MC 2 options \* 112 .72 (.59–.81) 795 .72 (.57–.81) n.s. MC 3 options 16 .58 (.41–.84) 16 .74 (.62–.83) n.s. MC 4 options 88 .64 (.52–.75) 50 .64 (.46–.73) † n.s. MC ≥5 options 87 .61 (.42–.74) ‡ 20 .54 (.38–.73) ‡ n.s. Drag-and-drop 22 .59 (.40–.79) 0 — — Long-list-menu 44 .56 (.43–.75) † 0 — —

MC, multiple-choice; n.s., not significant; Q1, 25th percentile; Q3, 75th percentile; TF, true or false.

Get Radiology Tree app to read full this article<

TABLE 4

Rir Values in Image-based Test Items vs Text-alone Test Items, Categorized According to Type of Response Format

Type of Response Format Image-based Test Items Text-alone Test Items Student t test_n_ Rir (mean ± SD)n Rir (mean ± SD) TF/MC 2 options \* 112 0.17 ± 0.13 795 0.15 ± 0.12 n.s. MC 3 options 16 0.15 ± 0.14 16 0.18 ± 0.13 n.s. MC 4 options 88 0.23 ± 0.12 50 0.22 ± 0.14 ‡ n.s. MC ≥5 options 87 0.26 ± 0.12 ‡ 20 0.21 ± 0.12 n.s. Drag-and-drop 22 0.28 ± 0.09 † 0 — — Long-list-menu 44 0.37 ± 0.12 § 0 — —

MC, multiple-choice; n.s., not significant; Rir, item-rest correlation; SD, standard deviation; TF, true or false.

Get Radiology Tree app to read full this article<

Discussion

Get Radiology Tree app to read full this article<

Acknowledgment

Get Radiology Tree app to read full this article<

References

1. Alderson P.O., Becker G.J.: The new requirements and testing for American Board of Radiology certification in diagnostic radiology. Radiology 2008; 248: pp. 707-709.
2. Ilyas S., Beatie A., Pettet G., et. al.: Junior Radiologists’ Forum (JRF): national trainee survey. Clin Radiol 2014; 69: pp. 952-958.
3. Di Marco L., Conway W.F., Chapin R.: Radiology resident education in France from medical school through board certification. J Am Coll Radiol 2015; 12: pp. 1097-1102.
4. Ravesloot C.J., van der Schaaf M.F., Kruitwagen C.L., et. al.: Predictors of knowledge and image interpretation skill development in radiology residents. Radiology 2017; 284: pp. 758-765.
5. Schuwirth L.W., van der Vleuten C.P.: Different written assessment methods: what can be said about their strengths and weaknesses?. Med Educ 2004; 38: pp. 974-979.
6. van Bruggen L., Manrique-van Woudenbergh M., Spierenburg E., et. al.: Preferred question types for computer-based assessment of clinical reasoning: a literature study. Perspect Med Educ 2012; 1: pp. 162-171.
7. Cerutti B., Blondon K., Galetto A.: Long-menu questions in computer-based assessments: a retrospective observational study. BMC Med Educ 2016; 16: pp. 55.
8. Tavakol M., Dennick R.: Post-examination analysis of objective tests. Med Teach 2011; 33: pp. 447-458.
9. Tavakol M., Dennick R.: Post-examination interpretation of objective test data: monitoring and improving the quality of high-stakes examinations: AMEE guide no. 66. Med Teach 2012; 34: pp. e161-e175.
10. De Champlain A.F.: A primer on classical test theory and item response theory for assessments in medical education. Med Educ 2010; 44: pp. 109-117.
11. Schuwirth L.W., van der Vleuten C.P.: General overview of the theories used in assessment: AMEE guide no. 57. Med Teach 2011; 33: pp. 783-797.
12. Hays R.B., Hamlin G., Crane L.: Twelve tips for increasing the defensibility of assessment decisions. Med Teach 2015; 37: pp. 433-436.
13. Holland J., O’Sullivan R., Arnett R.: Is a picture worth a thousand words: an analysis of the difficulty and discrimination parameters of illustrated vs. text-alone vignettes in histology multiple choice questions. BMC Med Educ 2015; 15: pp. 184.
14. Vorstenbosch M.A., Klaassen T.P., Kooloos J.G., et. al.: Do images influence assessment in anatomy? Exploring the effect of images on item difficulty and item discrimination. Anat Sci Educ 2013; 6: pp. 29-41.
15. Notebaert A.J.: The effect of images on item statistics in multiple choice anatomy examinations. Anat Sci Educ 2017; 10: pp. 68-78.
16. Ravesloot C.J., van der Schaaf M.F., van Schaik J.P., et. al.: Volumetric CT-images improve testing of radiological image interpretation skills. Eur J Radiol 2015; 84: pp. 856-861.
17. Ravesloot C., van der Schaaf M., Haaring C., et. al.: Construct validation of progress testing to measure knowledge and visual skills in radiology. Med Teach 2012; 34: pp. 1047-1055.
18. Swanson D.B., Holtzman K.Z., Allbee K., et. al.: Psychometric characteristics and response times for content-parallel extended-matching and one-best-answer items in relation to number of options. Acad Med 2006; 81: pp. S52-S55.
19. Phipps S.D., Brackbill M.L.: Relationship between assessment item format and item performance characteristics. Am J Pharm Educ 2009; 73: pp. 146.
20. Schneid S.D., Armour C., Park Y.S., et. al.: Reducing the number of options on multiple-choice questions: response time, psychometrics and standard setting. Med Educ 2014; 48: pp. 1020-1027.
21. van der Gijp A., Ravesloot C.J., van der Schaaf M.F., et. al.: Volumetric and two-dimensional image interpretation show different cognitive processes in learners. Acad Radiol 2015; 22: pp. 632-639.
22. van der Gijp A., van der Schaaf M.F., van der Schaaf I.C., et. al.: Interpretation of radiological images: towards a framework of knowledge and skills. Adv Health Sci Educ Theory Pract 2014; 19: pp. 565-580.
23. Wrigley W., van der Vleuten C.P., Freeman A., et. al.: A systemic framework for the progress test: strengths, constraints and issues: AMEE guide no. 71. Med Teach 2012; 34: pp. 683-697.
24. Swanson D.B., Hawkins R.E.: Using written examinations to assess medical knowledge and its application.Holmboe E.S.Durning S.J.Hawkins R.E.Practical guide to the evaluation of clinical competence.2018.ElsevierPhiladelphia, PA:pp. 113-139.
25. Anderson L.W., Krathwohl D.R., Airasian P.W., et. al.: A taxonomy for learning, teaching, and assessing: a revision of Bloom’s taxonomy of educational objectives.2001.LongmanNew York

Determinants of Difficulty and Discriminating Power of Image-based Test Items in Postgraduate Radiological Examinations

Rationale and Objectives

Materials and Methods

Results

Conclusion

Introduction

Test Item Characteristics

Quality of Items with and without Images

Aim of the Study

Methods

Dutch Radiology Progress Test

Data Collection

Statistical Analysis

Institutional Review Board Approval

Results

Discussion

Acknowledgment

References

Further Reading

A Clinically Meaningful Interpretation of the Prospective Investigation of Pulmonary Embolism Diagnosis (PIOPED) II and III Data

Addressing Potential Health Disparities in the Adoption of Advanced Breast Imaging Technologies

Assessing Resident Performance in Screening Mammography