Home What Factors Affect the Discrepancy Rate Between Preliminary Resident Interpretations of Neuroimaging Studies and the Final Attending Interpretation?
Post
Cancel

What Factors Affect the Discrepancy Rate Between Preliminary Resident Interpretations of Neuroimaging Studies and the Final Attending Interpretation?

This month’s issue of Academic Radiology contains an educational research study by Sistrom and Deitte ( ) on the performance of radiology residents in interpretation of after-hours and overnight neuroimaging studies over a period of approximately 5 years. The researchers performed a systematic, prospective review of nearly 22,000 preliminary interpretations of computed tomography (CT) and magnetic resonance (MR) imaging studies of the head, orbits, face, neck, and spine, and obtained discrepancy data from the attending neuroradiologist who created the final radiology report. The major results of this study can be described in two separate sections in this editorial. First, this study confirms the results of several smaller studies that demonstrate a relatively low rate of disagreement between the preliminary resident interpretation and the final attending neuroradiologist interpretation. In this study, the authors found an overall disagreement rate of 3.91%, with the vast majority of these discrepant findings deemed to have minimal clinical impact (3.05%). The rate of disagreement with discrepant findings deemed to have significant clinical impact was 0.86%; the mean number of significant disagreements remained stable at about 3 per month, despite increasing case volume load over the study period. In conjunction with the prior research studies, these results support the longstanding policy at many academic medical centers to allow residents the privilege of providing initial interpretation of cross-sectional neuroimaging studies for patient care purposes during the hours when attending neuroradiologists cannot be physically present.

Second, Sistrom and Deitte ( ) used this large dataset to assess factors that may affect the rate of agreement/disagreement (heretofore referred to as “level of agreement”) between preliminary resident interpretation and the final attending interpretation. Based upon their a priori knowledge as radiology educators, they identified six measurable factors and incorporated them into a logistic regression model. The logistic regression analysis tests the hypothesis that one or more of these factors can predict the level of agreement between preliminary resident interpretation and final attending interpretation of neuroimaging studies. Two of these factors are specifically related to the parties involved in the interpretation of the neuroimaging studies, whether it is the resident who does the preliminary interpretation or the attending neuroradiologist who makes the final interpretation and also determines the presence or absence of agreement with the preliminary interpretation. Three other factors are related to the type of imaging modality (CT or MR), the area examined (head, face, neck, or spine), and the current month of the study period (out of 62 months). For the last factor, they devised a measure (called “quartiles”) to assess the effects of learning upon the agreement rate for each individual resident over time. As it is believed that residents obtain greater experience and develop greater acumen in neuroimaging throughout the residency years, one might expect a steady improvement in the rate of agreement between their initial interpretation and the final attending interpretation, thereby yielding quantitative evidence of improvement in resident performance.

Analytically, the logistic regression model defines these six factors as independent variables and the level of agreement as the outcome variable. The regression analysis (including the “goodness-of-fit” analysis) revealed that five of the six independent variables were significantly associated (at the P < .05 level) with the outcome variable, with only the quartiles variable (which describes resident learning effect) not showing significant association with the level of agreement. The goodness-of-fit analysis demonstrated that the model represents a reasonable fit of the observed data. (For those readers poring over the statistical details, it is important to realize that a P value of greater than 0.05 (yes, P > .05) in the Hosmer and Lemeshow test for goodness-of-fit indicates acceptance of the null hypothesis that there is no significant difference between estimated and observed frequencies, leading to the conclusion that this model does present a reasonable fit of the observed data.) However, the authors also note that the model’s adjusted R -square value (“pseudo-correlation coefficient”) is only 0.11, indicating that the model does not explain a large portion of the variance in the likelihood in agreement between residents and attending radiologists. The authors suggest that the unexplained variation may be due to factors associated with the neuroimaging cases themselves, or to error. Given a reasonable assumption that learning must have some effect on resident performance (even if not shown in this analysis), one can propose that each radiology resident actually has a different neuroimaging learning experience with respect to specific cases encountered during call rotations during his or her residency. These differences could be related to frequency effects (eg, some residents have higher percentages of studies with abnormal findings than other residents), order effects (eg, some residents encounter many more “tough cases” earlier in residency than others), and idiosyncratic timing effects (eg, some residents may have just learned or read about a particular neuroimaging entity or finding that happens to occur more frequently during his or her call shift). This unexplained variation in agreement rate may serve as a source of future investigation, and perhaps the nature of residents’ actual learning experiences could be defined more precisely and measured with greater accuracy.

What is the most surprising result of the logistic regression analysis? The known factor that most affected the level of agreement between preliminary resident interpretations and final attending interpretations was the group of attending neuroradiologists. In other words, differences among individual attending neuroradiologists in determining the presence or absence of agreement with the radiology resident accounted for the largest source of variance defined in this regression model. How can this be true? A non-radiologist might consider the obvious explanation that the nine attending neuroradiologists at the University of Florida did not exhibit high levels of interobserver agreement in interpretation of the ∼22,000 neuroimaging studies over 62 months. Although the scientific way to rebut this explanation would be to perform a multireader research study that uses the κ statistic to measure interobserver variation, most academic neuroradiologists would not believe that this explanation has sufficient face validity to warrant consideration of such an undertaking. However, a more tenable explanation can be advanced based upon the fact that attending neuroradiologists actually had to perform two tasks in this study. Besides providing the final interpretation, the attending neuroradiologist needed to fill out a form to indicate agreement or disagreement with a resident’s preliminary interpretation. As fellowship training in neuroradiology provides oversight in the production of radiology reports but not in distinguishing degrees of disagreement with the reports of other radiologists, it is reasonable to expect a discernible degree of variation among the nine attending neuroradiologists in determining what constitutes a “minimal” disagreement, and what constitutes a “significant” disagreement. For example, many neuroradiologists would consider the resident’s failure to mention certain benign findings such as a concha bullosum (aerated middle turbinate), hyperostosis frontalis interna (ie, bilateral frontal hyperostosis), or an enlarged cisterna magna to be clinically immaterial, and therefore would not constitute a “disagreement with minimal clinical impact.” However, other neuroradiologists might consider one or more of these omissions to represent a discrepancy. Similarly, there could be a difference in opinion about whether to classify a resident’s failure to diagnose mild cerebral atrophy in a 40-year old individual as a “disagreement with minimal clinical impact” or a “disagreement with significant impact.” As premature cerebral atrophy could be the harbinger of human immunodeficiency (HIV) infection in some patients, this finding could be construed as having significant clinical impact. Finally, attending neuroradiologists may have had different thresholds for indicating disagreement on the written form to motivate immediate feedback to the residents about specific cases.

Detailed subanalyses based on the odds ratio analysis for each predictor variable from the logistic regression analysis showed an unexpected finding. After accounting for all other factors, the level of disagreement in preliminary interpretation of the neuroimaging studies differed very little among residents. When using the resident with the lowest percent agreement (88.8% concordance) with attending neuroradiologists as the reference standard, the residents with the highest percent agreement with attendings (98% to 99% concordance) did not show a statistically significant difference in the level of agreement compared to the reference resident. (There were only three residents who differed significantly at the 95% level of confidence from the reference resident, but their concordance rates with attending radiologists were all around 90%, which is numerically close to that of the reference resident.) Given the large number of preliminary resident interpretations in this study, there presumably was sufficient study power to find a statistically significant difference among residents in the preliminary interpretation of after-hours and weekend neuroimaging studies, if one had truly existed.

The inability to demonstrate a statistically significant difference among residents in preliminary interpretation of the after-hours and weekend neuroimaging studies runs counter to the belief of many academic neuroradiologists that some radiology residents have greater aptitude and affinity for interpretation of neuroimaging studies than their peers. This apparent contradiction can be reconciled if we consider the performance of residents in preliminary interpretation of after-hours neuroimaging studies to be an imperfect “natural” test of resident abilities. This study indicates that apparent numerical differences in resident performance in preliminary interpretations (ie, between 89% and 99% level of agreement with attending radiologists) are not statistically significant. Furthermore, these differences may be due to factors beyond the control of the residents and cannot be attributed solely to differences in the acumen and skill of the individual resident. Therefore, radiology educators and program directors would be well advised to heed the call by Drs. Sistrom and Deitte to exhibit “caution in using simple rates of disagreement to compare residents for remediation or evaluation.”

References

  • 1. Sistrom C., Deitte L.: Factors affecting attending agreement with resident early readings of computed tomography and magnetic resonance imaging of the head, neck, and spine. Acad Radiol 2008; 15: pp. 934-941.
This post is licensed under CC BY 4.0 by the author.