What is the Reliability of the SCID-I?

The accompanying table provides a summary of the most comprehensive published reliability studies of the SCID-I. (A more complete list of studies reporting on the reliability of the various versions of the SCID is contained here). Reliability for categorical constructs, such as the DSM-IV diagnoses being assessed by the SCID, is reported in terms of Kappa, a statistic that corrects for chance agreement. Kappa values above .70 are considered to reflect good agreement; values from .50 to .70, fair agreement and below .50 poor agreement. As can be seen immediately in the table, the range of Kappas from different studies and for different diagnoses is enormous. Many factors influence the reliability of an interview instrument such as the SCID. We will address some of these below.

Joint interviews vs. a test/retest design In some studies, a subject is interviewed by one clinician, while others observe (either in person or by reviewing a tape) and then make independent ratings. Joint interviews produce the highest reliability numbers because all raters are hearing exactly the same story, and because the outcome of skip instructions provides clues to the observers regarding the ratings made by the interviewer. A more stringent test of reliability (test/retest) entails having the subject interviewed at two different times by two different interviewers. This method tends to lead to lower levels of reliability because the subject may, even when prompted by the same questions, tell different stories to the two interviewers (information variance), resulting in divergent ratings.

Interviewer Training - Raters who are well trained, and particularly, raters who train and work together are likely to have better agreement on ratings. It is worth noting that the professional discipline of the interviewer (e.g. psychiatrist, psychologist, social worker) does not appear to contribute to differences in reliability. An ongoing training and quality assurance program, such as the one in place at the UCLA Research Center for Major Mental Illness, (see Ventura, 1998 reference in the Background Articles on the SCID-I and SCID-II (description, administration, training) ) has demonstrated that a high level of reliability (e.g. Kappas of at least .75 on symptoms, and 90% accuracy in diagnosis) can be maintained as interviewers leave and new interviewers are trained.

Subject population. Patients with the most severe and florid psychiatric disorders (e.g. patients repeatedly hospitalized with Schizophrenia or Bipolar Disorder) are likely to yield more reliable SCID diagnoses than subjects with milder psychiatric conditions that border on normality. This reflects the fact that relatively minor diagnostic disagreements are more likely to have a profound effect when the severity of the disorder is just at the diagnostic threshold. For example, a disagreement about a single criterion for a patient with exactly 5 out of 9 symptoms of a major depressive episode makes the difference between having a diagnosis of Major Depressive Disorder vs. Depressive Disorder NOS, whereas a one item disagreement for a patient with 7 out of 9 items would not result in any apparent disagreement on the diagnosis. Furthermore, studies that screen out subjects who are poor historians or who have exceptionally complex histories of psychopathologies will also produce higher reliability results as compared to studies without any pre-screening procedures.

Base rates. The base rates of the diagnoses in the population being studied affect the reported reliability. If the error of measurement for a diagnostic instrument is constant, reliability varies directly with the base rates. It is thus harder to obtain good reliability for a rare diagnosis than for a common diagnosis. For example, SCID reliability for Major Depressive Disorder will be higher in a Mood Disorders Clinic than in a community sample, in which the base rate of Major Depressive Disorder is much lower.

Reference (See Below) Skre et al., 1991 Zanarini et al., 2000 Zanarini et al., 2000 Segal et al., 1995 Williams et al., 1992 Zanarini et al., 2001 Zanarini et al., 2001 Lobbestael et al., 2010
Population Studied N=54 N=27 N=52 N=40 N=592; Mixed Inpt, Outpt, Non-Pt. N=45 N=30 N=151
Design of Reliability Study Joint;
84 Rater-Pairs from 4 sites
7-10 Day Interval Test-Retest Joint;
1-3 Week Interval Test-Retest Joint;
Observed Live
7-10 Day Interval Test-Retest Joint; Audio-Tape
Major Depressive Disorder .93 .80 .61 .90 .64 .90 .73 0.66
Dysthymic Disorder .88 .76 .35 .53 .40 .91 .60 .81
Bipolar Disorder .79 .84
Schizophrenia .94 .65
Alcohol Dependence/ Abuse .96 1.0 .77 .75 1.0


Other Substance Dependence/Abuse .85 1.0 .76 .84 .95 .77 .77
Panic Disorder .88 .65 .65 .80 .58 .88 .82 .67
Social Phobia .72 .63 .59 .47 .86 .53 .83
OCD .40 .57 .60 .59 .70 .42 .65
GAD .95 .63 .44 .56 .73 .63 .75
PTSD .77 .88 .78 1.0 1.0 .77
Any Somatoform Disorder -.03 .84
Any Eating Disorder .77 .64 .61
Agoraphobia     .60
Specific Phobia     .83

Complete References

Lobbestael J, Leurgans M, Arntz A.  Inter-rater reliability of the Structured Clinical Interview for DSM-IV Axis I Disorders (SCID I) and Axis II Disorders (SCID II). Clin Psychol Psychother 2010 Mar 21

Segal DL, Kabacoff RI, Hersen M, Van Hasselt VB, Ryan CF: Update on the Reliability of Diagnosis in Older Psychiatric Outpatients Using the Structured Clinical Interview for DSM-III-R. J of Clinical Geropsychology 1995; 1:313-321

Skre I, Onstad S, Torgersen S, Kringlen E High interrater reliability for the Structured Clinical Interview for DSM-III-R Axis I (SCID-I). Acta Psychiatr Scand 1991 Aug;84(2):167-73

Williams JBW, Gibbon M, First MB, Spitzer RL, Davis M, Borus J, Howes MJ, Kane J, Pope HG, Rounsaville B, Wittchen H: The Structured Clinical Interview for DSM-III-R (SCID) II. Multi-site test-retest reliability. Arch Gen Psychiatry,1992; 49:630-636

Zanarini MC, Frankenburg FR. Attainment and maintenance of reliability of axis I and axis II disorders over the course of a longitudinal study. Comprehensive Psych 2001 Sep-Oct 42(5):369-374.

Zanarini MC, Skodol AE, Bender D, Dolan R, Sanislow C, Schaefer E, Morey LC,Grilo CM, Shea MT, McGlashan TH, Gunderson JG. The Collaborative Longitudinal Personality Disorders Study: reliability of axis I and II diagnoses. J Personal Disord 2000 Winter;14(4):291-9