In 2000, national health care expenditures were approximately $1.3 trillion, with a projected twofold increase by 2010 (1). More than 7 percent of this amount was for mental health care, totaling an estimated $91 billion (2). To curb this alarming increase in costs and to ensure high-quality care, a greater emphasis on treatment accountability emerged from legislative and accreditation bodies, public agencies, and consumers (3,4,5). In this "age of accountability," it is becoming standard practice for mental health providers to implement outcome management programs to understand the relationship between services, cost, and patient change (6,7). Such programs require instruments that are standardized, psychometrically sound, easy to use, practical, and available at a low cost (6,7,8,9,10).
Those who are interested in using outcome management are met with a cornucopia of possibilities. As Hermann and colleagues (11,12) note, more than 50 stakeholders have proposed more than 300 measures for quality assessment, leading some to recommend common measures or methods (13,14,15,16). A vital distinction to maintain as one enters this literature is Donabedian's tripartite framework (17), which categorizes quality assessment into measures of structure, process, and outcome.
Measures that assess the process of health care delivery are plentiful and have received more attention than outcome measures (11). This level of attention has been attributed to the lower cost of process measures and their ability to provide quick feedback to administrators and clinicians (18,19). However, measures of outcome may be a more direct indicator of quality than either structure or process (20). Moreover, recent advances in computerized outcome management systems now provide the same real-time feedback that was once the domain of process measures (21,22,23). Ellwood (24) opined that outcome measures empower psychiatrists' management of patient care. Others have suggested, "Psychiatrists and mental health care administrators who use outcome assessment to study and apply principles of continuous quality management daily will probably experience better efficiency, greater effectiveness, lower costs, and more satisfied patients" (18).
The fact that outcome assessment receives less frequent attention may also be due to confusion surrounding definitions (25), coupled with the enormous number of measures available from which to select (26). One promising method for dealing with the bewildering number of outcome measures is to restrict focus to measures that are designed for the target patient population. Unfortunately, clinical reality often demands that the target population be defined beyond a single diagnosis or grouping, such as depression or mood disorder. For example, Erbes and colleagues (27) evaluated and recommended outcome instruments for a Department of Veterans Affairs hospital, necessarily considering a broad range of diagnoses. In this article we focus on psychiatric inpatients with diagnoses of severe and persistent mental illness and propose a similar method.
Persons with diagnoses of severe and persistent mental illness have been considered a difficult population to track from an outcomes perspective (28,29). The terms "severe" and "persistent" have been operationalized as "functional limitations in activities for daily living, social interaction, concentration, and adaptation to change in the environment" and likely to "last for 12 months or more," respectively (30). These patients number between one and five million; have diagnoses of schizophrenia, schizoaffective disorder, bipolar disorder, major depression, autism, or obsessive-compulsive disorder; and cost health care systems billions annually (31).
Despite the extensive impact on resources and the increasing focus on accountability, active debate persists about which outcome measures to use with this patient population. Several studies have evaluated the utility and effectiveness of individual measures (32,33,34,35,36), whereas others have focused on comparisons between a limited number of instruments (37,38,39,40,41). Still, little consensus exists on which measures to select. Particular challenges include norms that provide interpretive meaning, sensitivity to change among patients who are expected to demonstrate little improvement, and robust psychometrics needed to capture subtle patient change.
If an outcome measure succeeds in addressing these challenges, it must also meet a reasonable standard of clinical utility by minimizing time devoted to data collection and other direct costs. This article focuses on the challenges associated with selecting an outcome measure. A companion paper in the State Mental Health Policy column of this issue of Psychiatric Services (42) addresses pragmatic aspects of implementing an outcome management program in a large state-run psychiatric hospital.
Although a host of resources for selecting quality measures are available (6,11,14,18,43,44), their direct applications pose complex challenges. Accordingly, we propose and illustrate an approach that was used to guide a multidisciplinary work group at a state psychiatric hospital in its selection of outcome measures (see the box on this page). We do not presume to evaluate the enormous number of outcome measures (26), nor are we recommending a specific set of measures. Rather, we describe a method that proved fruitful in our evaluation of a myriad of recommendations and measures.
A proposed method for selecting outcome measures for inpatients with severe and persistent mental illness
Step 1: Identify the patient population targeted for outcome assessment
Step 2: Identify relevant outcome measures by tabulating those used in randomized clinical trials and effectiveness studies treating the targeted patient population
Step 3: Identify a finite and manageable number of selection criteria focused on restrictions of the setting (for example, resources), accuracy of the decision to be made, and the target patient population
Step 4: Evaluate each outcome measure on the criteria by using standards that match available resources
Step 5: Select one or more outcome measures
The first step is to identify the targeted population. This step is critical in the evaluation of instruments, because many instruments have not been extended or tested with distinct patient populations—for example, norms, construct validity, and sensitivity to change have not been established. Invariably, the number of instruments is reduced at step 1. Step 2 further limits the universe of instruments by considering measures that are repeatedly used in efficacy or effectiveness studies with the targeted population. This step has obvious advantages and disadvantages. Advantages include the increased probability of selecting measures that will successfully capture change in a targeted population. Indeed, average change across studies on a measure can be quantified by using a pre- to posttreatment effect size. This metric provides one index for comparing sensitivity to change that is useful given that instruments vary on sensitivity to change (44), especially when used with different populations (45). An average effect size also provides clinicians with a baseline with which to benchmark expectations for patient gains.
The empirical filter in step 2 has notable disadvantages. If widely employed, this filter could lead to stagnation in the field by discouraging the use of promising new instruments. However, older measures that have presumably survived the test of time often serve as the standard for new instruments. This approach may also underemphasize certain domains, such as functioning, and overemphasize others, such as symptoms. However, our proposal is not intended to balance all potential outcome domains, and we refer readers to other sources (18,46). Rather, our goal was pragmatic—identifying measures that have been successfully used with a target population to capture meaningful patient change.
Step 3 acknowledges the numerous and competing criteria proffered as selection guidelines—for example, broad domain coverage, robust psychometrics, cost, and clinical utility. Indeed, in the illustration that follows, we identified 24 criteria offered by experts. Once again, a pragmatic approach to clinical practice requires a restricted set of criteria that are highly relevant to the clinical setting. A frequently endorsed measure in step 2 may be simply impractical for a particular clinical setting. Thus step 3 criteria temper decision making by considering the clinical setting. Step 4 applies these criteria to the measures uncovered in step 2 and often results in a reordering of the measures that are considered best. A measure that is identified in step 2 as being highly endorsed may drop precipitously in rank after the criteria in step 3 are considered. The integration (step 4) of empirical performance (step 2) and clinical setting (step 3) is the end result of selection (step 5).
Representing multiple perspectives may produce better decisions about outcome systems (6,27). Accordingly, a group of academically based researchers, hospital administrators, mental health care providers, and patient advocates formed a team to use the aforementioned method to select optimal measures for a state hospital. More specifically, our principal aims were to survey treatment studies to identify measures in frequent use for our target population, identify relevant literature-based selection criteria, and review the outcome measures by using the proposed criteria.
+
Step 1: identifying the target population
Patients at our facility typically have diagnoses of psychotic illnesses (54 percent have schizophrenia, delusional disorders, or schizoaffective disorders) and mood disorders (23 percent have major depression, bipolar disorder, or dysthymia) (47). Thus outcome measures not directly normed for this population were excluded, a procedure that biased our sample of measures in favor of the target population.
+
Step 2: survey of relevant literature
Our interest in calibrating our selection process with measures used in efficacy or effectiveness treatment studies led to a computerized search of more than 30 bibliographic databases—for example, PsycINFO, MEDLINE, Social SciSearch, and ERIC. This approach yielded nearly 500 citations published between 1990 and 2002 by using the search terms "severe and persistent and mental and ill or SPMI," "severe and mental and ill or SMI," and "schizophrenia and outcome and inpatient." Studies conducted before 1990 were excluded to limit bias toward older instruments, and instruments were not counted more than once if they were used in multiple articles associated with a single investigation.
Only 94 citations (20 percent) were treatment evaluation studies that used standardized outcome measures (t1). Excluded citations included conceptual or policy papers, reviews, studies of process or structure measures, or nonstandardized outcomes measures—for example, dropout rates, cost, and recidivism. The sample produced 110 measures, with 11 (10 percent) used at least four times, three (3 percent) used three times, 15 (14 percent) used twice, and 81 (74 percent) used just once. Interestingly, 25 measures (23 percent) were investigator created; it has been argued that such measures provide meaningless comparative information (48).
Three observations were drawn from the results of the survey (t1). First, the clinician-rated measures that were used most often included the Brief Psychiatric Rating Scale (BPRS) (49), the Global Assessment of Functioning (GAF) (50), the Positive and Negative Syndrome Scale (PANSS) (51), and the Scale for the Assessment of Negative Symptoms (SANS) (52). The BPRS was used twice as frequently (44 percent) as the PANSS and the SANS.
Second, very few self-report instruments had repeated use. Only two self-report outcome measures surfaced in four or more studies, including the Symptom Checklist-90-Revised (SCL-90-R) (53) and the Quality of Life Interview (QOLI) (54). Third, multiple-measure assessment was preferred over single-measure protocols by a ratio of 2 to 1. This finding mirrors recommendations that a single source may be less reliable because each source contributes a valid yet potentially divergent perspective (55). Although the average number of measures used per study was 2.8 (range, one to 13), what was striking was the number of studies (N=30) that used a single outcome instrument.
+
Step 3: selection of criteria
The absence of a standard for selecting outcome instruments led to an integration of criteria suggested by six sources (6,9,10,55,56,57). These experts offer 24 criteria for selecting optimal outcome measures, from which we chose seven on the basis of frequency of endorsement and fit with our setting. In no particular hierarchal order, the criteria were applicability to the target population, availability of training protocol and materials, appropriate norms to ensure interpretability of scores, psychometric integrity (that is, adequate reliability and validity), cost, administration time, and sensitivity to change. Each is discussed more fully below.
+
Step 4: comparison of frequently used instruments
A summary of our evaluation of six of the most frequently used clinician and self-report instruments is presented in t2 and t3 (58,59,60). (This summary is restricted to six instruments because of space limitations.) A brief examination of each measure and greater explication of the seven criteria follow.
BPRS. The BPRS satisfied our target population given that it was created to provide rapid assessments of psychopathology for inpatient populations. Its extensive use in the literature has produced ready-made norms for a variety of populations. There are two revisions of the original 16-item version (49): an 18-item version (61) and a 24-item expanded version (BPRS-E) (62). Each version has produced four similar symptom factors—manic hostility, withdrawal-retardation (negative symptoms), thinking disturbance (positive symptoms), and depression-anxiety—that match typical patient characteristics of state psychiatric hospitals (47).
Clinician-rated scales can provide greater consistency across patients and diagnoses than self-report measures, thereby producing more reliable systemwide evaluations (63). However, this consistency is directly related to the quality of the training material available for ensuring adequate interrater reliability, which is a clear strength of the BPRS (44,57). Indeed, good to moderate interrater reliability is evident (37,64,65), along with moderate test-retest reliability (64) and good internal consistency (65). The literature was also largely supportive of the instrument's construct and concurrent validity (64,66,67,68,69,70). We ranked the clinical utility of the BPRS as high, because it was normed on clinical populations, available at no cost, and very sensitive to change (average d=1.21). Its greatest shortcoming was the resource drain associated with a clinician-rated instrument, an issue addressed in the companion paper in the State Mental Health Policy column in this issue (42).
GAF. As a standard part of the diagnostic protocol (71), the GAF is the most widely used measure of psychiatric patient function (33), with the extant literature providing a wealth of normative data. Introduced as a revised version of the Global Assessment Scale (72), the GAF allows clinicians to rate global patient functioning on a single scale ranging from 1 (persistent danger of severely hurting self or others) to 100 (absence of symptoms to minimal symptoms). Research has reported interrater reliability coefficients that range from modest to excellent (73,74,75,76) as well as moderate to high concurrent validity estimates (76,77). From a clinical utility perspective, the GAF was viewed as comparable to the BPRS, being normed on inpatient and outpatient populations, available at no cost, very quick to administer, and very sensitive to change (d=1.10). As with the BPRS, consistency of ratings requires the implementation of rater training and periodic consistency checks.
PANSS. The PANSS was developed "as an instrument for measuring the prevalence of positive and negative syndromes in schizophrenia" (51). It consists of the BPRS-18 (61) plus 12 items from the Psychopathology Rating Scale (78). Clinicians rate patients' symptoms with use of 30 items that aggregate on four scales: positive symptoms, negative symptoms, composite, and general psychopathology. Research has reported evidence of acceptable construct and concurrent validity (79,80), good internal consistency reliability, moderate test-retest reliability (51), and interrater reliability coefficients that range from high to moderate (37,80). Clinical utility was rated lower, because the instrument is lengthier to administer, more costly to use ($32 for a set of 25 questionnaires), and normed on a narrower population. Nevertheless, the instrument appears to be very sensitive to change in our analysis (d=1.23).
SANS. The SANS was developed by Andreasen (52,81) as a measure of negative symptoms among patients with schizophrenia. Clinicians use 30 items that are aggregated on five subscales: affective flattening or blunting, alogia, apathy, asociality, and inattention. This instrument has adequate construct and concurrent validity coefficients (35,80), good internal consistency reliability (52), moderate 24-month test-retest reliability (82), and interrater reliability coefficients that range from moderate to high (52,69,82). The instrument was ranked the lowest because of moderate clinical utility, it was normed on a single population, and it is somewhat lengthy and moderately sensitive to change (d=.68). However, it is available at no cost.
SCL-90-R. Originally designed for use with psychiatric outpatients, the SCL-90-R (53) has enjoyed widespread use in clinical and research settings, producing a wealth of normative data. Patients respond to 90 items that are aggregated on nine symptom dimensions (somatization, obsessive-compulsivity, interpersonal sensitivity, depression, anxiety, hostility, phobic anxiety, paranoid ideation, and psychoticism) and three global scales (the global severity index, the positive symptom distress index, and the positive symptom total). Research has shown little evidence of construct validity for this instrument (53,83), although the instrument has shown good internal consistency, test-retest reliability (83,84,85), and moderate concurrent validity with the BPRS (86).
The SCL-90-R was viewed as having moderate clinical utility; being normed on community, outpatient, and inpatient populations; being quick to administer; and being moderately sensitive to change (d=.69). Considerations that lowered its rank were cost ($41 per 50 hand-scored answer sheets) and the fact that it is self-reported among the target population. Self-report measures require less staff time and permit consumer-focused outcome assessment, because patients are empowered to report on their symptoms and expectations about treatment (87). Disadvantages include an insufficient clinical picture as a result of the dependence on patients' ability to accurately describe their condition, which at times is doubtful because of denial, minimization of symptoms, or responder bias (88).
QOLI. The QOLI (51) is a highly structured interview developed to assess current quality of life and global well-being among populations with chronic mental illness. This instrument is made up of objective and subjective questions that allow the patient to rate his or her current situation and satisfaction with life. The QOLI has good construct validity (89,90), moderate concurrent validity (91), moderate test-retest reliability (51), and high internal consistency (92). It was ranked low on clinical utility because of its length, cost (a pay-per-use structure), and low sensitivity to change (d=.02). However, the latter was based on a single study, and norms were available for community and inpatient populations.
+
Step 5: selecting measures
Clinician-rated instruments clearly outnumbered self-report instruments in our analysis. Lachar and colleagues (67) explained that clinician-rated measures have recently achieved an advantage over self-report in hospitals because of the disabling psychopathology patients now must exhibit to justify hospitalization. The impairment of newly admitted patients negatively affects patients' ability to complete even a brief self-report measure. However, the accuracy of information from clinician-completed measures must be balanced by the resource drain. The BPRS and the GAF require the least time to administer, and the BPRS, the GAF, and the PANSS appear to be equally sensitive to change, yet all instruments required mastery of training materials and demonstrated reliability to produce meaningful information about outcomes.
The team openly acknowledged the limitations of the sample, including a time frame that may have disadvantaged newer instruments—for example, the Multnomah Community Abilities Scales and Outcome Questionnaire. Furthermore, although the frequency count allowed us to easily calibrate against findings in the extant literature, it may have inadvertently excluded potentially useful instruments because of their infrequent use in our sample—for example, the Medical Outcomes Study SF-36 and the Addictions Severity Index appeared in two studies each. However, infrequent use portends unknown properties such as sensitivity to change and normative characteristics.
Three issues affected our final recommendations. First, global single-scale assessments, such as the GAF, are frequently used because they are simple to administer and provide immediate feedback (73). However, these scales suffer limitations in accuracy as a result of combining patients' symptoms and functioning in a single rating (93), leading some to question their accuracy with our target population (33,94).
Second, as with most publicly funded facilities, we have limited resources. As a mental health agency focused on improving service delivery from both an organizational and a consumer-oriented perspective (87), we were aware of the considerable discussion about the importance and effectiveness of self-report and clinician-rated instruments (95,96,97,98). At face value, our survey suggests the BPRS and the SCL-90-R as the best clinician-rated and self-report outcome instruments. However, concerns about financial resources, administration time, staff support, staff competence, and training led to active debate. Our adoption of the BPRS led to infrastructure realignment to address these concerns, as detailed in our companion paper (42).
Finally, our survey suggested the SCL 90-R as a self-report tool, but its use raised two concerns: meaningfulness of the outcome data given patient impairment, and cost. When patients are physically unable or unwilling, because of malingering or resistance, to complete a self-report assessment, data may be too erratic (item endorsement at both ends of the range) to facilitate meaningful interpretation. Nevertheless, we adopted an alternative self-report measure because of cost issues, and Earnshaw and colleagues (42) detail how we dealt with data accuracy concerns.