Mental health intervention research requires clear and accurate specification of the independent variables that are actually operating in studies. For example, inferences about effects in experimental studies depend on fidelity: therapists' adherence to the intended treatment, their competence to apply it, and sufficient differentiation across conditions (1,2). Although empirical verification of fidelity has been reported infrequently in psychological treatment research (3), fidelity has recently received greater attention in research on community-based psychosocial interventions for persons with serious mental illnesses. These program-based interventions are inherently more complex and less amenable to full specification in manuals than interventions delivered by a single clinician; they often include elements related to the organization, caseload, types of treatments and other services provided, and interactions with other programs (4). Fidelity measurement strategies and measures have been developed and used for research and practice with a wide range of such programs (5–11).
The principal research uses of program fidelity measures are to monitor and ensure adherence to particular interventions and to identify their critical ingredients. They also serve as operational syntheses of prior research and as vehicles to disseminate information to the field about essential features of evidence-based practices (4). The demands of multiple uses pose significant challenges for the design of fidelity measures. One is selection of features to include. Critical ingredients are identified through theory and empirical research addressing the active mechanisms that are expected to yield intended outcomes. Because such ingredients might include multiple organizational levels, contributing mechanisms, and assessment points, program-based models present developers with a multitude of options. A second challenge is the need to balance effectiveness, or the degree to which fidelity measures and methods capture the essential features of an intervention reliably and validly, with efficiency, or the degree to which the tools can be applied cost-effectively such that the real gains from use in ordinary settings warrant the effort required to use them (12). Because of the complexity of treatments, contexts, and possible uses, developers of fidelity measures have made a wide range of choices in balancing effectiveness and efficiency.
Consideration of a number of conceptual frameworks within which fidelity measures operate suggests reasons for such variation. Within the classic structure-process-outcome quality framework (13), fidelity measures typically include both structure and process elements. Although fidelity measures have sometimes emphasized more accessible structural features—for example, group size and duration of treatment—less tangible processes may be essential to program integrity (7), and overemphasizing structure poses risks to both research and practice (14–16). Such misplaced emphasis can follow from weak theory, because fidelity measures have been described as representing program theory or theory of action of the intervention (14,17). What is included in a fidelity measure will thus depend on what actions are considered essential and at what level the intervention is defined. In some cases, this represents a departure from program theory as such, which specifies mechanisms of change, to include implementation theory, which specifies how a program is carried out (18).
A more recent model of implementation research, with primary domains of intervention strategies, implementation strategies, and three types of outcome domains—implementation, service, and client—would place fidelity as one of a number of implementation outcomes (19). However, some fidelity measures, including two described here, have addressed implementation features in three or four of these five domains. This implementation research framework itself draws on a model for assessing change at four levels: individual, group or team, organization, and larger system and environment (20). Again, fidelity measures can span multiple levels. Finally, a recent heuristic model for ensuring quality of implementation of evidence-based practices proposes four main strategic categories: policy and administration, training and consultation, team operations, and program evaluation (21). Fidelity assessment is placed in the last category, but it can also support other strategies.
In this article, we describe four recently developed fidelity measures for community-based interventions for people with serious mental illness to illustrate a range of approaches within this context. Overall they reflect advances in effective measurement of critical processes, but they differ in terms of where and how they focus within those frameworks for theory, quality, and implementation. These measures are summarized in Table 1 and listed along a continuum of complexity of program levels.
Cognitive Therapy for Psychosis Adherence Scale
Between 25% and 40% of people with a schizophrenia spectrum disorder experience persistent psychotic symptoms (22,23), which are associated with high levels of distress and functional impairment, as well as increased vulnerability to relapse (24,25). To address this problem, cognitive-behavioral therapy (CBT) for psychosis was adapted based on the principles of CBT initially developed for the treatment of depression and anxiety (26,27). These principles emphasize treatment components, such as “normalizing” psychotic symptoms, teaching effective coping strategies for persistent symptoms, and critically examining and challenging thoughts and beliefs underlying psychotic symptoms (28–30). Over the past two decades, more than 30 randomized controlled trials have evaluated the effects of CBT for psychosis, and results indicate significant reduction of psychotic, negative, mood, and social anxiety symptoms (31). CBT for psychosis is a recommended treatment for schizophrenia, both in the most recent guidelines from the National Institute for Health and Clinical Excellence in Great Britain (32) and in recommendations from the Schizophrenia Patient Outcomes Research Team in the United States (33).
To evaluate therapist adherence to the elements of CBT for psychosis defined by Fowler and colleagues (29), Startup and colleagues (34) developed the Cognitive Therapy for Psychosis Adherence Scale (CTPAS), which includes 12 items, each rated on 7-point Likert scales, with assessments based on audiotapes of treatment sessions. Ratings on the scale pertain to specific therapist behaviors, such as “assessing psychotic experiences” and “validity testing.” Startup and colleagues (34) demonstrated that reliable ratings could be obtained with the CTPAS. A principal components factor analysis indicated two factors, corresponding to focus on problems and focus on delusions. This scale was used to document therapist fidelity to CBT for psychosis in two clinical trials (35,36).
The CTPAS was subsequently revised (R-CTPAS) by adding nine items and changing the rating scale to provide separate ratings of therapist adherence to the CBT for psychosis model and the frequency of specific therapist activities (37). Adherence is conceptualized as competent delivery of therapist activities described in the manual (29), as defined by practices that are individualized to the client's presenting problems, matched to the client's understanding, and carried out collaboratively. Frequency items are recorded for specific therapist activities, regardless of whether the activities are adherent to the manual or not. High interrater reliability ratings were obtained. A principal components factor analysis of the presence of specific therapist activities demonstrating adherence to the model yielded three factors corresponding to “engagement and assessment,” “relapse prevention,” and “formulation and schema work.” Concurrent validity was shown by demonstrating moderate associations between ratings on the CTPAS and the Cognitive Therapy Scale (38), which was developed to evaluate fidelity to CBT for depression. The CTPAS has been used to ensure adherence of therapists delivering CBT for psychosis in randomized controlled trials (39) and to compare the skills of clinicians working on a research project with those in routine clinical practice (37).
Strengths Model Fidelity Scale
The purpose of the strengths model of case management, first formulated in the early 1980s, is to help people with psychiatric disabilities to attain the goals that they set themselves by identifying, securing, and sustaining the range of resources, both environmental and personal, that are needed to live, play, and work in a normally interdependent way in the community (40). The focus is on individual and community strengths and assets in the service of goal achievement. The strengths model has been the subject of four experimental or quasi-experimental studies (41–44) and five nonexperimental studies (45–49). Results have been consistently positive, with reduction in symptoms and improved social functioning being the most frequent findings. This body of research has been criticized for small samples and the varied measures employed (50). Of particular concern is the lack of systematic monitoring of intervention implementation.
The impetus to develop a Strengths Model Fidelity Scale (SM-FS) was threefold. First, future research on the strengths model needed a reliable method for monitoring implementation of the intervention. Second, the mental health authority in Kansas created an enhanced Medicaid reimbursement rate for providers who delivered high-fidelity strengths model case management, and other states were pursuing similar arrangements. They needed a reliable method for ascertaining fidelity. Third, because the idea of strengths-based practice has gained such currency, there was a need to distinguish between the rhetoric of programs and actual practice.
The SM-FS contains three major domains: structure (for example, caseload size and use of group supervision), supervision (for example, field mentoring and review and feedback on the use of clinical tools), and clinical practice (for example, use of the strengths assessment and personal recovery plan, use of naturally occurring community resources, and hope-inducing behavior) (51). The measure uses the 5-point anchored-scale format used in many fidelity measures (11). Possible scores range from 11 to 55, with 45 defined as good fidelity. It uses multiple sources of data, including case records; interviews with consumers, case managers, and supervisors; and direct observation of practice. SM-FS has face validity with expert item reviews.
One study showed the predictive validity of SM-FS for team performance in terms of consumer outcomes (52). The core outcomes included psychiatric hospitalization, competitive employment, involvement in higher education, and independent living. Fidelity reviews were conducted at baseline and then every six months during the first 18 months of implementation. Each review was conducted by at least two consultant-trainers. Interrater reliability (intraclass correlation) between the two raters of the fidelity scale was .97, representing a high level of agreement. Internal consistency (Cronbach's alpha) for the 11 items was .98. Consumer outcomes were reported by the participating team case managers when fidelity reviews occurred. The data contained 14 case management teams representing ten agencies serving an average of 953 consumers over an 18-month period who were diagnosed as having a serious mental illness. The study found that consumer outcomes improved over time and that the improvement was explained by the increase in the fidelity score, which indicated predictive validity. Concurrent correlations between the fidelity score and outcomes were in the expected directions, which also supported the associations between fidelity and outcomes.
Illness Management and Recovery Program Fidelity Scale
The illness management and recovery (IMR) program was developed to teach illness self-management skills to people with severe mental illness (53). A comprehensive review of illness self-management strategies was first conducted that identified five empirically supported interventions: psychoeducation, behavioral tailoring for medication adherence, development of a relapse prevention plan, coping skills training, and social skills training to improve social support (54). These strategies were then incorporated into a comprehensive, integrated program that included ten different “modules” or topic areas aimed at teaching illness self-management strategies to help clients achieve personally meaningful recovery goals. The IMR program can be implemented either individually or in groups, and completion generally requires four to five months of twice-weekly meetings or nine to ten months of weekly meetings.
The IMR Fidelity Scale (IMR-FS) was developed to evaluate the adherence of clinicians to the principles of the IMR program. In contrast to the R-CTPAS, which focuses on evaluating the fidelity of individual clinicians to the CBT for psychosis treatment model, the IMR-FS focuses on evaluating the fidelity of an overall program (that is, all the clinicians together) to the principles and defining elements of IMR. The IMR-FS includes 13 items, each rated on 5-point behaviorally anchored scales, that tap a combination of specific structural aspects regarding how the program should be delivered (number of people in sessions, program length, comprehensiveness of curriculum, and provision of handouts), the provision of specific empirically supported components in IMR sessions (psychoeducation, behavioral tailoring, relapse prevention plan, coping skills training, and social skills training), and adherence to specific principles that guide implementation of the overall IMR program (goal setting and follow-up; use of educational, motivational, and cognitive-behavioral teaching strategies; and involvement of significant others). Ratings are usually conducted by two assessors on the basis of a combination of inspection of charts; meetings with clinicians, clients, and supervisors; and direct (limited) observation of IMR sessions.
Good interrater reliability has been shown for the IMR-FS, which was also found to be sensitive to change over two years after training and consultation in the IMR program across 12 community mental health centers participating in the National Implementing Evidence-Based Practices project (11). The IMR-FS has also been used to document fidelity to the IMR model in three randomized controlled trials comparing IMR with usual services (55–57). Of interest, in one of these studies IMR was implemented at 12 sites, of which nine showed high fidelity to the IMR program (56). When analyses were restricted to the nine high-fidelity sites, somewhat stronger effects were found than in the intent-to-treat analyses that included all 12 sites.
Validation work on the IMR-FS has yet to be conducted, although several possible approaches are possible. Research could be conducted to evaluate whether total scores on the IMR-FS at different agencies providing the IMR program are related to improvements in domains targeted by the program, such as illness self-management, hospitalization, or functioning. In addition, research could evaluate whether ratings on some of the items of the IMR-FS are significantly related to independent fidelity measures tapping the same constructs. For example, one would expect that higher scores on the IMR-FS item on motivational teaching strategies would be related to greater clinical competence on the motivational interviewing subscale of the Yale Adherence and Competence Scale (58).
The Tool for Measurement of ACT
Assertive community treatment (ACT) was developed as a comprehensive program to provide the full array of treatments, services, and supports needed by persons with severe mental disorders and significant psychiatric disabilities to establish and maintain fulfilling lives in the community (59,60). The program is the single point of responsibility for enrolled consumers; has a small caseload of approximately 100 consumers shared across multidisciplinary team of ten to 12 members; and provides highly individualized, integrated services in vivo, whenever, wherever, and for as long as needed in consumers' daily lives. The model incorporates carefully specified procedures to track and respond to consumer needs, deploying staff as needed. As definitions for optimal treatment and expectations for treatment goals have changed over time, the practice of ACT has also evolved, incorporating other evidence-based practices in treatment (10) within an overall recovery orientation (61).
After preliminary fidelity measures were developed (6,62), the Dartmouth Assertive Community Treatment Scale (DACTS) (7), although developed for a particular study (63), became the standard fidelity measure for ACT and has been used widely in studies (11,64,65). Because it was available before publication of the first ACT manual (60) and had a clear and accessible format and protocol, it was frequently used as a guide to implementing the program despite the authors' assertions that some key processes were not assessed. Although not problematic in its original application (66), the scale's emphasis on structural features and omission of some critical process risked weaker implementation and research inferences elsewhere, especially as the ACT model evolved.
The Tool for Measurement of ACT (TMACT) (16) was designed to address these issues. It assesses use of evidence-based practices—for example, supported employment and integrated treatment for co-occurring disorders—within the ACT model, includes items addressing consumer recovery orientation, and strengthens measurement of team functioning. It has 47 items in six subscales that define operations and structure, core team, specialist team, core practices, evidence-based practices, and person-centered planning and practices. A protocol specifies the fidelity assessment process and provides interview questions, rules for scoring all items, and formats for collecting data and providing feedback. Items on each evidence-based practice are derived from respective full fidelity scales (11). ACT staff function as both specialists and generalists informed by others' specialist services, so staff roles are assessed relative to other staff as well as to consumers. Recovery orientation is built into items assessing person-centered planning and practices and is more generally reflected throughout the measure in assessing the focus of treatment and interactions with consumers.
DACTS and TMACT scores were compared for ten teams over 18 months (16). Significant differences between the two measures varied over time and were a function of lower fidelity in key areas not measured by the DACTS, confirming the TMACT as a more comprehensive and higher standard than the DACTS and as more sensitive to change.
Advances in research on community-based interventions will depend in part on advances in our ability to measure whether they are being delivered as intended. The fidelity measures described above for four intervention models for people with severe mental disorders represent improvements in this respect. Although all of these measures include structural elements, they also include assessment of specific processes demonstrated or hypothesized as critical to successful delivery of the intervention. They were designed for use in a number of research purposes, such as validating inclusion of sites or practitioners in studies, indicating the strength of the intervention, or identifying critical ingredients. And their intended uses go beyond research: one or more is used to accredit programs for enhanced reimbursement rates, to certify individual clinicians, or as a tool for training and quality improvement.
The measures differ in important respects, following from differences in program and implementation theory underlying their respective interventions. At the programmatically simplest level, the CBT for psychosis intervention is specified strictly in terms of dyadic interaction. At the other extreme of programmatic complexity, the ACT model includes specifications for program-level structures and processes theoretically required to ensure optimal delivery of services at the dyadic level. Current fidelity measurement as exemplified by these recent measures expands beyond the respective niches suggested by recent frameworks for implementation science and quality improvement (19–21).
This broad practical and conceptual scope in what we currently define as fidelity measurement suggests an important future need. There are calls for refinement in program and implementation theory, as well as development of measures of implementation fidelity (17,19). The field would gain from greater clarity in concept and definition. The term “fidelity” has merit as representing a general concept, but we would benefit from articulation of a typology of fidelity measurement linked to emerging frameworks for implementation and quality.
The role of fidelity measures in research could also be better clarified. For example, McGrew (67) faulted the TMACT authors for including items that had not been individually demonstrated to predict outcomes in ACT. The authors' response was that fidelity measures had rarely been prevalidated at the item level and that the TMACT was just the sort of refinement of program theory called for on the basis of related evidence and necessary to move the science of this intervention forward (17,68). Further consideration of this issue is warranted.
Validation is a related need. The four measures described here have made variable use of one or more of the approaches described by Mowbray and colleagues (14)—reliability, structural analysis, known groups, convergent validity, and outcome prediction; but these must be used judiciously. Internal consistency, for example, may apply poorly to measurement of domains that do not represent a single underlying construct, outcome prediction may be uninformative or misleading when program variation is insufficient, and overall test-retest and interrater reliability may present practical challenges for programwide assessments. However, several of these, especially convergent validity, could more routinely apply to measure components. And validation of choice of method would be important and feasible in complex programs, for example, by evaluating program-level items against aggregated results from individual-level items.
In the absence of the suggested theoretical and empirical work it would be difficult to judge the respective choices made within the four fidelity measures in balancing effectiveness and efficiency. Respective descriptions and the information presented in Table 1 indicate an adaptation of both coverage and methods to the domains assessed. In the normal scientific context of testing and refinement, increasing effectiveness should also yield improved efficiency over time. All four entail considerable effort, albeit of varying types, suggesting the importance of additional work to quantify the value added and establish cost-effectiveness. How much effort in fidelity assessment is warranted, and to what degree of precision, is unclear. However, there is substantial evidence that higher fidelity is generally correlated with better outcomes; establishment of high fidelity early on should yield substantial benefit. Even a full TMACT assessment with consultative feedback requires less than .5% of the annual effort of a team, a modest marginal cost for expected treatment improvement from assessment of either a new team or one showing intermediate performance, and there is reasonable concern and some evidence that diluted fidelity measurement in an environment of complex incentives may weaken both practice and research findings (16,68). Further work is needed on development of low-risk strategies for titration of ongoing fidelity assessment efforts (64).
Fidelity measurement in mental health services research is at a promising if uncertain point. New measures are being developed to measure and guide fidelity of emerging and enhanced practices in serving persons with serious mental illnesses in the community. Four recent measures illustrate this progress. At the same time, the context of use is rapidly changing as an emerging implementation science begins to articulate frameworks for addressing the compelling translational challenge of developing the necessary knowledge to establish and maintain evidence-based practices in usual care settings. Further refinement and clarification of the science and practice of fidelity measurement, along with an expanded view of its useful place in these frameworks, should be a part of that development.