The American Psychiatric Association (APA) has updated its Privacy Policy and Terms of Use, including with new information specifically addressed to individuals in the European Economic Area. As described in the Privacy Policy and Terms of Use, this website utilizes cookies, including for the purpose of offering an optimal online experience and services tailored to your preferences.

Please read the entire Privacy Policy and Terms of Use. By closing this message, browsing this website, continuing the navigation, or otherwise continuing to use the APA's websites, you confirm that you understand and accept the terms of the Privacy Policy and Terms of Use, including the utilization of cookies.

×
Published Online:https://doi.org/10.1176/appi.ps.202000364

Abstract

Objective: The Patient Health Questionnaire–9 (PHQ-9) is commonly used to assess depression symptoms, but its associated treatment success criteria (i.e., metrics) are inconsistently defined. The authors aimed to analyze the impact of metric choice on outcomes and discuss implications for clinical practice and research.

Methods: Analyses included three overlapping and nonexclusive time cohorts of adult patients with depression treated in 33 organizations between 2008 and 2018. Average depression improvement rates were calculated according to eight metrics. Organization-level rank orders defined by these metrics were calculated and correlated.

Results: The 12-month cohort had higher rates of metrics indicating treatment success than did the 3- and 6-month cohorts; the degree of improvement varied by metric, although all organization-level rank orders were highly correlated.

Conclusions: Different PHQ-9 treatment metrics are associated with disparate improvement rates. Organization-level rankings defined by different metrics are highly correlated. Consistency of metric use may be more important than specific metric choice.

HIGHLIGHTS

  • Different Patient Health Questionnaire–9 (PHQ-9) treatment success criteria (e.g., metrics) are used in clinical practice to denote depression response or remission, and the choice of metric substantially affects observed treatment improvement rates.

  • Organization-level rankings for depression response and remission vary depending on metric choice, but their rankings are highly correlated, meaning that different metrics tend to rank organizations concordantly.

  • When comparing PHQ-9 symptom improvement rates across organizations, metric consistency is imperative, whereas the specific metric chosen is of secondary importance.

Depression is a common behavioral health problem that is a leading cause of disability, lower quality of life, diminished productivity, and reduced employment rates globally (1). In the United States, a recent study estimated that the total economic burden of major depressive disorder has reached at least $210.5 billion per year, a 21.5% increase from 2005 (2).

Although a variety of evidence-based practices have been validated for the treatment of depression in diverse settings, it remains challenging to accurately and reliably measure outcome progress at the organization level. This challenge occurs for a variety of reasons: depression symptoms often improve or worsen regardless of intervention (3), the literature is divided on how to define treatment response and remission (with various treatment success definitions used in research and clinical practice) (410), and exactly how changes in rating scale scores affect the outcomes most important to patients (e.g., quality of life and social connectivity) remains unclear. Nevertheless, outcome ascertainment for depression treatment is becoming increasingly important as measurement-based care and value-based payment become priorities for health organizations nationwide. There is a consequent growing need for pragmatic and scalable ways to assess treatment progress.

Of the numerous validated rating scales for depression, the Patient Health Questionnaire–9 (PHQ-9) (7) has consistently been one of the most used and validated in primary care, specialty behavioral health, and research settings (11). As a result, it is the instrument of choice for this investigation. The PHQ-9 has nine items that are each scored from 0 to 3, for a maximum score of 27; higher scores indicate a greater severity of depression symptoms.

Often, the PHQ-9 is used as a way to ascertain the severity of baseline depression symptoms and to track patients’ progress over time with treatment. In the literature, treatment success is usually quantified by using the terms “response” (or “partial response”) and “remission.” Although its definition is not universal, remission (on the PHQ-9 scale) is often defined as achieving a score <5 (for a patient with a previous score in a category suggestive of depression symptoms) (12). The literature is more divided on treatment response, with different definitions described across studies (410). Some definitions include a single criterion, whereas others include multiple components. Importantly, some definitions specify a minimum baseline PHQ-9 score, whereas others do not (8).

Organizations incorporating measurement-based care are tasked with choosing metrics, often one each for response and remission (although sometimes only one total metric is used). These decisions may be influenced by research studies, standardized organization-based recommendations (e.g., those from the National Committee for Quality Assurance’s Healthcare Effectiveness Data and Information Set), or the perceived frequency of metric use (with ≥50% change and score <5 most commonly used for response and remission, respectively). Organizations could take advantage of the lack of standardization nationwide and choose metrics that are easier to achieve, thereby making their clinical programs appear more successful, although this has never been formally demonstrated.

Table 1 includes eight PHQ-9 depression treatment success criteria (i.e., metrics) that have been described in the literature or identified in this study. Metric 5 (≥50% decrease from baseline and score <10) was originally proposed by Kroenke and colleagues (9) as the PHQ-9 metric for “clinically significant improvement” (i.e., response) because these criteria would be consistent with the established Hamilton Depression Rating Scale metric. The same study established metric 8 (score <5) as the remission metric (although this specific term was not used) by defining scores <5 as “nondepressed” (9). Additionally, metric 2 (absolute decrease of ≥5) was based on previous literature demonstrating that the minimal clinically important difference for the PHQ-9 is between 2.59 and 4.78 (5). The other metrics, however, have little to no supporting empirical evidence.

TABLE 1. Depression response and remission rates for the 3-, 6-, and 12-month time cohorts across eight treatment success metrics and 33 organizationsa

3 months6 months12 months
Metric and sourceNM %bSDcNM %bSDcNM %bSDc
Metric 1: ≥50% decrease from baseline (4)6,8383494,2153791,2793911
Metric 2: absolute decrease of ≥5 (5)9,4954885,7715181,780549
Metric 3: score <10 (4, 7)9,75849105,70650111,7145211
Metric 4: ≥50% decrease from baseline or score <10 (6)10,22551105,98153101,7895411
Metric 5: ≥50% decrease from baseline and score <10 (79)6,3713293,94035101,2043611
Metric 6: absolute decrease of ≥5 and score <10 (8, 10)6,5333394,02336101,2383712
Metric 7: ≥50% decrease from baseline and absolute decrease ≥56,2813293,9193591,2143711
Metric 8: score <5 (79)4,4522282,7222487742310

aFull sample, N=33 organizations, N=145 clinics, N=36,887 patients; 3 months, N=33 organizations, N=135 clinics, N=19,862 patients; 6 months, N=33 organizations, N=130 clinics, N=11,303 patients; 12 months, N=32 organizations, N=113 clinics, N=3,308 patients. Organizations were excluded from the sample at each time point when they did not have at least one clinic meeting inclusion criteria; clinics were excluded from the sample at each time point when they did not have at least one patient with an available Patient Health Questionnaire–9 (PHQ-9) score. Initial as well as 3-, 6-, and 12-month time cohort follow-up mean±SD PHQ-9 scores weighted by clinic size: initial, 15.1±1.4; 3 months, 10.6±2.0; 6 months, 10.3±2.1; 12 months, 10.2±2.3.

bMean percentage improvement for each depression treatment success definition across all included clinics weighted by clinic size.

cStandard deviation across all included clinics weighted by clinic size.

TABLE 1. Depression response and remission rates for the 3-, 6-, and 12-month time cohorts across eight treatment success metrics and 33 organizationsa

Enlarge table

One study used data from a 114-person collaborative care randomized controlled trial to compare outcomes based on metric 5 with structured interviews and three other depression metrics. In general, all measured metrics were found to have good agreement (κ>0.60) (8). The authors also reported that metrics combining multiplicative terms (50% change) or absolute terms (≥5-point change) with the requirement of a score <10 tended to classify the same patients as improved or not improved (8). However, unlike metrics defined by multiplicative terms, those predicated on absolute score changes do not “penalize” organizations with higher average baseline PHQ-9 scores.

In this investigation, we leveraged 10 years of longitudinal PHQ-9 data from the University of Washington’s Advancing Integrated Mental Health Solutions (AIMS) Center to analyze the extent to which different depression response and remission metrics influence organization-level performance. We then discuss the implications of these findings for measurement-based care, health systems, and research.

Methods

For years, the AIMS Center has supported practices implementing the collaborative care model (CoCM), an evidence-based practice for the treatment of common behavioral health problems in medical settings. Part of this support has included development and dissemination of the Care Management Tracking System (CMTS) (13), a specialized treatment registry that records contact details with patients and facilitates measurement-based care (e.g., tracks PHQ-9 depression scores over time). With the written consent of participating organizations, we compiled a data set of 36,887 adult patients with depressive symptoms who were treated in one of 145 primary care clinics (across 33 organizations) and who had depression outcomes tracked using CMTS between 2008 and 2018. Health care organizations and clinics were located across nine states; approximately 83% (N=120) of the clinics were in urban areas (as defined by the Federal Office of Rural Health Policy), and 64% (N=93) were federally qualified health centers (FQHCs). Analysis of this deidentified data set was granted exemption status by the University of Washington Institutional Review Board (ID STUDY00005907).

Our analysis, which was conducted at the organization and clinic levels, included all patients ages ≥18 who had at least two documented PHQ-9 assessment scores: one at baseline and one or more within the following 12 months. PHQ-9 scores closest to and within 30 days of 3, 6, and 12 months from baseline were extracted and noted as the scores for those respective time points. Incorporating these criteria, we created three overlapping and nonexclusive time cohorts (3, 6, and 12 months), each including patients who had follow-up scores at that time point. Of note, these time cohorts were not mutually exclusive and were not a single cohort being followed longitudinally over time. For example, an included patient could have baseline PHQ-9 and follow-up scores at both 3 and 12 months. Such a patient would be included in the 3- and 12-month time cohorts but not the 6-month time cohort.

Further inclusion and exclusion criteria were applied at the clinic level. In the 3-, 6-, and 12-month time cohorts, a clinic’s data were included if at least one of its patients had a recorded PHQ-9 score. At 3, 6, and 12 months, 135, 130, and 113 clinics met inclusion criteria, respectively. This finding corresponded to 33, 33, and 32 organizations as well as 19,862, 11,303, and 3,308 patients, respectively. Missing race and gender data were imputed at the clinic level. (For baseline characteristics of organizations and clinics in each of the three time cohorts, see the online supplement to this report.)

First, mean improvement rates defined by the eight depression response and remission metrics (and weighted by clinic size) were calculated for the 3-, 6-, and 12-month cohorts. Next, to analyze the impact of metric choice on comparative organization-level performance, all 33 organizations in the 6-month time cohort were ranked according to their improvement rates across the eight metrics. We calculated ranks by using empirical Bayes predictions from a random-intercept logistic regression model with reliability adjustment, a strategy that has been used with other health outcome rankings (14). We chose to calculate rankings using this random-effects, model-based approach (as opposed to using raw values or direct sample means) for two reasons: it reduced the impact of chance-driven uncertainty from small samples in certain groups because the between-groups variability was estimated by using data from all groups, and it made the rankings more reproducible over time (14). Finally, Spearman’s rank-order correlation coefficients of predicted ranks from different metrics for the 6-month time cohort were calculated.

Results

Across all eight metrics, the 12-month cohort had higher rates of metrics indicating treatment success than the 3- and 6-month cohorts. Additionally, rates within time cohorts varied substantially by response or remission definition. In the 3-month cohort, for example, depression response rates ranged from 32% (metric 7) to 51% (metric 4), whereas the remission rate was 22% (metric 8). Additionally, response rates appeared to form two clusters: metrics 2, 3, and 4 were similar (ranging from 48% to 51% in the 3-month cohort), as were metrics 1, 5, 6, and 7 (ranging from 32% to 34% in the 3-month cohort). Similar ranges were observed for the 6- and 12-month time cohorts. All treatment response and remission rates, in addition to mean initial and time cohort follow-up PHQ-9 scores, are presented in Table 1. (For the 6-month time cohort correlation matrix that was calculated with Spearman’s rank-order correlation coefficient, see the online supplement.) All pairwise rank-order correlation coefficients were positive, with mean=0.86, and only three of the 28 <0.75. Metric 2 was least correlated with the others. These results broadly demonstrate that across metrics, organization-level rank orders were highly correlated.

Discussion

In this analysis of PHQ-9 scores from 33 organizations and three time cohorts across nine states, we found that choice of depression treatment success metric led to markedly different rates of improvement. Response rates for the 3-month time cohort ranged from 32% to 51% depending on choice of metric, whereas remission was 22%. At the same time, we found that organization-level rank orders defined by performance on eight different metrics were uniformly positively correlated (and largely >0.75). These findings lead to two primary conclusions.

First, it is of paramount importance for organizations to be compared with benchmarks or with one another using the same depression response or remission metric. For example, if organization A uses metric 4 (≥50% decrease from baseline or score <10) and organization B uses metric 1 (≥50% decrease from baseline), their improvement rates cannot be meaningfully compared. We would expect organization A, with a compound metric including an “or” logical operator, to have the higher response rate even if the true rates of improvement were equivalent. This finding also has similar implications for research, in which consistent PHQ-9 definitions for depression treatment response, in particular, are lacking.

Second, our Spearman’s rank-order correlation coefficient findings suggest that the eight metrics all largely appear to tell the same story with regard to depression treatment response and remission. Of note, metric 2, although still positively correlated with the other metrics, was positively correlated to a lesser extent. One possible explanation is that it is the only assessed metric defined solely by an absolute change in PHQ-9 score over time. Depending on the initial PHQ-9 score, treatment success metrics defined in this way can be more or less challenging to achieve relative to those defined multiplicatively (i.e., ≥50% decrease from baseline). Citing similar reasoning, a recent study advocated for the use of multiplicative PHQ-9 metrics over commonly used threshold metrics (i.e., score <5) (15).

Regardless of which metric is chosen, however, similar organizations in this study tended to be ranked favorably and were deemed to be high performing. This finding suggests that inter- and intraorganization consistency of metric use may be more important than which specific metric is chosen. In other words, organization-level depression outcome measurement and research efforts should, above all else, strive to compare the metric equivalent of “apples with apples.”

At the same time, our result show that differently defined response rates tend to cluster across metrics: metrics 2, 3, and 4 were similar, as were metrics 1, 5, 6, and 7. This finding suggests that when comparing organizations using different metrics within each of these clusters, there is perhaps less concern about the impact of metric choice. Furthermore, given that the eight metrics in this investigation are highly correlated, one possible application of our findings could be to provide a theoretical “conversion factor” for different metrics. For example, on the basis of the 3-month cohort data in Table 1, one could expect that a response rate defined by metric 2 (48%) would be roughly 1.5 times greater than that of metric 5 (32%) with no true differences among patients’ clinical statuses.

The findings in this investigation are limited by the real-world nature of this AIMS Center data set, which had missing information, including follow-up PHQ-9 scores and patient demographic characteristics. This limitation is evidenced by the comparative sizes of the time cohorts, with the 3-, 6-, and 12-month cohorts including roughly 50%, 30%, and 10% of the full sample, respectively. Although these findings were not surprising, they highlight the challenges associated with consistent longitudinal outcome ascertainment in real-world outpatient settings. Of note, we were able to reduce the impact of missing gender and race data in this investigation through imputation methods. Additionally, our sample of 33 organizations across multiple states remains one of the largest and most diverse real-world CoCM implementation data sets to date. Approximately half of included patients were persons of color, and two-thirds of clinics were FQHCs. Additionally, our relative lack of patient-level exclusion criteria makes our results externally valid and generalizable.

Conclusions

Our findings demonstrate that the choice of PHQ-9 response or remission metric substantially affects observed treatment improvement rates. Furthermore, organization-level rankings for depression response and remission vary depending on choice of metric, but their rankings are highly correlated. We therefore conclude that in organization-level PHQ-9 response or remission rate comparisons, metric consistency is imperative, whereas the specific metric chosen is of secondary importance.

Department of Psychiatry and Behavioral Sciences, University of Washington School of Medicine, Seattle (Carlo, Arao, Vredevoogd, Fortney, Powers, Russo, Unützer); Department of Biostatistics, and Department of Health Services, University of Washington School of Public Health, Seattle (Chan); U.S. Department of Veterans Affairs, Health Services Research and Development, Center of Innovation for Veteran-Centered and Value-Driven Care, Seattle (Fortney)
Send correspondence to Dr. Carlo ().

Dr. Carlo was supported by a postdoctoral fellowship from the National Institutes of Health (6T32-MH-073553-15).

The authors thank the 33 organizations that agreed to share their deidentified Care Management Tracking System registry data for this work. Through their generosity, a collaborative care implementation data set of more than 35,000 adult patients was created, providing a tremendous opportunity to help study and improve care for depression.

The authors report no financial relationships with commercial interests.

References

1. Schoenbaum M, Unützer J, McCaffrey D, et al.: The effects of primary care depression treatment on patients’ clinical status and employment. Health Serv Res 2002; 37:1145–1158Crossref, MedlineGoogle Scholar

2. Greenberg PE, Fournier AA, Sisitsky T, et al.: The economic burden of adults with major depressive disorder in the United States (2005 and 2010). J Clin Psychiatry 2015; 76:155–162Crossref, MedlineGoogle Scholar

3. Cuijpers P: The challenges of improving treatments for depression. JAMA 2018; 320:2529–2530Crossref, MedlineGoogle Scholar

4. Katzelnick DJ, Duffy FF, Chung H, et al.: Depression outcomes in psychiatric clinical practice: using a self-rated measure of depression severity. Psychiatr Serv 2011; 62:929–935LinkGoogle Scholar

5. Löwe B, Unützer J, Callahan CM, et al.: Monitoring depression treatment outcomes with the Patient Health Questionnaire-9. Med Care 2004; 42:1194–1201Crossref, MedlineGoogle Scholar

6. Ziring JP, Black J, Gogia S, et al.: It’s time to rethink how we measure remission from depression. NEJM Catalyst (Epub Feb 5, 2017)Google Scholar

7. Kroenke K, Spitzer RL: The PHQ-9: a new depression diagnostic and severity measure. Psychiatr Ann 2002; 32:509–515CrossrefGoogle Scholar

8. McMillan D, Gilbody S, Richards D: Defining successful treatment outcome in depression using the PHQ-9: a comparison of methods. J Affect Disord 2010; 127:122–129Crossref, MedlineGoogle Scholar

9. Kroenke K, Spitzer RL, Williams JB: The PHQ-9: validity of a brief depression severity measure. J Gen Intern Med 2001; 16:606–613Crossref, MedlineGoogle Scholar

10. New York State Medicaid Collaborative Care Provider Certification. New York, New York State Office of Mental Health, 2020. https://aims.uw.edu/nyscc/sites/default/files/NYS Medicaid Collaborative Care Program Provider Certification 2020_0.docx. Accessed May 20, 2020Google Scholar

11. Screening Tools. Rockville, MD, Substance Abuse and Mental Health Services Administration, Health Resources and Services Administration Center for Integrated Health Solutions, 2019. https://www.samhsa.gov/integrated-health-solutions. Accessed May 20, 2020Google Scholar

12. Frank E, Prien RF, Jarrett RB, et al.: Conceptualization and rationale for consensus definitions of terms in major depressive disorder. Remission, recovery, relapse, and recurrence. Arch Gen Psychiatry 1991; 48:851–855Crossref, MedlineGoogle Scholar

13. Care Management Tracking System (CMTS). Seattle, University of Washington, Advancing Integrated Mental Health Solutions Center, 2020. https://aims.uw.edu/resource-library/care-management-tracking-system-cmts. Accessed May 20, 2020Google Scholar

14. Dimick JB, Ghaferi AA, Osborne NH, et al.: Reliability adjustment for reporting hospital outcomes with surgery. Ann Surg 2012; 255:703–707Crossref, MedlineGoogle Scholar

15. Coley RY, Boggs JM, Beck A, et al.: Defining success in measurement-based care for depression: a comparison of common metrics. Psychiatr Serv 2020; 71:312–318LinkGoogle Scholar