The American Psychiatric Association (APA) has updated its Privacy Policy and Terms of Use, including with new information specifically addressed to individuals in the European Economic Area. As described in the Privacy Policy and Terms of Use, this website utilizes cookies, including for the purpose of offering an optimal online experience and services tailored to your preferences.

Please read the entire Privacy Policy and Terms of Use. By closing this message, browsing this website, continuing the navigation, or otherwise continuing to use the APA's websites, you confirm that you understand and accept the terms of the Privacy Policy and Terms of Use, including the utilization of cookies.

×
EditorialsFull Access

Predicting Conversion to Psychosis Using Machine Learning: Are We There Yet?

In the present issue of the American Journal of Psychiatry, Smucny et al. suggest that predictive algorithms for psychosis using machine learning (ML) methods may already achieve a clinically useful level of accuracy (1). In support of this perspective, these authors report on the results of an analysis using the North American Prodrome Longitudinal Study, Phase 3 (NAPLS3) data set (2), which they accessed through the National Institute of Mental Health Data Archive (NDAR). This is a large multisite study of youth at clinical high risk for psychosis followed up on multiple occasions with clinical, cognitive, and biomarker assessments. Several ML approaches were compared with each other and with Cox (time-to-event) and logistic regression using the clinical, neurocognitive, and demographic features from the NAPLS2 individualized risk calculator (3), with salivary cortisol also tested as an add-on biomarker. When these variables were analyzed using Cox and logistic regression, the model applied to the NAPLS3 cohort attained a level of predictive accuracy comparable to that observed in the original NAPLS2 cohort (overall accuracy in the 66%–68% range). However, several ML algorithms produced nominally better results, with a random forest model performing best (overall accuracy in the 90% range). Acknowledging that a predictive algorithm with 90% or higher predictive accuracy will have greater clinical utility than one with substantially lower accuracy, several issues remain to be resolved before it can be determined whether ML methods have attained this utility “threshold.”

First and foremost, an ML model’s expected real-world performance can only be ascertained when tested in an independent sample/data set that the model has never before encountered. ML methods are very adept at finding apparent structure in data that predict an outcome, but if that structure is idiosyncratic to the training data set, the model will fail to generalize to other contexts and thus not be useful, a problem known as “overfitting” (4). Internal cross-validation methods are not sufficient to overcome this problem, since the model “sees” all the training data at certain points in the process, even if some is left out on a particular iteration (5). Overfitting is indicated by a big drop-off in model accuracy moving from the original internally cross-validated training data set to an external, independent cross-validation test. Smucny et al. (1) acknowledge the need for an external replication test before the utility of the ML models they evaluated using only internal cross-validation methods can be fully appreciated.

Is there likely to be a big drop-off in accuracy of the ML models reported by Smucny et al. (1) when such an external validation test is performed? On one hand, they limited consideration to a small number of features that have previously been shown to predict psychosis in numerous independent samples (i.e., the variables in the NAPLS2 risk calculator [3]). This aspect mitigates the overfitting issue to some extent because the features used in model building are already filtered (based on prior work) to be highly likely to predict conversion to psychosis, both individually and when combined in a regression model. On the other hand, the ML models employed in the study use various approaches to find higher-order interactive and nonlinear amalgamations among this set of feature variables that maximally discriminate outcome groups. This aspect increases the risk of overfitting given that a very large number of such higher-order interactive effects are assessed in model building, with relatively few subjects available to represent each unique permutation, a problem known as the “curse of dimensionality” (6). Tree-based methods such as the random forest model that performed best in the NAPLS3 data set are not immune from this problem and, in fact, are particularly vulnerable to it when applied on data sets with relatively small numbers of individuals with the outcome of interest (7).

The relatively low base rate of conversion to psychosis (i.e., 10%–15%), even in a sample selected to be at elevated risk as in NAPLS3, creates another problem for ML methods; namely, such models can achieve high levels of predictive accuracy in the training data set simply by guessing that each case is a nonconverter. Smucny et al. (1) attempt to overcome this issue using a synthetic approach that effectively up samples the minority class (in this case, converters to psychosis) to the point that it has 50% representation in the synthetic sample (8). Although this approach is very helpful in preventing ML models from defaulting to prediction of a majority class, its use in computing cross-validation performance metrics is likely to be highly misleading, given that real-world application of the model is not likely to occur in a context in which there is a 50:50 rate of future converters and nonconverters. Rather, the model will be applied in circumstances in which new clinical high risk (CHR) individuals’ likelihoods of conversion are computed, and those CHR individuals will derive from a population in which the base rate of conversion is ∼15%. It is now well established that the same predictive model will result in different risk distributions (and, thereby, different thresholds in model-predicted risk for making binary predictions) in samples that vary in base rates of conversion to psychosis (9). Given this, a 90% predictive accuracy of an ML algorithm in a synthetically derived sample in which the base rate of psychosis conversion is artificially created to be 50% is highly unlikely to generalize to an independent, real-world CHR sample, at least as ascertained using current approaches.

When developing the NAPLS2 risk calculator, the investigators made purposeful decisions to allow the resulting algorithm to be applied validly in scaling the risk of newly ascertained CHR individuals (3). Key among these decisions was to avoid using the NAPLS2 data set to test different possible models, which would then necessitate an external validation test. Rather, a small number of predictor variables was chosen based on their empirical associations with conversion to psychosis in prior studies, and Cox regression was employed to generate an additive multivariate model of predicted risk (i.e., no interactive or non-linear combinations of the variables were included). As a result, the ratio of converters to predictor variables was 10:1 (helping to create adequate representation of the scale values of each predictor in the minority class), and there was no need to use a synthetic sampling approach given that Cox regression is well suited for prediction of low base rate outcomes. The predictor variables chosen for inclusion are ones that are easily ascertained in standard clinical settings and have a high level of acceptability (face validity) for use in clinical decision making. It is important to note that the NAPLS2 model has been shown to replicate (in terms of area under the curve or concordance index) when applied to multiple external independent data sets (10).

Nevertheless, two issues continue to limit the utility of the NAPLS2 risk calculator. One is that it will generate differently shaped risk distributions on samples that vary in conversion risk and in distributions of the individual predictor variables, making it problematic to apply the same threshold of predicted risk for binary predictions across samples that differ in these ways (9, 11). However, it appears possible to derive comparable prediction metrics across samples with differing conversion risks when considering the relative recency of onset or worsening of attenuated positive symptoms at the baseline assessment (11). A more recent onset or worsening of attenuated positive symptoms defines a subgroup of CHR individuals with a higher average predicted risk and higher overall transition rate and in whom particular putative illness mechanisms, in this case an accelerated rate of cortical thinning (12), appear to be differentially relevant (11).

The second rate-limiting issue for the utility of the NAPLS2 risk calculator is that its performance in terms of sensitivity, specificity, and balanced accuracy, even when accounting for recency of onset of symptoms, is still in the 65%–75% range. Although ML methods represent one approach that, if externally validated, could conceivably result in predictive models at the 90% or higher level of accuracy, such models would continue to have the disadvantage of being relatively opaque (“black box”) in terms of how the underlying predictor variables aggregate in defining risk and for that reason may not be used as readily in clinical practice. Alternatively, it may be possible to rely on more transparent analytic approaches to achieve the needed level of accuracy. It has recently been demonstrated that integrating information on short-term (baseline to 2-month follow-up) change on a single clinical variable (e.g., deterioration in odd behavior/appearance) improves the performance of the NAPLS2 risk calculator to >90% levels of sensitivity, specificity, and balanced accuracy; i.e., a range that would support its use in clinical trial design and clinical decision-making (13). Importantly, although the Cox regression model aspect of this algorithm has been externally validated, the incorporation of short-term clinical change (via mixed effects growth modeling) requires replication in an external data set.

Smucny et al. (1) are to be congratulated on a well-motivated and well-executed analysis of the NAPLS3 data set. It is heartening to see such creative uses of this unique shared resource for our field bear fruit, reinforcing the value of open science. As we move forward toward the time and place in which prediction models of psychosis and related outcomes have utility for clinical decision making, whether those models rely on machine learning methods or more traditional approaches, it will be crucial to insist on external validation of results before deciding that we are, in fact, “there.”

Clark L. Hull Professor of Psychology and Professor of Psychiatry, Yale University, New Haven, Conn.
Send correspondence to Dr. Cannon ().

Dr. Cannon reports no financial relationships with commercial interests.

References

1. Smucny J, Davidson I, Carter CS: Are we there yet? Predicting conversion to psychosis using machine learning. Am J Psychiatry 2023; 180:836–840 AbstractGoogle Scholar

2. Addington J, Liu L, Brummitt K, et al.: North American Prodrome Longitudinal Study (NAPLS 3): methods and baseline description. Schizophr Res 2022; 243:262–267Crossref, MedlineGoogle Scholar

3. Cannon TD, Yu C, Addington J, et al.: An individualized risk calculator for research in prodromal psychosis. Am J Psychiatry 2016; 173:980–988LinkGoogle Scholar

4. Cawley GC, Talbot NLC: On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res 2010; 11:2079–2107 Google Scholar

5. Arlot S, Celisse A: A survey of cross-validation procedures for model selection. Statist Surv 2010; 4:40–79 CrossrefGoogle Scholar

6. Hughes G: On the mean accuracy of statistical pattern recognizers. IEEE Trans Inf Theor 1968; 14:55–63 CrossrefGoogle Scholar

7. Peng Y, Nagata MH: An empirical overview of nonlinearity and overfitting in machine learning using COVID-19 data. Chaos Solitons Fractals 2020; 139:110055Crossref, MedlineGoogle Scholar

8. Chawla NV, Bowyer KW, Hall LO, et al.: SMOTE: Synthetic Minority Over-Sampling Technique. J Artif Intell Res 2002; 16:321–357 CrossrefGoogle Scholar

9. Koutsouleris N, Worthington M, Dwyer DB, et al.: Toward generalizable and transdiagnostic tools for psychosis prediction: an independent validation and improvement of the NAPLS-2 risk calculator in the multisite PRONIA cohort. Biol Psychiatry 2021; 90:632–642Crossref, MedlineGoogle Scholar

10. Worthington MA, Cannon TD: Prediction and prevention in the clinical high-risk for psychosis paradigm: a review of the current status and recommendations for future directions of inquiry. Front Psychiatry 2021; 12:770774Crossref, MedlineGoogle Scholar

11. Worthington MA, Collins MA, Addington J, et al.: Improving prediction of psychosis in youth at clinical high-risk: pre-baseline symptom duration and cortical thinning as moderators of the NAPLS2 risk calculator. Psychol Med 2023:1–9Crossref, MedlineGoogle Scholar

12. Collins MA, Ji JL, Chung Y, et al.: Accelerated cortical thinning precedes and predicts conversion to psychosis: the NAPLS3 longitudinal study of youth at clinical high-risk. Mol Psychiatry 2023; 28:1182–1189Crossref, MedlineGoogle Scholar

13. Worthington MA, Addington J, Bearden CE, et al.: Dynamic prediction of outcomes for youth at clinical high risk for psychosis: a joint modeling approach. JAMA Psychiatry 2023:e232378Google Scholar