The American Psychiatric Association (APA) has updated its Privacy Policy and Terms of Use, including with new information specifically addressed to individuals in the European Economic Area. As described in the Privacy Policy and Terms of Use, this website utilizes cookies, including for the purpose of offering an optimal online experience and services tailored to your preferences.

Please read the entire Privacy Policy and Terms of Use. By closing this message, browsing this website, continuing the navigation, or otherwise continuing to use the APA's websites, you confirm that you understand and accept the terms of the Privacy Policy and Terms of Use, including the utilization of cookies.

×
Taking IssueFull Access

Large Data Sets Can Be Dangerous!

Published Online:https://doi.org/10.1176/appi.ps.54.2.133

Researchers generally believe in the advantages of having more data, often as an antidote to problems with recruiting, retention, and statistical power. Yet the increasing availability of large administrative databases and computerized clinical records and the easy manipulation of data by computerized statistical packages have created a different set of problems that journal reviewers now encounter more commonly. Among the problems are poor quality of data, statistical significance without meaningfulness, the use of multiple tests that capitalize on chance, and post hoc interpretations.

First, data collected for purposes other than research—for example, for billing or for clinical records—are, as a general rule, rarely of research quality. To complicate matters, researchers often have little information on the reliability and validity of such data. The danger is that invalid data are used for invalid analyses that lead to invalid conclusions—a common occurrence.

Second, very large samples yield numerous statistically significant but meaningless associations for a variety of well-documented reasons, such as similar biases that apply across the measures. Statistically significant findings are unimportant when they reflect measurement errors or represent tiny differences that do not approach clinical significance. Without studying measurement accuracy and specifying a meaningful difference a priori, researchers sometimes synthesize a pattern of trivial findings into a publishable paper.

Third, with computers and large data sets, the temptation to sift through numerous associations and pick out the ones that seem to fit the investigators' hypotheses—or, even worse, the ones that seem to cohere according to post hoc explanations—is ever present. Many investigators do not report all the tests they have run or all the variables they have examined and do not correct for multiple tests. The inevitable result is a proliferation of type 1 errors.

Fourth, large existing data sets encourage investigators to look for research questions that fit the data—usually imperfectly—rather than find data that can answer a meaningful question. For example, investigators are tempted to use whatever comparison group exists rather than a group that makes sense on the basis of logic and a priori hypotheses.

What is to be done? Researchers can emphasize research ethics, oversight by senior researchers, the criterion of common sense in research training, more quality and less quantity of publications, and adherence to scientific standards. Mental health journals are necessarily adopting new standards for disclosure and review, such as the use of effect sizes and corrections for multiple tests.