Taking IssueFull Access

Large Data Sets Can Be Dangerous!

Robert E. Drake, M.D., Ph.D., and
Gregory J. McHugo, Ph.D.,

Robert E. Drake

Search for more papers by this author

, M.D., Ph.D., and

Gregory J. McHugo

Search for more papers by this author

, Ph.D., Dartmouth Medical School, Hanover, New Hampshire

Published Online:1 Feb 2003https://doi.org/10.1176/appi.ps.54.2.133

Researchers generally believe in the advantages of having more data, often as an antidote to problems with recruiting, retention, and statistical power. Yet the increasing availability of large administrative databases and computerized clinical records and the easy manipulation of data by computerized statistical packages have created a different set of problems that journal reviewers now encounter more commonly. Among the problems are poor quality of data, statistical significance without meaningfulness, the use of multiple tests that capitalize on chance, and post hoc interpretations.

First, data collected for purposes other than research—for example, for billing or for clinical records—are, as a general rule, rarely of research quality. To complicate matters, researchers often have little information on the reliability and validity of such data. The danger is that invalid data are used for invalid analyses that lead to invalid conclusions—a common occurrence.

Second, very large samples yield numerous statistically significant but meaningless associations for a variety of well-documented reasons, such as similar biases that apply across the measures. Statistically significant findings are unimportant when they reflect measurement errors or represent tiny differences that do not approach clinical significance. Without studying measurement accuracy and specifying a meaningful difference a priori, researchers sometimes synthesize a pattern of trivial findings into a publishable paper.

Third, with computers and large data sets, the temptation to sift through numerous associations and pick out the ones that seem to fit the investigators' hypotheses—or, even worse, the ones that seem to cohere according to post hoc explanations—is ever present. Many investigators do not report all the tests they have run or all the variables they have examined and do not correct for multiple tests. The inevitable result is a proliferation of type 1 errors.

Fourth, large existing data sets encourage investigators to look for research questions that fit the data—usually imperfectly—rather than find data that can answer a meaningful question. For example, investigators are tempted to use whatever comparison group exists rather than a group that makes sense on the basis of logic and a priori hypotheses.

What is to be done? Researchers can emphasize research ethics, oversight by senior researchers, the criterion of common sense in research training, more quality and less quantity of publications, and adherence to scientific standards. Mental health journals are necessarily adopting new standards for disclosure and review, such as the use of effect sizes and corrections for multiple tests.

Volume 54
Issue 2

February 2003
Pages 133-133

Metrics

PDF download

History

Published online 1 February 2003

Published in print 1 February 2003

Sign In

Change Password

Your password must have 6 characters or more:

Password Changed Successfully

Create your account

Forget yout Password?

Forgot your Username?

Large Data Sets Can Be Dangerous!