Get Alert
Please Wait... Processing your request... Please Wait.
You must sign in to sign-up for alerts.

Please confirm that your email address is correct, so you can successfully receive this alert.

Taking Issue   |    
Large Data Sets Can Be Dangerous!
Robert E. Drake, M.D., Ph.D.; Gregory J. McHugo, Ph.D.
Psychiatric Services 2003; doi: 10.1176/appi.ps.54.2.133
text A A A

Researchers generally believe in the advantages of having more data, often as an antidote to problems with recruiting, retention, and statistical power. Yet the increasing availability of large administrative databases and computerized clinical records and the easy manipulation of data by computerized statistical packages have created a different set of problems that journal reviewers now encounter more commonly. Among the problems are poor quality of data, statistical significance without meaningfulness, the use of multiple tests that capitalize on chance, and post hoc interpretations.

First, data collected for purposes other than research—for example, for billing or for clinical records—are, as a general rule, rarely of research quality. To complicate matters, researchers often have little information on the reliability and validity of such data. The danger is that invalid data are used for invalid analyses that lead to invalid conclusions—a common occurrence.

Second, very large samples yield numerous statistically significant but meaningless associations for a variety of well-documented reasons, such as similar biases that apply across the measures. Statistically significant findings are unimportant when they reflect measurement errors or represent tiny differences that do not approach clinical significance. Without studying measurement accuracy and specifying a meaningful difference a priori, researchers sometimes synthesize a pattern of trivial findings into a publishable paper.

Third, with computers and large data sets, the temptation to sift through numerous associations and pick out the ones that seem to fit the investigators' hypotheses—or, even worse, the ones that seem to cohere according to post hoc explanations—is ever present. Many investigators do not report all the tests they have run or all the variables they have examined and do not correct for multiple tests. The inevitable result is a proliferation of type 1 errors.

Fourth, large existing data sets encourage investigators to look for research questions that fit the data—usually imperfectly—rather than find data that can answer a meaningful question. For example, investigators are tempted to use whatever comparison group exists rather than a group that makes sense on the basis of logic and a priori hypotheses.

What is to be done? Researchers can emphasize research ethics, oversight by senior researchers, the criterion of common sense in research training, more quality and less quantity of publications, and adherence to scientific standards. Mental health journals are necessarily adopting new standards for disclosure and review, such as the use of effect sizes and corrections for multiple tests.




CME Activity

There is currently no quiz available for this resource. Please click here to go to the CME page to find another.
Submit a Comments
Please read the other comments before you post yours. Contributors must reveal any conflict of interest.
Comments are moderated and will appear on the site at the discertion of APA editorial staff.

* = Required Field
(if multiple authors, separate names by comma)
Example: John Doe

Web of Science® Times Cited: 6

Related Content
Dulcan's Textbook of Child and Adolescent Psychiatry > Chapter 65.  >
Textbook of Psychotherapeutic Treatments > Chapter 30.  >
Dulcan's Textbook of Child and Adolescent Psychiatry > Chapter 34.  >
Dulcan's Textbook of Child and Adolescent Psychiatry > Chapter 7.  >
The American Psychiatric Publishing Textbook of Geriatric Psychiatry, 4th Edition > Chapter 32.  >
Topic Collections
Psychiatric News