In Reply: We are in complete agreement with the comments of Drs. Pandiani and Banks and with many of those of Dr. Segal. They have described the other side of the same coin. Theirs is the more commonly viewed side, the one that inspires numerous and increasing efforts to make use of existing large data sets, and the one that is presumably familiar to most readers of the journal. We were asked by the editor of Psychiatric Services to illuminate the dark side precisely because it is less well known.
Our contention is that large data sets "can be" dangerous, not that they are inherently dangerous. Our goal was not to stigmatize research using large data sets but rather to remind the scientific community of the frequently overlooked limitations of large data sets and of the seductive ways that they can lead investigators astray. A parallel editorial in the March 2003 issue of Scientific American suggests that the same concerns are pertinent in other areas of science (1). As the editors of Scientific American point out, the dangers of information overload, poor data quality, and capitalization on chance abound.
Often where there is opportunity there is liability. The use of large data sets presents many opportunities for the advancement of knowledge that is relevant for practice and policy, but it also requires careful attention to data quality and the disciplined application of statistical and inferential methods. The warnings in our editorial address the latter issues, which appear to be less salient to some users of large data sets on the basis of the journal's experience with manuscripts submitted for publication and on our experience as peer reviewers. Besides sharing the optimism of Drs. Pandiani and Banks about the potential of large data sets, we also wish that articles sent for review showed their admirable attention to quality.