Back in April, we ran into an opinion piece by Gary Smith entitled, "Believe in Science? Bad Big-Data Studies May Shake Your Faith." (Yes, I do save up articles in my Blog Idea Bank. Sometimes they're too out of date by the time I get around to them, but often they are still relevant.) The link is to the original article at Bloomberg, but if you've run into your free three article limit, you should be able to find it in many other places, including the New York Times and our own Orlando Sentinel.
Smith's thesis is that the current availability of so much data that can be mined, refined, poked, prodded, and manipulated is making for bad science. There's some good science, too, but a lot that is simply nonsense—and we don't know which is which.
The short article is worth reading in its entirety, but here are some quotes to pique your interest.
The cornerstone of the scientific revolution is the insistence that claims be tested with data, ideally in a randomized controlled trial. ... Today, the problem is not the scarcity of data, but the opposite. We have too much data, and it is undermining the credibility of science.
Luck is inherent in random trials. ... Researchers consequently calculate the probability (the p-value) that the outcomes might happen by chance. A low p-value indicates that the results cannot easily be attributed to the luck of the draw. [This number was arbitrarily decided in the 1920's to be 5%.] ... The “statistically significant” certification needed for publication, funding and fame ... is not a difficult hurdle. Suppose that a hapless researcher calculates the correlations among hundreds of variables, blissfully unaware that the data are all, in fact, random numbers. On average, one out of 20 correlations will be statistically significant, even though every correlation is nothing more than coincidence.
All too often, [researchers] correlate what are essentially randomly chosen variables. This haphazard search for statistical significance even has a name: data mining. As with random numbers, the correlation between randomly chosen, unrelated variables has a 5% chance of being fortuitously statistically significant. Data mining can be augmented by manipulating, pruning and otherwise torturing the data to get low p-values. To find statistical significance, one need merely look sufficiently hard. Thus, the 5% hurdle has had the perverse effect of encouraging researchers to do more tests and report more meaningless results.
A team led by John Ioannidis looked at attempts to replicate 34 highly respected medical studies and found that only 20 were confirmed. The Reproducibility Project attempted to replicate 97 studies published in leading psychology journals and confirmed only 35. The Experimental Economics Replication Project attempted to replicate 18 experimental studies reported in leading economics journals and confirmed only 11.
It is tempting to believe that more data means more knowledge. However, the explosion in the number of things that are measured and recorded has magnified beyond belief the number of coincidental patterns and bogus statistical relationships waiting to deceive us.