Google's Big Data Flu Flop

This article in Science finds that the real-world predictive power has been pretty unimpressive. And the reasons behind this failure are not hard to understand, nor were they hard to predict. Anyone who's ever worked with clinical trial data will see this one coming: The initial version of GFT was a particularly problematic marriage of big and small data. Essentially, the methodology was to find the best matches among 50 million search terms to fit 1152 data points. The odds of finding search terms that match the propensity of the flu but are structurally unrelated, and so do not predict the future, were quite high. GFT developers, in fact, report weeding out seasonal search terms unrelated to the flu but strongly correlated to the CDC data, such as those regarding high school basketball. This should have been a warning that the big data were overfitting the small number of cases—a standard concern in data analysis. This ad hoc method of throwing out peculiar search terms failed when GFT completely missed the nonseasonal 2009 influenza A–H1N1 pandemic. The Science authors have a larger point to make as well: “Big data hubris” is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. Elsewhere, we have asserted that there are enormous scientific possibilities in big data. However, quantity of data does not mean that one can ignore foundational issues of measurement and construct validi...
Source: In the Pipeline - Category: Chemists Tags: Clinical Trials Source Type: blogs