When Data Science Goes Wrong
In the past week, there have been two interesting articles about bad data or manipulating data. The first one I read was about a “famous” Cornell food researcher whose 3 decade career is being called into question due to a batch of uncovered emails suggesting he and his research team were manipulating data to meet a certain p-value, at which point the findings could be considered significant. It sounded like he wasn’t changing the numbers but was slicing and dicing the data in different ways to find a relationship.
I saw the second article within the same week, which is an amazing coincidence since normally I don’t see many articles about bad data. The second article was about the false link between vaccination and autism that was published twenty years ago and is still influencing parents today. Twenty years after the autism-vaccine link was debunked, we still have parents today who won’t vaccinate their children because of the fear of that link. The weakness of the study was numerous: only 12 subjects were in the study; the study was not really a scientific study but mere medical histories or stories; the science was never replicated; and data may have been altered.
The thing that struck me about the first article and prompted me to write this post is the idea of slicing and dicing the data is bad science. Here’s the question I have in my mind: that Target story a few years ago about finding certain purchases predicting pregnancy must have been done by some kind of slicing and dicing analysis rather than making a hypothesis first and then testing – where does that fall? Is that okay? The store was conducting the analyses without publishing them as scientific discoveries so I presume it is okay.
Most of the data science books that I have read (there has not been many) seems to indicate two kinds of data science work in companies: one where you do analysis because a manager has a question about something and you do research to answer that question and the second one where you are not really researching a question rather performing a general fishing expedition to see if something strange pops up. I do a lot of the second kind to learn about what is going on. It’s mainly a curiosity on my part to see what is going on from a big picture point of view. The fishing expedition helps me get a flavor of parts of the business, especially if I can’t talk to the people. Thinking logically, I think it may be okay to do the slicing and dicing to see if there are any interesting patterns as long as you are not claiming it as a scientific study. For a scientific study, you would have to follow the rigorous scientific protocols to make sure the findings are solid and replicable before you can claim your findings to be scientifically valid.