Is Your Data Lying to You?
The importance of data has never been more critical. The internet of things is digitizing everything from your light bulbs to your toilet producing trillions of bits of data every day. Social media, credit card transactions, toll-booth traffic—everything makes data. With so much raw information, big data analysis has exploded and companies everywhere are seeking ways to leverage this data into a product or a platform or a competitive advantage.
But in the age of big data, many businesses are faced with two fundamental questions; How much do I trust the data? and, “What do I do with it?”
The first question is pretty straightforward; trust the data, question the analysis.
What’s true, what’s a lie, and what’s statistics?
A 2009 survey from the University of Edinburgh found that more than a third of scientists admitted to, let’s call them “questionable” research practices, including withholding data to alter results or simply altering it because of “gut feelings.” If you agree with Daniel Kahneman’s point of view that gut feelings are subject to subconscious biases and therefore always wrong, you’re probably experiencing your own gut feeling: a sick-to-your-gut feeling.
But wait! Did you see what just happened?
If you’re like most people, you didn’t think to stop and question the stat I pulled from that survey. You simply took the information as presented, trusting the scientists who conducted the survey, and thus the narrative the claim supported. You likely didn’t pause to question how this survey was conducted: How large was the sample size? What was used to benchmark the results to the entire scientific community? How did they adjust the data to overcome this sample bias? Was the survey conducted in person? By email? With multiple choice questions? Free-form answers?
Businesses should think hard about the data they employ to understand their customers and markets. Even if numbers don’t lie, the systems that deliver those numbers are human-made. You can’t conduct your own independent statistical analysis every time you’re presented with information. So you have to operate in a world where insights resulting from data may be accidentally, or even intentionally, skewed.
Now, if you’ve taken a Statistics 101 class, then you’re aware of the problems with analyzing data. And you learned how researchers and data scientists are supposed to combat faulty analysis. So while it appears some may take liberties with these standards, many others are very respectful of the limits of data and are diligent to protect outcomes from faulty results.
At Jumpshot, for example, we analyze millions of data points of online behavior every day to provide insights into internet-wide consumer behavior. Our clients use this data to make major strategic decisions and optimize their advertising campaigns. But we know that our broad and diverse and powerful data set has bias. All samples do. The panel can under- or over-represent some users, skewing the data and therefore the results.
To correct for this, we weight our data according to known traffic patterns reported by a broad swath of domains. We then use machine learning models and constantly tweaked algorithms to understand patterns of our panelists’ behavior in the context and understand how it relates to total traffic online. Then, to represent the internet population’s behavior as a whole, our data science team uses those patterns to generate statistical calibrations for every domain, country, and platform (desktop or mobile) we report on, every day.
The process, which you can read more about, helps ensure that our raw data doesn’t fall into the traps outlined above. We work hard to uncover how our data is biased, and in doing so, we understand how to control for that bias.