Over the course of the last year I’ve spoken with hundreds of employers interested in hiring data scientists, in particular, data scientists with advanced educational degrees. Many employers and hiring managers have heard that big data is the “hot new thing.” But as with all “hot new things,” there’s as much misinformation about data science as there are facts. Here are three misconceptions about big data and data science that I often encounter:
1. Big data is statistics and business intelligence with more data. There’s nothing new here.
This is a view often held by those with limited or no software development experience and it is plainly false. The perfect analogy for this is ice. Ice is just cold water right? There’s nothing new here. However, cooling down water doesn’t just change a quantitative property (temperature) but drastically changes its qualitative properties (transforming a liquid to a solid). The same can be said of more data. Big data strains and ultimately breaks the old paradigms of computation. With big data, all the data cannot fit into RAM and the traditional BI calculations would take years complete. Parallelization and distributed computation are obvious answers to scaling, but this is not always easy: Even a simple statistical tool like logistic regression does not easily parallelize. Distributed statistical computation is as different from traditional business analytics as ice is from water.
2. Data scientists are just rebranded software engineers.
Sometimes engineers with strong software development backgrounds will rebrand as data scientists for the salary premium. This can lead to subpar results. At the simplest level, debugging stats bugs becomes much harder. Engineers are trained to spot and solve programming bugs. But without a solid background in probability and statistics, they often have a hard time solving statistical bugs. Your code might be just fine but if you didn’t reweight your training examples correctly, your predictions will be off.
At a higher level, engineers are well trained to build simple discrete rules-based models. But these models are ill-suited to derive the more subtle insights from continuous-valued data and are leaving money on the table. Solid statistical chops are necessary to overcome these challenges to build the next generation of scalable predictive models.
3. Data scientists don’t need to understand the business, the data will tell you everything.
People with machine-learning backgrounds often succumb to this one, in part because machine learning is so powerful. But it is not omnipotent. Searching for all possible correlations is time consuming, not to mention statistically problematic. Data scientists need to be guided by business intuition to help them distinguish between spurious correlations and real ones. Lack of domain expertise can lead to ill-founded conclusions (“more police officers leads to higher crime rates”) that prompt bad policy recommendations (“cut the policing staff in high crime neighborhoods”). Finally, having business intuition is also important for convincing key stakeholders. These stakeholders might not be data-scientists but are usually domain experts: Talking about your correlations in a language they can understand is key to getting the kind of institutional buy-in that is necessary for data science to achieve its promise.
Big data and data science is about building the right model that combines the right engineering, statistical, and business skills. Without all three, your data scientists will not be able to achieve everything they set out to do.
Michael Li is founder and executive director of data science fellowship program The Data Incubator. He was formerly a data science lead at both Foursquare and Andreessen Horowitz and spent time as a NASA researcher and Wall Street quant. You can follow him on Twitter @tianhuil.