As machine learning evolves, we need to update the definition of 'data scientist'

In the early days of machine learning, hiring good statisticians was the key challenge for AI projects. Now, machine learning has evolved from its early focus on statistics to more emphasis on computation. As the process of building algorithms has become simpler and the applications for AI technology have grown, human resources professionals in AI face a new challenge. Not only are data scientists in short supply, but what makes a successful data scientist has changed.

Divergence between statistical models and neural networks

As recently as six years ago, there were minimal differences between statistical models (usually logistic regressions) and neural networks. The neural network had a slightly larger separation capacity (statistical performance) at the cost of being a black box. Since they had similar potential, the choice of whether to use a neural network or a statistical model was determined by the requirements of each scenario and by the type of professional available to create the algorithm.

More recently, though, neural networks have evolved to support many layers. This deep learning allows for, among other things, effective and novel exploitation of unstructured data such as text, voice, images, and videos. Increased processing capacity, image identifiers, simultaneous translators, text interpreters, and other innovations have set neural networks further apart from statistical models. With this evolution comes the need for data scientists with new skills.

Unchanging elements of building algorithms

Despite the changes in algorithm structures and capabilities, the process of constructing high-quality predictive models still follows a series of steps that hasn't changed much. More important than the fit and method used is the ability to perform each step of this process efficiently and creatively.

Process to build a supervised algorithm

Field interviews. Data scientists are not usually experts in the subject they are working on. Instead, they are experts on the accuracy and precision required to create the algorithms for various corporate or academic decision-making processes. However, the requirement today is that data scientists develop an understanding of the problem the algorithm was meant to solve, so interviews with subject matter experts focused on that particular problem are essential. Now, data scientists can work on neural networks that span a range of broad knowledge areas, from predicting the mortality of African butterflies to deciding when and where to publish advertising for seniors. This means that today's data scientists must be able and eager to learn from experts on many subjects.

Understanding the problem. Each prediction hinges on a wealth of factors, all of which the data scientist must know about in order to understand the causal relationships among them. For example, to predict which applicants will default on their loans, the data scientist must know to ask questions such as:

Why do people default?
Are they planning to default when they apply?
Do defaulters have outsize debt relative to their income?
Is there fraud in the application process?
Is there sales pressure to apply for the loan?

These are some of the many questions to ask on this topic, and there is long lists of questions for every machine learning process. A data scientist who only wants to create algorithms without talking in depth with those involved in the phenomenon being explored will have a limited ability to create effective algorithms.

Identifying relevant information. As a data scientist sifts through the answers to these types of questions, he or she must also be skilled at picking out the information that may explain the phenomenon. A well-trained, inquisitive data scientist will also seek out related data online via search, crawler, and API to pinpoint the most relevant predictive factors.

Sampling. Statistical knowledge — on top of computational knowledge, experience, and judgment — matters for the definition of the response variable, the separation of the database, the certification of past data use, the separation of data between adjustment, validation and testing, and other sampling steps. However, the computational approach supports the use of the ever-larger databases that are required for the construction of complex algorithms. Therefore, both statistical and computational skill sets are a must for today’s data scientists.

Variable work. This is the only step that's been eliminated in the transition from the statistical approach to the computational one; and with my background in statistics, I miss it. The artisanal design of variables is an extremely creative stage that also generates a lot of insight about the phenomenon under study. With neural networks, this step is no longer necessary, but its elimination puts more responsibility on the data scientists to understand the phenomenon under study.

Adjustment and assessment. This step has been transformed, requiring more connectivity and effort than academic technical knowledge. In the computational approach, adjustments and evaluations are primarily based on community research plus trial and error. With the impossibility of a mathematical understanding of the causal relationship implicit in the equations, professionals should know how to search communities for network architectures that best fit their activities. Once they find something applicable, trial and error takes over until there's a satisfactory explanation of the phenomenon.

Implementation. In this step, the data scientist's IT knowledge and rapport with subject matter experts is critical. All those APIs, internal data extractions, and crawlers aren't easy to deploy with precision and stability and without errors. For example, if a crawler was used, it must run without production errors in the future, and if the source changes, the crawler will need maintenance. More than an algorithm, today's data scientist designs new applications that must be monitored and maintained.

Based on the new requirements for each step, it's clear that thoroughness, creativity, and holistic vision are now the hallmarks of a great data scientist, much more than expertise in linear algebra. That doesn't rule out experienced statisticians, of course. They often adapt easily to these changes, delving deeper into IT with their languages and architectures. The computational school also creates professionals fully capable of performing well, as long as they combine research and understanding of the problem with the ability to think probabilistically.

Traditionalists may insist that statisticians make the best data-science hires. But I believe that curiosity, a breadth of academic knowledge, and the willing to engage with others in the pursuit of information are more important to the role of modern data scientist than statistical training, because neural network creation requires a focus far broader than the algorithms themselves.

Bernardo Lustosa is Partner, cofounder, and COO at ClearSale.