Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! Watch here.
In the slow process of developing machine learning models, data scientists and data engineers need to work together, yet they often work at cross purposes. As ludicrous as it sounds, I’ve seen models take months to get to production because the data scientists were waiting for data engineers to build production systems to suit the model, while the data engineers were waiting for the data scientists to build a model that worked with the production systems.
A previous article by VentureBeat reported that 87% of machine learning projects don’t make it into production, and a combination of data concerns and lack of collaboration were primary factors. On the collaboration side, the tension between data engineers and data scientists — and how they work together — can lead to unnecessary frustration and delays. While team alignment and empathy building can alleviate these tensions, adopting some developing MLOps technologies can help mitigate issues at the root cause.
Scoping the Problem
Before we dive into solutions, let’s lay out the problem in more detail. Scientists and engineers (data and otherwise) have always been like cats and dogs, oil and water. A simple web search of “scientists vs engineers” will lead you to a lengthy debate about which group is more prestigious. Engineers are tasked with construction, operation and maintenance, so they focus on the simplest, most efficient and reliable systems possible. On the other hand, scientists are tasked with doing whatever it takes to build the most accurate models, so they want access to all the data, and they want to manipulate it in unique, sophisticated ways.
Instead of fixating on the differences, I find it’s much more productive to acknowledge they’re both immensely valuable and to think about how we can use each of their talents to the fullest capacity. By focusing on the things that unify data scientists and data engineers — a dedication to timely, quality information and well-designed systems — the two sides can foster a more collaborative environment. And by understanding each other’s pain points, the two teams can build empathy and understanding to make working together easier. There are also emerging tools and systems that can help bridge the gap between these two camps and help them meet more readily in the middle.
MLOps is an emerging area that applies the ideas and principles of DevOps practices to the data science and machine learning ecosystem. It lifts the burden of building and maintenance off of data engineers, while providing flexibility and freedom for data scientists. This is a win-win solution. Let’s take a look at some common problems, and the tools that are emerging to more effectively solve them.
Model orchestration. The first major hurdle when trying to put a model into production is deployment: where to build it, how to host it, and how to manage it. This is largely an engineering problem, so when you have a team of data scientists and data engineers, it typically falls to the data engineers.
Building this system takes weeks, if not months – time that the data or ML engineers could have spent improving data flows or improving models. Model orchestration platforms standardize model deployment frameworks and help make this step significantly easier. While companies like Facebook can invest resources in platforms like FBLearner to handle model orchestration, this is less feasible for smaller or emerging companies. Thankfully, open source systems have started to emerge to handle the process, namely MLFlow and KubeFlow. Both of these systems use containerization to help manage the infrastructure side of model deployment.
Feature stores. The second major hurdle to taking a model from the lab to production lies with the data. Oftentimes, models are trained using historical data housed in a data warehouse but queried with data from a production database. Discrepancies between these systems cause models to perform poorly or not at all and often require significant data engineering work to re-implement things in the production database.
I’ve personally spent weeks building out and prototyping impactful features that never made it to production because the data engineers didn’t have the bandwidth to productionize them. Feature stores, or data stores built specifically to support the training and productionization of machine learning models, are working to alleviate this issue by ensuring that data and features built in the lab are immediately production-ready. Data scientists have the peace of mind that their models are getting built, and data engineers don’t have to worry about keeping two different systems perfectly in line. Larger corporations like Uber and Airbnb have built their own feature stores (Michelangelo and ZipLine respectively), but vendors that sell pre-built solutions have emerged. Logical Clocks, for example, offers a feature store for its Hopsworks platform. And my team at Kaskada is building a feature store for event-based data.
DataOps. There’s no experience quite like getting paged late at night because your model is behaving strangely. After briefly checking the model service, you come to the inevitable conclusion: something has changed with the data.
I’ve had variations on the following conversation more times than I like to admit:
- Data Engineer: “Your model is throwing errors. Why is it broken?”
- Data Scientist: “It’s not, the data stream is broken and needs to be fixed.”
- Data Engineer: “OK, let me know which data stream and I can fix it.”
- Data Scientist: “I don’t know where the problem is, just that there is one.”
Finding the issue is like finding a needle in a haystack. Fortunately, new frameworks and tools are coming into place that set up monitoring and testing for data and data sources and can save valuable time. Great Expectations is one of these emerging tools to improve how databases are built, documented, and monitored. Databand.ai is another company entering the data pipeline monitoring space; in fact the company published a great blog post here that explores in greater detail why traditional pipeline monitoring solutions don’t work for data engineering and data science.
By using tools to reduce the complexity of asks and by growing empathy and trust between data scientists and data engineers, data scientists can be empowered to deliver without overly burdening data engineers. Both teams can focus on what they do best and what they enjoy about their jobs, instead of fighting with each other. These tools can help turn a combative relationship into a collaborative one where everyone ends up happy.
Max Boyd is a Data Science Lead at Kaskada. He has built and deployed models as a Data Scientist and Machine Learning Engineer at several Seattle-area tech startups in HR, finance and real estate.
VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn more about membership.