How to wrangle data and manage your AI pipeline

Rahul Singhal, who led IBM Watson products and now serves as chief product officer at Innodata, has a few strong beliefs about AI. One is that Google CEO Sundar Pichai is right: AI will have more impact on society than electricity. The other is a saying you've probably heard before: "garbage in, garbage out."

Managing an AI pipeline is all about the data, he believes. In fact, Singhal says 80% of every AI budget should be spent specifically on ensuring you have high quality training data. Seeing as Innodata, a data engineering company, specializes in creating and annotating datasets (and sometimes customized models) for its clients, he of course has a vested interest in this spend. But it's true that data is messy, difficult to obtain, and has historically been plagued by bias, a fact widely affecting the ethics and success of AI today. A model is nothing without good data.

VentureBeat recently chatted with Singhal for his thoughts on how enterprises can best approach data, launch AI initiatives, and manage AI pipelines. He also pulled back the curtain on the company's own processes and approach to bias and explainability.

This interview has been edited for brevity and clarity.

VentureBeat: Tell me about the AI journey Innodata has been on over the past few years. What made the company decide to incorporate more AI, and what were you looking to achieve with it?

Rahul Singhal: Innodata has been investing heavily in AI for the last six years, and we've built a lot of interesting AI models to automate the content transformation journey for our clients. Three years ago when the CEO, Jack Abuhoff, asked me to join Innodata, one of my premises for joining was that AI is not going to be successful if you don't have three key ingredients. First, you need to have lots of proprietary content, or access to proprietary content. Second, you need to have lots of subject matter experts and an ability to create pristine quality training data. And third, you need data scientists who are training these models and can then lead and build a large AI pipeline.

Innodata had two of those, so in 2019, we started our journey of building clean training data for AI and machine learning. And we've been fairly successful over the last two and a half years. We've truly transformed the business, taking our domain experts and processes to now serve a larger market of data scientists looking to build these kinds of models. So we use domain experts in financial services, social media companies, health care, pharma, and other large accounts where 80% of the projects were failing because of a lack of clean, annotated data.

VentureBeat: And when you're creating datasets and models for clients, what are the key steps?

Singhal: For any model we create, it starts with having content that is good enough for training. For example, we're working with a startup that is looking to train our AI model to ensure that a webcam in a high secure environment is able to recognize somebody taking a picture. So we needed to create a diverse pool of datasets with different ethnicities, objects, angles, and formats like cell phones and laptops.

The second step we help our clients with is, do you know what you're annotating on? This is really about the labels. If I'm predicting something for a client, do you have the right ontologies and taxonomies?

The third step, once you've got the content, is actually creating and annotating that content. When you think about the AI pipeline today, 90% of the work is done with supervised learning, so you do need to provide a large amount of annotated training data. And that's where we use a pool of 3,500 global experts and our processes. We built an annotation platform with arbitration built in, and that allows our teams and customers to look at that data and ensure it's been annotated with the right quality metrics.

And then the fourth step is building the model. Some of our clients want to build a model, so we'll give them the training data. Others want us to build it, so we bring in data scientists for the build as well.

VentureBeat: What about mitigating bias and building in explainability? How did these considerations come into play?

Singhal: Bias and explainability are both big problems. We use quality metrics like category distribution and different labels. We also have agreement and disagreement rates between annotators, which allows a data scientist to know which datasets need more data for accuracy and which are probably overfitting or underfitting the model. It's not really the definition of active learning, but is kind of active learning, for lack of better words.

And the real research our team is working on is around if we can use algorithms to automatically identify what datasets to use. So if you have 100,000 records, can machine algorithms find those 10,000 documents that need to be annotated that will provide the highest value for model building? That's a very hot area our AI team is working on. And the way we think about it is that when we take data in, we're able to extract the metadata. And the way we make it explainable is through what we call "transparency to source." So if you have a financial statement, for example, and we've trained a model to identify how much cash is on hand, we're able to transparently show where that data point resides. So we're looking to make it more explainable from a model building perspective -- what data went in and how the model actually came to those predictions. That is where I think AI explainability is going. We're not there yet, but I think that's the journey we're on.

VentureBeat: Earlier you mentioned those top three premises you feel are really important for success with AI. But what are some small details or smaller considerations you found are really important for managing and establishing an AI pipeline? What might people not think of?

Singhal: One of the biggest factors for success or failure for any AI project, I find, has nothing to do with technology. It's management and leadership. It takes effort, time, and top-down leadership to drive AI into a production environment. So my perspective is that the companies that are going to be successful are the ones with executive leadership that is ready to go on that journey to truly build AI products and integrate them into their systems and workflows.

VentureBeat: Based on your experience, what advice would you offer other enterprises looking to launch or further develop their AI efforts?

Singhal: Identify the business problem you're trying to solve, be certain about how AI can solve it, and be ready to make those changes within your environment. Plan it well in advance. And ensure you have the right domain experts creating the right training data. Spend 80% of your budget in ensuring that you have high quality training data and 20% on training the models. Because it's garbage in, garbage out.

More