How to migrate to Snowflake without getting ‘data drunk’

In case you haven't heard, the cloud is booming. And when it comes to cloud storage and analytics, in particular, Snowflake is benefiting from the blizzard. In its latest financial disclosure, the company reported 4,532 customers and 110% year-over-year revenue growth.

Even though migration is only the first step when it comes to embracing the cloud, getting it right is crucial to setting any business up for success. And there's a lot to consider: governance, customizations, aligning stakeholders, and building out a team to make it happen. Plus, the fact that Snowflake's unlimited storage and compute make it easy to rack up a big bill.

To get a better idea of how one company prepared for its migration to Snowflake, we chatted with Salim Syed, senior director of data engineering at Capital One. He pulled back the curtain on the company's migration, which kicked off in 2017. The team has made several updates over the years, he says, and it's been a successful journey resulting in almost 27% cost savings.

This interview has been edited for brevity and clarity.

VentureBeat: Capital One set out to migrate to Snowflake because you saw some potential benefits, of course. But what challenges did you anticipate? Were there any drawbacks you felt you had to solve for regarding how this would impact Capital One's data and means of working with that data?

Salim Syed: It's a good question. And yes. So Snowflake's architecture was different than any other data warehouse we had worked with, which had separation of storage and compute. So in the past, we didn't have to manage compute separately; you just gave access to the database to our users. But we knew that Snowflake provides unlimited storage and unlimited compute and if we didn't manage how to provision that and build proper controls and governance around it, then we would lose track of the cost and of governance. So that's one thing.

The other thing is we didn't want the centralized team to be a bottleneck, with 6,000 users requesting access to the data warehouse and compute separately. So we started thinking about how we could make this more self-service and give the ownership of data and infrastructure to the businesses to manage their own environments, but also ensure governance, cost control, and best practices are built in. And so that led to our journey building all these tools that help us manage Snowflake better.

VentureBeat: And what's this concern about becoming "data drunk"?

Syed: As we move to the cloud, the amount of data we're seeing now is, I can't even ... maybe 50 times more than what we ever had in our on-premise. So the amount of data and the variety of data is just continually increasing, and Snowflake allows you to basically store as much data as you want and run as much analytics as you want. So that's the term we came up with about how our analysts will basically use whatever resources we give them. When analysts work with data, they're basically creating subsets of data and storing them in their private sandboxes. And what happens is when you allow analysts, data scientists, or whoever to just continue to create more and more storage, you lose control of that data. And so we also very specifically wanted to make sure that any data that is created outside of our production systems by our users is well-governed. We know exactly what that data is, who should have access to it, how it's shared, how long to keep the data, the metadata -- we require all that is captured so that we are still growing really fast but also making sure we're still well-governed.

VentureBeat: So the concern around getting "data drunk" is more about the control than the amount of data?

Syed: It's both. The cost is one aspect because you can end up spending a lot, whereas in the past, you didn't. It wasn't pay as you go, but rather you bought a license for a year and just used it -- it didn't matter how much. With Snowflake and AWS cloud, the more you use it, the more you end up paying. So it's very important to make sure you're using the compute as efficiently as possible. On the other side, governance and control is also very important when you have such a variety of data and so many different types of data. In order for us to be well governed, we have to satisfy not only the cyber folks but regulators, the database administration team, and all the different stakeholders.

VentureBeat: Speaking of regulators, does the fact that Capital One sits in a heavily regulated industry have any impact?

Syed: I think Capital One was in a better place because we are such a heavily regulated company, so we understand risk management better than others. But what really changed as part of our migration was scaling governance because now we're just dealing with exponentially more data. Historically, governance can become a bottleneck and can stifle your innovation because everyone has to maintain the central team that enforces governance, and everyone has to follow that. So our challenge was how do we federate and simplify governance? And how do we hide all the bureaucracy that goes on and make it transparent so our users can still access the data and innovate while making sure that all the governance activities are taken care of behind the scenes? That's what we really focused on during our migration. And you asked about other companies. Even if it's not a regulated company, it's becoming such an important part of every organization. All that information is going to be super valuable no matter [whether] it's regulated.

VentureBeat: So let's get into your solutions. How did you go about not just putting in controls, but streamlining the process?

Syed: We built the tools because we knew cost would become a big issue if we didn't. But the idea was that you are federating the ownership and management to the business while enforcing central policies and using centralized tools. So the question was how can you make it still be flexible so that line of business can still adjust and they don't just reject it? That's where it really started.

Then the journey went from infrastructure management in Snowflake to data management. We wanted to make sure that on the producer side, for example, the experience was seamless -- that you could ingest data from all the different sources and make sure the one single workflow would get your data and registered metadata, identify the sensitivity of columns, and classify columns and fields. And then make sure that beyond where the data will be stored, how it will get updated and what transformations will happen. We just wanted to make that whole experience easy. And then while that was happening, we basically enabled all the data governance things so businesses don't have to reinvest and can just configure their workflow and use our ingestion process.

We really thought about the data discovery part too. We needed to build a system where you could find the data easily by seeing what other people in your role have searched for, so we used machine learning to figure that out. And then once you find the data that's relevant to you, we give you information around if you can trust the data, how often has it been updated, when was the last time, what are the values, who accesses the data, etc. We wanted to remove all that bureaucracy and make a seamless end-to-end application.

VentureBeat: And what did this all look like in terms of the people involved? Did you have a dedicated team? Which types of experts would you say are a must to have involved in this sort of undertaking?

Syed: It all starts from leadership. You have to have leadership's buy-in so all the lines of business understand it's the way you're going. And yeah, absolutely. You will have to build a team of data engineers, application developers, UI architects, and people who understand governance and the pain points. A huge product team. So it was definitely a combination of teams that were brought in, and we also constantly engaged with line of businesses to make sure we were addressing their needs as well.

VentureBeat: Has all this carried you well up to today? Have you had to make any updates or changes?

Syed: We've definitely learned a lot along the way and made adjustments. For example, we had initially created some patterns for data producers to, for example, load the data. And we gave the lines of businesses the rules of the road and said they can do it on their own. But over time, we realized it was really hard to enforce this and know who was or wasn't following the rules. So we made centralized tooling for this, but also addressed the concerns of line of business by making sure it would be highly configurable and flexible. But I feel like we're now in a really good position and seeing the benefits. Almost 50,000 hours of manual work we used to do is now done by this application, and we've seen almost 27% cost savings. And we're seeing usage continue to go up, with 5-6 times more queries being run.

VentureBeat: What takeaways do you have from this experience? Is there anything you wish you had known earlier on in the process?

Syed: For anyone who's trying to make a migration or data transformation to the cloud, know it's hard to put the genie back in the bottle. So it's really important to think ahead on how you're going to deploy the governance.

Update at 6:30am Pacific: We updated the first paragraph to mention analytics and data storage, rather than just data warehousing.

More