Inside Twitter’s growing interest in Google Cloud

Twitter earlier this month announced it would be expanding its partnership with Google, moving more data and workloads from its servers to the Google Data Cloud.

While Twitter doesn't yet have plans to port its entire infrastructure to the cloud, its growing relationship with Google's Data Cloud highlights some of the key challenges companies face as their data stores grow and how employing the right cloud strategy can help them solve these challenges.

From on-premise to the cloud

Before its interest in the cloud, Twitter had long been running on its own solid IT infrastructure. Servers and datacenters on five continents stored and processed hundreds of petabytes of data, served hundreds of millions of users, and had the capacity to scale with the company's growth. Twitter also developed many in-house tools for data analysis. But in 2016, the company became interested in exploring the benefits of moving all or part of its data to the cloud.

"The advantages, as we saw them, were the ability to leverage new cloud offerings and capabilities as they became available, elasticity and scalability, a broader geographical footprint for locality and business continuity, reducing our footprint, and more," Twitter senior manager of software engineering Joep Rottinghuis wrote in a blog post in 2019.

After evaluating several options, Twitter partnered with Google Cloud to adopt a hybrid approach in which Twitter kept its immediate operations on its own servers and ported some of its data and workloads to the cloud.

"Large companies depend on collecting massive amounts of data, deriving insights and building experiences on top of this data in order to run the day-to-day aspects of their business and scale as they grow," Google Cloud product management director Sudhir Hasbe told VentureBeat. "This is very similar to what Google does. At Google, we have nine applications with more than 1 billion monthly active users. Over the past 15-plus years, we have built tools and solutions to process large amounts of data and derive value from it to ensure the best possible experience for our users."

The partnership, which officially started in 2018, involved migrating Twitter's "ad-hoc clusters" and "dedicated dense storage clusters" to Google Cloud. Ad-hoc clusters serve special, one-off queries, and the dedicated clusters store less frequently accessed data.

Democratizing data analysis

One of the key demands Google Cloud has helped address is the democratization of data analysis and mining at Twitter. In essence, Twitter wanted to enable its developers, data scientists, product managers, and researchers to derive insights from its constantly growing database of tweets.

Twitter's previous data analysis tools, such as Scalding, required a programming background, which made them unavailable to less technical users. And tools such as Presto and Vertica had problems dealing with large-scale data.

The partnership with Google gave Twitter's employees access to tools like BigQuery and Dataflow. BigQuery is a cloud-based data warehouse with built-in machine learning tools and the capability to run queries on petabytes of data. Dataflow enables companies to collect massive streams of data and process them in real time.

"BigQuery and Dataflow are two examples that do not have open source or Twitter-developed counterparts. These are additional capabilities that our developers, PMs, researchers, and data scientists can take advantage of to enable learning much faster," Twitter platform leader Nick Tornow told VentureBeat.

Twitter currently stores hundreds of petabytes of data in BigQuery, all of which can be accessed and queried via simple SQL-based web interfaces.

"Many internal use cases, including the vast majority of data science and ML use cases, may start with SQL but will quickly need to graduate to more powerful data processing frameworks," Tornow said. "The BigQuery Storage API is an important capability for enabling these use cases."

Breaking the silos

One of the key problems many organizations face is having their data stored in different silos and separate systems. This scattered structure makes it difficult to run queries and perform analysis tasks that require access to data across silos.

"Talking to many CIOs over the past few years, I have seen that there is a huge issue of data silos being created across organizations," Hasbe said. "Many organizations use Enterprise Data Warehouse for their business reporting, but it is very expensive to scale, so they put a lot of valuable data like clickstream or operational logs in Hadoop. Using this structure made it difficult to analyze all the data."

Hasbe added that merely moving silos to the cloud is not enough, as the data needs to be connected to provide a full scope of insights into an organization.

In the case of Twitter, siloed data required the extra effort of developing intermediate jobs to consolidate data from separate sources into larger workloads. The introduction of BigQuery helped remove many of these intermediate roles by providing interoperability across different data sources. BigQuery can seamlessly query data stored across various sources, such as BigQuery Storage, the Google Cloud Storage data lake, data lakes from cloud providers like Amazon and Microsoft, and Google Cloud Databases.

"The landscape is still fragmented, but BigQuery, in particular, has played an important role in helping to democratize data at Twitter," Tornow said. "Importantly, we have found that BigQuery provides a managed data warehouse experience at a substantially larger scale than legacy solutions can support."

An evolving relationship

Today, Twitter still runs its main operations on its own servers. But its relationship with Google has evolved and expanded over the last three years. "In some cases, we will move workloads as-is to the cloud. In other cases, we will rewrite workloads to take advantage of the managed services we're onboarding on," Tornow said. "Additionally, we are seeing our developers at Twitter come up with new use cases to take advantage of the streaming capabilities offered by Dataflow, as an example."

Google has also benefited immensely from onboarding a customer as big as Twitter. Throughout the partnership, Twitter has communicated feature requests in areas such as storage and computation slot allocation and dashboards that have helped Google better understand how it can improve its data analytics tools.

Under the new deal declared this month, Twitter will move its processing clusters, which run regular production jobs with dedicated capacity, to Google Cloud. The expanded partnership will also include the transition of offline analytics and machine learning workloads to Google Cloud. Machine learning already plays a key role in a wide range of tasks at Twitter, including image classification, natural language processing, content moderation, and recommender systems. Now Twitter will be able to leverage Google's vast array of tools and specialized hardware to improve its machine learning capabilities.

"GCP's ML hardware and managed services will accelerate our ability to improve our models and apply ML in additional product surfaces," Tornow said. "Improvements in our ML applications often connect directly to improved experience for people using Twitter, such as presenting more relevant timelines or more proactive action on abusive content."

How to prepare for the cloud

Google's cloud business is still trailing behind Amazon and Microsoft. But in the past few years, the tech giant has managed to snatch several big-ticket customers, including Wayfair, Etsy, and the Home Depot. Working with Twitter and these companies has helped the Google Cloud team draw important lessons on cloud migration. Hasbe summarizes these into three key tips for organizations considering moving to the cloud:

Break down the silos. "Focus on all data, not just one type of data when you move to the cloud," Hasbe said.
Build for today but plan for the future. "Many organizations are hyper-focused on use cases they are using today and moving them as-is to the cloud," Hasbe said, adding that cloud migration should be an opportunity to plan for long-term modernization and transformation. "Organizations have to live with the platform they pick for years if not decades," he said.
Focus on business value-driven use cases. "Don't boil the ocean and create a data lake. Start small and pick a use case that has real business value. Deliver that value end to end. This will enable business leaders to see the ROI, enable your teams to get confident in their new abilities, and importantly reduce your time to value or failure ... You can learn and pivot as you go," Hasbe said.

Finally, Hasbe stressed that the responsibility for driving innovation cannot fall only on technology teams. "It has to involve all parts of the organization. Hence, having commitment from leadership across business and technology is key," he said.