How Netflix built its real-time data infrastructure

What makes Netflix, Netflix? Creating compelling original programming, analyzing its user data to serve subscribers better, and letting people consume content in the ways they prefer, according to Investopedia's analysis.

While few people would disagree, probably not many are familiar with the backstory of what enables the analysis of Netflix user and operational data to serve subscribers better. During Netflix’s global hyper-growth, business and operational decisions rely on faster logging data more than ever, says Zhenzhong Xu.

Xu joined Netflix in 2015 as a founding engineer on the real-time data Infrastructure team, and later led the stream processing engines team. He developed an interest in real-time data in the early 2010s, and has since believed there is much value yet to be uncovered in this area.

Recently, Xu left Netflix to pursue a similar but expanded vision in the real-time machine learning space. Xu refers to the development of Netflix's real-time data Infrastructure as an iterative journey, taking place between 2015 and 2021. He breaks down this journey in four evolving phases.

Phase 1: Rescuing Netflix logs from the failing batch pipelines (2015)

Phase 1 involved rescuing Netflix logs from the failing batch pipelines. In this phase, Xu's team built a streaming-first platform from the ground up to replace the failing pipelines.

The role of Xu and his team was to provide leverage by centrally managing foundational infrastructure, enabling product teams to focus on business logic.

In 2015, Netflix already had about 60 million subscribers and was aggressively expanding its international presence. The platform team knew that promptly scaling the platform leverage would be the key to sustaining the skyrocketing subscriber growth.

As part of that imperative, Xu's team had to figure out how to help Netflix scale its logging practices. At that time, Netflix had more than 500 microservices, generating more than 10PB data every day.

Collecting that data serves Netflix by enabling two types of insights. First, it helps gain business analytics insights (e.g., user retention, average session length, what’s trending, etc.). Second, it helps gain operation insights (e.g., measuring streaming plays per second to quickly and easily understand the health of Netflix systems) so developers can alert or perform mitigations.

Data has to be moved from the edge where it's generated to some analytical store, Xu says. The reason is well-known to all data people: microservices are built to serve operational needs, and use online transactional processing (OLTP) stores. Analytics require online analytical processing (OLAP).

Using OLTP stores for analytics wouldn't work well and would also degrade the performance of those services. Hence, there was a need to move logs reliably in a low-latency fashion. By 2015, Netflix's logging volume had increased to 500 billion events/day (1PB of data ingestion), up from 45 billion events/day in 2011.

The existing logging infrastructure (a simple batch pipeline platform built with Chukwa, Hadoop, and Hive) was failing rapidly against the increasing weekly subscriber numbers. Xu's team had about six months to develop a streaming-first solution. To make matters worse, they had to pull it off with six team members.

Furthermore, Xu notes that at that time, the streaming data ecosystem was immature. Few technology companies had proven successful streaming-first deployments at the scale Netflix needed, so the team had to evaluate technology options and experiment, and decide what to build and what nascent tools to bet on.

It was in those years that the foundations for some of Netflix's homegrown products such as Keystone and Mantis were laid. Those products got a life of their own, and Mantis was later open-sourced.

Phase 2: Scaling to hundreds of data movement use cases (2016)

A key decision made early on had to do with decoupling concerns rather than ignoring them. Xu's team separated concerns between operational and analytics use cases by evolving Mantis (operations-focused) and Keystone (analytics-focused) separately, but created room to interface both systems.

They also separated concerns between producers and consumers. They did that by introducing producer/consumer clients equipped with standardized wire protocol and simple schema management to help decouple the development workflow of producers and consumers. It later proved to be an essential aspect in data governance and data quality control.

Starting with a microservice-oriented single responsibility principle, the team divided the entire infrastructure into messaging (streaming transport), processing (stream processing), and control plane. Separating component responsibilities enabled the team to align on interfaces early on, while unlocking productivity by focusing on different parts concurrently.

In addition to resource constraints and an immature ecosystem, the team initially had to deal with the fact that analytical and operational concerns are different. Analytical stream processing focuses on correctness and predictability, while operational stream processing focuses more on cost-effectiveness, latency, and availability.

Furthermore, cloud-native resilience for a stateful data platform is hard. Netflix had already operated on AWS cloud for a few years by the time Phase 1 started. However, they were the first to get a stateful data platform onto the containerized cloud infrastructure, and that posed significant engineering challenges.

After shipping the initial Keystone MVP and migrating a few internal customers, Xu's team gradually gained trust and the word spread to other engineering teams. Streaming gained momentum in Netflix, as it became easy to move logs for analytical processing and to gain on-demand operational insights. It was time to scale for general customers, and that presented a new set of challenges.

The first challenge was increased operation burden. White-glove assistance was initially offered to onboard new customers. However, it quickly became unsustainable given the growing demand. The MVP had to evolve to support more than just a dozen customers.

The second challenge was the emergence of diverse needs. Two major groups of customers emerged. One group preferred a fully managed service that’s simple to use, while another preferred flexibility and needed complex computation capabilities to solve more advanced business problems. Xu notes that they could not do both well at the same time.

The third challenge, Xu observes honestly, was that the team broke pretty much all their dependent services at some point due to the scale — from Amazon's S3 to Apache Kafka and Apache Flink. However, one of the strategic choices made previously was to co-evolve with technology partners, even if not in an ideal maturity state.

That includes partners who Xu notes were leading the stream processing efforts in the industry, such as LinkedIn, where the Apache Kafka and Samza projects were born. Simultaneously, the company formed to commercialize Kafka;Data Artisans, the company, formed to commercialize Apache Flink, later renamed to Ververica.

Choosing the avenue of partnerships enabled the team to contribute to open-source software for their needs while leveraging the community’s work. In terms of dealing with challenges related to containerized cloud infrastructure, the team partnered up with the Titus team.

Xu also details more key decisions made early on, such as choosing to build an MVP product focusing on the first few customers. When exploring the initial product-market fit, it’s easy to get distracted. However, Xu writes, they decided to help a few high-priority, high-volume internal customers and worry about scaling the customer base later.

Phase 3: Supporting custom needs and scaling beyond thousands of use cases (2017 - 2019)

Again, Xu's team made some key decisions that helped them throughout Phase 2. They chose to focus on simplicity first versus exposing infrastructure complexities to users, as that enabled the team to address most data movement and simple streaming ETL use cases while enabling users to focus on the business logic.

They chose to invest in a fully managed multitenant self-service versus continuing with manual white-glove support. In Phase 1, they chose to invest in building a system that expects failures and monitors all operations, versus delaying the investment. In Phase 2, they continued to invest in DevOps, aiming to ship platform changes multiple times a day as needed.

Circa 2017, the team felt they had built a solid operational foundation: Customers were rarely notified during their on-calls, and all infrastructure issues were closely monitored and handled by the platform team. A robust delivery platform was in place, helping customers to introduce changes into production in minutes.

Xu notes Keystone (the product they launched) was very good at what it was originally designed to do: a streaming data routing platform that’s easy to use and almost infinitely scalable. However, it was becoming apparent that the full potential of stream processing was far from being realized. Xu's team constantly stumbled upon new needs for more granular control on complex processing capabilities.

Netflix, Xu writes, has a unique freedom and responsibility culture where each team is empowered to make its own technical decisions. The team chose to expand the scope of the platform, and in doing so, faced some new challenges.

The first challenge was that custom use cases require a different developer and operation experience. For example, Netflix recommendations cover things ranging from what to watch next, to personalized artworks and the best location to show them.

These use cases involve more advanced stream processing capabilities, such as complex event/processing time and window semantics, allowed lateness, and large-state checkpoint management. They also require more operational support, more flexible programming interfaces, and infrastructure capable of managing local states in the TBs.

The second challenge was balancing between flexibility and simplicity. With all the new custom use cases, the team had to figure out the proper level of control exposure. Furthermore, supporting custom use cases dictated increasing the degree of freedom of the platform. That was the third challenge — increased operation complexity.

Last, the team’s responsibility was to provide a centralized stream processing platform. But due to the previous strategy to focus on simplicity, some teams had already invested in their local stream processing platforms using unsupported technology – “going off the paved path”, in Netflix terminology. Xu's team had to convince them to move back to their managed platform. That, namely the central vs. local platform, was the fourth challenge.

At Phase 3, Flink was introduced in the mix, managed by Xu's team. The team chose to build a new product entry point, but refactored existing architecture versus building a new product in isolation. Flink served as this entry point, and refactoring helped minimize redundancy.

Another key decision was to start with streaming ETL and observability use cases, versus tackling all custom use cases all at once. These use cases are the most challenging due to their complexity and scale, and Xu felt that it made sense to tackle and learn from the most difficult ones first.

The last key decision made at this point was to share operation responsibilities with customers initially and gradually co-innovate to lower the burden over time. Early adopters were self-sufficient, and white-glove support helped those who were not. Over time, operational investments such as autoscaling and managed deployments were added to the mix.

Phase 4: Expanding stream processing responsibilities (2020 - present)

As stream processing use cases expanded to all organizations in Netflix, new patterns were discovered, and the team enjoyed early success. But Netflix continued to explore new frontiers and made heavy investments in content production and more gaming. Thus, a series of new challenges emerged.

The first challenge is the flip side of team autonomy. Since teams are empowered to make their own decisions, many teams in Netflix end up using various data technologies. Diverse data technologies made coordination difficult. With many choices available, it is human nature to put technologies in dividing buckets, and frontiers are hard to push with dividing boundaries, Xu writes.

The second challenge is that the learning curve gets steeper. With an ever-increasing amount of available data tools and continued deepening specialization, it is challenging for users to learn and decide what technology fits into a specific use case.

The third challenge, Xu notes, is that machine learning practices aren't leveraging the full power of the data platform. All previously mentioned challenges add a toll on machine learning practices. Data scientists’ feedback loops are long, data engineers’ productivity suffers, and product engineers have challenges sharing valuable data. Ultimately, many businesses lose opportunities to adapt to the fast-changing market.

The fourth and final challenge is the scale limits on the central platform model. As the central data platform scales use cases at a superlinear rate, it’s unsustainable to have a single point of contact for support, Xu notes. It’s the right time to evaluate a model that prioritizes supporting the local platforms that are built on top of the central platform.

Xu extracted valuable lessons from this process, some of which may be familiar to product owners, and applicable beyond the world of streaming data. Lessons such as having a psychologically safe environment to fail, deciding what not to work on, educating users to become platform champions, and not cracking under pressure. VentureBeat encourages interested readers to refer to Xu's account in its entirety.

Xu also sees opportunities unique to real-time data processing in Phase 4 and beyond. Data streaming can be used to connect worlds, raise abstraction by combining the best of both simplicity and flexibility, and better cater to the needs of machine learning. He aims to continue on this journey focusing on the latter point, currently working on a startup called Claypot.

Phase 1: Rescuing Netflix logs from the failing batch pipelines (2015)

Phase 2: Scaling to hundreds of data movement use cases (2016)

Phase 3: Supporting custom needs and scaling beyond thousands of use cases (2017 - 2019)

Phase 4: Expanding stream processing responsibilities (2020 - present)

More