Why Netflix Removed Kafka from Tudum’s Architecture

Why Netflix Removed Kafka from Tudum’s Architecture

Analytics India Magazine (Ankush Das)

Netflix launched Tudum in 2021, promising fans a richer experience, exclusive behind-the-scenes footage, first looks, interviews, and more. Named after the iconic Netflix startup sound ‘tudum’, the official fan destination attracts over 20 million users each month, the company revealed. 

Tudum is not just a blog; it is a high-performance publishing platform. However, behind the scenes, Netflix engineers were fighting the latency issue caused by a distributed, eventually consistent architecture.

Initially built on a Command Query Responsibility Segregation (CQRS) model with Apache Kafka, Tudum’s setup favoured scale over speed. But when editors had to wait minutes to preview a change, the trade-off no longer made sense. The company decided to migrate it to its in-house technology, as stated in their blog post.

That rethinking led Netflix to replace a multi-component read pipeline with a bold new approach, an in-memory system called RAW Hollow. This resulted in less I/O, faster page loads, and previews within seconds.

Kafka was King, Until it Wasn’t

The original Tudum architecture used a classic event-driven CQRS design. Editors pushed updates via a third-party CMS, triggering a pipeline that published updates to Kafka. The system then ran validations, transformed the content, and passed it through multiple services before it appeared on the website.

This was fine for scale, but not for speed. A simple change had to wait in line, through ingestion, Kafka, database writes, cache refresh, and finally rendering. Caches were meant to improve read speed, but they ironically became the bottleneck for seeing new content. As the number of content elements grew, so did the refresh lag.

“In our performance profiling, we found the source of delay was our Page Data Service, which acted as a facade for an underlying Key Value Data Abstraction database,” the blog read.

While developers could optimise cache refresh intervals, they couldn’t escape the deeper problem: the architecture’s reliance on too many sequential I/O operations. With 60 keys and a 60-second refresh interval, the near cache updated one key per second. 

“This was problematic for previewing recent modifications, as these changes were only reflected with each cache refresh. As Tudum’s content grew, cache refresh times increased, further extending the delay,” the company mentioned in the blog post.

RAW Hollow for the Simplicity

Netflix’s answer was RAW Hollow, a compressed, in-memory object database co-located with each service node. Designed for small to medium-sized datasets with high read requirements and low mutation frequency, RAW Hollow fits Tudum’s needs perfectly.

Instead of pulling data from a key-value store via a cached facade, Tudum’s services now read directly from in-memory structures. With compression, even three years of data could be stored in just 130 MB of memory. Moreover, with support for strong read-after-write consistency, editors could see their changes almost instantly, without sacrificing performance for site visitors.

The company ditched the Kafka pipelines, the Apache Cassandra database reads, and the cache invalidation woes. The new architecture embedded RAW Hollow directly into Tudum’s microservices, drastically simplifying the stack.

Performance Unlocked with Less Complexity

Notably, page load times dropped from 1.4 seconds to 0.4 seconds. The editing workflow transitioned from being sluggish to nearly real time. Moreover, the microservices that powered search and personalisation could now make decisions in little to no time, since all the relevant data was already in memory.

However, this speed came at the cost of tighter coupling. Tudum’s Page Construction Service now directly depends on RAW Hollow’s in-memory state. Netflix engineers acknowledged the trade-off that it works for Tudum, but could limit flexibility if shared more broadly.

Yet, the benefits outweighed the drawbacks. As the blog notes, by holding the complete dataset in memory, Netflix “eliminated an entire class of problems” tied to caching, latency, and inconsistent previews.

Netflix’s Tudum shows that CQRS isn’t always about eventual consistency—it can be reimagined for low-latency as well. RAW Hollow turned the problem inside out. If reads are slow, don’t just cache them—move them into memory entirely.

While caching is notoriously hard to get right, RAW Hollow offered something better than a smarter cache.

The post Why Netflix Removed Kafka from Tudum’s Architecture appeared first on Analytics India Magazine.

Generated by RSStT. The copyright belongs to the original author.

Source

Report Page