Why Discord Moved Away from Redis and Rebuilt Search on Kube…

Why Discord Moved Away from Redis and Rebuilt Search on Kube…

Analytics India Magazine (Ankush Das)

Whether it is the ‘Java secret’ of Netflix or Uber using Kubernetes, the tech behind the scenes that makes it all happen is always an interesting observation when it comes to popular platforms used by individuals and enterprises worldwide.

For Discord, what started as a performant Elasticsearch-based system in 2017 eventually cracked under the immense pressure of the company’s growth, forcing the engineering team to reimagine their search architecture completely.

When Success Becomes a Problem

For many years, Discord’s messaging system functioned efficiently using Elasticsearch clusters. These clusters store messages with each Discord server and direct message stream receiving its own shard. However, as the platform scaled, fundamental limitations began to surface that couldn’t be patched with incremental improvements. 

Discord’s original search infrastructure was designed with solid engineering principles, but even the best-laid plans can buckle under exponential growth. The Redis-backed message indexing queue, which had performed admirably in the early days, became a critical failure point as the volume of messages increased. 

“When our indexing queue got backed up for any reason, which happened often on Elasticsearch node failure, the Redis cluster became a source of failure that began dropping messages once CPU maxed out with too many messages in the queue,” the Discord team explained in a blog post.

The bulk indexing strategy, originally optimised for performance, created unexpected vulnerabilities at scale. Batching messages from diverse indices and nodes led to interconnected failures. Consequently, the breakdown of just one node could severely disrupt a large number of operations.

“If one node fails, assuming an equal distribution, the odds of a given batch having at least one message going to that failed node are ~40%. This means that a single-node failure leads to ~40% of our bulk index operations failing!” Discord’s team highlighted in the blog.

The company explained that the system grew so large and fragile that essential maintenance had become nearly impossible. The inability to perform rolling restarts or software upgrades meant that Discord was stuck running outdated versions, missing both security patches and performance improvements. As explained in the blog post, the log4shell vulnerability patch required the search system to be taken offline for maintenance. During this period, all Elasticsearch nodes were restarted with updated configurations.

Kubernetes’s Familiarity  

While Discord had been successfully running stateless services on Kubernetes, the search infrastructure represented their first significant stateful workload migration. The company discovered that the Elastic Kubernetes Operator provided the orchestration capabilities to manage complex Elasticsearch deployments at scale.

“With the Elasticsearch Operator, we would easily be able to define our cluster topology and configuration, and deploy the Elasticsearch cluster onto our Kubernetes nodepool,” the blog post stated.

Kubernetes brought immediate operational benefits that addressed many of Discord’s pain points—OS upgrades became automatic, rolling restarts could be performed safely, and the granular resource allocation helped optimise costs.

More importantly, it enabled Discord to implement its multi-cluster “cell” architecture, where smaller, more manageable Elasticsearch clusters could be deployed and managed independently.

The cell architecture represented a complete departure from the monolithic cluster approach. Instead of managing over 200 massive node clusters, Discord now operates 40 smaller clusters organised into logical cells. Each cluster within a cell runs dedicated node types with specific roles, ensuring that master-eligible nodes have sufficient resources for coordination. At the same time, ingest nodes can scale dynamically to handle traffic spikes.

Indexing Trillions of Messages

The migration to Pub/Sub from Redis for message queuing solved the message-dropping problem while enabling more sophisticated routing strategies. Discord implemented a message router that intelligently batches messages by their destination cluster and index, ensuring that bulk operations remain isolated and resilient to individual node failures.

This architectural flexibility unlocked new search capabilities that were previously impossible. Cross-message search, a long-requested feature, became feasible. 

For Discord’s largest communities, dubbed ‘Big Freaking Guilds’ or BFGs, the new architecture provides dedicated resources and multi-shard indices to handle billions of messages. These exceptional cases get their own Elasticsearch cell with optimised configurations, ensuring that massive guilds don’t impact performance for smaller communities while still providing fast search capabilities.

The changes resulted in significant improvements, such as indexing throughput being two times better, query latency falling dramatically from a median of 500ms to below 100ms, and cluster upgrades becoming seamless with zero service interruptions.

The transformation results demonstrate the power of thoughtful infrastructure evolution. Discord now processes trillions of messages with improved performance metrics across the board, all while maintaining the flexibility to handle edge cases and future growth. 

The post Why Discord Moved Away from Redis and Rebuilt Search on Kubernetes appeared first on Analytics India Magazine.

Generated by RSStT. The copyright belongs to the original author.

Source

Report Page