Why Groq Loves Mixture of Experts Models

Analytics India Magazine (Supreeth Koundinya)

Mixture-of-Experts (MoE) architectures power most of today’s frontier AI models, at least the ones which we are aware of, thanks to their open weights nature.

This includes models from DeepSeek, Moonshot AI’s Kimi, and even the recently announced OpenAI’s gpt-oss series.

For context, MoE architecture activates only a subset of parameters per token, while retaining a large number of parameters for counts. And for companies like Groq, which have built their entire business around inference, MoE models present a perfect match for the company’s LPU (Language Processing Unit) chips, as per CEO Jonathan Ross.

Groq’s LPUs are hardware systems designed specifically for AI inference, and they outperform traditional GPU systems in output speed.

Ross was in Bengaluru recently at an event hosted by Lightspeed Ventures India, where he narrated how MoE models directly align with the company’s fundamental advantages. While GPUs struggle with MoE architectures due to memory bottlenecks, Ross said that Groq’s LPUs thrive on them for inference.

“If you look at NVIDIA GPUs, you can calculate their performance on a problem by calculating how long it takes to load the weights and the KV [key-value] cache from the HBM [High Bandwidth Memory]. They have way more compute than they have memory,” said Ross.

GPUs achieve performance through batching, which is running the same model weights across many users simultaneously. But MoE models disrupt this efficiency.

Ross said that with mixture-of-experts models, each input only uses a small subset of the model’s weights, so when you process one batch after another, the GPU often has to load a completely different set of weights.

That breaks the normal batching advantage, where the same weights can be reused across many inputs, so you end up moving more data. To compensate for this, GPUs are encouraged to use larger batches, but this increases latency.

Groq’s architecture sidesteps this entirely. “In our case, we have everything in the memory on the chips, so we don’t have to read from an external memory. We can still use smaller batching and still get good performance,” said Ross.

The LPU stores model weights in on-chip SRAM rather than external DRAM, eliminating the memory bandwidth penalties that plague GPU-based MoE inference.

Ross demonstrated this with Moonshot AI’s new open-weights large language MoE model, Kimi K2, running on 4,000 chips. “Each chip has a small part of the model. Each chip does a very small amount of computation and then hands it off to the next set of chips to do their part. Almost like an assembly line or a factory,” he described.

This distributed approach creates a memory efficiency advantage: instead of requiring 500 copies of the model across GPU clusters (which would necessitate 500 terabytes of memory), Groq’s architecture needs only one terabyte in total, since each chip holds just its portion.

The economics reveal how different Groq’s architecture is from GPU-based inference. Ross explained that his team debated whether to charge the same price for both their 20B and 120B parameter MoE models, since the actual serving costs are nearly identical. Both models have similar active parameter counts despite the massive difference in total parameters.

“And I really wanted to serve both of them at the same price, just to show the market how different our architecture is,” said Ross.

Lessons from AlphaGo

Before Ross set out to start his journey as the founder of Groq, he was an integral part of Google’s team that built the Tensor Processing Units (TPUs).

Ross said that during his time at Google, his team was approached by a group that, at that time, did not reveal the competition but had a million-dollar prize purse. However, they lost badly and subsequently contacted Google to ask if their TPUs were fast enough.

When Google claimed that it was, and the group decided to port their AI model over to their TPUs, they then revealed that this was AlphaGo, and they were playing a world champion in Go in 30 days. “He [world champion] had played the test game on a GPU, and he had beaten the GPUs badly — it wasn’t even close,” said Ross.

“So, 30 days later, we had it on the TPUs at Google, and we won four out of five games,” added Ross.

The famous shoulder hit move by AlphaGo was a 1 in 10,0000 probability play that was in the training data. Ross said that when they tested the model again on a GPU, it could not generate the move, despite existing in its training data within the actual gameplay that the TPUs could.

The insight connected Go to language models in a fundamental way: “Go is a sequence of moves on a board, and you have almost 300 different moves you can pick from. An LLM is the same thing, except you have about 100,000 moves that you can pick from.”

Both problems involve sequential decision-making where each choice affects future options. “Just like you can’t pick the 30th move in Go until you figure out what the 29th is, you can’t pick the 100th token until you know what the 99th is,” Ross explained.

This sequential dependency is where his architecture thesis crystallised: “CPUs are good at sequential, GPUs are good at parallel, and LPUs are a perfect blend between both of them.”

The Competition

Having said that, Ross dismissed any claims that GPU makers, such as NVIDIA, are at war with inference-providing companies like Groq on multiple occasions.

In a podcast episode earlier this year, he stated that AI model training must be conducted on GPUs and noted that if Groq were to deploy high volumes of lower-cost inference chips, the demand for training would increase.

“The more inference you have, the more training you need, and vice versa,” he said.

While Groq offers access to LPUs via GroqCloud, the company faces intense competition from Cerebras, which on several occasions has claimed to achieve the fastest inference speed among all.

According to Artificial Analysis, a benchmarking platform, Cerebras’ lead in output speed is reflected among several AI models, providing an output speed of over 3,000 tokens per second on MoE models like OpenAI’s latest gpt-oss-120b model.

The company’s Wafer-Scale Engine (WSE) incorporates massive on-chip memory, targeting both throughput and scale, while also supporting model training. Cerebras excels in performance when it comes to larger AI models.

However, Groq has its own advantages — it still offers the lower latency and the quicker time to first token, and the company also claims that it provides the lowest cost per token.

For deploying smaller models, which are increasingly being preferred for agentic AI systems today, Groq also offers a much larger context window (exceeding 100k tokens) compared to Cerebras (32k tokens), for models like Llama 4 Maverick.

The post Why Groq Loves Mixture of Experts Models appeared first on Analytics India Magazine.

Generated by RSStT. The copyright belongs to the original author.

Source

Why Groq Loves Mixture of Experts Models

Lessons from AlphaGo

The Competition

Report Page