Are You Wasting Money on NVIDIA or AMD GPUs?

Are You Wasting Money on NVIDIA or AMD GPUs?

Analytics India Magazine (Supreeth Koundinya)

It’s not just pennies that companies and startups invest in GPUs — so it’s only fair to thoroughly evaluate which ones fit the bill, and which ones do not. Especially for the critical aspect of inference, the process of retrieving meaningful information from trained AI models. 

But existing benchmarks fall short. They show tokens per second from a model on a GPU, but don’t specify what task it is. 

GPUs also get software updates weekly. These updates significantly affect performance, yet are not accounted for in many inference benchmarks. 

How SemiAnalysis Is Solving the Problem

SemiAnalysis, a leading semiconductor market research firm led by Dylan Patel, has announced a benchmark to solve the abovementioned challenges.

InferenceMAX is an automated benchmark that runs nightly to track AI inference performance across NVIDIA (GB200, NVL72; H200, H100, B200) and AMD (MI355X, MI325X, MI300X) hardware systems. It is expected to support Google TPUs and AWS’ Trainium in the next two months. 

The benchmark is directly supported by NVIDIA, AMD, Microsoft, OpenAI, Together AI, CoreWeave, Nebius, PyTorch Foundation, Supermicro, Crusoe, HPE, Tensorwave, VLLM, SGLang, and others. 

The hardware systems are benchmarked on three AI models: Meta’s Llama 3.3 70B, OpenAI’s GPT-OSS 120B model, and the DeepSeek R1 May edition. 

“There’s no benchmark that is real-world today,” said Patel in a pre-release briefing that AIM was a part of. 

“There are benchmarks out there where vendors cherry-pick a specific point of inference, or a certain point of performance,” he added, indicating the motivation to build InferenceMAX.

Software Updates Matter as Much as Hardware

New GPU variants bring substantial performance differences, but updates to the underlying software matter just as much.

Frameworks like SGLang, vLLM, TensorRT-LLM, CUDA, and ROCm achieve continuous improvement in performance through kernel-level optimisations, distributed inference, and scheduling strategies.

The frameworks update frequently, and InferenceMAX captures their impact in real time. It also accounts for FP4 and FP8 precisions when applicable. 

In addition, the benchmark evaluates GPUs on a variety of tasks that mirror actual deployment scenarios — simple chats, reasoning tasks, and document summarisation. 

Simple chats use 1,024 tokens in and out. Whereas, reasoning tasks demand 1,024 input tokens but produce 8,192 output tokens, requiring verbose and extended responses. 

Document summarisation flips this ratio (8,192 tokens in and 1,024 tokens out), challenging models to distil lengthy, data-heavy inputs into concise summaries.

“People tend to cherry-pick one specific thing; they’ll make everything the exact same length for input and output, and then that ends up showing numbers that are not realistic,” said Patel, indicating how benchmarks should account for different workloads. 

The Economics of Inference

At the briefing, Patel explained inference operations with an analogy of a bus and a racecar. The former can transport many people at a lower speed, while the latter can carry one person at higher speeds.

Inference systems are flexible to operate anywhere between these two extremes. GPUs can be configured to serve many users slowly (high throughput) or few users quickly (high interactivity). 

Thus, the benchmark provides a Pareto frontier curve that shows the best possible tradeoffs between throughput versus latency. 

Operating on the frontier means maximising revenue potential by either serving more users with the same infrastructure or delivering faster responses to the same number of users. 

No other point apart from this Pareto optimal point improves one metric without sacrificing the other.

If one is operating below this curve, they are wasting money and power, without serving enough users, said Patel. 

Using SemiAnalysis’s data centre and GPU research models, InferenceMAX evaluates performance based on total cost of ownership (TCO) to reveal true economic efficiency. These metrics are rarely captured in standard benchmarks. 

It calculates TCO per million tokens by incorporating server costs, networking costs, software licensing, data centre expenses, and power consumption to determine the actual cost per hour of running inference on specific hardware. 

The platform normalises performance using two key frameworks: TCO per million tokens versus end-to-end latency, and TCO per million tokens versus tokens per second per user. 

“Running an H200 costs $1/hour on a capital cost, 35 cents an hour on a power and operational cost. That ends up being ~$1.40. How many tokens can you make? What is the cost per million tokens that’s delivered here?” said Patel, providing an example. 

The platform will soon add other industry-standard evaluations, such as the MATH-500 benchmark and the GPQA diamond evaluations. 

What About Groq and Cerebras?

AIM asked Patel if there would be support for other ASIC providers, such as Groq and Cerebras, which have been posting record-breaking inference performance. Notably, Cerebras recently raised $1.1 billion in funding.

Patel said he’d like to have them on board, but they would need to make their code open or allow SemiAnalysis to implement open source code. 

“With GPUs, today, there are open source frameworks and libraries. With TPUs and Trainium, there are significant efforts from both of these teams to make open source frameworks and libraries,” said Patel. 

Besides, the benchmark is open-source and won’t directly incentivise contributors. Patel said it is an expensive affair for vendors to provide compute to run it. 

Leaders from AMD, NVIDIA, and OpenAI have endorsed InferenceMAX. “Open collaboration is driving the next era of AI innovation. The open-source InferenceMAX benchmark gives the community transparent, nightly results that inspire trust and accelerate progress,” said Lisa Su, CEO of AMD, in a statement. 

“By benchmarking frequently, InferenceMAX gives the industry a transparent view of LLM inference performance on real-world workloads,” said Jensen Huang, CEO of NVIDIA. 

SemiAnalysis has released a dashboard to assess GPUs based on benchmark results and also published a deep dive for further reading on the benchmark. 

Debates and arguments have already begun on social media based on InferenceMAX results, and the platform has paved a new road for companies to showcase, if not boast, their performance in the competition. 

The post Are You Wasting Money on NVIDIA or AMD GPUs? appeared first on Analytics India Magazine.

Generated by RSStT. The copyright belongs to the original author.

Source

Report Page