Reading Between the Lines of Chinese LLMs
Analytics India Magazine (Supreeth Koundinya)

Understanding the reality of open weights AI models, and how China feels about LLMs.
Chinese models look formidable on paper, but once benchmarks refresh and real users settle into habits, the gap between performance and adoption reappears.
On OpenRouter—a multi-model inference platform that tracks real-world usage across more than 100 trillion tokens—Chinese models surged to nearly 30% of routed tokens at their peak. This spike was driven by the launches of DeepSeek, Qwen and Kimi, which were widely framed as breakthroughs for open weights AI.
Over the whole November-to-November window, however, that share averaged closer to 13%, while proprietary models still accounted for roughly 70% of total usage.
This disconnect between headline performance and lived adoption is the starting point of Gavin Leech’s research, Paper AI Tigers.
Benchmarks Peak Before Users Commit
Leech, an AI researcher and co-author of The Scaling Era: An Oral History of AI, 2019–2025, focused on how performance changes once benchmarks refresh.
Using AIME, a high-effort mathematics benchmark that refreshes annually while keeping difficulty broadly constant, he compared model performance across successive iterations. Chinese models showed markedly larger drops when moving from AIME 2024 to AIME 2025. Across other Chinese AI models, scores fell by 13.7 percentage points, or about 21%, while Western models dropped 4.5 points, roughly 10%.
For instance, DeepSeek R1 fell from 89.3% on AIME 2024 to 76.0% on AIME 2025, a drop of nearly 15 percentage points following the benchmark refresh.

Imagr Credit: Gavin Leech
“There’s nothing special about AIME, I picked it just because it’s high-effort and it updates,” Leech said in an interaction with AIM about his research. “It gives us nice properties: it’s novel, of equal difficulty, and the data is clean.”
The distinction matters because most benchmarks are static, public, and often embedded in training data.
“I imagine some of them are intentionally training on test data (to build hype, Silicon Valley style) and others are just in too much of a rush to decontaminate the pretraining corpus properly,” Leech said.
Why Benchmark Scores Overstate Real Capability
Evaluation further distorts results, as metrics like avg@64 allow 64 attempts per question in a benchmark, crediting models if any run succeeds. This can mask instability in single-pass, real-world use.
Leech argues that a perfectly uncontaminated benchmark is largely theoretical. Even when test questions are excluded from training, modern models still absorb the same underlying facts and patterns, allowing them to solve “new” questions that are mere rephrasings of what they have already seen.
“This is profoundly difficult to correct for,” he said. Even granting that experiment, Leech expects the gap to widen rather than shrink.
He does, however, acknowledge improvements in Chinese models during 2025, while noting that the latest releases may already reflect exposure to AIME 2025.
“Qwen3 seems less contaminated than Qwen2.5, for instance,” he said. He plans to repeat the analysis when AIME 2026 arrives.
But there’s more.
Where Chinese Models Truly Perform Well
Based on OpenRouter’s usage patterns, Chinese models perform strongly on well-defined, repeatable workloads.
By late 2025, roleplay—which typically involves long system prompts, explicit character constraints and tightly scoped narrative objectives—accounted for roughly 33% of Chinese open-source usage, while programming and technical tasks made up nearly 40%. These are domains where prompts are structured, objectives are explicit, and variance is constrained. Even when programming usage rises, it tends to be volatile week-to-week.
By the end of 2025, more than half of all tokens on OpenRouter flowed through reasoning-optimised models, with tool-calling now routine in high-value workflows. That traffic is dominated by Claude, Gemini, xAI’s Grok variants, and a small number of Western open models.
Fatigue among open source models is observed across the ecosystem.
Varun Mayya, founder of Aeos Labs, recently described stepping back from actively testing new open-source models altogether. “Two years ago, I tried everything,” he said in a YouTube video. “Now I don’t—most of them are simply behind.”
The issue, he argued, is not ideology but the speed of iteration. “You test an open model, then ChatGPT gets better. Then Claude gets better. Then Gemini gets better.”
Once users settle on a system that is “good enough” and improving quickly, experimentation rarely turns into retention.
Retention Favours Systems Not Experiments
OpenRouter’s cohort and retention data show that models embedded in durable workflows develop stable user bases over time.
Claude 4 Sonnet and Gemini 2.5 Pro display relatively consistent repeat usage, while DeepSeek’s traffic is more episodic, characterised by sharp, release-driven spikes followed by drop-offs.
Besides, Leech draws a clear distinction between raw capability and readiness for deployment. Chinese labs, he says, are “closer on capability than product, and closer on product than on legal and quasi-legal compliance.”
This indicates that in practice, Chinese models may perform well in isolation but struggle to meet the reliability and compliance standards expected in large-scale deployments.
Leech also points to latency. In browser use, Chinese models are often slower than Western systems, a gap he links to inference bottlenecks from chip controls.
For example, Artificial Analysis shows that DeepSeek V3 generates fewer than 30 tokens per second, whereas Gemini 2.5 Flash and Grok 4 Fast create more than 200 tokens per second.
Leech makes a related point about context windows and cost. While many Chinese models advertise long maximum contexts, the amount they can reliably reason over at once is far smaller—often closer to 5-10% of the headline figure.
The result is that low per-token pricing does not always translate into lower per-task cost.
On aggregated benchmarks as per SemiAnalysis, DeepSeek R1 consumed nearly 100 million output tokens to complete the same evaluation, whereas Claude Opus 4 or GPT-4.1 required a fraction of that.

Even at a steep per-token discount, longer generations and heavier reasoning traces can erase much of the apparent savings.
But Why?
Leech attributes the brittleness observed in benchmarks, the low adoption reported by OpenRouter, and poor real-world reliability to technical constraints rather than to structural inferiority.
“This year (at the country level) they had 5–10x less compute than the Western labs, and so their models are probably undertrained,” Leech said. As a result, Western labs have had more resources to train larger and more robust models.
He added that Chinese labs are still developing expertise in key areas such as pre-training and reinforcement learning. “They will catch up eventually.”
These constraints matter at the global edge. But inside China, a very different adoption logic is taking shape.
Inside China, Adoption Follows a Different Logic
“Overall, Chinese AI continues to take a pragmatic approach and is really focused on diffusing into the real economy, especially after the announcement of the AI Plus initiative,” said Grace Shao, a tech analyst and founder of AI Proem, in an interaction with AIM.
This government-led initiative, formally announced in the March 2024 Government Work Report, sets a specific target for 70% of key industrial sectors, such as manufacturing and agriculture, to adopt AI-driven digital workflows by 2027.
She stated that, as a company, Alibaba has taken a very deliberate enterprise strategy and is already seeing revenue growth in its cloud business.
Recent financial reports validate this approach, with Alibaba Cloud recording 34% year-over-year revenue growth, fuelled by AI-related product revenue, which has sustained triple-digit growth for several consecutive quarters.
“Tencent is seeing ad revenue and value-added services grow. For the startups, there has been a lot of noise around the potential IPO for Moonshot and MiniMax,” she said.
Specifically, Tencent reported a 17% increase in ad revenue driven by AI-enhanced targeting, while Moonshot and MiniMax are reportedly preparing for listings in Hong Kong as early as 2026 to secure capital for the model wars.
Looking back at the year, Shao stated that startups have focused on consumer-facing apps, such as companion bots and image-editing apps.
“But those apps have not been able to find revenue and distribution,” Shao added. “So you’re actually seeing the first wave of those startups slowly dying down, especially when big tech launches a similar function within their 1+ billion MAU ecosystem.”
A significant example of this consolidation occurred when Zhipu AI reportedly disbanded its consumer product R&D centre in late 2025, laying off or reassigning staff to focus strictly on government and enterprise contracts.
This threat arises from big tech companies in China integrating models such as DeepSeek into existing social media applications, including WeChat.
She also points out how the former COO of Zhipu has started his own company targeting enterprise customers.
This new venture, Yuanli Intelligence (Yoolee AI), founded by Zhang Fan, recently raised $8 million to build autonomous “digital workforce” agents for business workflows rather than consumer chatbots.
This move signals a broader industry pivot, with top talent and capital fleeing the saturated consumer “model wars” to seek refuge in high-value. These specialised B2B applications avoid direct competition with big tech ecosystems.
“Other players like Kuaishou’s Kling are focused on a very niche vertical—serving the creatives. So what you’re seeing is that companies are moving away from a general chatbot direction, especially now that there is Doubao, Kimi, and even Qwen the app (and to some degree Yuanbao in WeChat) competing for that China ‘ChatGPT’ crown.”
Zixuan Li, director of product and generative AI strategy at Z.ai (Zhipu AI), draws a clear contrast between Chinese and Western anxieties around AI. He recently appeared on an interview with ChinaTalk.
In China, Li said, the concern is concentrated among developers and largely focuses on jobs rather than on abstract long-term risk. “They fear the most because they try out the new models and new products more frequently than the general public, so they can feel the power.”
Outside that group, concern is far less pronounced. “Many people use DeepSeek and other chatbots to brainstorm ideas or polish writing, but they don’t believe that this work can replace them,” he added.
With regards to how much the general public follows and keeps up with the AI trends, Li said, “I believe that people just know about DeepSeek. Maybe only one million people follow the latest trend, and a billion people do their work daily and are not impacted by AI. The more you learn about it, the more fear you have.”
The post Reading Between the Lines of Chinese LLMs appeared first on Analytics India Magazine.
Generated by RSStT. The copyright belongs to the original author.