Comparing open-source models like Qwen against Anthropic’s models is absolutely foolish. First of all, Anthropic has never disclosed the actual parameter count or architecture of their models.
Second, it’s well known that these open-source models more or less distill from other models and use MoE, which allows them to run at much lower computational costs. Using Qwen as a comparison point only proves the blog post author is foolish. The article devoted such a large portion to discussing Qwen on OpenRouter, I find it hard to believe.
Anthropic is obviously also aware of the benefits of MoE and distilling a larger model into a smaller one, so they could run a model of the same size as Alibaba's for the same inference cost if they want to. Or they can run a slightly larger model for slightly higher cost. They definitely aren't running a much larger model (except potentially as a teacher for distillation training) because then they wouldn't be able to hit the output speeds they're hitting.
They are fully aware, but are playing a different game, R&D isn't something you flip a parameter and you get what the efficient oriented pipelines do.
Chinese models were built on constraints. As we know limitations lead to innovation. So the "Chinese" R&D invested in optimisations. Teacher models were already there so they likely built the best distillation processes, along with the best MoE. Actually they published many of these works.
Nuance, sure.
Anthropic/OpenAI could revise their philosophy to adopt efficiency.
But momentum can't be underestimated. Plus, dollar per optimisations is a different math altogether, it's not only about access to the latest Nvidia GPUs. At $400k the engineer pop a year, health coverage, pension contribution. Hardware efficiency doesn't weigh as much as making sure engineering focuses on.. the raw power factor, I suppose.
Every company is subject to constraints. A bigger budget is not an infinite budget. And there is no tradeoff between efficiency and raw power. An optimization that lets you build a similarly powerful model for less money also lets you build a more powerful model for the same amount of money.
Honestly, I wonder what you think closed LLM companies do R&D on if not optimizations. And the nature of research is that most ideas that sound good turn out duds, so they already need to have an established process for testing many ideas quickly. Now if somebody publishes a new idea they haven't tried yet, setting up an experiment to try it out is just a routine task... But they aren't going to tell anybody the results, just quietly integrate it if it works.
I concede we can't be sure what they do since it's proprietary. Aside leaks which give us a sense of the philosophy.
It's clear to me the economics would make the likes of OpenAI and Anthropic's focus on raw power over optimisations. I never meant they wouldn't optimise anything, but it's earlier diminishing returns vs for a company like Alibaba, or even Mistral.
The Chinese models were trained in the context of compute scarcity. So it isn't the same for them as "routine" optimisations, it's optimisations or nothing.
A year or two later those optimisations allowed their models to be somewhat on par with raw power models from the US providers.
Now despite papers being published, a design is rather sticky, it's not as simple as plugging an optimisations another lab came up with. It depends what the optimisation, perhaps multi head wasn't that big of a deal to add in, MoE would have been less so easy.
Another rather peculiar point: why did the Department of War reject Anthropic and label it a supply chain risk entity, despite its "prohibited content" policy being almost identical to OpenAI's stated policy in its announcement, yet award the contract to OpenAI?
reply