Counterpoint: iOS’s biggest competitor is Android. They are now effectively funding their competition on a core product interface. I see this as strategically devastating.
Counterpoint: Google is paying Apple $20b/year to keep themselves as the default search engine in iOS. Android's biggest competitor is iOS. They are now effectively funding their competition on a core product interface. I see this as strategically devastating.
It's strategically devastating because no small number of users choose Apple because they do not trust Google and now they have no choice but to have Google AI on-board their machines.
I respect Google's engineering, and I'm aware that fundamental technologies such as Protocol Buffers and FlatBuffers are unavoidably integrated into the software fabric, but this is is avoidable.
I'm surprised Google aren't paying Apple for this.
> no small number of users choose Apple because they do not trust Google
Unfortunately, it probably actually is a small number comparatively. Or at least I would need to see some sort of real data to say anything different.
I feel like people who distrust Google probably wouldn't trust Apple enough to give them their data either? Why would you distrust one but not the other?
Apple still is in the business of selling devices, not customer data - with Google being an external company , I bet there'll be an extensive permissions systems you can limit what the AI can do (or turn it off altogether).
Yes but I may want to use Apple intelligence and now I have to use Google intelligence instead. This is not the product I paid thousands of dollars for.
Second I'm developing privacy focused apps that were going to use foundation models. Now I need to seriously reconsider this.
Just don't update your phone, they'll probably switch it on without asking like they do for Apple Intelligence. Or use Carplay, for which Siri is required.
Is android really iOSs competition ? I feel like the competition is less android more vendors who use android. Every android phone feels different. Android doesn’t even compete on performance anymore the chips are quite behind. The target audience of the two feels different lately.
It ISN'T in this day and age. People don't switch back and forth between iOS and Android like it's still 2010. They use whatever they got locken in initially since their first smartphone or where Apple's green/blue-bubble issue pushed them to or what their family handed them down or what their close friend groups used to have.
People who've been using iOS for 6+ years will 98% stick to iOS for their next purchase and won't even bother look at Android no matter what features Android were to add.
The Android vs iOS war is as dead as the console war. There's no competition anymore, it's just picking one from a duopoly of vendor lock-ins.
Even if EU were to break some of the lockins, people have familiarity bias and will stick with inertia of what they're used to, so it will not move the market share needle one bit.
Of course android is iOSs competition. android is also 75% of the market that apple surely wants bigger piece of.
Performance? We are many years past the point somebody cared about performance. I am writing this on iphone 11 pro and the experience is almost exactly the same as current iOS.
You know what's not the same? Android became pretty great OS. I recently got older Pixel to see how GrapheneOS works and was surprised about Android (which i havent seen for a decade). iOS on the other hand has recently gone trough with very bad ui redesign for no reason.
Imho the main thing Apple has going for it is that Google is spyware company and Apple is still mainly hardware company. But if Apple decides to pull their users data to gemini… well good luck.
My bar for super-rough is Servo, which doesn't have password autofill… and doesn't render the Orion page right.
Orion is less rough, but the color scheme doesn't work, and it doesn't have an omnibar (as in: type in the address bar, enter, and it shows search results).
1. Chinese models typically focus on text. US and EU models also bear the cross of handling image, often voice and video. Supporting all those is additional training costs not spent on further reasoning, tying one hand in your back to be more generally useful.
2. The gap seems small, because so many benchmarks get saturated so fast. But towards the top, every 1% increase in benchmarks is significantly better.
On the second point, I worked on a leaderboard that both normalizes scores, and predicts unknown scores to help improve comparisons between models on various criteria: https://metabench.organisons.com/
You can notice that, while Chinese models are quite good, the gap to the top is still significant.
However, the US models are typically much more expensive for inference, and Chinese models do have a niche on the Pareto frontier on cheaper but serviceable models (even though US models also eat up the frontier there).
Nothing you said helps with the issue of valuation. Yes, the US models may be better by a few percentage points, but how can they justify being so costly, both operationally as well as in investment costs? Over the long run, this is a business and you don't make money being the first, you have to be more profitable overall.
I think the investment race here is an "all-pay auction"*. Lots of investors have looked at the ultimate prize — basically winning something larger than the entire present world economy forever — and think "yes".
But even assuming that we're on the right path for that (which we may not be) and assuming that nothing intervenes to stop it (which it might), there may be only one winner, and that winner may not have even entered the game yet.
> investors have looked at the ultimate prize — basically winning something larger than the entire present world economy
This is what people like Altman want investors to believe. It seems like any other snake oil scam because it doesn't match reality of what he delivers.
Qwen Image and Image Edit were among the best image models until Nano Banana Pro came along. I have tried some open image models and can confirm , the Chinese models are easily the best or very close to the best, but right now the Google model is even better... we'll see if the Chinese catch up again.
I'd say Google still hasn't caught up on the smaller model side at all, but we've all been (rightfully) wowed enough by Pro to ignore that for now.
Nano Banano Pro starts at 15 cents per image at <2k resolution, and is not strictly better than Seedream 4.0: yet the latter does 4K for 3 cents per image.
Add in the power of fine-tuning on their open weight models and I don't know if China actually needs to catch up.
I finetuned Qwen Image on 200 generations from Seedream 4.0 that were cleaned up with Nano Banana Pro, and got results that were as good and more reliable than either model could achieve otherwise.
FWIW, Qwen Z-Image is much better than Seedream and people (redditors) are saying its better than Nano Banana in their first trials. Its also 7B I think, and open.
I've used and finetuned Z-Image Turbo: it's nowhere near Seedream or even Qwen-Image when the latter is finetuned (also doesn't do image editing yet)
It is very good for the size and speed, and I'm excited for the Edit and Base variants... but Reddit has been a bit "over-excited" because it run on their small GPUs and isn't overly resistant to porn.
Not true at all. Qwen has a VLM (qwen2 vl instruct) which is the backbone of Bytedance’s TARS computer use model. Both Alibaba (Qwen) and Bytedance are Chinese.
Also DeepSeek got a ton of attention with their OCR paper a month ago which was an explicit example of using images rather than text.
The scales are a bit murky here, but if we look at the 'Coding' metric, we see that Kimi K2 outperforms Sonnet 4.5 - that's considered to be the price-perf darling I think even today?
I haven't tried these models, but in general there have been lots of cases where a model performs much worse IRL than the benchmarks would sugges (certain Chinese models and GPT-OSS have been guilty of this in the past)
• For both Kimi K2 and for Sonnet, there's a non-thinking and a thinking version.
Sonnet 4.5 Thinking is better than Kimi K2 non-thinking, but the K2 Thinking model came out recently, and beats it on all comparable pure-coding benchmarks I know: OJ-Bench (Sonnet: 30.4% < K2: 48.7%), LiveCodeBench (Sonnet: 64% < K2: 83%), they tie at SciCode at 44.8%. It is a finding shared by ArtificialAnalysis: https://artificialanalysis.ai/models/capabilities/coding
• The reason developers love Sonnet 4.5 for coding, though, is not just the quality of the code. They use Cursor, Claude Code, or some other system such as Github Copilot, which are increasingly agentic. On the Agentic Coding criteria, Sonnet 4.5 Thinking is much higher.
By the way, you can look at the Table tab to see all known and predicted results on benchmarks.
The table is confusing. It is not clear what is known and what is predicted (and how it is predicted). Why not measure the missing pieces instead of predicting—is it too expensive or is the tooling missing?
Qwen, Hunyuan, and WAN are three of the major competitors in the vision, text-to-image, and image-to-video spaces. They are quite competitive. Right now WAN is only behind Google's Veo in image-to-video rankings on llmarena for example
Being open source, I believe Chinese models are less prone to censorship, since the US corporations can add censorship in several ways just by being a closed model that they control.
It's not about a LLM being prone to anything, but more about the way a LLM is fine-tuned (which can be subject to the requirements of those wielding political power).
Yes extremely likely they are prone to censorship based on the training. Try running them with something like LM Studio locally and ask it questions the government is uncomfortable about. I originally thought the bias was in the GUI, but it's baked into the model itself.
Indeed. A mouse that runs through a maze may be right to say that it is constantly hitting a wall, yet it makes constant progress.
An example is citing Mr Sutskever's interview this way:
> in my 2022 “Deep learning is hitting a wall” evaluation of LLMs, which explicitly argued that the Kaplan scaling laws would eventually reach a point of diminishing returns (as Sutskever just did)
which is misleading, since Sutskever said it didn't hit a wall in 2022[0]:
> Up until 2020, from 2012 to 2020, it was the age of research. Now, from 2020 to 2025, it was the age of scaling
The larger point that Mr Marcus makes, though, is that the maze has no exit.
> there are many reasons to doubt that LLMs will ever deliver the rewards that many people expected.
That is something that most scientists disagree with. In fact the ongoing progress on LLMs has already accumulated tremendous utility which may already justify the investment.
Why RVQ though, rather than using the raw VAE embedding?
If I compare rvq-without-quantization-v4.png with rvq-2-level-v4.png, the quality seems oddly similar, but the former takes a 32-sized vector, while the latter takes two 32-sized (one-hot) vectors, (2 = number of levels, 32 = number of quantization cluster centers). Isn't that more?
I had a part about this but I took it out: for compression, you could keep the embeddings unquantized and it would still compress quite well, depending on the embedding dimension and the number of quantization levels.
But categorical distributions are better for modelling. It's a little difficult to explain here without using diagrams. The intuition is that if you try to have a model predict the next embedding and not the next token, you can't model multimodal distributions - you'll end up predicting the mean of the possible continuations and not the mode, which is not what you want.
At the bottom of the blog, I link two articles that do make continuous embeddings work. One of them is the Kyutai paper Continuous Audio Language Models: https://arxiv.org/abs/2509.06926
Hmm, I think a mixture of beta distributions could work just as well as cateogrical here. I'm going to train it for PixelRNN, but it's going to take hours or days to train (it's a very inefficient and unparallelizable architecture). I'll report back tomorrow.
After another 24 hours of training and around 100 epochs, we get down to 4.4 bits/dim and colors are starting to emerge[1]. However, an issue a friend brought up is that log-likelihood + beta distribution weights values near 0 and 1 much higher:
This means we should see most outputs be pure colors: black, white, red, blue, green, cyan, magenta, or yellow. 3.6% of the channels are 0 or 255, up from 1.4% after 50 epochs[2]. Apparently, an earth-mover loss might be better:
E_{x ~ output distribution}[|correct - x|]
I could retrain this for another day or two, but PixelRNN is really slow, and I want to use my GPU for other things. Instead, I trained a 50x faster PixelCNN for 50 epochs with this new loss and... it just went to the average pixel value (0.5). There's probably a way to train a mixture of betas, but I haven't figured it out yet.
Okay, so my PixelCNN masking was wrong... which is why it went to the mean. The earth-mover did get better results than negative log-likelihood, but I found a better solution!
The issue with negative log-likelihood was the neural network could optimize solely around zero and one because there are poles there. The key insight is that the color value in the image is not zero or one. If we are given #00, all we really know is the image from the real world had a brightness between #00 and #01, so we should be integrating the probability density function from 0 to 1/256 to get the likelihood.
It turns out PyTorch does not have a good implementation of Beta.cdf(), so I had to roll my own. Realistically, I just asked the chatbots to tell me what good algorithms there were and to write me code. I ended up with two:
(1) There's a known continued fraction form for the CDF, so combined with Lentz' algorithm it can be computed.
(2) Apparently there's a pretty good closed-form approximation as well (Temme [1]).
The first one was a little unstable in training, but worked well enough (output: [2], color hist: [3]). The second was a little more stable in training, but had issues with nan's near zero and one, so I had to clamp things there which makes it a little less accurate (output: [4], color hist: [5]).
The bits/dim gets down to ~3.5 for both of these, which isn't terrible, but there's probably something that can be done better to get it below 3.0. I don't have any clean code to upload, but I'll probably do that tomrrow and edit (or reply to) this comment. But, that's it for the experiments!
Anyway, the point of this experiment was because this sentence was really bothering me:
> But categorical distributions are better for modelling.
And when I investigated why you said that, it turns out the PixelRNN authors used a mixture of Gaussians, and even said they're probably losing some bits because Gaussians go out of bounds and need to be clipped! So, I really wanted to say, "seems like a skill issue, just use Beta distributions," but then I had to go check if that really did work. My hypothesis was Betas should work even better than a categorical distribution because the categorical model would have to learn nearby outputs are indeed nearby while this is baked into the Beta model. We see the issue show up in the PixelRNN paper, where their outputs are very noisy compared to mine (histogram for a random pixel: [6]).
> DeepSeek models cost more to use than comparable U.S. models
They compare DeepSeek v3.1 to GPT-5 mini. Those have very different sizes, which makes it a weird choice. I would expect a comparison with GPT-5 High, which would likely have had the opposite finding, given the high cost of GPT-5 High, and relatively similar results.
Granted, DeepSeek typically focuses on a single model at a time, instead of OpenAI's approach to a suite of models of varying costs. So there is no model similar to GPT-5 mini, unlike Alibaba which has Qwen 30B A3B. Still, weird choice.
Besides, DeepSeek has shown with 3.2 that it can cut prices in half through further fundamental research.
> CAISI chose GPT-5-mini as a comparator for V3.1 because it is in a similar performance class, allowing for a more meaningful comparison of end-to-end expenses.
mosh is hard to get into. There are many subtle bugs; a random sample that I ran into is that it fails to connect when the LC_ALL variables diverge between the client and the server[0]. On top of it, development seems abandoned. Finally, when running a terminal multiplexer, the predictive system breaks the panes, which is distracting.
I wonder why so many governments sign with a company that, even if the contract says they will not leak information to the US government, is required to yield any information to it if the US requests it, without even being able to notify their client—regardless of the location of the servers themselves.
What evidence is there that Palantir is lobbying for Chat Control? I can't find anything online.
I know you said "probably", but is your speculation based on anything? To me that would be considerably worse than just selling surveillance and investigation software to governments.
Past Mistral investors: JC Decaux (urban advertizing), CMA CGM CEO (maritime logistics), Iliad CEO (Internet service provider), Salesforce (client relation management), Samsung (electronics), Cisco (network hardware), NVIDIA (chips designer)[0]. I agree ASML is a surprising choice, but I guess investments are not necessarily directly connected to the company purpose.
BTW, I generated that list by asking my default search engine, which is Mistral Le Chat: indeed, using Cerebras chips, the responses are so fast that it became competitive with asking Google Search. A lot of comments claim it is worse, but in my experience it is the fastest, and for all but very advanced mathematical questions, it has similar quality to its best competitors. Even LMArena’s Elo indicates it wins 46% of the time against ChatGPT.
The list seems to be missing a couple of other notable investors: Eric Schmidt (former Google CEO), Andreessen Horowitz, Lightspeed Venture Partners, General Catalyst and Microsoft (only $16M).