The basic argument is that its KV cache is roughly an order of magnitude more co... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		zozbot234 14 days ago \| parent \| context \| favorite \| on: DeepSeek 4 Flash local inference engine for Metal The basic argument is that its KV cache is roughly an order of magnitude more compact than previous Chinese models, which were already very compact compared to the likes of Gemma 4 (though that example is a bit of an extreme). If you pair this with the basic facts of how to maximize LLM inference performance at scale (this was recently talked about in a video lecture on the Dwarkesh Patel YouTube podcast) the case for doing slow batched inference on prem with DeepSeek V4, perhaps even with memory offload, becomes, as I see it, quite obvious. Of course, I'd like to be proven wrong!

gghh 14 days ago [–]

Right, Dwarkesh's episode with Reiner Pope. Didn't watch the full video but as soon I saw both going to an old school blackboard with an actual chalk in hand I could tell they meant business hehe :) Thanks for recommending the vid and for the info about DS V4.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact