Hacker Newsnew | past | comments | ask | show | jobs | submit | lostmsu's commentslogin

I don't know what's behind a wall I'm sitting next to right now, but I'm reasonably sure there's a street. I'm also reasonably sure the comment about "you've been dead" is also a very accurate prediction.

That wall is concrete and material. Death is not so much. I am reasonably sure you can do that with great accuracy while still having zero idea what lies in wait for us after we die. A false equivalence.


I never saw that happen in Codex so there's a good chance that OpenClaw does something wrong. My main suspicion would be that it does not pass back thinking traces.

Anecdata, but I see this in Codex all the time. It takes about two rounds before it realises it's supposed to continue.

I started seeing this a lot more with GPT 5.4. 5.3-codex is really good about patiently watching and waiting on external processes like CI, or managing other agents async. 5.4 keeps on yielding its turn to me for some reason even as it says stuff like "I'm continuing to watch and wait."

H2 is already a fuel

He meant lossy compression

From a security-minded user perspective it makes sense to destroy keys when instead of a single entity I receive updates from I get another entity that is not equivalent, and half of my previous entity thinks that the other half is sus.

[flagged]


It wasnt intelligence agency compromise, it was a business partner compromise, who intended to violate the privacy and security of their users. Nothing about this is done out of spite. Im not sure where youre getting that from. You just seem to be attacking peoples character for making the right choice given the circumstances.

Break it up

> And they know what revolutions mean in Russia.

In retrospective they would mean saving 400k+ young men from dying, approx. 200k+ on each side. But Navalny wasn't a revolutioner (his mistake: if you have a death wish, there are more effective methods than peaceful protesting).


You can also tell Claude/Codex/whatever to look up previous conversations in respective folders.

Yes, I go even further. In-repo, I have a chats folder that my /done skill fills with ~"what we did, and didn't accomplish in this chat. Blah blah (a few more instructions) - finish with a great hand off to the next chat to continue the work." I run that anytime I approach 50% of the context window, as all models get dumb at that point. Then /clear, then /effort max just to be safe, then "please ingest chats/2026-01-01-00-00-what-we-did.md and proceed." It's a very purposeful custom /compress that works far better in my experience. If I ever hit auto-compress, I have failed as a Claude jockey.

No you did not. You got 207 tok/s on an RTX 3090 with speculative decoding which, generally speaking, is not the same quality as serving the model without it.

Greedy-only decoding is even worse. There's a reason every public model comes with suggested sampling parameters. When you don't use them, output tends to degrade severely. In your case simply running a 14B model on the same hardware with the tools you compare against would probably be both faster and produce output of higher quality.


Speculative decoding doesn't degrade output quality. The distribution it produces is exactly the same if you do it correctly. The original paper on it clearly talks about this. [0]

Speculative decoding is the same as speculative execution on CPUs. As long as you walk back on an incorrect prediction (i.e. the speculated tokens weren't accepted) then everything is mathematically exactly the same. It just uses more parallelism (specificslly higher arithmetic intensity).

[0] https://arxiv.org/abs/2211.17192


why is it that speculative decoding lowers quality? My understanding of it is that you use a small/distilled fast model to predict next token - when it doesn't match, you generate more. Checking against the large model is quick.

This should maintain exactly the quality of the original model, no?


AFAIU It's not that checking against the large model is quick (in the usual P!=NP sense that checking an answer is easier than finding one). It's that you can batch your checks. So you speculate the next 5 tokens, and then you can parallelize the large model running once for the batch of [...,n+1], [...,n+2], [...,n+3], [...,n+4], [...,n+5]. If you guessed right for a prefix, you turned a sequential problem (computing next token from current prefix) into a parallel one (doing multiple prefixes together) that the GPU likes. If you guessed wrong, you have to throw away the suffix starting at the wrong guess, and you wasted some extra energy computing.

I looked up, and you are correct in regards to the specific algorithm used. In general there are approximate algorithms for speculative decoding.

Greedy decoding means it is still not ready though.


> speculative decoding which, generally speaking, is not the same quality as serving the model without it.

I've never heard of ANY speculative decoding that wasn't lossless. If it was lossy it'd be called something else.

This page is just a port of DFLASH to gguf format, it only implements greedy decoding like you said so the outputs will be inferior, but not inferior to greedy decoding on the original model. Tho that's just a matter of implementing temperature, top_k, etc.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: