Hacker Newsnew | past | comments | ask | show | jobs | submit | foobar10000's commentslogin

GLM 4.7 supports it - and in my experience for Claude code a 80 plus hit rate in speculative is reasonable. So it is a significant speed up.

Claude code router

Yeap - this is why when running it in a dev container, I just use ZFS and set up a 1 minute auto-snapshot - which is set up as root - so it generally cannot blow it away. And cc/codex/gemini know how to deal with zfs snapshots to revert from them.

Of course if you give an agentic loop root access in yolo mode - then I am not sure how to help...


Well - lists of tuples. Otherwise knows as a graph :)

A question on the 100+ tps - is this for short prompts? For large contexts that generate a chunk of tokens at context sizes at 120k+, I was seeing 30-50 - and that's with 95% KV cache hit rate. Am wondering if I'm simply doing something wrong here...


Depends on how well the speculator predicts your prompts, assuming you're using speculative decoding — weird prompts are slower, but e.g. TypeScript code diffs should be very fast. For SGLang, you also want to use a larger chunked prefill size and larger max batch sizes for CUDA graphs than the defaults IME.


In our case - agentic loop optimization of kernels. Works like a dream - after all, you have a perfect python spec (validation), the kernel is small (under 10 kloc), and the entire thing is testable. The trick is to have enough back test data so that things like cache behavior are taken into account. Ended up with different kernel versions for batch-back-test vs real-time work - which was interesting. 5 years ago would have hired about 10 ppl for the job - now 2.


How many man hours would it take 10 ppl to do this job?


IMO for some things RAG works great, and for others you may need attention, and hence why the completely disparate experiences about RAG.

As an example, if one is chunking inputs into a RAG, one is basically hardcoding a feature based on locality - which may or may not work. If it works - as in, it is a good feature (the attention matrix is really tail-heavy - LSTMs would work, etc...) - then hey, vector DBs work beautifully. But for many things where people have trouble with RAG, the locality assumption is heavily violated - and there you _need_ the full-on attention matrix.


Did you try to add codex cli as an MCP server so that Claude uses it as an mcp client instead of pasting to it? Something like “ claude mcp add codex-high -- codex -c model_reasoning_effort="high" -m "gpt-5-codex" mcp-server” ?

I’ve had good luck with it - was wondering if that makes the workflow faster/better?


Yeah I've looked into that kind of thing. In general I don't love the pattern where a coding agent calls another agent automatically. It's hard to control and I don't like how the session "disappears" after the call is done. It can be useful to leave that Codex window open for one more question.

One tool that solves this is RepoPrompt MCP. You can have Sonnet 4.5 set up a call to GPT-5-Pro via API and then that session stays persisted in another window for you to interact with, branch, etc.


In fairness, assuming 4/8 ports in, 4 ports out, operating at some ungodly GHz using a custom GaAs or SiGe chip, and working in gearboxed scrambled space with very clever input mac prefix mapping and output scrambled/gearboxed precomputation, one _could_ do around 8ns optimistically from the start of the first 64b/66b block (the 0x55... preamble) for 10GB. There's some stuff about preamble shrink that makes it wonky, but that is doable. And interestingly enough, for 25G-R, one can comfortably do this in about 4ns. I am not in fact aware of such a beastie existing for 25G, but I have seen 1 or 2 for 10G though.

Surprisingly, if ChatGPT is prompted _juust_ right, it will even give you a good way to do this.


Yes he very much does :)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: