More

rcxdude · 2026-01-15T00:55:25 1768438525

11PM is pretty standard but it varies quite a lot. Some places can be open much later, it depends on what the local council will license.

rcxdude · 2026-01-14T21:58:40 1768427920

The complication is that it doesn't work reliably. You can train an LLM with special tokens for delimiting different kinds of information (and indeed most non-'raw' LLMs have this in some form or another now), but they don't exactly isolate the concepts rigorously. It'll still follow instructions in 'user input' sometimes, and more often if that input is designed to manipulate the LLM in the right way.

rcxdude · 2026-01-14T21:55:40 1768427740

Part of the issue is reads can exfiltrate data as well (just stuff it into a request url). You need to also restrict what online information the agent can read, which makes it a lot less useful.

rcxdude · 2026-01-14T00:16:16 1768349776

Stability and accuracy, when applied to clocks, are generally about dynamic range, i.e. how good is the scale with which you are measuring time. So if you're talking about nanoseconds across a long time period, seconds or longer, then yeah, you probably should care about your clock. But when you're measuring nanoseconds out of a millisecond or microsecond, it really doesn't matter that much and you're going to be OK with the average crystal oscillator in a PC. (and if you're measuring a 10% difference like in the article, you're going to be fine with a mechanical clock as your reference if you can do the operation a billion times in a row).

jacquesm · 2026-01-14T00:35:36 1768350936

This setup is a user space program on a machine that is not exclusively dedicated to the test running all kinds of interrupts (and other tasks) left, right and center through the software under test.

loeg · 2026-01-14T03:35:51 1768361751

For something like this, you can just take several trials and look at the minimum observed time, which is when there will have been ~no interruptions.

https://github.com/facebook/folly/blob/main/folly/docs/Bench...

jacquesm · 2026-01-14T03:50:40 1768362640

You don't actually know that for sure. You have only placed a new upper bound.

loeg · 2026-01-14T05:24:51 1768368291

This seems like more of a philosophical argument than a practical one.

jacquesm · 2026-01-14T11:46:04 1768391164

No, it is a very practical one and I'm actually surprised that you don't see it that way. Benchmarking is hard, and if you don't understand the basics then you can easily measure nonsense.

jerrinot · 2026-01-14T12:42:01 1768394521

You raise a fair point about the percentiles. Those are reported as point estimates without confidence intervals and the implied precision overstates what system clock can deliver.

The mean does get proper statistical treatment (t-distribution confidence interval), but you're right that JMH doesn't compute confidence intervals for percentiles. Reporting p0.00 with three significant figures is ... optimistic.

That said I think the core finding survives this critique. The improvement shows up consistently across ~11 million samples at every percentile from p0.50 through p0.999.

jacquesm · 2026-01-14T15:38:12 1768405092

Yes, I would expect the 'order of magnitude' value to be relatively close but the absolute values to be very imprecise.

menaerus · 2026-01-14T18:19:59 1768414799

You can compute the confidence intervals all you want but if you can't be sure, in one or another way, that what you're observing (measuring) in your experiment is what you actually wanted to measure (signal), not even confidence interval would help you there to distinguish between the signal and noise.

That said, at your CPU base frequency, 80ns is ~344 cycles, 70ns is ~300 cycles. That's ~40 cycles of difference. That's on the order of ~2x CPU pipeline flushes due to branch mispredictions. Or another example is RDTSCP which, at least on Intel CPUs, forces all prior instructions to retire before executing, and it prevents speculative execution of following instructions until theirs results are available. This can also impose a 10-30 cycle penalty. Both of these can interfere with the measurements of the scale you have so there is a possibility that you're measuring these effects instead of the optimization you thought you implemented.

I am not saying that this is the case, I am just saying it's possible. Since the test is simple enough I would eliminate other similar CPU level gotchas that can screw your hypothesis testing up. In more complex scenarios I would have to consider them as well.

The only reliable way I found to be sure what is really happening is to read the codegen. And I do that _before_ each test run, or to be more precise after each recompile, because compilers do crazy transformations with our code, even when just moving a naively looking function few lines above or adding some naive boolean flag. If I don't do that, I could again end up measuring, observing, and finally drawing the conclusion that I implemented a speedup without realizing that the compiler in that last case decided to eliminate half of the code because of that innocuous boolean flag. Just an example.

radix tree lookup looks interesting and it would be interesting to see at what exact instruction does it idle on. I had a case where the function would be sitting idle, reproducible, but when you look into the function there is nothing obvious you can optimize. It turned out that the CPU pipeline was so saturated that there were no more available CPU ports for the instruction this function was idling for. The fix was to rewrite code elsewhere but in vicinity of this function. This is something flamegraphs can never show you, which is partly the reason I had never been a huge fan of.

rcxdude · 2026-01-13T18:50:57 1768330257

Assuming AI is at all useful it's likely to be used for safety-critical software development. Safety-critical processes aren't likely to care about LLM involvement much at all, much like they don't generally care about competence of those doing the work already.

rcxdude · 2026-01-13T12:46:30 1768308390

Yeah, the general theme was the laws seem simple enough but the devil is in the details. Pretty much every story is about them going wrong in some way (to give another example: what happens if a robot is so specialised and isolated it does not recognise humans?)

rcxdude · 2026-01-13T10:55:34 1768301734

It would, but it would also result in a bunch of users getting hacked through prompt injection attacks.

rcxdude · 2026-01-12T11:55:35 1768218935

From his description of the approach I suspect its also to smooth over sharp edges that the grid optimization doesn't like so much.

rcxdude · 2026-01-12T09:55:16 1768211716

Have you looked at Eurocircuits? Not quite as big, but similar kind of thing.

magicalhippo · 2026-01-12T10:41:54 1768214514

Similar in how you order, not similar in price at all. Order of magnitude more expensive for hobby-grade boards.

rcxdude · 2026-01-12T09:54:22 1768211662

Really? Most of the electronics I work on get made in the EU. There are a few decent options, even. It's not dead, even if China is much bigger.