I'm at 1137 with one hour with opus now...
Pipelined vectorized hash, speculation, static code for each stage, epilogues and prologues for each stage-to-stage...
I think I'm going to get sub 900 since i just realized i can in-parallel compute whether stage 5 of the hash is odd just by looking at bits 16 and 0 of stage 4 with less delay.....
I think I can hit #1 (current #1 is 1000). sub 900 not possible though.
Let me put down my thought process: You have to start to think of designing a 6-slot x8-len vector pipeline doing 48 hashes in parallel first which needs at least 10 steps —- if you convert three stages to multiply adds and do parallel XORs for the other three) —- the problem with 10 cycle hashing is you need to cram 96 scalar xors along side your vector pipeline, so that will use all 12 ALUs for 8 of those cycles. Leaving you only 24 more scalar ops per hash cycle which isn’t enough for the 48 tree value xors..
so you must use at least 11 steps per hash, with 96 xors (including the tree value xor) done in the scalar alus using 8 steps, and giving 3*12 Alu ops per hash cycle. You need 12 more ops per hash to do odd/even, so you must be 12 stages, and just do all of the hash ops in valu, 4 cycles of 12 alus doing modulo, 8 cycles x 12 alus free
With 12 steps and 48 parallel you’re absolute minimum could be 4096/48 x 12 = 1,024 cycles, since stage 10 can be optimized (you don’t need the odd/even modulo cycle, and can use some of those extra scalar cycles to pre-xor the constant can save you ~10 cycles. 1024 gonna be real hard, but I can imagine shenanigans to get it down to 1014, sub-1000 possible by throwing more xor to the scalar alus.
I performed a similar analysis to you and found it very difficult to imagine sub-1000. Your comment I think convinced me that it may be possible, though. Interesting.
I'm below the threshold for recruiting but not below Claude at the moment. Not sure where I am going wrong.
Here’s some other hints:
combine hash stages 2 and 3, it can be two muladds and a XOR
For the first several
rounds (when every tree value is in use) Combine the stage 5 XOR with the subsequent round’s tree XORs. You can determine even/odd in hash stage 5 starting with a ^ (a>>16) without Xoring the constant, then you can only need one XOR, this saves you a ton of XORs
Create separate instruction bundles for the first round, rounds 1-5 (combining hash stages 5 XOR with next round tree XORs) and 6-9 (not every tree node is used anymore), round 10 round 11-14 and round 15 and combine them.
you can use add_imm in parallel to load consts.
stage 0 you have to do load the tree first and the vals, by later stages when everything is in scratch, you could use 12 scalar XORs and 6 vector XORs on scratch.
once you vload vals, you can start to do XORs but can only advance so much at a time, so I’m starting to work on getting hash stages moving to different rounds faster to hide the initial vloads and get to the heavy load section sooner and spread the load pain.
take advantage of index collisions, optimizing round 0 and 11, speculative pre-loading, and the early branch predictor (which now I am doing looking at bits output at stage 3)
Very Nice. There is an issue with panning on the million point demo -- it currently does not redraw until the dragging velocity is below some threshold, but it should seem like the points are just panned into frame. It is probably enough to just get rid of the dragging velocity threshold, but sometimes helps to cache an entire frame around the visible range
When I first read it I thought wait, 3% of 6 is 0.18, but then I realized no I'm a dork because 6 is the age of the kid, whereas the number 100 is written as a word hundred, hence I decided to write "HN poster responds:" with quotes around my first non-coffee aided thought because I thought it was funny. I guess I should have just made that full statement, but I do have a tendency to rather oblique communication strategies.
on edit: basically because I thought hah, this is the kind of mistake I always see poor tired folks make on HN and making the dumb comment and here I am making it!! This is a classic moment!
Whatever happened to just firing up a calculator app that's already on the device you were using? Or bashing "100/3" into the search box in your OS or browser?
Do you ask ChatGPT how long to cook spaghetti instead of reading it off the package you just took the spaghetti out of? Honest question.
Wasn't most of that caused by that one change in 2022 to how R&D expenses are depreciated, thus making R&D expenses (like retaining dev staff) less financially attractive?
Yes, because US big tech have regional offices in loads of other countries too, fired loads of those developers at the same time and so the US job market collapse affected everyone.
And since then there's been a constant doom and gloom narrative even before AI started.
Same thing happened to farmers during the industrial revolution, same thing happened to horse drawn carriage drivers, same thing happened to accountants when Excel came along, mathmaticins, and on and on the list goes. Just part of human peogress.
I don't know, I go back and forth a bit. The thing that makes me skeptical is this: where is the training data that contains the experiences and thought processes that senior developers, architects, and engineering managers go through to gain the insight they hold?
I don't have all the variables in (financials of openai debt etc) but a few articles mention that they leverage part of their work to {claude,gemini,chatgpt} code agents internally with good results. it's a first step in a singularity like ramp up.
People think they'll have jobs maintaining AI output but i don't see how maintaining is that harder than creating for a llm able to digest requirements and codebase and iterate until a working source runs.
I don't think either, people forget that agents are also developing.
Back then, we put all the source code into AI to create things, then we manually put files into context, now it looks for needed files on their own. I think we can do even better by letting AI create a file and API documentation and only read the file when really needed. And select the API and documentation it needs and I bet there is more possible, including skills and MCP on top.
So, not only LLMs are getting better, but also the software using it.
# Tell the driver to completely ignore the NVLINK and it should allow the GPUs to initialise independently over PCIe !!!! This took a week of work to find, thanks Reddit!
I needed this info, thanks for putting it up. Can this really be an issue for every data center?
Good lunch spot for a nudnik
reply