Hacker Newsnew | past | comments | ask | show | jobs | submit | amirhirsch's commentslogin

> S&S Deli in Cambridge

Good lunch spot for a nudnik


I'm at 1137 with one hour with opus now... Pipelined vectorized hash, speculation, static code for each stage, epilogues and prologues for each stage-to-stage...

I think I'm going to get sub 900 since i just realized i can in-parallel compute whether stage 5 of the hash is odd just by looking at bits 16 and 0 of stage 4 with less delay.....


Submit it to the leaderboard: https://www.kerneloptimization.fun/


I think I can hit #1 (current #1 is 1000). sub 900 not possible though.

Let me put down my thought process: You have to start to think of designing a 6-slot x8-len vector pipeline doing 48 hashes in parallel first which needs at least 10 steps —- if you convert three stages to multiply adds and do parallel XORs for the other three) —- the problem with 10 cycle hashing is you need to cram 96 scalar xors along side your vector pipeline, so that will use all 12 ALUs for 8 of those cycles. Leaving you only 24 more scalar ops per hash cycle which isn’t enough for the 48 tree value xors..

so you must use at least 11 steps per hash, with 96 xors (including the tree value xor) done in the scalar alus using 8 steps, and giving 3*12 Alu ops per hash cycle. You need 12 more ops per hash to do odd/even, so you must be 12 stages, and just do all of the hash ops in valu, 4 cycles of 12 alus doing modulo, 8 cycles x 12 alus free

With 12 steps and 48 parallel you’re absolute minimum could be 4096/48 x 12 = 1,024 cycles, since stage 10 can be optimized (you don’t need the odd/even modulo cycle, and can use some of those extra scalar cycles to pre-xor the constant can save you ~10 cycles. 1024 gonna be real hard, but I can imagine shenanigans to get it down to 1014, sub-1000 possible by throwing more xor to the scalar alus.


> sub 900 not possible though.

I performed a similar analysis to you and found it very difficult to imagine sub-1000. Your comment I think convinced me that it may be possible, though. Interesting.

I'm below the threshold for recruiting but not below Claude at the moment. Not sure where I am going wrong.


Here’s some other hints: combine hash stages 2 and 3, it can be two muladds and a XOR

For the first several rounds (when every tree value is in use) Combine the stage 5 XOR with the subsequent round’s tree XORs. You can determine even/odd in hash stage 5 starting with a ^ (a>>16) without Xoring the constant, then you can only need one XOR, this saves you a ton of XORs

Create separate instruction bundles for the first round, rounds 1-5 (combining hash stages 5 XOR with next round tree XORs) and 6-9 (not every tree node is used anymore), round 10 round 11-14 and round 15 and combine them.

you can use add_imm in parallel to load consts. stage 0 you have to do load the tree first and the vals, by later stages when everything is in scratch, you could use 12 scalar XORs and 6 vector XORs on scratch. once you vload vals, you can start to do XORs but can only advance so much at a time, so I’m starting to work on getting hash stages moving to different rounds faster to hide the initial vloads and get to the heavy load section sooner and spread the load pain.


Why do you need an X account for it? Seems like a ridiculous requirement


How do you avoid the load bottleneck?


======================================================================

BROADCAST LOAD SCHEDULE

======================================================================

Round | Unique | Load Strategy

------|--------|------------------------------------------

   0  |    1   | 1 broadcast → all 256 items

   1  |    2   | 2 broadcasts → groups

   2  |    4   | 4 broadcasts → groups

   3  |    8   | 8 broadcasts → groups

   4  |   16   | 16 broadcasts → groups

   5  |   32   | 32 broadcasts → groups

   6  |   63   | 63 loads (sparse, use indirection)

   7  |  108   | 108 loads (sparse, use indirection)

   8  |  159   | 159 loads (sparse, use indirection)

   9  |  191   | 191 loads (sparse, use indirection)

  10  |  224   | 224 loads (sparse, use indirection)

  11  |    1   | 1 broadcast → all 256 items

  12  |    2   | 2 broadcasts → groups

  13  |    4   | 4 broadcasts → groups

  14  |    8   | 8 broadcasts → groups

  15  |   16   | 16 broadcasts → groups
Total loads with grouping: 839

Total loads naive: 4096

Load reduction: 4.9x


take advantage of index collisions, optimizing round 0 and 11, speculative pre-loading, and the early branch predictor (which now I am doing looking at bits output at stage 3)


it's actually pretty funny since opus will suggest both of these with enough prying (though with a single-prompt it might not try it).


Very Nice. There is an issue with panning on the million point demo -- it currently does not redraw until the dragging velocity is below some threshold, but it should seem like the points are just panned into frame. It is probably enough to just get rid of the dragging velocity threshold, but sometimes helps to cache an entire frame around the visible range


this won't cost the city too much, there's only like a hundred kids under 6 in this city and 3% of them are mine.


When people say "there are barely any kids," they're often describing the outcome of past policy choices, not a reason to avoid changing them


HN poster responds: "You have 0.18 kids under 6! That seems unlikely!"


Am i missing the joke? ChatGPT tells me 3% of 100 is 3, not 0.18.


When I first read it I thought wait, 3% of 6 is 0.18, but then I realized no I'm a dork because 6 is the age of the kid, whereas the number 100 is written as a word hundred, hence I decided to write "HN poster responds:" with quotes around my first non-coffee aided thought because I thought it was funny. I guess I should have just made that full statement, but I do have a tendency to rather oblique communication strategies.

on edit: basically because I thought hah, this is the kind of mistake I always see poor tired folks make on HN and making the dumb comment and here I am making it!! This is a classic moment!


I gave your reply the most generous interpretation and read it in the ironic way as you point out in the edit


thanks!


.18 is 3% of 6. This might mean something, but I don't know what.


10 months out of six years is 0.14 so it isn't quite prenatal benefits.

What happens if an unborn baby has rights to go to preschool, but the birthing parent can't?

Is an unborn child a US citizen yet?


the next number in the sequence 3, 6, 18 is 72, but I doubt it means anything.


>ChatGPT tells me 3% of 100 is 3

Sweet baby Jesus in his high chair.

Whatever happened to just firing up a calculator app that's already on the device you were using? Or bashing "100/3" into the search box in your OS or browser?

Do you ask ChatGPT how long to cook spaghetti instead of reading it off the package you just took the spaghetti out of? Honest question.


Off topic: what are you trying to signal by saying chatgpt helped you with arithmetic here?

Is it supposed to give more weight to what you are saying?


I read it as a side joke about how people overuse chtgpt for trivial tasks


I hope that's the case!


> ChatGPT tells me 3% of 100 is 3,

FYI: % (percent) literally means "out of a hundred".


I think the joke is people trying to figure out why 0.18. I, personally, enjoy it.


You’re missing something if you asked ChatGPT that.


No, they have their irony fully deployed, not missing anything.


Hard to be sure on HN


nah, it just means you get 18% of childcare costs paid.


I did write this 20 years ago https://fpgacomputing.blogspot.com/2006/05/methods-for-recon...

The vendor tools are still a barrier to the high-end FPGA's hardened IP


python won because of enforced whitespace. It solved a social problem that other languages punted to linters, baking readability into the spec


The effect of these tools is people losing their software jobs (down 35% since 2020). Unemployed devs aren’t clamoring to go use AI on OSS.


Wasn't most of that caused by that one change in 2022 to how R&D expenses are depreciated, thus making R&D expenses (like retaining dev staff) less financially attractive?

Context: This news story https://news.ycombinator.com/item?id=44180533


Yes! Even though it's only a tax rule for USA, it somehow applied for the whole world! Thats how mighty the US is!

Or could it be, after the growth and build, we are in maintenance mode and we need less people?

Just food for thought


Yes, because US big tech have regional offices in loads of other countries too, fired loads of those developers at the same time and so the US job market collapse affected everyone.

And since then there's been a constant doom and gloom narrative even before AI started.


Probably also end of ZIRP and some “AI washing” to give the illusion of progress


Same thing happened to farmers during the industrial revolution, same thing happened to horse drawn carriage drivers, same thing happened to accountants when Excel came along, mathmaticins, and on and on the list goes. Just part of human peogress.


I keep asking chatgpt when will LLM reach 95% software creation automation, answer is ten years.


I don't think that long, but yeah, I give it five years.

Two years and 3/4 will be not needed anymore


I don't know, I go back and forth a bit. The thing that makes me skeptical is this: where is the training data that contains the experiences and thought processes that senior developers, architects, and engineering managers go through to gain the insight they hold?


I don't have all the variables in (financials of openai debt etc) but a few articles mention that they leverage part of their work to {claude,gemini,chatgpt} code agents internally with good results. it's a first step in a singularity like ramp up.

People think they'll have jobs maintaining AI output but i don't see how maintaining is that harder than creating for a llm able to digest requirements and codebase and iterate until a working source runs.


I don't think either, people forget that agents are also developing.

Back then, we put all the source code into AI to create things, then we manually put files into context, now it looks for needed files on their own. I think we can do even better by letting AI create a file and API documentation and only read the file when really needed. And select the API and documentation it needs and I bet there is more possible, including skills and MCP on top.

So, not only LLMs are getting better, but also the software using it.


Cool!

Constraint propagation from SICP is a great reference here:

https://sicp.sourceacademy.org/chapters/3.3.5.html


I wasn't aware of this chapter, but I did use constraint propagation for the solver (among other things), thanks!


# Tell the driver to completely ignore the NVLINK and it should allow the GPUs to initialise independently over PCIe !!!! This took a week of work to find, thanks Reddit!

I needed this info, thanks for putting it up. Can this really be an issue for every data center?


Doesn’t this prevent the GPUs from talking to each other over the high speed link?


I'll find out soon, but without this hack, the GPUs are non-functional.


I implemented a PDP-11 in 2007-10 and I can still read PDP-11 Octal


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: