> The root cause reason for this project existing is to show that GPU > programm...

JonChesterfield · on Dec 14, 2024

GPU offloading languages - cuda, openmp etc - work something like:

1. Split the single source into host parts and gpu parts

2. Optionally mark up some parts as "kernels", i.e. have entry points

3. Compile them separately, maybe for many architectures

4. Emit a bunch of metadata for how they're related

5. Embed the GPU code in marked up sections of the host executable

6. Embed some startup code to find GPUs into the x64 parts

7. At runtime, go crawling around the elf section launching kernels

This particular library (which happens to be libc) is written in C++, compiled with ffreestanding target=amdgpu, to LLVM bitcode. If you build a test, it compiles to an amdgpu elf file - no x64 code in it, no special metadata, no elf-in-elf structure. The entry point is called _start. There's a small "loader" program which initialises hsa (or cuda) and passes it the address of _start.

I'm not convinced by the clever convenience cut-up-and-paste-together style embraced by cuda or openmp. This approach brings the lack of magic to the forefront. It also means we can add it to openmp etc when the reviews go through so users of that suddenly find fopen works.

einpoklum · on Dec 15, 2024

CUDA C++ _can_ work like that. But I would say that these are mostly kiddie wheels for convenience. And because, in GPU programming, performance is king, most (?) kernel developers are likely to eventually need to drop those wheels. And then:

* No single source (although some headers might be shared)

* Kernels are compiled and linked at runtime, for the platform you're on, but also, in the general case, with extra definitions not known apriori (and which are different for different inputs / over the course of running your program), and which have massive effect on the code.

* You may or may not use some kind of compiled kernel caching mechanism, but you certainly don't have all possible combinations of targets and definitions available, since that would be millions or compiled kernels.

It should also be mentioned that OpenCL never included the kiddie wheels to begin with; although I have to admit it makes it less convenient to start working with.