Cool project! How do you think about targeting hardware-specific ISAs directly? There’s an interesting paper from Citadel (https://arxiv.org/pdf/1804.06826) that highlights inefficiencies in nvcc for the Volta architecture. Do you see Luminal’s search-based paradigm eventually extending beyond outperforming handwritten kernels, towards actually competing with NVIDIA’s compiler optimizations at the PTX level?
I don't suppose you have an eye towards verilog in the long term?
I'm curious as to the breadth of possibilities that could be searched. I would imagine something like this could invent flash attention if it cast its net wide enough, but that is a pretty broad net. [Edit: I scrolled back and saw flash attention was explicitly mentioned, cool stuff]
Equality saturation (something that luminal uses at its core) is a topic for hardware synthesis and verification too. Something like dynamic hardware generation (instead of kernel generation). For example, see this thesis [1] by Samuel Coward of Imperial.
If you're looking for a high level introduction to GPU development on Apple silicon I would recommend learning Metal. It's Apple's GPU acceleration language similar to CUDA for Nvidia hardware. I ported a set of puzzles for CUDA called GPU-Puzzles (a collection of exercises designed to teach GPU programming fundamentals)[1] to Metal [2]. I think it's a very accessible introduction to Metal and writing GPU kernels.
I recently ported this to Metal for Apple Silicon computers. If you're interested in learning GPU programming on an M series Mac, I think this is a very accessible option. Thanks to Sasha for making this!