Agreed, understanding how a method works and how it would be done helps with developing an intuition for its limitations -- what it can and what it can't do
When the topic under discussion is incredibly complex that even researchers in mentioned companies do not understand. This is like saying lets learn how combustion inside airplane engines work to get a better understanding of what LLMs can do.
Is it not better to focus your limited time on things that you can understand?
I disagree here: Setting up a large-scale pretraining run is super complex if you have to manage your distributed computing platform, but looking at how the training data looks like and is fed into an LLM is not that complex. If you are developing a product based on or with LLMs, it's worth spending a few hours to understand it on the big-picture level. I mean, look at how many people are confused why LLMs a) hallucinate facts, b) sometimes copy text passages verbatim, c) why they probably shouldn't be used as scientific calculators etc. All that could be much more clear if you know how they are trained.
I wouldn't pretrain from scratch, but continued pretraining is pretty popular for adapating LLMs to recent and/or custom data. (Sometimes this is referred to 'finetuning', however, not to be confused with 'instruction finetuning').
Quoting from the readme, it embraces other executers, including torch.compile and also works with multiple GPUs:
> Thunder is a source-to-source compiler for PyTorch. It makes PyTorch programs faster by combining and using different hardware executors at once (ie: nvFuser, torch.compile, cuDNN, and TransformerEngine FP8).
Works on single accelerators and in multi-GPU settings. Thunder aims to be usable, understandable, and extensible.
Yes, it's 8.5B params if you account for weight tying, and 9.3B if you count the embedding layer and output layer weights separately as shown in the 2nd figure in the article. In the paper, I think they justified 7B by only counting the non-embedding parameters (7,751,248,896), which is kind of cheating in my opinion, because if you do that, then Llama 2 is basically a 5B-6B param model.
Yes that's correct. It's 9.3B parameters if you count the embedding layer and final projection layer separately. However, since they used weight tying, the adjusted count is 8.5B as discussed in the article.
Yes, it's definitely unfair to count it as a 7B model. In that case, we could call Llama 2, which is 6.6B parameters, a 6B (or even 5B) parameter model.
Not sure, but in general, it looks like ZipLoRA is only useful in specific contexts like when you have two different tasks you want to optimize for (like style and content in a vision context). DoRA is more general, it's basically normalizing and scaling the LoRA matrices to get much better performance. According to the paper, it even works great for low ranks, which also effectively makes it even more parameter-efficient than OG LoRA.
I just read the article, nice write up! I think it would benefit from a short explanation of what the magnitude vector (m) and the directional matrix (V) are, I'm not familiar with that kind of decomposition.
Not related to the article but tangentially relevant, would it be possible to train a LoRA or DoRA with a high rank, and then use SVD to see if the rank is too high and truncate to a better value of r? Maybe use different ranks for different layers after some training?