More

rasbt · on Sept 9, 2024

Thanks for sharing!

rasbt · on June 14, 2024

Agreed, understanding how a method works and how it would be done helps with developing an intuition for its limitations -- what it can and what it can't do

objektif · on June 14, 2024

When the topic under discussion is incredibly complex that even researchers in mentioned companies do not understand. This is like saying lets learn how combustion inside airplane engines work to get a better understanding of what LLMs can do.

Is it not better to focus your limited time on things that you can understand?

rasbt · on June 14, 2024

I disagree here: Setting up a large-scale pretraining run is super complex if you have to manage your distributed computing platform, but looking at how the training data looks like and is fed into an LLM is not that complex. If you are developing a product based on or with LLMs, it's worth spending a few hours to understand it on the big-picture level. I mean, look at how many people are confused why LLMs a) hallucinate facts, b) sometimes copy text passages verbatim, c) why they probably shouldn't be used as scientific calculators etc. All that could be much more clear if you know how they are trained.

rasbt · on June 14, 2024

thanks for mentioning, that makes me super happy to hear!

rasbt · on June 14, 2024

I wouldn't pretrain from scratch, but continued pretraining is pretty popular for adapating LLMs to recent and/or custom data. (Sometimes this is referred to 'finetuning', however, not to be confused with 'instruction finetuning').

rasbt · on June 14, 2024

Thanks, glad that this is helpful!

rasbt · on March 23, 2024

Quoting from the readme, it embraces other executers, including torch.compile and also works with multiple GPUs:

> Thunder is a source-to-source compiler for PyTorch. It makes PyTorch programs faster by combining and using different hardware executors at once (ie: nvFuser, torch.compile, cuDNN, and TransformerEngine FP8).

Works on single accelerators and in multi-GPU settings. Thunder aims to be usable, understandable, and extensible.

rasbt · on Feb 24, 2024

Yes, it's 8.5B params if you account for weight tying, and 9.3B if you count the embedding layer and output layer weights separately as shown in the 2nd figure in the article. In the paper, I think they justified 7B by only counting the non-embedding parameters (7,751,248,896), which is kind of cheating in my opinion, because if you do that, then Llama 2 is basically a 5B-6B param model.

andy99 · on Feb 24, 2024

Is the 2B measured like that as well? I did use it with llama.cpp and noticed it ran slower than I expected.

That's the danger of too much abstraction, it's easy to have big gaps in one's understanding of what's really going on.

rasbt · on Feb 24, 2024

Yes, it's somewhat similar to the 2B model as it uses the same vocabulary size.

rasbt · on Feb 24, 2024

> Gemma is a +9B model

Yes that's correct. It's 9.3B parameters if you count the embedding layer and final projection layer separately. However, since they used weight tying, the adjusted count is 8.5B as discussed in the article.

neodymiumphish · on Feb 24, 2024

Which still rounds to 9B and is 21.4% larger.

rasbt · on Feb 24, 2024

Yes, it's definitely unfair to count it as a 7B model. In that case, we could call Llama 2, which is 6.6B parameters, a 6B (or even 5B) parameter model.

neodymiumphish · on March 2, 2024

Except 6.6 rounds to 7. That’s completely reasonable. Arguing otherwise is pedantic.

rasbt · on Feb 18, 2024

Not sure, but in general, it looks like ZipLoRA is only useful in specific contexts like when you have two different tasks you want to optimize for (like style and content in a vision context). DoRA is more general, it's basically normalizing and scaling the LoRA matrices to get much better performance. According to the paper, it even works great for low ranks, which also effectively makes it even more parameter-efficient than OG LoRA.

sorenjan · on Feb 18, 2024

I just read the article, nice write up! I think it would benefit from a short explanation of what the magnitude vector (m) and the directional matrix (V) are, I'm not familiar with that kind of decomposition.

Not related to the article but tangentially relevant, would it be possible to train a LoRA or DoRA with a high rank, and then use SVD to see if the rank is too high and truncate to a better value of r? Maybe use different ranks for different layers after some training?

rasbt · on Feb 18, 2024

Thanks for the feedback. Clarifying definitely wouldn't hurt. Added a paragraph and new figure at the top of the DoRA section: https://magazine.sebastianraschka.com/i/141797214/introducin...

I haven't tried what you were suggesting, but that sounds actually plausible. Interesting idea!

rasbt · on Feb 18, 2024

Thanks, fixed!