Hacker Newsnew | past | comments | ask | show | jobs | submit | atgctg's commentslogin

The paper's Table 7 shows DyT reducing overall LLaMA 7B inference time by 7.8% and training time by 8.2%. That is not insignificant.


But LLM performance scales according to the log of compute, so yeah it’s pretty insignificant. I think we’ve reached a bit of a plateau.


You can get a free trial right now to Stratechery Plus through Asianometry:

https://stratechery.passport.online/member/plan/4ycW4SE71Cy6...

Source: https://substack.com/home/post/p-154928959


Thanks! I used to subscribe but it's been a while.


You have to store the KV cache, not the tokens. For Gemma 27B (probably slightly larger than Flash), this would be:

  Size of KV cache = 2 * (num_layers) * (num_kv_heads * dim_head) * seq_length * precision

  8-bit Gemma 27B KV cache = 2 * (46) * (16 * 144) * 1e6 * 1 byte ≈ 200 GB
Note that this doesn't take further optimizations into account that Google might be using.

Formula: https://developer.nvidia.com/blog/mastering-llm-techniques-i...

Gemma 27B config: https://huggingface.co/google/gemma-2-27b/blob/main/config.j...


Is there some easy to understand source / paper about how this caching works?



Ask chat gpt to explain how K-V caching works. What they are doing is essentially the same thing, with a few more engineering details.


Works using math CSS injection [1]:

    ```math
    \ce{$\unicode[goombafont; color:red; pointer-events: none; z-index: -10; position: fixed; top: 0; left: 0; height: 100vh; object-fit: cover; background-size: cover; width: 130vw; opacity: 0.5; background: url('https://github.com/cloud11665/cloud11665/assets/59028866/3b916a93-1632-49cd-bf65-14e666cd81c8');]{x0000}$}
[1]: https://raw.githubusercontent.com/cloud11665/cloud11665/mast...


Yeah, it seems like MathJax just puts whatever is in the square brackets of the Unicode tag into css font-family without escaping it beforehand


Tiktoken added support for GPT-4o: https://github.com/openai/tiktoken/commit/9d01e5670ff50eb74c...

It has an increased vocab size of 200k.


Does that imply they retrained the foundation model from scratch? I thought changing the tokenization was something you couldn't really retrofit to an existing model. I mean sure they might have initialized the weights from the prior GPT-4 model but it'd still require a lot of retraining.


Yeah and they say as much in the blog.


For posterity, GPT-3.5/4's tokenizer was 100k. The benefit of a larger tokenizer is more efficient tokenization (and therefore cheaper/faster) but with massive diminishing returns: the larger tokenizer makes the model more difficult to train but tends to reduce token usage by 10-15%.


Oh interesting, does that mean languages other than English won't be paying such a large penalty in terms of token lengths?

With previous tokenizers there was a notable increase in the number of tokens needed to represent non-English sentences: https://simonwillison.net/2023/Jun/8/gpt-tokenizers/


Yep. Non-English text gets a much bigger cost drop and speedup compared to English. Has always been a bummer that GPT-4 is like 5x slower and more expensive in Japanese, etc.


Just found there's a whole section about that in this post: https://openai.com/index/hello-gpt-4o/

It says "Japanese 1.4x fewer tokens (from 37 to 26)" - some other languages get much bigger improvements though, best is "Gujarati 4.4x fewer tokens (from 145 to 33)".


How are they able to use such a brand name, Tiktoken? Is it because TikTok is Chinese? Tiktoken, it's almost like if Apple released the Facebooken library for something entirely unrelated to Facebook.


That's not the right analogy. The "tok" in "Tiktoken" comes from "token", not "TikTok".


And the "tik" comes from TikTok.


Lots of those tokens would have to be pixel patches and sound samples right?


Yep. Since it’s multimodal. Pictures, text, audio all go into token space.


Seems like they are working on adding that capability:

> We're exploring whether we can responsibly provide the ability to generate NSFW content in age-appropriate contexts through the API and ChatGPT.

Link to section: https://cdn.openai.com/spec/model-spec-2024-05-08.html#dont-...


A lot of transformer explanations fail to mention what makes self attention so powerful.

Unlike traditional neural networks with fixed weights, self-attention layers adaptively weight connections between inputs based on context. This allows transformers to accomplish in a single layer what would take traditional networks multiple layers.


In case it’s confusing for anyone to see “weight” as a verb and a noun so close together, there are indeed two different things going on:

1. There are the model weights, aka the parameters. These are what get adjusted during training to do the learning part. They always exist.

2. There are attention weights. These are part of the transformer architecture and they “weight” the context of the input. They are ephemeral. Used and discarded. Don’t always exist.

They are both typically 32-bit floats in case you’re curious but still different concepts.


I always thought the verb was "weigh" not "weight", but apparently the latter is also in the dictionary as a verb.

Oh well... it seems like it's more confusing than I thought https://www.merriam-webster.com/wordplay/when-to-use-weigh-a...


“To weight” is to assign a weight (e.g., to weight variables differently in a model), whereas “to weigh” is to observe and/or record a weight (as a scale does).


A few other cases of this sort of thing:

affect (n). an emotion or feeling. "She has a positive affect."

effect (n). a result or change due to some event. "The effect of her affect is to make people like her."

affect (v). to change or modify [X], have an effect upon [X]. "The weather affects my affect."

effect (v). to bring about [X] or cause [X] to happen. "Our protests are designed to effect change."

Also:

cost (v). to require a payment or loss of [X]. "That apple will cost $5." Past tense cost: "That apple cost $5."

cost (v). to estimate the price of [X]. "The accounting department will cost the construction project at $5 million." Past tense costed. "The accounting department costed the construction project at $5 million."


I think in most deployments, they're not fp32 by the time you're doing inference no them, they've been quantized, possibly down to 4 bits or even fewer.

On the training side I wouldn't be surprised if they were bf16 rather than fp32.


I think a good way of explaining #2 is “weight” in the sense of a weighted average


None of this seems obvious just reading the original Attention is all you need paper. Is there a more in-depth explanation of how this adaptive weighting works?


The audience of this paper are other researchers who already know the concept of attention, which was very well known already in the field. In such research papers, such things are never explained again, as all the researchers already know this or can read other sources, which are cited, but focus on the actual research questions. In this case, the research question was simply: Can we get away by just using attention and not using the LSTM anymore? Before that, everyone was using both together.

I think learning it following it more this historical development can be helpful. E.g. in this case here, learn the concept of attention, specifically cross attention first. And that is this paper: Bahdanau, Cho, Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate", 2014, https://arxiv.org/abs/1409.0473

That paper introduces it. But even that is maybe quite dense, and to really grasp it, it helps to reimplement those things.

It's always dense, because those papers already have space constraints given by the conferences, max 9 pages or so. To get a better detailed overview, you can study the authors code, or other resources. There is a lot now about those topics, whole books, etc.


What books cover exclusively about this topic ? Thanks


This is frequently a topic here on HN. E.g.:

https://udlbook.github.io/udlbook/ (https://news.ycombinator.com/item?id=38424939)

https://fleuret.org/francois/lbdl.html (https://news.ycombinator.com/item?id=35767789)

https://www.fast.ai/ (https://news.ycombinator.com/item?id=24237207)

https://d2l.ai/ (https://news.ycombinator.com/item?id=38428225)

Some more:

https://news.ycombinator.com/item?id=35543774

There is a lot more. Just google for "deep learning", and you'll find a lot of content. And most of that will cover attention, as it is a really basic concept now.


Thanks for the udl book (Understanding Deep Learning), that looks like a really great starting point.


To add to the excellent resources that have already been posted, Chapter 9 of Jurafsky and Martin's "Speech and Language Processing" has a nice overview of attention, and the next chapter talks specifically about the Transformer architecture: https://web.stanford.edu/~jurafsky/slp3/


I doubt any.


It’s definitely not obvious no matter how smart you are! The common metaphor used is it’s like a conversation.

Imagine you read one comment in some forum, posted in a long conversation thread. It wouldn’t be obvious what’s going on unless you read more of the thread right?

A single paper is like a single comment, in a thread that goes on for years and years.

For example, why don’t papers explain what tokens/vectors/embedding layers are? Well, they did already, except that comment in the thread came 2013 with the word2vec paper!

You might think wth? To keep up with this some one would have to spend a huge part of their time just reading papers. So yeah that’s kind of what researchers do.

The alternative is to try to find where people have distilled down the important information or summarized it. That’s where books/blogs/youtube etc come in.


Is there a way of finding interesting "chains" of such papers, short of scanning the references / "cited by" page?

(For example, Google Scholar lists 98797 citations for Attention is all you need!)


As a prerequisite to the attention paper? One to check out is:

A Survey on Contextual Embeddings https://arxiv.org/abs/2003.07278

Embeddings are sort of what all this stuff is built on so it should help demystify the newer papers (it’s actually newer than the attention paper but a better overview than starting with the older word2vec paper).

Then after the attention paper an important one is:

Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165

I’m intentionally trying to not give a big list because they’re so time-consuming. I’m sure you’ll quickly branch out based on your interests.


I found these notes very useful. They also contain a nice summary of how LLMs/transformers work. It doesn't help that people can't seem to help taking a concept that has been around for decades (kernel smoothing) and giving it a fancy new name (attention).

http://bactra.org/notebooks/nn-attention-and-transformers.ht...


It's just as bad a "convolutional neural networks" instead of "images being scaled down"


“Convolution” is a pretty well established word for taking an operation and applying it sliding-window-style across a signal. Convnets are basically just a bunch of Hough transforms with learned convolution kernels.


I struggled to get an intuition for this, but on another HN thread earlier this year saw the recommendation for Sebastian Raschka's series. Starting with this video: https://www.youtube.com/watch?v=mDZil99CtSU and maybe the next three or four. It was really helpful to get a sense of the original 2014 concept of attention which is easier to understand but less powerful (https://arxiv.org/abs/1409.0473), and then how it gets powerful with the more modern notion of attention. So if you have a reasonable intuition for "regular" ANNs I think this is a great place to start.


Turns out Attention is all you need isn't all you need!

(I'm sorry)


softmax(QK) gives you a probability matrix of shape [seq, seq]. Think of this like an adjacency matrix with edges with flow weights that are probabilities. Hence semantic routing of parts of X reduced with V.

where

- Q = X @ W_Q [query]

- K = X @ W_K [key]

- V = X @ V [value]

- X [input]

hence

attn_head_i = (softmax(Q@K/normalizing term) @ V)

Each head corresponds to a different concurrent routing system

The transformer just adds normalization and mlp feature learning parts around that.


Just to add on, a good way to learn these terms is to look at the history of neural networks rather than looking at transformer architecture in a vacuum

This [1] post from 2021 goes over attention mechanisms as applied to RNN / LSTM networks. It's visual and goes into a bit more detail, and I've personally found RNN / LSTM networks easier to understand intuitively.

[1] https://medium.com/swlh/a-simple-overview-of-rnn-lstm-and-at...


It would be interesting to have attention visualized as well, similar to how it's done in BertViz:

https://github.com/jessevig/bertviz


I really enjoyed playing with BertViz. Another similar visual exploration tool I found recently is in Edwin Chen's blog. I am pretty sure this is the best explanation of LSTM. I think more tutorials should use this visual approach.

http://blog.echen.me/2017/05/30/exploring-lstms/

There is also https://playground.tensorflow.org/


As an example, INT8 support in WebGPU would enable running quantized models, allowing larger LLMs to run locally in the browser.

See Limitations section here: https://fleetwood.dev/posts/running-llms-in-the-browser



Nice, imagine all these games on an iPad…


Imagine all these games on an Apple TV… I’m assuming at some point those will also start using M chips. Excited to see if apple can enter the console market.


Could you just airplay from the mac to the Apple TV? Not sure about input lag in that situation but it'd be interesting to see


That would be wonderful!


On iPad they would need their controls redone; it's possible, but an extra step.

I'm also getting the feeling the ipad is quickly running out of favor, I haven't seen one in ages except on my parents' dinner table.


You can pair your XBox or Playstation controller with an iPad.


Also keyboard and mouse!


Those are terrible controls for most of games.


Only for RTS, most other games actually favor the controller due to the included aim assist in FPS games, for example.

Having to be on console makes pretty much any game controller compliant.


If they were a good kind of controls they would not require an "assist".


You can pair a mouse and keyboard with an iPad too. You can even plug them in…


GeForce Now should allow you to do that right now :)


Yeah but you need to subscribe to them, you need a stable low latency connection, etc

Flights will be way more fun to just pop down with your iPad (or Vision Pro) and not have to also bring along your switch or steam deck


So, you need to spend $3k to get playable framerates, which are possible with a 3060 laptop, maybe even a 3050ti, as they can do 1080p, this just seems to be 900p.


The point is that you'll be able to play games on the machine you already have. Yes, a PC with GPU is going to be better and people that are really into gaming will probably always opt for that. But there's a big casual market out there too.


Actually you can play these games with Xbox Series S at 1440p easily. I wonder if people would just simply buy a game console (if they don't already have one) for the more demanding titles. Most people don't need a processor that is nearly as powerful as M1 Max, and I doubt anyone is going to spend extra money on a computer just for its GPU that doesn't even play games as well as a $300 console.


>Actually you can play these games with Xbox Series S at 1440p easily.

The Xbox Series S version of Cyberpunk runs at 30 FPS with a dynamic resolution between 2304x1296 and 2560x1440 on quality mode and at 60 FPS with a dynamic resolution between 1410x800 and 1920x1080 on performance mode. If you were to run it with a fixed resolution of 1440p, then you'd definitely not be averaging 30 FPS.


There are options if you just want to play the game, yes, but Apple did the work here, and met developers the wrong half way IMO. If you just want to play with higher settings and fps on the computer you have, less emulation is better, as impressive as it might be. A Vulkan driver would be less emulation and more performance all around I think. Also, $300 can buy a lot of games if games can be made to run well with minimal work.


The last thing I still use my windows machine for is the occasional gaming session. I'd love to be able to be rid of it forever. This seems like a positive step in that direction.


I play Cyberpunk with an aging AMD RX570, and get consistent rate of 45 FPS running at 3K with high quality settings.

I'm not an Apple fan, but have to admit that Apple chip is getting incredible frame rates, considering it has no discrete GPU.


I'm already spending $3k for my laptop so that I can develop.

This means I won't _also_ have to spend several thousand dollars on a gaming PC in addition.


Yes, but then you have a gaming laptop...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: