More

kristjansson · 2026-02-04T19:06:23 1770231983

> approximately the same magnitude

and they really do mean that, their results show +/- 1 on log10 plots.

cptroot · 2026-02-05T00:50:14 1770252614

I don't think this is an accurate characterization of the error magnitude? Their error plots (from appendix 3) are all showing `log_10(|Y - \dot{Y}|)` as having a median of ~-3 (difference of 0.001) and a max of ~1.5 (difference of 0.035), and this is with only 3 Taylor terms.

kristjansson · 2026-02-04T16:15:41 1770221741

> self-attention is efficiently computable to arbitrary precision with constant cost per token

This paper at least aspires to reproduce 'true' attention, which distinguishes it from many of the others. TBD if its successful in that.

logicchains · 2026-02-04T16:47:42 1770223662

It can't be successful at that any more than 1+1 can equal 3. Fundamentally, if every token wants to be able to look at every previous token without loss of information, it must be O(n^2); N tokens looking at N tokens is quadratic. Any sub-quadratic attention must hence necessarily lose some information and be unable to support perfect recall on longer sequences.

orlp · 2026-02-04T18:59:54 1770231594

> N tokens looking at N tokens is quadratic

Convolving two arrays can be done perfectly accurately in O(n log n), despite every element being combined with every other element.

Or consider the even more basic sum of products a[i] * b[j] for all possible i, j:

    total = 0
    for i in range(len(a)):
        for j in range(len(b)):
            total += a[i] * b[j]

This can be computed in linear time as sum(a) * sum(b).

Your logic that 'the result contains terms of all pairs, therefore the algorithm must be quadratic' simply doesn't hold.

CrazyStat · 2026-02-04T23:11:54 1770246714

One of my favorite bits of my PhD dissertation was factoring an intractable 3-dimensional integral

\iiint f(x, y, z) dx dy dz = \int [\int g(x, y) dx]*[\int h(y, z) dz] dy

which greatly accelerated numerical integration (O(n^2) rather than O(n^3)).

My advisor was not particularly impressed and objectively I could have skipped it and let the simulations take a bit longer (quite a bit longer--this integration was done millions of times for different function parameters in an inner loop). But it was clever and all mine and I was proud of it.

anvuong · 2026-02-04T20:07:03 1770235623

This brings me back to DSP class, man learning about FFT was eye-opening.

noosphr · 2026-02-05T00:02:18 1770249738

Convolution is a local operation.

Attention is a global operation.

logicchains · 2026-02-04T21:52:43 1770241963

That's like saying sorting can be done in O(n) because radix sort exists. If you assume some structure, you lose generality, i.e. there'll be some problems it's no longer able to solve. It can no longer approximate any arbitrary function that needs perfect memory over the sequence.

hellohello2 · 2026-02-04T17:22:34 1770225754

I'm not saying if the paper is correct or not (since I can't tell), but I don't think your argument really holds. Consider applying it to multiplication:

Fundamentally, multiplication need to look at every pair of integer from the two input numbers. It must be O(n^2); N digits looking at N other digits is quadratic. Any sub-quadratic multiplication must hence necessarily lose some information.

nine_k · 2026-02-05T02:02:30 1770256950

Integer multiplication x * y can be trivially done in O(k): k = log₂(min(x, y)). This is because we can do addition in constant time, adding all bits in parallel.

By combining many more adding units, we can do (fixed-size) multiplication in constant time, too: https://en.wikipedia.org/wiki/Dadda_multiplier

sifar · 2026-02-05T02:16:15 1770257775

Multiplication can be sub-quadratic using Karatsuba's algorithm.

actionfromafar · 2026-02-04T18:00:25 1770228025

Doesn't that have to do with how many bits you allow in the actual calculation in physical reality?

hellohello2 · 2026-02-04T19:03:27 1770231807

Well, for multiplication complexity is defined in terms of on the number of digits/bits digits directly. For attention, complexity is defined on terms of the number of input vectors which are all at fixed precision. I don't understand what happens to the method proposed in the paper at higher precision (since I don't understand the paper), but in reality in doesn't matter since there is no value in anything over float16 for machine learning.

logicchains · 2026-02-04T22:00:52 1770242452

Multiplication has some properties like being cumulative. If we assume the sequence has any specific properties then we no longer have a general sequence model.

direwolf20 · 2026-02-04T23:45:32 1770248732

I think you meant commutative.

Attention also has some specific properties.

And sometimes results are just unexpected. Did you know that anything a Turing machine can do in t tome steps, a different Turing machine can do in O(sqrt(t log t)) memory cells? https://news.ycombinator.com/item?id=44055347

naasking · 2026-02-04T18:42:46 1770230566

Your argument just assumes there is no latent structure that can be exploited. That's a big assumption.

logicchains · 2026-02-04T21:53:45 1770242025

It's a necessary assumption for the universal approximation property; if you assume some structure then your LLM can no longer solve problems that don't fit into that structure as effectively.

direwolf20 · 2026-02-04T23:52:14 1770249134

Neural nets are structured as matrix multiplication, yet, they are universal approximators.

noosphr · 2026-02-05T00:05:27 1770249927

You're missing the non-linear activations.

naasking · 2026-02-04T22:52:28 1770245548

But language does have structure, as does logic and reasoning. Universal approximation is great when you don't know the structure and want to brute force search to find an approximate solution. That's not optimal by any stretch of the imagination though.

oasisaimlessly · 2026-02-04T17:48:19 1770227299

That argument could also be used to say that the FFT's time complexity of O(n log n) should be impossible.

energy123 · 2026-02-04T16:20:56 1770222056

It's like claims of room temperature superconductors or millenium prize solutions. Earth shattering if true. It'd be such a black swan. Terrible for Nvidia.

SeanAnderson · 2026-02-04T16:49:39 1770223779

Well, we solved one of the Millennium Prize problems (honestly kinda quickly) so maybe there's hope :)

kristjansson · 2026-02-04T04:45:29 1770180329

50MW might be one aisle of a really dense DC. A single rack might draw 120kW.

kristjansson · 2026-02-03T06:47:57 1770101277

The physical constraints aren’t insane for black body radiators. The engineering to run radiators at 90C in space OTOH…

kristjansson · 2026-02-03T03:08:11 1770088091

Space is empty, not cold.

SJC_Hacker · 2026-02-03T23:38:29 1770161909

> Space is empty, not cold.

The "dark" side of the JWST has temperature of about 40 K (-233 C)

kristjansson · 2026-02-03T03:03:01 1770087781

1kW TDP chips need LESS cooling?

fooker · 2026-02-03T03:49:13 1770090553

Yes, Rubin reportedly can deal with running significantly hotter.

That makes radiating a much more practical approach to cooling it.

kristjansson · 2026-02-03T06:28:25 1770100105

I see what you’re saying - higher design temp radiates better despite more energy overall to dissipate.

youarentrightjr · 2026-02-03T07:33:42 1770104022

> I see what you’re saying - higher design temp radiates better despite more energy overall to dissipate.

Yes, running hotter will cause more energy to be radiated.

but

These parts are not at all designed to radiate heat - just look at the surface area of the package with respect to the amount of power they consume.

kristjansson · 2026-02-03T15:40:25 1770133225

I think OP was saying hotter part -> hotter radiator attached to the part, not that the part itself will radiate significantly.

youarentrightjr · 2026-02-03T23:15:33 1770160533

> I think OP was saying hotter part -> hotter radiator attached to the part, not that the part itself will radiate significantly.

Hmm, surely the radiator can run at arbitrary temperatures w.r.t. the objects being cooled? I'm assuming heat pump etc is already part of the design.

kristjansson · 2026-01-30T23:52:52 1769817172

Same reason humans write in higher-level languages instead of machine code? Each additional unit of program text costs energy at write time, so there's a bias toward more compact _representations_ of programs, even if they're less efficient at runtime.

kristjansson · 2026-01-30T23:27:53 1769815673

> Bullion advocates argue that exchanging dollars for physical gold is a currency exchange rather than a consumption purchase.

One can argue that until they're blue, but it'd still be wrong. Gold is a commodity, and if you're buying it shell-packed at Costco you probably should be paying sales tax on it.

kristjansson · 2025-12-20T19:55:54 1766260554

Really why have the PCI/CPU artifice at all? Apple and Nvidia have the right idea: put the MPP on the same die/package as the CPU.

bigyabai · 2025-12-20T20:17:36 1766261856

> put the MPP on the same die/package as the CPU.

That would help in latency-constrained workloads, but I don't think it would make much of a difference for AI or most HPC applications.

PunchyHamster · 2025-12-21T13:56:49 1766325409

We need low power but high PCIE lane count CPUs for that. Just purely for shoving models from NVMe to GPU

kristjansson · 2025-11-17T23:38:34 1763422714

Sure, some fraction of withdrawn water is retained due to contamination. That’s counted in the consumption numbers of the source study? The operative question is still whether DCs and related power generation consume a concerning about of water.