Hacker Newsnew | past | comments | ask | show | jobs | submit | dave_universetf's commentslogin

Strictly speaking: no. The "forward pass" terminology does not imply that there exists a "reverse pass" that does the same kind of computation. Rather, it's describing two different kinds of computation, and the direction they occur in.

The forward pass is propagating from inputs to outputs, computing the thing the model was trained for. The reverse/backwards pass is propagating from outputs back to inputs, but it's calculating the gradients of parameters for training (rougly: how much changing each parameter in isolation affects the output, and whether it makes the output closer to the desired training output). The result of the "reverse pass" isn't a set of inputs, but a set of annotations on the model's parameters that guide their adjustment.

The computations of the forward pass are not trivially reversible (e.g. they include additions, which destroys information about the operand values). As a sibling thread points out, you can still probabilistically explore what inputs _could_ produce a given output, and get some information back that way, but it's a lossy process.

And of course, you could train a "reverse" model, one that predicts the prefix of a sequence given a suffix (trivially: it's the same suffix prediction problem, but you train it on reversed sequences). But that would be a separate model trained from scratch on that task, and in that model the prefix prediction would be its forward pass.


I do want to see ChatGPT running upwards on my screen now, predicting earlier and earlier words in a futile attempt to explain a nonsense conclusion. We could call it ChatJeopardy.

Both, with caveats. The attention computation is fundamentally quadratic: for every token in the sequence, you're doing a computation that has to compute over every other token in the sequence. So it's O(N) per token, O(N^2) for the whole sequence.

The big mitigation for this is that in causal transformers (i.e. all the chatbot type applications, where each token is only allowed to see tokens before it), you're running inference repeatedly on the same prefix in order to grow it by one token at a time. So if you cache the computations for tokens 0..N-1, on each inference pass you only have to compute O(N) for the newly added token at the end of the sequence.

That's why caching (and caching charges) appear so prominently everywhere in the pricing of inference.

In practice, caching is most beneficial at inference time, because you typically have relatively long conversations that start with the same cacheable prefix (the system prompt). At training time the same optimization can apply, but you're typically not pushing the same prefixes through the model repeatedly so you end up paying the quadratic cost more often.

The quadratic cost of attention is the fundamental compute bottleneck for transformer architectures, which is why there's research like this trying to find shortcuts in computing attention, as well as research into completely new primitives to replace attention (e.g. SSM, which is O(N) on a cold cache and O(1) on a warm cache).


I don't know about all hyperscalers, but I have knowledge of one of them that has a large enough fleet of atomic frequency standards to warrant dedicated engineering. Several dozen frequency standards at least, possibly low hundreds. Definitely not one per machine, but also not just one per datacenter.

As you say, the goal is to keep the system clocks on the server fleet tightly aligned, to enable things like TrueTime. But also to have sufficient redundancy and long enough holdover in the absence of GNSS (usually due to hardware or firmware failure on the GNSS receivers) that the likelihood of violating the SLA on global time uncertainty is vanishingly small.

The "global" part is what pushes towards having higher end frequency standards, they want to be able to freewheel for O(days) while maintaining low global uncertainty. Drifting a little from external timescales in that scenario is fine, as long as all their machines drift together as an ensemble.

The deployment I know of was originally rubidium frequency standards disciplined by GNSS, but later that got upgraded to cesium standards to increase accuracy and holdover performance. Likely using an "industrial grade" cesium standard that's fairly readily available, very good but not in the same league as the stuff NIST operates.


I wrote MetalLB, a bare metal load-balancer for Kubernetes, because I needed one for myself. It gained some popularity because for a couple years, it was the only way to get working L4 LB outside of clouds. These days I believe a couple of the CNIs added support for external BGP peering and integration with k8s's LB machinery, but that came years later.

As a result, I became network troubleshooting tech support for a large chunk of people trying to run kubernetes on bare metal. If you've not looked at k8s's networking, debugging even your own cluster's networking is a nightmare, never mind debugging someone else's over slack, while (usually) simultaneously having to give them a crash course in intermediate/advanced networking stuff like asymmetric routing and tracing packets through netfilter so that you can then tell them that networks can't do the thing they wanted and no amount of new features I can add will change that.

Meanwhile companies selling bare metal k8s services started bundling MetalLB, but kept sending their customers to my bugtracker instead of taking some of the load themselves.

The experience burned me out severely. It's been several years and I still have a visceral negative reaction to the idea of open-sourcing code I wrote, and when I infrequently do they come with a fairly blunt "contributions no welcome" message and a disabled issue tracker. I handed over the keys to MetalLB a long while back now. I hope the new maintainers and the project are doing okay.

I'll mention a positive of that time as well, to balance it out: as an experiment I opened a pinned issue asking happy users to drop me a note (https://github.com/metallb/metallb/issues/5), and many did. It was nice occasionally getting a notification that wasn't a complaint or demand for support. At one point someone left me a note that it was being used to support research projects at NASA JPL and DARPA. That was pretty neat.


retrospectively, could you have made some good money for your support?


BGP is the canonical way to take your entire global network offline within seconds. Glancing at BGP looking glasses, Starlink's prefixes seem to still be announced, but there could still be an accidental blackhole or routing loop within their AS, or something broken in one of their transit providers.

No idea if that's what's going on, but routing protocols are one of a few effectively global control planes that can go wrong very quickly like this.


If it were BGP/routing, you would think we'd be able to still get a signal and the modem would think it's healthy (although maybe not if the issue prevented us from obtaining our public IP), we just wouldn't be able to route to any dst. In the current case we don't have a signal (orange light on the modem)


Yes, before the drop out my traffic was coming from a downlink station in Bulgaria, on an IP on AS14593

Traceroute from "the internet" back to that IP reaches AS14593 just fine, and my endpoint doesn't get beyond the first hop of the local starlink router.

Whatever it is, it doesn't look like a peering problem


https://mtr.ping.pe ftw for MTR from "the internet" ? :)


From various monitoring points I have on multiple internet connections.

One of the promises of starlink was it would stay in space as long as possible before being downlinked, giving far lower latency, alas that hasn't happened yet, and traffic will run thousands of miles in the wrong direction before being downlinked. For example from one location to another I have 360ms via Starlink but just 200ms rtt via local provision (5g p2p wireless then optical). On another it used to downlink in Lagos, but now it downlinks in Nairobi, meaning traffic to Lagos routes Nairobi -> Marseille -> Lagos, taking far longer than it used to. A shame really.


Does the orange light specifically mean no RF link at all? Or does it include anything that prevents the modem from getting an IP address and route configuration? If the latter, BGP could still be at fault if it took out access to the control planes on the ground. But again all just guessing, from the outside all I see is the BGP routes are still being announced, and everyone seems to be seeing 100% packet loss and zero traffic.


Right, good point that could be the case, those were just my assumptions and probably jumping to conclusions on my part speculating that orange means no signal (don't actually have any idea :) ). Imagine it could be any of what you said too.


That sounds bad…


A network of satellites gives you entirely new and exciting ways of taking your network offline, such as bricking them with a firmware update and no way to actually get up there and fix it.


Imagine being that dev who pushed the bad patch - Crowdstrike but x100


Crowdstrike’s fuckup was at the company level.

Whoever is doing immediate global deployments and/or any prod deployments without verified testing is just wrong as a corporate culture.


Elon saw yesterday's article about The Promised LAN, and said "What if we connected that instead of the Internet...?" https://news.ycombinator.com/item?id=44661682#44663409


Instead, it seems he's created the LAN of the lost


it's pretty amazing the amount of damage a BGP oopsie can do. Also, you can fit pretty much all the BGP admins for the entire Internet in one large room.


If by “room” you mean “Wembley stadium”, maybe.


Messing up with BGP makes you feel alive, that’s for sure.

But hey, if you haven’t caused an incident yet, that just means you’re still in onboarding. Those SLA downtime budgets are there to be spent.


I rebooted my terminal and I can’t tell for sure if it sees any satellites. It looks like it doesn't.

It says it didn’t, and it says the “which way is down” thing hasn’t converged. Occasionally, the signal to noise ratio light in the app goes gray which means < 3.

It also rebooted itself.

Before the first reboot, 30% of pings went through. It’s almost like the azimuth or some other timely but cached data was corrupted.


It's always a bad route that was introduced during a planned upgrade


I thought it was always DNS


They are/were scanning random public projects and creating unsolicited bugs in those projects to chide them about profanity in their source code. The linked thread is just a bunch of the victims of that spam having some fun with the format.


Thanks for clarifying, I was puzzled.


In the case of this attack, somewhere between 40s and 30min of physical access, depending on how the card was set up. In the case of a hotel, the spicy card to clone would be the cleaning staff's, which conveniently also admits a reasonable explanation for the card going temporarily missing (e.g. abandon it one corridor over, oops must have dropped it while doing the rounds).

Depending on the specifics of a deployment, I'm guessing you could also use the card secrets to mint new cards that authenticate correctly to facility readers, but contain different information? But I don't know nearly enough about how these cards get used to know how much flexibility you get there.


> But I don't know nearly enough about how these cards get used to know how much flexibility you get there.

A lot of systems still just use the UID.

Physical security/door access control is still completely disconnected from IT security, despite these systems relying on software for the last 20 years. As such, there is generally no knowledge in the buyers of such systems as to the risks and how to test for any vulnerabilities.

I bet systems which rely on the UID only (something even the card manufacturer specifically warns against in their datasheet) are still being sold, and lots are definitely still out there. This is trivial to clone and requires only a single read of the card, no cracking needed because the UID isn’t designed to be private to begin with.


I know only one access system that is built on Mifare and does not use UID, and that thing uses a file on the card as a bitfield of what doors it can open.


The paper reports that the same backdoor seems to be present in some NXP and Infineon SKUs as well, including some manufactured in Europe.


They could have licensed the IP from the same company.


Possibly so. It just means that based on the report's findings, even if you'd decided to play it safe and buy exclusively from NXP directly (the creators of this ecosystem and owners of the MIFARE trademark), it looks like you could still end up with backdoored hardware.


Sorry if I was being unclear with my compound snark, but using a MIFARE Classic of any provenance would be a firing offense for the CISO of my daydream company.


Indeed. Alas (or fortunately depending which colour team you work on), fully broken Mifare Classic is still all over the place, and likewise the "hardened" variant broken in this paper :(


What's a good alternative? How more expensive is it?


MIFARE DESFire is an option. In a genral public reseller, I found 100 DESFire cards sold for 146€ (tax excluded), while 100 of the equivalent versions as MIFARE Classic are sold for 109€ (tax excluded). This is a differnce of 37 cents by card, MIFARE Classic are about 25% less expensive than MIFARE DESFire. I guess the difference increase with the quantity you buy at once.


NXP would probably want to steer you away from mifare classic in the first place, wouldn't they?


Maybe for greenfield deployment… but there’s all the existing infrastructure to support.

I still see classic being installed for door/gate systems in American apartments that are under active construction in 2024. Presumably that’s because resellers either don’t know better or they just have a massive inventory.


I still see new apartment buildings with Sentex or Linear call boxes with the factory master passwords. I don't think these guys are crack security experts.


They found the exact same backdoor key present on old NXP and Infineon cards produced as early as 1996. See p.11:

> But, quite surprisingly, some other cards, aside from the Fudan ones, accept the same backdoor authentication commands using the same key as for the FM11RF08!

> ...

> - Infineon SLE66R35 possibly produced at least during a period 1996-20136 ;

> - NXP MF1ICS5003 produced at least between 1998 and 2000 ;

> - NXP MF1ICS5004 produced at least in 2001.

> ...

> Additionally, what are we to make of the fact that old NXP and Infineon cards share the very same backdoor key?


I think it's more likely those NXP/Infineon parts are counterfeits. Look at A.12, there are early cards that don't NACK $F000 but claim to be NXP or Infineon, behavior counter to legit parts. It looks like the Chinese copies started to chameleon that behavior later as well.


... or used the IP without licensing.


Please, name these academic Unixes! I would love to go see what they do. Down-thread there's a mention of minix, which does the normal thing: demand paging, context switches only the page table directory pointer, and process memory images are moved around the storage hierarchy indirectly, through page faults. Which other academic Unix or Unix-like did you have in mind?

Linux is indeed not the owner of the concept of PID 0. It's fortunate that I didn't say that! It is, however, not frequently involved with paging in and out memory.


xv6 and its many forks are what I'm thinking about

You address this somewhat in the post:

> Going back to the Wikipedia article, it seems the author of that edit wanted to write “swapping”, in the classic Unix V5 sense of swapping out whole processes as a consequence of scheduling. But the edit didn’t clarify that “swapping” was being used in an archaic sense that was likely to confuse the modern reader.

> context switches only the page table directory pointer

Swapping out the the PTD pointer is exactly what I'm thinking of. I'm wrong, because I didn't have the common colloquial meaning of "swapping" (paging out memory to disk) in my mind

I think it's a little strange such a meaning has come to dominate, at least in a classroom setting it is still fairly common to discuss the operation of the scheduler as "swapping pages".


Yeah, admittedly it's confusing terminology generally, because it's still natural to say you're swapping the page tables out when you do a context switch on current systems. I probably do at some point in the post!

The distinction I was trying to get at was that, in early Unix, all process bytes were being actively streamed to and from disk as part of scheduling because the hardware didn't yet have a concept of virtual memory. So, if you wanted to make a program ready to run, you had to fully load it into memory, and shove anything else out of the way right then and there. That makes the scheduling function 5% deciding what should run, and 95% playing memory sokoban to make that happen.

OTOH, on systems with paged virtual memory, the scheduler is almost entirely "what's a good thing to run?", and implementing that decision is updating a couple of pointers. The only place the memory hierarchy creeps in, is if the scheduling algorithm wants to be fancy and account for things like NUMA nodes in its ranking of tasks.

I think it's reasonable, looking at it in isolation, to describe this part of the kernel as a "swapper", or the operation as "swapping". I think where it turns into a bear trap is when presenting these concepts to folks less familiar with kernel internals, where words like "swap" and "pages" are firmly the domain of the memory subsystem. And so, if I hand them a task and say "this is the swapper", IMO the majority will interpret that as being a component of virtual memory management, and they wouldn't be at fault for thinking that.

Empirically this happened in the 2008 wikipedia edit: "swapping" mutated to "paging" because in modern vmm-land that's a valid synonym, and that in turn became "this task is sometimes called 'sched' for historical reasons, and it handles paging" on the web. And cue a decade of confused students and stackoverflow users asking followups like "but if this task does paging, why does linux have all these kswapd threads?" That to me suggests that, for better or worse, the memory subsystem owns those words now, and the rest of the kernel has to be very careful if it uses them to mean something else, if it wants to avoid casual onlookers creating false associations. Something something naming things is still the hardest thing in computer science :)


It's not a problem, never has been. Nix mirrors all source bundles it pulls from third parties and caches them. cache.nixos.org has a copy of all the sources needed to build not just current HEAD, but also past commits (although deep history might start getting pruned for cost control soon, iiuc).

The Software Heritage archive also has an up to date mirror of xz's repo: https://archive.softwareheritage.org/browse/origin/directory...


Are they storing source archives for each version? I'd think mirroring the actual repo might take less space than a bunch of copies of source archives.

Though it looks like git only uses deflate on pack files. Someone should write a patch to add lzma support. :-)


In this instance then, it sounds like it has cached the source with a backdoor in it, and anyone using it is potentially exposing themselves to a very public problem right now.


Also no. It was rolled back hours ago, and cache.nixos.org keeps all past builds so it didn't even need rebuilding.

Orthogonal to that, the backdoor was irrelevant to nix in at least three different ways: the malicious build logic targeted rpm/deb build environments and so didn't trigger in nix's build sandbox, the backdoor code makes assumptions about filesystem layout that are invalid on nixos and so wouldn't have activated anyway, and nix doesn't include the downstream patch that results in the backdoor even getting into sshd's address space. Still got rolled back out of an abundance of caution, but nix got lucky that the attacker didn't bother targeting it the way they did debian and rpm-based distros.


The issue is that the author had been contributing for 2 years.

Debian is considering reverting prior to his involment or switching, cf: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1068024


They do that out of caution, not because that's what needs to be done. No reason to panic if your distro doesn't go as far.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: