> In fact, some of the Lisp Machines had garbage collection implemented in hardware and allocated everything including stack frames and binding environments in the heap.
I wrote a lot of code for CADR, Symbolics and D-Machine Lispms and I’m not aware of any of them having hardware or microcode GC. That would be crazy anyway. Tagging and other support sure, but you could say that the x86 has hardware support through its pager too.
As for stack frames on the heap, think of how important function calling is, and the impact of allocation, cache, etc of such an approach. They all used PDLs like any other machines.
Now InterLisp-D did spaghetti the stack when you made a closure, at least for some time. I know this because I triggered it with a pathological case of making thousands of closures, bringing any machine to its knees. But a spaghetti stack is a different thing from allocating frames from the heap.
The Scheme-79 chip (https://dspace.mit.edu/handle/1721.1/6334) was doing GC in microcode. According to Steel & Sussman it typically spent 80% of time collecting. :-) RMS invented "phantom stacks" to reduce the amount of garbage due to closures, not sure this was ever implemented anywhere.
Generational GC might make heap-allocating your activation records a viable option, even if you're not satisfied with CPython levels of performance. I mean in a sense that's what Chicken does, right?
That is the argument made by Andrew Appel’s paper “garbage collection can be faster than stack allocation” https://www.cs.princeton.edu/~appel/papers/45.pdf and it’s the compilation strategy used by SML/NJ as set out in Appel’s book “compiling with continuations”
It's pretty hard to beat the pushj/popj approach, except that you have to lock out the GC before returning (as it may be scanning the activation records to find roots).
This has always been my favorite garbage collector. Not because of any performance benefits; I’ve just always liked the elegance of it.
At the dawn of the Squeak Smalltalk era, there was some company (forget which) that did a hybrid Bakers treadmill implementation for some real time stuff.
I did a simplistic implementation in C once for an image processing pipeline to manage labeled regions. Once you get it correct, it’s pretty slick. I used a Fibonacci sizer for the different ring sizes.
It says that alloc() is O(1) and thus real-time...but then later on says that advance() must be called k times per alloc in order to avoid exhausting the free space while still having work on the gray list. This makes it not quite real-time; either you call advance() k times inside alloc (which it recommends) in which case it isn't constant latency but variable on the number of reachable objects from the object you are popping from the work list, or you risk running out of heap and have an eventual loop of draining the grey list to flip.
In either case that doesn't seem quite as useful as the claimed "alloc is just a pointer chase" and not much better than other GCs that don't claim to be real-time, and makes it soft real-time (upper bound on instructions) not real-time (exact number of instructions); most other incremental GCs also have a bounded set of work they do per operation.
It also sounds like it requires a mutex on read barriers for multithreaded programs, which is pretty bad and expensive. Most modern GC read barriers are just setting a bit in a card table, which doesn't require a lock.
As far as I understand it, it's real-time if you don't adjust k dynamically to avoid running out of memory.
The heap size you need is R*(1+1/k). So if k=1, then you need to threat a half-full heap as an out-of-memory condition, which is comparable to quite a few GC's. If you set k at e.g. 4 instead, you can use 80% of the heap.
If you're ok with losing real-time guarantees you can instead dynamically increase k as needed.
> It says that alloc() is O(1) and thus real-time...but then later on says that advance() must be called k times per alloc in order to avoid exhausting the free space while still having work on the gray list. This makes it not quite real-time; either you call advance() k times inside alloc (which it recommends) in which case it isn't constant latency but variable on the number of reachable objects from the object you are popping from the work list, or you risk running out of heap and have an eventual loop of draining the grey list to flip.
The original description is for CONS cells, which have exactly 2 pointers; the notes on generalization later in TFA state that if you want to support arbitrary sized objects, advance() needs to be made incremental for larger objects.
Sorry, you're right, I was thinking of write barrier schemes.
Off the top of my head I think most non-moving GCs use only write barriers and not read barriers now, though, along with writes being less common than reads, which is maybe even better than what I said. https://webkit.org/blog/7122/introducing-riptide-webkits-ret... for example only uses a write barrier, and so does Lua's tricolor GC.
The closest modern analogue would maybe be something like Metronome, which is a moving GC that also advertises as real-time, but has a read barrier they intentionally tried to make very cheap and optimizable - I guess you could also coalesce Treadmill read barriers to only one mutex + color transition for multiple fields of an object all at once, though.
I don't know about little known. a lot of concurrent collector designs take some key points from treadmill, and its was certainly still standard grad school reading if you were in systems 15 years ago.
but definitely a must-read if you care about that kind of thing
The Dijkstra, Lamport, Martin, Scholten and Steffens "on-the-fly" collector [0] is the most inspirational, I think, with its analogy of "colors" for objects. Baker first designed a serial "real-time" copying collector [1], showing that three areas in copying are equivalent to the colors; then I think the lists of objects in the Treadmill are supposed to correspond to those areas in the copying collector. Instead of moving objects and advancing a scan pointer to change the color of objects, objects are rearranged in the doubly linked-list.
I have seen it mentioned before in other articles/implementations of garbage collectors. I think anyone interested in GC will come across, although certainly no as often as ones that are currently popular.
I thought based on the title it would a be a simple GC that worked similarly to a real treadmill.
The GC would move used objects to the front (as if they were running forward).
The unused objects would either be moved (or left in place) to the back (as if they stopped running).
All memory in the back would be considered free space when alloc-ing.
But the article quickly says thats not going to happen since its "in-place" and I have no idea what my idea would be already called.
The question isn’t whether there’s some way to do arithmetic on raw pointers, but whether you can take a pointer to a subobject and expect the GC to keep the object alive for you. In Go you can do this, but you can’t in Java or C#, I believe.
Although it isn't visible from the language level the JVM internally does have "derived pointers" which are normally stored on the stack and point to an offset within an object. The GC can figure this out because it has a map for what the stack entries mean.
'The' JVM can internally have whatever it likes; so long as it is not exposed at the language level, nothing prevents an implementation from being devised which does not have such things internally, and which has a treadmill.
Java used to do something like that for strings, where taking a substring of a string created an object that pointed into the original string, but that only was for strings, and didn’t expose the pointer, so the implementation could keep a pointer to the original string around for the GC to use.
I'm a bit worried about the amount of pointer chasing here. Presumably on modern architectures that would kill cache locality? Or am I missing something?
IIRC the Treadmill has always had appalling constant factors; its appeal is that, unlike any simpler or earlier GC algorithm, it has a worst-case execution time.
On my first skim through the article it seemed like objects were getting copied between regions, i.e. when they change color. I guess the key is one of the first sentences "all objects in the heap are organised in a cyclic double-linked list", so there is a layer of indirection, and the circle you see is not the actual address space as I initially assumed, but the linked list. Did I understand correctly?
Double-linked list (Baker calls it "Knuth's double-linked list") allows O(1) insertions and deletions (hence moving) without actually moving objects and without additional indirection. Each object has a header that contains pointers to the next and previous objects in the list. To remove obj from the list do { obj->next->prev, obj->prev->next = obj->prev, obj->next } (I use "parallel assignment" for simplicity), similarly for insertion, see https://en.wikipedia.org/wiki/Doubly_linked_list . So the circles in the diagram are not actual address space (in the sense that object addresses are not increasing monotonically as you go clockwise), but there is no additional indirection because the header with the forward and backward pointers is part of the object.
Iirc, there are two approaches to managing your treadmill.
You either use one treadmill with a fixed size abject header/pointer. The “guts” of the object are placed on a heap which gets spaghettified and fragmented pretty fast, especially if your not going to move bodies, because… real time.
The other approach is to use multiple treadmills/rings for different size objects. Different rings are then used for different sizes of objects. In this way, your heap fragmentation/locality improves, but at the added cost of indirecting on object size.
It does mean that in a short period your heap will be a knotted spaghetti of objects with virtually pessimal locality as you chase pointers around the loop.
That's true, the worst case is bad. But if mutator locality of reference is reasonable, the scanner will preserve it. After all, Baker rejected his earlier copying design for a reason.
I don't know a lot about GC and didn't know about this before I saw this on HN, but I think it is super elegant. One thing I didn't understand: Are all objects assumed to be exactly the same size? Since allocation is just the move of a pointer, that appears to be the case. But how would that work if one would want to implement Treadmill for a language like Java where object sizes vary?
Variable size objects are addressed at the very end of the post: "Support for variable-sized objects requires a separate cyclic list for each size...".
> That implies that you have to divide up the memory ahead of time into an area for each size?
No, it doesn't, as it maintains linked lists of objects, so you can use any underlying allocator to feed you objects to add to the different linked lists.
Cheney's copying collector? You're gonna have huge garbage-to-live ratio with immutable objects, and this algorithm is pretty amazing for such scenarios.
I did too but haven't had my coffee yet. I mean I could see someone collecting them and selling them online for a profit or something. They aren't all that hard to fix. I've fixed a couple I found curbside headed for the landfill, fixed them, and sold locally on facebook groups (i wouldn't even attempt to ship a treadmill/elliptical)
I wrote a lot of code for CADR, Symbolics and D-Machine Lispms and I’m not aware of any of them having hardware or microcode GC. That would be crazy anyway. Tagging and other support sure, but you could say that the x86 has hardware support through its pager too.
As for stack frames on the heap, think of how important function calling is, and the impact of allocation, cache, etc of such an approach. They all used PDLs like any other machines.
Now InterLisp-D did spaghetti the stack when you made a closure, at least for some time. I know this because I triggered it with a pathological case of making thousands of closures, bringing any machine to its knees. But a spaghetti stack is a different thing from allocating frames from the heap.