Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Treadmill garbage collector by H. Baker (cofault.com)
163 points by oecumena on July 26, 2022 | hide | past | favorite | 59 comments


> In fact, some of the Lisp Machines had garbage collection implemented in hardware and allocated everything including stack frames and binding environments in the heap.

I wrote a lot of code for CADR, Symbolics and D-Machine Lispms and I’m not aware of any of them having hardware or microcode GC. That would be crazy anyway. Tagging and other support sure, but you could say that the x86 has hardware support through its pager too.

As for stack frames on the heap, think of how important function calling is, and the impact of allocation, cache, etc of such an approach. They all used PDLs like any other machines.

Now InterLisp-D did spaghetti the stack when you made a closure, at least for some time. I know this because I triggered it with a pathological case of making thousands of closures, bringing any machine to its knees. But a spaghetti stack is a different thing from allocating frames from the heap.


The Scheme-79 chip (https://dspace.mit.edu/handle/1721.1/6334) was doing GC in microcode. According to Steel & Sussman it typically spent 80% of time collecting. :-) RMS invented "phantom stacks" to reduce the amount of garbage due to closures, not sure this was ever implemented anywhere.


I forgot about that chip.

yeah, I don’t think anyone looked at the phantom stack idea. Perhaps interlisp should have.


Generational GC might make heap-allocating your activation records a viable option, even if you're not satisfied with CPython levels of performance. I mean in a sense that's what Chicken does, right?


That is the argument made by Andrew Appel’s paper “garbage collection can be faster than stack allocation” https://www.cs.princeton.edu/~appel/papers/45.pdf and it’s the compilation strategy used by SML/NJ as set out in Appel’s book “compiling with continuations”


Well, I know, but I think he turned out to be wrong in that case. Fast enough, maybe.


It's pretty hard to beat the pushj/popj approach, except that you have to lock out the GC before returning (as it may be scanning the activation records to find roots).


Yeah, whatever Appel thinks, the stack is always faster. But sometimes the heap penalty is affordable. Before generational GC it was ridiculous.


For microcoded GC, also see https://en.wikipedia.org/wiki/Flex_machine


This has always been my favorite garbage collector. Not because of any performance benefits; I’ve just always liked the elegance of it.

At the dawn of the Squeak Smalltalk era, there was some company (forget which) that did a hybrid Bakers treadmill implementation for some real time stuff.

I did a simplistic implementation in C once for an image processing pipeline to manage labeled regions. Once you get it correct, it’s pretty slick. I used a Fibonacci sizer for the different ring sizes.


I'm put to mind of Mike Pall's design (tragically unwritten) for a quad-color GC for LuaJIT: http://wiki.luajit.org/New-Garbage-Collector#gc-algorithms_q...

I'm just hearing of the Treadmill for the first time, so I can't compare the design in detail yet, but the two pictures tell a lot of that story.


It says that alloc() is O(1) and thus real-time...but then later on says that advance() must be called k times per alloc in order to avoid exhausting the free space while still having work on the gray list. This makes it not quite real-time; either you call advance() k times inside alloc (which it recommends) in which case it isn't constant latency but variable on the number of reachable objects from the object you are popping from the work list, or you risk running out of heap and have an eventual loop of draining the grey list to flip.

In either case that doesn't seem quite as useful as the claimed "alloc is just a pointer chase" and not much better than other GCs that don't claim to be real-time, and makes it soft real-time (upper bound on instructions) not real-time (exact number of instructions); most other incremental GCs also have a bounded set of work they do per operation.

It also sounds like it requires a mutex on read barriers for multithreaded programs, which is pretty bad and expensive. Most modern GC read barriers are just setting a bit in a card table, which doesn't require a lock.


As far as I understand it, it's real-time if you don't adjust k dynamically to avoid running out of memory.

The heap size you need is R*(1+1/k). So if k=1, then you need to threat a half-full heap as an out-of-memory condition, which is comparable to quite a few GC's. If you set k at e.g. 4 instead, you can use 80% of the heap.

If you're ok with losing real-time guarantees you can instead dynamically increase k as needed.


> It says that alloc() is O(1) and thus real-time...but then later on says that advance() must be called k times per alloc in order to avoid exhausting the free space while still having work on the gray list. This makes it not quite real-time; either you call advance() k times inside alloc (which it recommends) in which case it isn't constant latency but variable on the number of reachable objects from the object you are popping from the work list, or you risk running out of heap and have an eventual loop of draining the grey list to flip.

The original description is for CONS cells, which have exactly 2 pointers; the notes on generalization later in TFA state that if you want to support arbitrary sized objects, advance() needs to be made incremental for larger objects.


Are you sure? Setting a bit in a card table sounds more like a write barrier.


Sorry, you're right, I was thinking of write barrier schemes.

Off the top of my head I think most non-moving GCs use only write barriers and not read barriers now, though, along with writes being less common than reads, which is maybe even better than what I said. https://webkit.org/blog/7122/introducing-riptide-webkits-ret... for example only uses a write barrier, and so does Lua's tricolor GC.

The closest modern analogue would maybe be something like Metronome, which is a moving GC that also advertises as real-time, but has a read barrier they intentionally tried to make very cheap and optimizable - I guess you could also coalesce Treadmill read barriers to only one mutex + color transition for multiple fields of an object all at once, though.


Treadmill is a "real-time" in-place garbage collection algorithm designed by H. Baker. It is simple, elegant, efficient and surprisingly little known.


I don't know about little known. a lot of concurrent collector designs take some key points from treadmill, and its was certainly still standard grad school reading if you were in systems 15 years ago.

but definitely a must-read if you care about that kind of thing


The Dijkstra, Lamport, Martin, Scholten and Steffens "on-the-fly" collector [0] is the most inspirational, I think, with its analogy of "colors" for objects. Baker first designed a serial "real-time" copying collector [1], showing that three areas in copying are equivalent to the colors; then I think the lists of objects in the Treadmill are supposed to correspond to those areas in the copying collector. Instead of moving objects and advancing a scan pointer to change the color of objects, objects are rearranged in the doubly linked-list.

[0] https://lamport.azurewebsites.net/pubs/garbage.pdf

[1] https://plover.com/~mjd/misc/hbaker-archive/RealTimeGC.html


I have seen it mentioned before in other articles/implementations of garbage collectors. I think anyone interested in GC will come across, although certainly no as often as ones that are currently popular.


He also wrote an experimental lisp with linear types, a precursor to Rust borrow system -https://news.ycombinator.com/item?id=20369522


I thought based on the title it would a be a simple GC that worked similarly to a real treadmill.

The GC would move used objects to the front (as if they were running forward). The unused objects would either be moved (or left in place) to the back (as if they stopped running). All memory in the back would be considered free space when alloc-ing.

But the article quickly says thats not going to happen since its "in-place" and I have no idea what my idea would be already called.


A mark–compact algorithm? As all copying collectors, it doesn't work really that well with large live sets.


Am I correct in assuming that "no pointer arithmetic" and "read barrier on every pointer access" are the reasons this approach isn't more widely used?


No. Most languages with good gcs do not have pointer arithmetic, and many good gcs, especially real-time ones, have a read barrier.


.NET, Go, Java, Eiffel and Common Lisp all expose ways to do pointer arithmetic, either via unsafe code blocks, or unsafe runtime APIs.

I bet there isn't a language around (with GC) that tops their GC implementations.


The question isn’t whether there’s some way to do arithmetic on raw pointers, but whether you can take a pointer to a subobject and expect the GC to keep the object alive for you. In Go you can do this, but you can’t in Java or C#, I believe.



The Java link doesn't describe a way to have a pointer to a subobject. In Java you must always point at the start of an object.

In Go, objects are segregated by allocation size, so the GC can round subobject pointers to keep the surrounding object alive.


Although it isn't visible from the language level the JVM internally does have "derived pointers" which are normally stored on the stack and point to an offset within an object. The GC can figure this out because it has a map for what the stack entries mean.


'The' JVM can internally have whatever it likes; so long as it is not exposed at the language level, nothing prevents an implementation from being devised which does not have such things internally, and which has a treadmill.


JNI and Unsafe also exist, it isn't only about the language alone.


Oh cool, I didn't realize the stack maps were so advanced. Now I'm trying to imagine which optimizations made this necessary/useful.


Java used to do something like that for strings, where taking a substring of a string created an object that pointed into the original string, but that only was for strings, and didn’t expose the pointer, so the implementation could keep a pointer to the original string around for the GC to use.

C# has Span<T>, nowadays, but is guaranteed to only live on the stack, making it easier for the GC to keep the object it points into alive while the Span lives (https://docs.microsoft.com/en-us/dotnet/api/system.span-1?vi...)


Java doesn't even have subobjects in this sense, so you're quite right, you can't do this!


It has JNI, Unsafe, and incoming, Panama.

All provide ways to shoot yourself if not used properly.


I guess it is not more widely used, because more complex generational collectors are often more efficient.


I'm a bit worried about the amount of pointer chasing here. Presumably on modern architectures that would kill cache locality? Or am I missing something?


IIRC the Treadmill has always had appalling constant factors; its appeal is that, unlike any simpler or earlier GC algorithm, it has a worst-case execution time.


On my first skim through the article it seemed like objects were getting copied between regions, i.e. when they change color. I guess the key is one of the first sentences "all objects in the heap are organised in a cyclic double-linked list", so there is a layer of indirection, and the circle you see is not the actual address space as I initially assumed, but the linked list. Did I understand correctly?


Double-linked list (Baker calls it "Knuth's double-linked list") allows O(1) insertions and deletions (hence moving) without actually moving objects and without additional indirection. Each object has a header that contains pointers to the next and previous objects in the list. To remove obj from the list do { obj->next->prev, obj->prev->next = obj->prev, obj->next } (I use "parallel assignment" for simplicity), similarly for insertion, see https://en.wikipedia.org/wiki/Doubly_linked_list . So the circles in the diagram are not actual address space (in the sense that object addresses are not increasing monotonically as you go clockwise), but there is no additional indirection because the header with the forward and backward pointers is part of the object.


Iirc, there are two approaches to managing your treadmill.

You either use one treadmill with a fixed size abject header/pointer. The “guts” of the object are placed on a heap which gets spaghettified and fragmented pretty fast, especially if your not going to move bodies, because… real time.

The other approach is to use multiple treadmills/rings for different size objects. Different rings are then used for different sizes of objects. In this way, your heap fragmentation/locality improves, but at the added cost of indirecting on object size.


It does mean that in a short period your heap will be a knotted spaghetti of objects with virtually pessimal locality as you chase pointers around the loop.


That's true, the worst case is bad. But if mutator locality of reference is reasonable, the scanner will preserve it. After all, Baker rejected his earlier copying design for a reason.


Yes


I don't know a lot about GC and didn't know about this before I saw this on HN, but I think it is super elegant. One thing I didn't understand: Are all objects assumed to be exactly the same size? Since allocation is just the move of a pointer, that appears to be the case. But how would that work if one would want to implement Treadmill for a language like Java where object sizes vary?


Variable size objects are addressed at the very end of the post: "Support for variable-sized objects requires a separate cyclic list for each size...".


That implies that you have to divide up the memory ahead of time into an area for each size? If you get the proportions wrong you will waste memory.

Languages with only one object size are not exactly in vogue at the moment, so this is a major limitation.


> That implies that you have to divide up the memory ahead of time into an area for each size?

No, it doesn't, as it maintains linked lists of objects, so you can use any underlying allocator to feed you objects to add to the different linked lists.


You can extend or shrink circular list on the go, depending on needs. You can plan to divide memory ahead of time, but you don't have to.


Thanks, I somehow missed that line.


Anyone know of a garbage collector specialised to immutable objects?

No old objects containing pointers to new objects and no mutation while scanning hazards.


If there are no pointers from "old" to "new" objects, then there are no cycles and (compiler-assisted) reference counting will handle it.


Cheney's copying collector? You're gonna have huge garbage-to-live ratio with immutable objects, and this algorithm is pretty amazing for such scenarios.


Why would you garbage collect something that is going to change in the first place? By definition there isn't a way to reach it any more.


honestly thought this was going to be a story about someone who collected / gathered treadmills from landfills.


I did too but haven't had my coffee yet. I mean I could see someone collecting them and selling them online for a profit or something. They aren't all that hard to fix. I've fixed a couple I found curbside headed for the landfill, fixed them, and sold locally on facebook groups (i wouldn't even attempt to ship a treadmill/elliptical)


I thought this was literally a physical garbage collection device made from a repurposed treadmill.


Yep, was expecting something like Mr. Trash Wheel: https://www.mrtrashwheel.com/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: