I don't speak for bopbopbop7, but I will say this: my experience of using Claude Code has been that it can do much longer tasks than the METR benchmark implies are possible.
The converse of this is that if those tasks are representative of software engineering as a whole, I would expect a lot of other tasks where it absolutely sucks.
This expectation is further supported by the number of times people pop up in conversations like this to say for any given LLM that it falls flat on its face even for something the poster thinks is simple, that it cost more time than it saved.
As with supposedly "full" self driving on Teslas, the anecdotes about the failure modes are much more interesting than the success: one person whose commute/coding problem happens to be easy, may mistake their own circumstances for normal. Until it does work everywhere, it doesn't work everywhere.
When I experiment with vibe coding (as in, properly unsupervised), it can break down large tasks into small ones and churn through each sub-task well enough, such that it can do a task I'd expect to take most of a sprint by itself. Now, that said, I will also say it seems to do these things a level of "that'll do" not "amazing!", but it does do them.
But I am very much aware this is like all the people posting "well my Tesla commute doesn't need any interventions!" in response to all the people pointing out how it's been a decade since Musk said "I think that within two years, you'll be able to summon your car from across the country. It will meet you wherever your phone is … and it will just automatically charge itself along the entire journey."
It works on my [use case], but we can't always ship my [use case].
From my perspective, it's not the worst analogy. In both cases, some people were forecasting an exponential trend into the future and sounding an alarm, while most people seemed to be discounting the exponential effect. Covid's doubling time was ~3 days, whereas the AI capabilities doubling time seems to be about 7 months.
I think disagreement in threads like this often can trace back to a miscommunication about the state today / historically versus. Skeptics are usually saying: capabilities are not good _today_ (or worse: capabilities were not good six months ago when I last tested it. See: this OP which is pre-Opus 4.5). Capabilities forecasters are saying: given the trend, what will things be like in 2026-2027?
The "COVID-19's doubling time was ≈3 days" figure was the output of an epidemiological model, based on solid and empirically-validated theory, based on hundreds of years of observations of diseases. "AI capabilities' doubling time seems to be about 7 months" is based on meaningless benchmarks, corporate marketing copy, and subjective reports contradicted by observational evidence of the same events. There's no compelling reason to believe that any of this is real, and plenty of reason to believe it's largely fraudulent. (Models from 2, 3, 4 years ago based on the "it's fraud" concept are still showing high predictive power today, whereas the models of the "capabilities forecasters" have been repeatedly adjusted.)
The article does not say at any point which model was used. This is the most basic important information when talking about the capabilities of a model, and probably belongs in the title.
It does (unless the previous comment was edited? Currently it says Opus 4.1): https://www.anthropic.com/news/claude-opus-4-1. You can see it in the 'more models' list on the main Claude website, or in Claude Console.
Seems like a really interesting project! I don't understand what's going on with latency vs durability here. The benchmarks [1] report ~1ms latency for sequential writes, but that's just not possible with S3. So presumably writes are not being confirmed to storage before confirming the write to the client.
What is the durability model? The docs don't talk about intermediate storage. Slatedb does confirm writes to S3 by default, but I assume that's not happening?
SlateDB offers different durability levels for writes. By default writes are buffered locally and flushed to S3 when the buffer is full or the client invokes flush().
The durability profile before sync should be pretty close to a local filesystem. There’s (in-memory) buffering happening on writes, then when fsync is issued or when we exceed the in-memory threshold or we exceed a timeout, data is sync-ed.
Thanks, makes sense. I found the benchmark src to see it's not fsyncing, so only some of the files will be durable by the time the benchmark is done. The benchmark docs might benefit from discussing this or benchmarking both cases? O_SYNC / fsync before file close is an important use case.
edit: A quirk with the use of NFSv3 here is that there's no specific close op. So, if I understand right, ZeroFS' "close-to-open consistency" doesn't imply durability on close (and can't unless every NFS op is durable before returning), only on fsync. Whereas EFS and (I think?) azure files do have this property.
There's an NFSv3 COMMIT operation, combined with a "durability" marker on writes. fsync could translate to COMMIT, but if writes are marked as "durable", COMMIT is not called by common clients, and if writes are marked as non-durable, COMMIT is called after every operation, which kind of defeats the point. When you use NFS with ZeroFS, you cannot really rely on "fsync".
I'd recommend using 9P when that matters, which has proper semantics there. One property of ZeroFS is that any file you fsync actually syncs everything else too.
I think your example reflects well on oss-20b, not poorly. It (may) show that they've been successful in separating reasoning from knowledge. You don't _want_ your small reasoning model to waste weights memorizing minutiae.
Some of the comments so far seem to be misunderstanding this submission. As I understand it:
1. Custom scaffolding (system prompt and tools) using Qwen3-32B achieved 13.75% on Terminal-Bench. No training was involved.
2. The author has built an RL system, but it has not been used for anything due to cost limitations.
So there's actually no result related to training here. It well known that the scaffolding used can have a large impact on benchmark outcomes (the Terminal bench leaderboard also demonstrates this [1]).
This actually intersects with two of my current interests. We have, in production, rarely been seeing ThreadPoolExecutor hangs (JDK17) during shutdown. After a lot of debugging, I've been suspecting more and more that it may be an actual JDK issue. But, this type of issue is extremely hard to reason about in production, and I've never successfully reproduced it locally. (It's not clear to me that it's the same issue as in the post, since it's not a scheduled executor.)
Separately, we're looking at using fray for concurrency property testing, as a way to reliably catch concurrency issues in a distributed system by simulating it within a single JVM.
Argumentum ad populum, I have the impression that most computer scientists, at least, do not find Searle's argument at all convincing. Too many people for whom GEB was a formative book.