More

anorwell · 2026-01-02T02:04:56 1767319496

What do you think about the METR 50% task length results? About benchmark progress generally?

ben_w · 2026-01-02T11:02:33 1767351753

I don't speak for bopbopbop7, but I will say this: my experience of using Claude Code has been that it can do much longer tasks than the METR benchmark implies are possible.

The converse of this is that if those tasks are representative of software engineering as a whole, I would expect a lot of other tasks where it absolutely sucks.

This expectation is further supported by the number of times people pop up in conversations like this to say for any given LLM that it falls flat on its face even for something the poster thinks is simple, that it cost more time than it saved.

As with supposedly "full" self driving on Teslas, the anecdotes about the failure modes are much more interesting than the success: one person whose commute/coding problem happens to be easy, may mistake their own circumstances for normal. Until it does work everywhere, it doesn't work everywhere.

When I experiment with vibe coding (as in, properly unsupervised), it can break down large tasks into small ones and churn through each sub-task well enough, such that it can do a task I'd expect to take most of a sprint by itself. Now, that said, I will also say it seems to do these things a level of "that'll do" not "amazing!", but it does do them.

But I am very much aware this is like all the people posting "well my Tesla commute doesn't need any interventions!" in response to all the people pointing out how it's been a decade since Musk said "I think that within two years, you'll be able to summon your car from across the country. It will meet you wherever your phone is … and it will just automatically charge itself along the entire journey."

It works on my [use case], but we can't always ship my [use case].

anorwell · 2025-12-14T17:13:56 1765732436

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...

From my perspective, it's not the worst analogy. In both cases, some people were forecasting an exponential trend into the future and sounding an alarm, while most people seemed to be discounting the exponential effect. Covid's doubling time was ~3 days, whereas the AI capabilities doubling time seems to be about 7 months.

I think disagreement in threads like this often can trace back to a miscommunication about the state today / historically versus. Skeptics are usually saying: capabilities are not good _today_ (or worse: capabilities were not good six months ago when I last tested it. See: this OP which is pre-Opus 4.5). Capabilities forecasters are saying: given the trend, what will things be like in 2026-2027?

wizzwizz4 · 2025-12-14T18:26:36 1765736796

The "COVID-19's doubling time was ≈3 days" figure was the output of an epidemiological model, based on solid and empirically-validated theory, based on hundreds of years of observations of diseases. "AI capabilities' doubling time seems to be about 7 months" is based on meaningless benchmarks, corporate marketing copy, and subjective reports contradicted by observational evidence of the same events. There's no compelling reason to believe that any of this is real, and plenty of reason to believe it's largely fraudulent. (Models from 2, 3, 4 years ago based on the "it's fraud" concept are still showing high predictive power today, whereas the models of the "capabilities forecasters" have been repeatedly adjusted.)

anorwell · 2025-12-07T22:00:54 1765144854

The article does not say at any point which model was used. This is the most basic important information when talking about the capabilities of a model, and probably belongs in the title.

thecr0w · 2025-12-07T23:32:45 1765150365

Whoops, I'm very dumb. It's Opus 4.1. I updated the blog post and credited you for the correction. Thank you!

dweekly · 2025-12-08T03:39:47 1765165187

That model does not exist. Do you mean Opus 4.5?

retsibsi · 2025-12-08T05:59:37 1765173577

> That model does not exist.

It does (unless the previous comment was edited? Currently it says Opus 4.1): https://www.anthropic.com/news/claude-opus-4-1. You can see it in the 'more models' list on the main Claude website, or in Claude Console.

thecr0w · 2025-12-08T20:17:58 1765225078

yep, this is what I used.

dweekly · 2025-12-12T20:13:53 1765570433

Whoops, my bad. Sorry.

hu3 · 2025-12-08T04:11:14 1765167074

Opus GPT 4.1 Pro Maverick DeepK2

anorwell · 2025-12-03T20:29:33 1764793773

But only in the the tip (nightly) build. I'm somewhat tempted to switch to them for this.

cpach · 2025-12-03T21:05:22 1764795922

A while ago I compiled Ghostty from HEAD, because it had a bug fix I cared for. It was a very stable and pleasant experience. No hassle whatsoever.

memco · 2025-12-03T21:42:45 1764798165

If you'd like you can also use `tip` as the update channel to get the nightly build binary without having to compile it yourself: https://ghostty.org/docs/config/reference#auto-update-channe...

cpach · 2025-12-04T07:23:04 1764832984

Ah. Cool!

ilvez · 2025-12-04T11:21:44 1764847304

If you need to do that again note that there is asdf plugin as well ;-)

For Linux compiling is actually the only way to get tip.

anorwell · 2025-09-09T02:17:18 1757384238

Seems like a really interesting project! I don't understand what's going on with latency vs durability here. The benchmarks [1] report ~1ms latency for sequential writes, but that's just not possible with S3. So presumably writes are not being confirmed to storage before confirming the write to the client.

What is the durability model? The docs don't talk about intermediate storage. Slatedb does confirm writes to S3 by default, but I assume that's not happening?

[1] https://www.zerofs.net/zerofs-vs-juicefs

Shakahs · 2025-09-09T03:45:11 1757389511

SlateDB offers different durability levels for writes. By default writes are buffered locally and flushed to S3 when the buffer is full or the client invokes flush().

https://slatedb.io/docs/design/writes/

Eikon · 2025-09-09T06:14:58 1757398498

The durability profile before sync should be pretty close to a local filesystem. There’s (in-memory) buffering happening on writes, then when fsync is issued or when we exceed the in-memory threshold or we exceed a timeout, data is sync-ed.

anorwell · 2025-09-09T12:23:55 1757420635

Thanks, makes sense. I found the benchmark src to see it's not fsyncing, so only some of the files will be durable by the time the benchmark is done. The benchmark docs might benefit from discussing this or benchmarking both cases? O_SYNC / fsync before file close is an important use case.

edit: A quirk with the use of NFSv3 here is that there's no specific close op. So, if I understand right, ZeroFS' "close-to-open consistency" doesn't imply durability on close (and can't unless every NFS op is durable before returning), only on fsync. Whereas EFS and (I think?) azure files do have this property.

Eikon · 2025-09-09T13:15:25 1757423725

There's an NFSv3 COMMIT operation, combined with a "durability" marker on writes. fsync could translate to COMMIT, but if writes are marked as "durable", COMMIT is not called by common clients, and if writes are marked as non-durable, COMMIT is called after every operation, which kind of defeats the point. When you use NFS with ZeroFS, you cannot really rely on "fsync".

I'd recommend using 9P when that matters, which has proper semantics there. One property of ZeroFS is that any file you fsync actually syncs everything else too.

anorwell · 2025-08-06T00:07:30 1754438850

I think your example reflects well on oss-20b, not poorly. It (may) show that they've been successful in separating reasoning from knowledge. You don't _want_ your small reasoning model to waste weights memorizing minutiae.

anorwell · 2025-07-29T13:30:10 1753795810

Some of the comments so far seem to be misunderstanding this submission. As I understand it:

1. Custom scaffolding (system prompt and tools) using Qwen3-32B achieved 13.75% on Terminal-Bench. No training was involved.

2. The author has built an RL system, but it has not been used for anything due to cost limitations.

So there's actually no result related to training here. It well known that the scaffolding used can have a large impact on benchmark outcomes (the Terminal bench leaderboard also demonstrates this [1]).

[1] https://www.tbench.ai/leaderboard

esafak · 2025-07-29T14:38:12 1753799892

It looks like the submission has two aspects that are being conflated.

1. Tooling for training a terminal agent.

2. An agent that was _not_ trained with this tooling but prompt engineered. I could not find the author's discussion on this point.

anorwell · 2025-06-09T13:40:43 1749476443

This actually intersects with two of my current interests. We have, in production, rarely been seeing ThreadPoolExecutor hangs (JDK17) during shutdown. After a lot of debugging, I've been suspecting more and more that it may be an actual JDK issue. But, this type of issue is extremely hard to reason about in production, and I've never successfully reproduced it locally. (It's not clear to me that it's the same issue as in the post, since it's not a scheduled executor.)

Separately, we're looking at using fray for concurrency property testing, as a way to reliably catch concurrency issues in a distributed system by simulating it within a single JVM.

anorwell · 2025-06-01T01:10:57 1748740257

Nor does a neuron.

Argumentum ad populum, I have the impression that most computer scientists, at least, do not find Searle's argument at all convincing. Too many people for whom GEB was a formative book.

anorwell · 2025-05-30T11:58:32 1748606312

Is it any good? Perhaps we can ask Opus to review it to find out.

karmakurtisaani · 2025-05-30T12:00:16 1748606416

It's full of made up scenarios and anecdotes. Make it a little bit worse and it'll reach Malcolm Gladwell level.

IshKebab · 2025-05-30T12:39:29 1748608769

I skimmed one chapter and it just decayed into lists. Why do LLMs love lists so much?

Cyphase · 2025-05-30T12:49:14 1748609354

Top 13 Reasons LLMs Love Lists

[Click to begin the slideshow]

__alexs · 2025-05-30T12:25:15 1748607915

Gemini seems to think it's very interesting.