FWIW I work at Steel (not the OP). While we’ve been iterating on the “right shape” for agent tooling, I’ve been building a benchmark harness to measure how different surfaces affect real web task completion: raw API context, CLI-only, opinionated “skills” (structured outputs + artifact capture), and combinations.
If you’ve run agents on the open web, I’d love suggestions for nasty-but-representative workflows to include in the benchmark.
This rings true, as I’ve noticed that with every new model update, I’m leaving behind full workflows I’ve built. The article is really great, and I do admire the system, even if it is overengineered in places, but it already reads like last quarter’s workflow. Now letting Codex 5.3 xhigh chug for 30 minutes on my super long dictated prompt seems to do the trick. And I’m hearing 5.4 is meaningfully better model. Also for fully autonomous scaffolding of new projects towards the first prototype I have my own version of a very simple Ralph loop that gets feed gpt-pro super spec file.
For no special reason, beside I could, I’ve slop coded this AI agents ephemeral VM orchestrator which I use inside any agent to manipulate and maintain my coding VMs on Proxmox. Probably it could make sense to simplify it further and move from Proxmox to something like this. Link: https://github.com/nibzard/agentlab
This is exciting. But I had to read and check everything twice to figure it out, as some already commented. Strong Feedback loop is an ultimate unlock for AI agents and having twins is exactly the right approach.
For sure! Just ask enough times "why" and you will find the root. The main issue here it is, how many people do that for real, and how this is becoming even more critical now.
Reproducibility is a fascinating topic for me, and today with AI coding agents we could have automated reproducibility at least in some fields. The concept they touch on in the paper, of post publication verification could replace or add onto existing research valorization.
I thought radiologists need to know what to look for in order to diagnose something? Do they brute force every potential condition in the body that can be detected with an MRI?
Exactly, because an MRI is not a simple "shows problems" machine. It provides a very simplified model of certain aspects of the state of the body. We very often can't know if parts of that state are a health problem or not.
To my knowledge, studies have not shown any benefits of regular full body MRI's. You might find a problem, or you might find a non-problem and in the process of fixing it (aka operation / medication) you create a problem. Those two effects seem to balance out each other on average.
> I thought radiologists need to know what to look for in order to diagnose something? Do they brute force every potential condition in the body that can be detected with an MRI?
No, when they read a scan, they're supposed to read everything visible for every problem. Think of it this way: if you break your leg and they take an MRI, do you want the radiologist to miss a tumor because he was focused on the break?
About how many "parameters" do they evaluate roughly for a full body scan? And is one typically qualified to evaluate across the entire body or do they specialize in different areas of the body?
I don't know, but I've heard from doctors (many times, sometimes quite forcefully) that it's a radiologist's job to call out all abnormalities on the full image they get, and the reasoning makes sense.
I suppose a full body MRI would be very expensive and take a lot of time to read.
Strong point. I’m considering to tag patterns better and add stuff like “model/toolchain-specific,” and something like “last validated (month/year)” field. Things change fast and for example “Context anxiety” is likely less relevant and should be reframed that way (or retired).
If you’ve run agents on the open web, I’d love suggestions for nasty-but-representative workflows to include in the benchmark.
reply