More

nielstron · 2026-02-17T09:55:27 1771322127

Hey thanks for your review, a paper author here.

Regarding the 4% improvement for human written AGENTS.md: this would be huge indeed if it were a _consistent_ improvement. However, for example on Sonnet 4.5, performance _drops_ by over 2%. Qwen3 benefits most and GPT-5.2 improves by 1-2%.

The LLM-generated prompts follow the coding agent recommendations. We also show an ablation over different prompt types, and none have consistently better performance.

But ultimately I agree with your post. In fact we do recommend writing good AGENTS.md, manually and targetedly. This is emphasized for example at the end of our abstract and conclusion.

vidarh · 2026-02-17T10:15:45 1771323345

Without measuring quality of output, this seems irrelevant to me.

My use of CLAUDE.md is to get Claude to avoid making stupid mistakes that will require subsequent refactoring or cleanup passes.

Performance is not a consideration.

If anything, beyond CLAUDE.md I add agent harnesses that often increase the time and tokens used many times over, because my time is more expensive than the agents.

_joel · 2026-02-17T11:26:56 1771327616

CLAUDE.md isn't a silver bullet either, I've had it lose context a couple of questions deep. I do like GSD[1] though, it's been a great addition to the stack. I also use multiple, different LLMs as a judge for PRs, which captures a load of issues too.

[1] https://github.com/gsd-build/get-shit-done

yorwba · 2026-02-17T12:48:05 1771332485

In this context, "performance" means "does it do what we want it to do" not "does it do it quickly". Quality of output is what they're measuring, speed is not a consideration.

vidarh · 2026-02-17T21:55:16 1771365316

The point is that whether it does what you tell it in a single iteration is less important then whether it avoids stupid mistakes. Any serious use will put it in a harness.

yorwba · 2026-02-18T11:36:58 1771414618

My point is that you misread the comment you replied to. (By the way, on page 2 of the paper: "we evaluate each LLM only within its corresponding harness.")

vidarh · 2026-02-18T19:01:54 1771441314

> My point is that you misread the comment you replied to.

I'm not the person you replied to.

> (By the way, on page 2 of the paper: "we evaluate each LLM only within its corresponding harness.")

That has zero relevance to my comment or to the type of harnesses I talked about in the comment you replied to, nor in my comment up-thread.

yorwba · 2026-02-18T21:23:59 1771449839

The only people I have replied to in this thread were vidarh, vidarh, and now vidarh again. I thought you were all the same person?

sdenton4 · 2026-02-17T13:10:16 1771333816

You're measuring binary outcomes, so you can use a beta distribution to understand the distribution of possible success rates given your observations, and thereby provide a confidence interval on the observed success rates. This week help us see whether that 4% success rate is statistically significant, or if it is likely to be noise.

bee_rider · 2026-02-17T13:52:18 1771336338

I’ve only ever gotten, like, slight wording suggestions from reviewers. I wish they would write things like this instead—it is possibly meaningful and eminently do-able (doesn’t even require new data!).

sdenton4 · 2026-02-17T17:37:54 1771349874

Taking a slightly closer look at the paper, you've got K repositories and create a set of test cases within each repository, totaling 130-ish tests. There may be some 'repository-level' effects - ie, tasks may be easier in some repo's than others.

Modeling the overall success rate then requires some hierarchical modeling. You can consider each repository as a weighted coin, and each test within a repository as flip of that particular coin. You want to estimate the overall probability of getting heads, when choosing a coin at random and then flipping it.

Here's some Gemini hints on how to proceed with getting the confidence interval using hierarchical bayes: https://gemini.google.com/corp/app/e9de6a12becc57f6

(Still no need for further data!)

regularfry · 2026-02-17T13:52:48 1771336368

> Regarding the 4% improvement for human written AGENTS.md: this would be huge indeed if it were a _consistent_ improvement. However, for example on Sonnet 4.5, performance _drops_ by over 2%. Qwen3 benefits most and GPT-5.2 improves by 1-2%.

Ok so that's interesting in itself. Apologies if you go into this in the paper, not had time to read it yet, but does this tell us something about the models themselves? Is there a benchmark lurking here? It feels like this is revealing something about the training, but I'm not sure exactly what.

nielstron · 2026-02-17T17:26:14 1771349174

It could... but as pointed out by other the significance is unclear and per-model results have even less samples than the benchmark average. So: maybe :)

deaux · 2026-02-17T14:50:41 1771339841

Thank you for turning up here and replying!

> The LLM-generated prompts follow the coding agent recommendations. We also show an ablation over different prompt types, and none have consistently better performance.

I think the coding agent recommended LLM-generated AGENTS.md files are almost without exception really bad. Because the AGENTS.md, to perform well, needs to point out the _non_-obvious. Every single LLM-generated AGENTS.md I've seen - including by certain vendors who at one point in time out-of-the-box included automatic AGENTS.md generation - wrote about the obvious things! The literal opposite of what you want. Indeed a complete and utter waste of tokens that does nothing but induce context rot.

I believe this is because creating a good one consumes a massive amount of resources and some engineering for any non-trivial codebase. You'd need multiple full-context iterations, and a large number of thinking tokens.

On top of that, and I've said this elsewhere, most of the best stuff to put in AGENTS.md is things that can't be inferred from the repo. Things like "Is this intentional?", "Why is this the case?" and so on. Obviously, the LLM nor a new-to-the-project human could know this or add them to the file. And the gains from this are also hard to capture by your performance metric, because they're not really about the solving of issues, they're often about direction, or about the how rather than the what.

As for the extra tokens, the right AGENTS.md can save lots of tokens, but it requires thinking hard about them. Which system/business logic would take the agent 5 different file reads to properly understand, but can we summarize in 3 sentences?

nielstron · 2026-02-17T17:27:03 1771349223

Yes that's a great summary and I agree broadly.

Note with different prompt types I refer to different types of meta-prompts to generate the AGENTS.md. All of these are quite useless. Some additional experiments not in the paper showed that other automated approaches are also useless ("memory" creating methods, broadly speaking).

c0rleyma · 2026-02-17T21:43:56 1771364636

I will read the paper, but I am curious if the methods promoted by eng/researchers at openai for models like codex 5.2/5.3 work? ie, is having a separate agent look at recent agent sessions and deduce problems the agents ran into and update agents.md (or more likely, the indexed docs referenced in an agents.md) actually helpful? A priori that seems like the main kind of meta prompting/harness you might expect to work more robustly.

nielstron · 2026-02-17T07:06:30 1771311990

This is life of an LLM researcher. We literally ran the last experiments only a month ago on what were the latest models back then...

nielstron · 2026-02-17T07:04:07 1771311847

Exactly my thoughts... the model should just auto ingest README and CONTRIBUTING when started.

delaminator · 2026-02-17T12:57:40 1771333060

You could have claude --init create this hook and then it gets into the context at start and resume

Or create it in some other way

    {
      "hookSpecificOutput": {
        "hookEventName": "SessionStart",
        "additionalContext": "<contents of your file here>"
      }
    }

I thought it was such a good suggestion that I made this just now and made it global to inject README at startup / resume / post compact - I'll see how it works out

https://gist.github.com/lawless-m/fa5d261337dfd4b5daad4ac964...

    #!/bin/bash
    # ~/.claude/hooks/inject-readme.sh

    README="$(pwd)/README.md"

    if [ -f "$README" ]; then
      CONTENT=$(jq -Rs . < "$README")
      echo "{\"hookSpecificOutput\" :{\"hookEventName\":\"SessionStart\",\"additionalContext\":${CONTENT}}}"
      exit 0
    else
      echo "README.md not found" >&2
      exit 1
    fi

with this hook

    {
      "hooks": {
        "SessionStart": [
          {
            "matcher": "startup|clear|compact",
            "hooks": [
              {
                "type": "command",
                "command": "~/.claude/hooks/inject-readme.sh"
              }
            ]
          }
        ]
      }
    }

delaminator · 2026-02-17T15:44:06 1771343046

Unlike other content - what you put in here survives compacting

rmnclmnt · 2026-02-17T10:05:27 1771322727

And that makes total sense. Honestly working since a few days with Opus 4.6, it really feels like a competent coworker, but need some explicit conventions to follow … exactly when onboarding a new IC! So i think there is a bright light to be seen: this will force having proper and explicit contribution rules and conventions, both for humans and robots

nielstron · 2026-02-17T07:00:55 1771311655

Hey, paper author here. We did try to get an even sample - we include both SWE-bench repos (which are large, popular and mostly human-written) and a sample of smaller, more recent repositories with existing AGENTS.md (these tend to contain LLM written code of course). Our findings generalize across both these samples. What is arguably missing are small repositories of completely human-written code, but this is quite difficult to obtain nowadays.

menaerus · 2026-02-17T08:17:32 1771316252

Why stick to python-only repositories though?

troupo · 2026-02-17T08:22:10 1771316530

To reduce the number of variables to account for. To be able to finish the paper this year, and not the next century. To work with a familiar language and environments. To use a language heavily represented in the training data.

I mean, it's not that hard to understand why.

menaerus · 2026-02-17T08:27:10 1771316830

[flagged]

troupo · 2026-02-17T10:42:10 1771324930

All research is conducted in constraints. It's not hard to understand those constraints by simply thinking.

Besides, one could actually open the research, and scroll to section 5 where they acknowledge the need to expand beyond Python:

--- start quote ---

5. Limitations and Future Work

While our work addresses important shortcomings in the literature, exciting opportunities for future research remain.

# Niche programming languages

The current evaluation is focused heavily on Python. Since this is a language that is widely represented in the training data, much detailed knowledge about tooling, dependencies, and other repository specifics might be present in the models’ parametric knowledge, nullifying the effect of context files. Future work may investigate the effect of context files on more niche programming languages and toolchains that are less represented in the training data, and known to be more difficult for LLMs

--- end quote ---

menaerus · 2026-02-17T16:08:28 1771344508

You still did not answer my question and you're still being a d*ck. I understand now why - because you have no idea what I am talking about.

nielstron · 2026-02-17T06:58:15 1771311495

Hey, a paper author here :) I agree, if you know well about LLMs it shouldn't be too surprising that autogenerated context files are not helping - yet this is the default recommendation by major AI companies which we wanted to scrutinize.

> Their definition of context excludes prescriptive specs/requirements files.

Can you explain a bit what you mean here? If the context file specifies a desired behavior, we do check whether the LLM follows it, and this seems generally to work (Section 4.3).

nielstron · 2025-09-13T12:30:20 1757766620

Debunking the Claims of K2-Think https://www.sri.inf.ethz.ch/blog/k2think

nielstron · 2025-09-13T12:30:10 1757766610

Debunking the Claims of K2-Think https://www.sri.inf.ethz.ch/blog/k2think

nielstron · 2025-05-14T12:01:53 1747224113

noted. we'll make sure to critizise turing complete type systems more thoroughly next time :))

nielstron · 2025-05-14T08:34:27 1747211667

Yes this work is super cool too! Note that LSPs can not guarantee resolving the necessary types that we use to ensure the prefix property, which we leverage to avoid backtracking and generation loops.

nielstron · 2025-05-14T06:16:20 1747203380

thank you!