More

comboy · 2026-04-14T20:18:40 1776197920

We are sorry, but your print resembles random princess from Disney too much (actually, we won't tell you which). Just following the law you know..

comboy · 2026-04-14T19:33:47 1776195227

Unrelated, but Claude was performing so tragically last few days, maybe week(s), but days mostly, that I had to reluctantly switch. Reluctantly because I enjoy it. Even the most basic stuff, like most python scripts it has to rerun because of some syntax error.

The new reality of coding took away one of the best things for me - that the computer always just does what it is told to do. If the results are wrong it means I'm wrong, I made a bug and I can debug it. Here.. I'm not a hater, it's a powerful tool, but.. it's different.

scandinavian · 2026-04-15T06:19:27 1776233967

I'm not a big user, but I have been doing some vibe-ish coding for a PoC the past few days, and I'm astonished at how bad it is at python in particular (Opus 4.6 High).

* It likes to put inline imports everywhere, even though I specify in my CLAUDE.md that it should not.

* We use ruff and pyright and require that all problems are addressed or at least ignored for a good reason, but it straight up #noqa ignores all issues instead.

* For typing it used the builtin 'any' instead of typing.Any which is nonsense.

* I asked it to add a simple sum of a column from a related database table, but instead of using a calculated sum in SQL it did a classic n+1 where it gets every single row from the related table and calculates the sum in python.

Just absolute beginner errors.

Lord-Jobo · 2026-04-15T11:49:24 1776253764

It really does have some disgusting inline behaviors. I’ve also seen it do some really bizarre stuff with SQL

lovlar · 2026-04-15T08:01:19 1776240079

Clajjan is this you?

taspeotis · 2026-04-14T22:57:21 1776207441

https://marginlab.ai/trackers/claude-code/

comboy · 2026-04-15T11:14:17 1776251657

I think API is fine, likely only subscription is affected. Not to mention trivial heuristics to differentiate repeated API calls / same data and potential CLI usage although that would be true malice.

It seemed to me that it was performing better through opencode using API but did not test extensively.

chillacy · 2026-04-15T02:08:09 1776218889

If SWE Bench is public then Anthropic is at a minimum probably also looking at their SWE bench scores when making changes, I'd trust more a tracker which runs a private benchmark not known to Anthropic.

bluegatty · 2026-04-14T21:12:12 1776201132

Codex with 5.4 xhigh. It's a bad communicator but does the job.

elAhmo · 2026-04-14T23:27:31 1776209251

You mean codex (client) with GPT 5.4 xhigh? I am using Codex 5.3 (model) through Cursor, waiting for Codex 5.4 model as I had great experience so far with 5.3.

bluegatty · 2026-04-14T23:52:39 1776210759

yes codex. it has 5.4.

winrid · 2026-04-15T05:10:25 1776229825

It's bad at long running tasks.

bluegatty · 2026-04-15T05:19:37 1776230377

Yes and no. It's bad because of shorter context but it does have auto-compaction which was much better than Claude. If you provide it documentation to work from and re-reference, it works long-running.

Honestly - 'every inch of IQ delta' seems to be worth it over anything else.

I'm a long time Claude Code supporter - and I'm ashamed to admit how instantly I dropped it when discovering how much better 5.4 is.

I don't trust Claude anymore for anything that requires heavy thinking - Codex always finds flaws in the logic.

But this happens every few months.

pacha3000 · 2026-04-14T19:52:48 1776196368

I'm the first to be tired of everyone, for every model, that says "uuuh became dumber" because I didn't believe them

... until this week! Opus is struggling worse than Sonnet those last two weeks.

saghm · 2026-04-14T21:57:08 1776203828

Forget the agent itself being dumber: right now I'm getting an "API error: usage limit exceeded" message whenever I try anything despite my usage showing as 26% for the session limit and 8% for the week (with 0/5 routines, which I guess is what this thread is about). This is with the default model and effort, and Claude Code is saying I need to turn on extra usage for it to work. Forget that, I just canceled my subscription instead.

There's utility in LLMs for coding, but having literally the entire platform vibe-coded is too much for me. At this point, I might genuinely believe they're not intentionally watering anything down, because it's incredibly believable that they just have no clue how any of it works anymore.

jpcompartir · 2026-04-14T21:10:59 1776201059

Likewise, I foolishly assumed everybody else was just doing it wrong.

But this week I've lost count of the times I've had to say something along the lines of: "Can you check our plan/instructions, I'm pretty sure I said we need to do [this thing] but you've done [that thing]..."

And get hit with a "You're absolutely right...", which virtually never happened for me. I think maybe once since Opus 4-6.

spoiler · 2026-04-15T06:23:51 1776234231

Honestly, I thought it was a skill issue too, but it just turns out I wasn't using it enough.

I started a new job recently, so I'm asking it a lot of questions about the codebase, sometimes just to confirm my understanding and often it came up with wrong conclusions that would send me down rabbit holes only to find out it was wrong.

On a side project I gave it literally a formula and told it to run it with some other parameters. It was doing its usual "let me get to know the codebase" then a "I have a good understanding of the codebase" speech, only to follow it up with "what you're asking is not possible" I'm like... No, I know it's possible I implemented it already, just use it in more places only to get the same "o ye ur right, I missed that... Blabla"

Yeah, it's gotten pretty bad...

redanddead · 2026-04-15T06:51:18 1776235878

They track our frustration, which is probably really good coding data. The reason why it's painful is because that's data annotation, it's literally a job people get paid to do, yet we're paying to do it. If they need good data, they just turn the models to shit and gaslight everyone

girvo · 2026-04-14T21:41:05 1776202865

My favourite was, Opus 4.6 last night (to be fair peak IST time, late afternoon my time), the first prompt with a small context: jams a copy-pasted function in between a bunch of import statements, doesn't even wire up it's own function and calls it done. Wild, I've not seen failure states like that since old Sonnet 4

data-ottawa · 2026-04-15T01:18:44 1776215924

Yesterday I had my biggest Opus WTF.

I asked Opus 4.6 to help me get GPU stats in btop on nixos. Opus's first approach was to use patchelf to monkey patch the btop binary. I had to redirect it to just look the nix wiki and add `nixpkgs.config.rocmSupport = true;`.

But the approach of modifying a compiled binary for a configuration issue is bizarre.

pxc · 2026-04-15T03:06:09 1776222369

It does stuff like this all the time. It loves doing this with scripts with sed, so I'm not surprised to hear about it trying to do this with binaries. It's definitely wilder, though

spoiler · 2026-04-15T06:32:22 1776234742

It frequently gets indentation wrong on projects, then tries to write sed/awk scripts. Can't get it right, then write a python script that reformats the whole file on stdout, makes sure the indentation is correct, then writes requests an edit snippet.

And you might be thinking. Well, you should use a code formatter! But I do!

And then you might say, well surely you forgot to mention it in you AGENTS/CLAUDE file. Nope, it's there, multiple times even in different sections because once was apparently not enough.

And lastly, surely if I'm watching this cursed loop unfold and am approving edits manually, like some bogan pleb, I can steer it easily... Well, let me tell ya... I tried stopping it and injecting hints about the formatter, and it stick for a minute before it goes crazy again. Or sometimes it rereads the file and just immediately fucks up the formatting.

I think when this shit happens, it probably uses like 3x more tokens.

For a Rust project, it recently stated analysing binaries in the target as directory a first instinct, instead of looking at the code...

Good grief.

comboy · 2026-04-14T20:21:09 1776198069

Pretty reassuring to hear that. I was skeptical too, there's a lot of variables like some crap added to memory specific skill or custom instructions interfering with the workflow and what not. But now it was like a toddler that consumes money when talking.

timacles · 2026-04-14T21:45:03 1776203103

It’s quite an interesting business model actually that the worse it performs to a degree the more money it makes you because of the token churn

combyn8tor · 2026-04-14T21:54:06 1776203646

In my experience Opus and Claude have declined significantly over the past few weeks. It actually feels like dealing with an employee that has become bored and intentionally cuts corners.

rishabhaiover · 2026-04-15T00:18:36 1776212316

And the worse part is the company is gaslighting people when they report it

qingcharles · 2026-04-14T21:56:26 1776203786

Is it? Or is it the task you're trying to do? Opus 4.6 has been staggeringly good for me this last week, both inside Claude Code and through Antigravity until I used up my quota.

SoMomentary · 2026-04-15T01:43:58 1776217438

I think some of this comes down to undeclared A/B testing. I've had the worst week of interactions I have ever had using Claude Code. The whole week whenever I have a session that isn't failing miserably I seem to get tapped for a session survey but on any that are out and out shitting the bed it never asks. It has felt a little surreal. I'd love to see a product wide stats graph for swearing, I would 100% believe that it is hitting an all time high but maybe I'm just a victim of a bad A/B round.

oefrha · 2026-04-15T03:08:50 1776222530

Oh I’ve been getting a lot more of those too lately even though I dismiss it every time. Wonder if I should report not satisfied every time so that I get routed to something better…

pacha3000 · 2026-04-15T11:47:00 1776253620

Usually, Claude code with Opus checks by itself the right tools to check the docs, for Svelte for example. So what it gives me is usually flawless.

And right now, I have to remind it every time that the MCP exists, and even then it cannot manage to find a routing bug I have with Sveltekit.

Did a lot of Sveltekit with Opus in the past, and I didn't have to think about it, Opus always got it right easily. Until now

bicepjai · 2026-04-15T03:07:16 1776222436

Yes totally agree it’s regurgitating crazy expansive text like book author who needs to publish 10 books a day

comboy · 2026-04-12T18:47:39 1776019659

I have the same bias as the parent. I'd rather pay $50 one time than $9 a year even if I throw it away after 4 years.

But the main reason I wouldn't install it despite being happy customizing linux is that it's yet another black box I need to trust and that knows way too much. It's really insane how much you need to compromise your security on macos to have a decent developer experience.

cactusplant7374 · 2026-04-12T20:38:27 1776026307

It's not economical. Lifetime sales for a lifetime unlock would probably be under $100. So not worth it for the developer.

comboy · 2026-04-12T18:02:46 1776016966

To be fair, I can do 3 balls effortlessly, but I can't do 1 ball like it is in this description, I just have a lot of error correction, enough to do it pretty much indefinitely. But I cannot reliably throw it accurately to the other hand.

Our software stack is the opposite of that.

comboy · 2026-04-12T14:59:24 1776005964

You just convinced me to try it. Claude just copy pastes, does search and replace, zero abstractions and I'm the one that needs to think about the edge cases.

dns_snek · 2026-04-13T15:19:02 1776093542

You may think that's a good thing but it's not. Codex is great at coming up with solutions to problems that don't exist and failing to find solution to problems that do. In the end you have 300 new lines of code and nothing to show for it.

stavros · 2026-04-12T16:01:43 1776009703

That's why I have Claude write the code and Codex review.

bdangubic · 2026-04-12T16:05:43 1776009943

that’s like having oleg kiselyov’s code reviewed by my middle school daughter :)

stavros · 2026-04-12T16:08:21 1776010101

I didn't know your middle school daughter is a genius coder, congratulations!

comboy · 2026-04-12T14:57:15 1776005835

Any good reasonable alternatives? Gemini is like prodigious 3yo hopeless for my projects, anybody tested some opencode with kimi or something?

eurekin · 2026-04-12T15:14:15 1776006855

I'm adding two extra gpus to my local rig. Turns out qwen 3.5 122b is already enough to handle (finish with moderate guidance) non-planning parts of my tasks.

cdelsolar · 2026-04-13T02:26:28 1776047188

what kinda gpus are you using?

comboy · 2026-04-12T04:00:20 1775966420

Interesting that in Chinese, classical writing is associated with terseness. That definitely wouldn't work that way in other languages I know.

rjh29 · 2026-04-12T06:42:06 1775976126

Newspaper headlines?

comboy · 2026-04-06T20:00:47 1775505647

It's not a well-written book. It's an interesting book (more like a story).

FrustratedMonky · 2026-04-06T20:02:21 1775505741

Oh no, a book that tells a story.

comboy · 2026-04-04T18:07:56 1775326076

Hey, you seem to have similar view on this. I know ideas are cheap but hear me out:

You talk with agent A it only modifies this spec, you still chat and can say "make it prettier" but that agent only modifies the spec, the spec could also separate "explicit" from "inferred".

And of course agent B which builds only sees the spec.

User actually can care about diffs generated by agent A again, because nobody wants to verify diffs on agents generated code full of repetition and created by search and replace. I believe if somebody implements this right it will be the way things are done.

And of course with better models spec can be used to actually meaningfully improve the product.

Long story short what industry misses currently and what you seem to be understanding is that intent is sacred. It should be always stored, preferably verbatim and always with relevant context ("yes exactly" is obviously not enough). Current generation of LLMs can already handle all that. It would mean like 2-3x cost but seem so much worth it (and the cost on the long run could likely go below 1x given typical workflows and repetitions)

beshrkayali · 2026-04-04T19:52:32 1775332352

Right, the spec/build separation is exactly the idea and Ossature is already built that way on the build side.

I agree a dedicated layer for intent capture makes a lot of sense. I thought about that as well, I am just not fully convinced it has to be conversational (or free-form conversational). Writing a prompt to get the right spec change is still a skill in itself, and it feels like it'd just be shifting the problem upstream rather than actually solving it. A structured editing experience over specs feels like it'd be more tractable to me. But the explicit vs inferred distinction you mention is interesting and worth thinking through more.

comboy · 2026-04-04T20:01:45 1775332905

The spec manually crafted the user is ideal.

It's just that we're lazy. After being able to chat, I don't see people going back. You can't just paste some error into the specs, you can't paste it image and say it make it look more like this. Plus however well designed the spec, something like "actually make it always wait for the user feedback" can trigger changes in many places (even for the sake of removing contradictions).

ithkuil · 2026-04-04T22:15:05 1775340905

The spec can be wrong for many reasons:

1. You can write a spec that builds something that is not what you actually wanted

2. You can write spec that is incoherent with itself or with the external world

3. You can write a spec that doesn't have sufficient mechanical sympathy with the tooling you have and so it requires you to all spec out more and more of the surrounding tech than you practically can.

All of those issues can be addressed by iterating on the spec with the help of agents. It's just an engineering practice, one that we have to become better at understanding

beshrkayali · 2026-04-05T14:01:44 1775397704

All three of these are real. The audit pass in Ossature is meant to catch the first two before generation starts, it reads across all specs and flags underspecified behavior, missing details, and contradictions. You resolve those, update the specs, and re-audit until the plan is clean. It's not perfect but it shifts a lot of the discovery earlier in the process.

The third point is harder. You still need to know your tooling well enough to write a spec that works with it. That part hasn't gone away.

whattheheckheck · 2026-04-05T03:32:54 1775359974

And what is a spec other than a program in a programming language? How do you prove the code artifact matches the spec or state machine

comboy · 2026-04-05T08:51:47 1775379107

Program defines the exact computer instructions. Most of the time you don't care about that level of detail. You just have some intent and some constraints.

Say "I want HN client for mobile", "must notify me about comments", you see it and you add "should support dark mode". Can you see how that is much less than anything in any programming language?

visarga · 2026-04-05T05:26:12 1775366772

My own approach also has intent sitting at the top: intent justifies plan justifies code justifies tests. And the other way around, tests satisfy code, satisfy plan, satisfy intent. These threads bottom up and top down are validated by judge agents.

I also make individual tasks md files (task.md) which makes them capable of carrying intent, plan, but not just checkbox driven "- [ ]" gates, they get annotated with outcomes, and become a workbook after execution. The same task.md is seen twice by judge agents which run without extra context, the plan judge and the implementation judge.

I ran tests to see which component of my harness contributes the most and it came out that it is the judges. Apparently claude code can solve a task with or without a task file just as well, but the existence of this task file makes plans and work more auditable, and not just for bugs, but for intent follow.

Coming back to user intent, I have a post user message hook that writes user messages to a project scoped chat_log.md file, which means all user messages are preserved (user text << agent text, it is efficient), when we start a new task the chat log is checked to see if intent was properly captured. I also use it to recover context across sessions and remember what we did last.

Once every 10-20 tasks I run a retrospective task that inspects all task.md files since last retro and judges how the harness performs and project goes. This can detect things not apparent in task level work, for example when using multiple tasks to implement a more complex feature, or when a subsystem is touched by multiple tasks. I think reflection is the one place where the harness itself and how we use it can be refined.

    claude plugin marketplace add horiacristescu/claude-playbook-plugin

    source at https://github.com/horiacristescu/claude-playbook-plugin/tree/main

beshrkayali · 2026-04-05T10:51:58 1775386318

The hierarchy you describe (intent -> plan -> code -> tests) maps well to how Ossature works. The difference is that your approach builds scaffolding around Claude Code to recover structure that chat naturally loses, whereas Ossature takes chat out of the generation pipeline entirely. Specs are the source of truth before anything is generated, so there's no drift to compensate for, the audit and build plan handle that upfront.

The judge finding is interesting though. Right now verification during build for each task in Ossature is command-based, compile, tests, that kind of thing. A judge checking spec-to-code fidelity rather than (or maybe in addition to?) runtime correctness is worth thinking about.

visarga · 2026-04-05T14:39:50 1775399990

Yes, judges should not just look for bugs, they should also validate intent follow, but that can only happen when intent was preserved. I chose to save the user messages as a compromise, they are probably 10 or 100x smaller than full session. I think tasks themselves are one step lower than pure user intent. Anyway, if you didn't log user messages you can still recover them from session files if they have not been removed.

One interesting data point - I counted word count in my chat messages vs final code and they came out about 1:1, but in reality a programmer would type 10x the final code during development. From a different perspective I found I created 10x more projects since I relied on Claude and my harness than before. So it looks user intent is 10x more effective than manual coding now.

4b11b4 · 2026-04-05T00:01:37 1775347297

close

miki123211 · 2026-04-05T06:43:17 1775371397

See also: https://juxt.github.io/allium/ (not affiliated in any way, just an interesting project)

I'm using something similar-ish that I build for myself (much smaller, less interesting, not yet published and with prettier syntax). Something like:

    a->b # b must always be true if a is true
    a<->b # works both ways
    a=>b # when a happens, b must happen
    a->fail, a=> fail # a can never be true / can never happen
    a # a is always true

So you can write:

    Product.alcoholic? Product in Order.lineItems -> Order.customer.can_buy_alcohol?
    u1 = User(), u2=User(), u1 in u2.friends -> u2 in u1.friends
    new Source() => new Subscription(user=Source.owner, source=Source)
    Source.subscriptions.count>0 # delete otherwise

This is a much more compact way to write desired system properties than writing them out in English (or Allium), but helps you reason better about what you actually want.

beshrkayali · 2026-04-05T14:55:23 1775400923

Allium looks interesting, making behavioral intent explicit in a structured format rather than prose is very close to what I'm trying to do with Ossature actually.

Ossature uses two markdown formats, SMD[1] for describing behavior and AMD for structure (components, file paths, data models). AMDs[2] link back to their parent SMD so behavior and structure stay connected. Both are meant to be written, reviewed, and/or owned by humans, the LLM only reads the relevant parts during generation. One thing I am thinking about for the future is making the template structure for this customizable per project, because "spec" means different things to different teams/projects. Right now the format is fixed, but I am thinking about a schema-based way to declare which sections are required, their order, and basic content constraints, so teams can adapt the spec structure to how they think about software without having to learn a grammar language to do it (though maybe peg-based underneath anyway, not sure).

The formal approach you describe is probably more precise for expressing system properties. Would be interesting to see how practical it is to maintain it as a project grows.

1: https://docs.ossature.dev/specs/smd.html

2: https://docs.ossature.dev/specs/amd.html

4b11b4 · 2026-04-05T00:01:11 1775347271

yep but spec isn't the root

comboy · 2026-04-04T17:45:34 1775324734

GPUs can do graphics too?

aobdev · 2026-04-04T21:54:25 1775339665

I can’t tell if you’re making a joke about the current state of AI and GPUs or refuting the purpose of this driver