Claude Sonnet 4.5 is _way_ better than previous sonnets and as good as Opus for ...

bangaroo · 2025-10-14T13:08:41 1760447321

"it's better at coding" is not useful information, sorry. i'd love to hear tangible ways it's actually better. does it still succumb to coding itself in circles, taking multiple dependencies to accomplish the same task, applying inconsistent, outdated, or non-idiomatic patterns for your codebase? has compliance with claude.md files and the like actually improved? what is the round trip time like on these improvements - do you have to have a long conversation to arrive at a simple result? does it still talk itself into loops where it keeps solving and unsolving the same problems? when you ask it to work through a complex refactor, does it still just randomly give up somewhere in the middle and decide there's nothing left to do? does it still sometimes attempt to run processes that aren't self-terminating to monitor their output and hang for upwards of ten minutes?

my experience with claude and its ilk are that they are insanely impressive in greenfield projects and collapse in legacy codebases quickly. they can be a force multiplier in the hands of someone who actually knows what they're doing, i think, but the evidence of that even is pretty shaky: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

the pitch that "if i describe the task perfectly in absolute detail it will accomplish it correctly 80% of the time" doesn't appeal to me as a particularly compelling justification for the level of investment we're seeing. actually writing the code is the simplest part of my job. if i've done all the thinking already, i can just write the code. there's very little need for me to then filter that through a computer with an overly-verbose description of what i want.

as for your search results issue: i don't entirely disagree that google is unusable, but having switched to kagi... again, i'm not sure the order of magnitude of complexity of searching via an LLM is justified? maybe i'm just old, but i like a list of documents presented without much editorializing. google has been a user-hostile product for a long time, and its particularly recent quality collapse has been well-documented, but this seems a lot more a story of "a tool we relied on has gotten measurably worse" and not a story of "this tool is meaningfully better at accomplishing the same task." i'll hand it to chatgpt/claude that they are about as effective as google was at directing me to the right thing circa a decade ago, when it was still a functional product - but that brings me back to the point that "man, this is a lot of investment and expense to arrive at the same result way more indirectly."

kasey_junk · 2025-10-14T13:17:57 1760447877

You asked for a single anecdote of llms getting better at daily tasks. I provided two. You dismissed them as not valuable _to you_.

It’s fine that your preferences aren’t aligned such that you don’t value the model or improvements that we’ve seen. It’s troubling that you use that to suggest there haven’t been improvements.

bangaroo · 2025-10-14T13:20:44 1760448044

you didn't provide an anecdote. you just said "it's better." an anecdote would be "claude 4 failed in x way, and claude 4.5 succeeds consistently." "it is better" is a statement of fact with literally no support.

the entire thrust of my statement was "i only hear nonspecific, vague vibes that it's better with literally no information to support that concept" and you replied with two nonspecific, vague vibes. sorry i don't find that compelling.

"troubling" is a wild word to use in this scenario.

kasey_junk · 2025-10-14T13:30:09 1760448609

My one shot rate for unattended prompts (triggered GitHub actions) has gone from about 2 in 3 to about 4 in 5 with my upgrade to 4.5 in the codebase I program in the most (one built largely pre-ai). These are highly biased to tasks I expect ai to do well.

Since the upgrade I don’t use opus at all for planning and design tasks. Anecdotally, I get the same level of performance on those because I can choose the model and I don’t choose opus. Sonnet is dramatically cheaper.

What’s troubling is that you made a big deal about not hearing any stories of improvements as if your bar was very low for said stories, then immediately raised the bar when given them. It means that one doesn’t know what level of data you actually want.

Retric · 2025-10-14T13:52:01 1760449921

Requesting concrete examples isn’t a high bar. Autopilot got better tells me effectively nothing. Autopilot can now handle stoplights does.

kasey_junk · 2025-10-14T14:14:38 1760451278

“ but i cannot point to a single person who seems to think that they are accomplishing real-world tasks with GPT5 better than they were with GPT4.”

I don’t use OpenAI stuff but I seem to think Claude is getting better for accomplishing the real world tasks I ask of it.

Retric · 2025-10-14T17:15:20 1760462120

Specifics are worth talking about. I just felt it unfair to complain about raising the bar when you didn’t initially reach it.

In your own worlds: “You asked for a single anecdote of llms getting better at daily tasks.”

Which is already less specific than their request: “i'd love to hear tangible ways it's actually better.”

Saying “is getting better for accomplishing the real world tasks I ask of it” brings nothing to a discussion and was the kind of vague statement that they were initially complaining about. If LLM’s are really improving it’s not a major hurdle to say something meaningful about what specific is getting better. /tilting at windmills

mwigdahl · 2025-10-14T15:21:23 1760455283

Here's one. I have a head to head "benchmark" involving generating a React web app to display a Gantt chart, add tasks, layer overlaps, read and write to files, etc. I compared implementing this application using both Claude Code with Opus 4.1 / Sonnet 4 (scenario 1) and Claude Code 2 with Sonnet 4.5 (scenario 2) head to head.

The scenario 1 setup could complete the application but it had about 3 major and 3 minor implementation problems. Four of those were easily fixed by pointing them out, but two required significant back and forth with the model to resolve.

The scenario 2 setup completed the application and there were four minor issues, all of which were resolved with one or two corrective prompts.

Toy program, single run through, common cases, stochastic parrot, yadda yadda, but the difference was noticeable in this direct comparison and in other work I've done with the model I see a similar improvement.

Take from that what you will.

bangaroo · 2025-10-14T15:38:57 1760456337

so to clarify your case, you are having it generate a new application, from scratch, and then benchmarking the quality of the output and how fast it got to the solution you were seeking?

i will concede that in this arena, there does seem to be meaningful improvement.

i said this in one of my comments in this thread, but the place i routinely see the most improvement in output from LLMs (and find they perform best) for code generation is in green field projects, particularly ones whose development starts with an agent. some facts that make me side-eye this result (not yours in particular, just any benchmark that follows this model):

- the codebase, as long as a single agent and model are working on it, is probably suited to that model's biases and thus implicitly easier for it to work in and "understand."

- the codebase is likely relatively contained and simple.

- the codebase probably doesn't cross domains or require specialized knowledge of services or APIs that aren't already well-documented on the internet or weren't built by the tool.

these are definitely assumptions, but i'm fairly confident in their accuracy.

one of the key issues i've had approaching these agents is that all my "start with an LLM and continue" projects actually start incredibly impressively! i was pretty astounded even on the first version of claude code - i had claude building a service, web management interface AND react native app, in concert, to build an entire end to end application. it was great! early iteration was fast, particularly in the "mess around and find out what happens" phase of development.

where it collapsed, however, was when the codebase got really big, and when i started getting very opinionated about outcomes. my claude.md file grew and grew and seemed to enforce less and less behavior, and claude became less and less likely to successfully refactor or reuse code. this also tracks with my general understanding of what an LLM may be good or bad at - it can only hold so much context, and only as textual examples, not very effectively as concepts or mental models. this ultimately limits its ability to reason about complex architecture. it rapidly became faster for me to just make the changes i envisioned, and then claude became more of a refactoring tool that i very narrowly applied when i was too lazy to do the text wrangling myself.

i do believe that for rapid prototyping - particularly the case of "product manager trying to experiment and figure out some UX" - these tools will likely be invaluable, if they can remain cost effective.

the idea that i can use this, regularly, in the world of "things i do in my day-to-day job" seems a lot more far fetched, and i don't feel like the models have gotten meaningfully better at accomplishing those tasks. there's one notable exception of "explaining focused areas of the code", or as a turbo-charged grep that finds the area in the codebase where a given thing happens. i'd say that the roughly 60-70% success rate i see in those tasks is still a massive time savings to me because it focuses me on the right thing and my brain can fill in the rest of the gaps by reading the code. still, i wouldn't say its track record is phenomenal, nor do i feel like the progress has been particularly quick. it's been small, incremental improvements over a long period of time.

i don't doubt you've seen an improvement in this case (which is, as you admit, a benchmark) but it seems like LLMs keep performing better on benchmarks but that result isn't, as far as i can see, translating into improved performance on the day-to-day of building things or accomplishing real-world tasks. specifically in the case of GPT5, where this started, i have heard very little if any feedback on what it's better at that doesn't amount to "some things that i don't do." it is perfectly reasonable to respond to me that GPT5 is a unique flop, and other model iterations aren't as bad, in that case. i accept this is one specific product from one specific company - but i personally don't feel like i'm seeing meaningful evidence to support that assertion.

mwigdahl · 2025-10-14T18:37:00 1760467020

Thank you for the thoughtful response. I really appreciate the willingness to discuss what you've seen in your experience. I think your observations are pretty much exactly correct in terms of where agents do best. I'd qualify in just a couple areas:

1. In my experience, Claude Code (I've used several other models and tools, but CC performs the best for me so that's my go-to) can do well with APIs and services that are proprietary as long as there's some sort of documentation for them it can get to (internal, Swagger, etc.), and you ensure that the model has that documentation prominently in context.

2. CC can also do well with brownfield development, but the scope _has_ to be constrained, either to a small standalone program or a defined slice of a larger application where you can draw real boundaries.

The best illustration I've seen of this is in a project that is going through final testing prior to release. The original "application" (I use the term loosely) was a C# DLL used to generate data-driven prescription monitoring program reporting.

It's not ultra-complicated but there's a two step process where you retrieve the report configuration data, then use that data to drive retrieval and assembly of the data elements needed for the final report. Formatting can differ based on state, on data available (reports with no data need special formatting), and on whether you're outputting in the context of transmission or for user review.

The original DLL was written in a very simplistic way, with no testing and no way to exercise the program without invoking it from its link points embedded in our main application. Fixing bugs and testing those fixes were both very painful as for production release we had to test all 50 states on a range of different data conditions, and do so by automating the parent application.

I used Claude Code to refactor this module, add DI and testing, and add a CLI that could easily exercise the logic in all different supported configurations. It took probably $50 worth of tokens (this was before I had a Max account, so it was full price) over the course of a few hours, most of which time I was in other meetings.

The final result did exhibit some classic LLM problems -- some of the tests were overspecific, it restructured without always fully cleaning up the existing functions, and it messed up a couple of paths through the business logic that I needed to debug and fix. But it easily saved me a couple days of wrestling with it myself, as I'm not super strong with C#. Our development teams are fully committed, and if I hadn't used CC for this it wouldn't have gotten done at all. Being able to run this on the side and get a 90% result I could then take to the finish line has real value for us, as the improved testing alone will see an immediate payback with future releases.

This isn't a huge application by any means, but it it's one example of where I've seen real value that is hitting production, and seems representative of a decently large category of line-of-business modules. I don't think there's any reason this wouldn't replicate on similarly-scoped products.

theshrike79 · 2025-10-14T21:21:08 1760476868

The biggest issue with Sonnet 4.5 is that it's chatty as fuuuck. It just won't shut up, it keeps producing massive markdown "reports" and "summaries" of every single minor change, wasting precious context.

With Sonnet 4 I rarely ran out of quota unexpectedly, but 4.5 chews through whatever little Anthropic gives us weekly.