Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I actually hope to find better answers here than on cursor forum where people seems to be basically saying "it's you fault" instead of answering the actual question which is about trust, process, and real world use of agents..

So far it's just reinforcing my feeling that none of this is actually used at scale.. We use AI as relatively dumb companions, let them go wilder on side projects which have loser constraints, and Agent are pure hype (or for very niche use cases)



The reason why OP is getting terrible results is because he's using Cursor, and Cursor is designed to ruthlessly prune context to curtail costs.

Unlike the model providers, Cursor has to pay the retail price for LLM usage. They're fighting an ugly marginal price war. If you're paying more for inference than your competitors, you have to choose to either 1) deliver equal performance as other models at a loss or 2) economize by way of feeding smaller contexts to the model providers.

Cursor is not transparent on how it handles context. From my experience, it's clear that they use aggressive strategies to prune conversations to the extent that it's not uncommon that cursor has to reference the same file multiple times in the same conversation just to know what's going on.

My advice to anyone using Cursor is to just stop wasting your time. The code it generates creates so much debt. I've moved on to Codex and Claude and I couldn't be happier.


What deal is GitHub Copilot getting then? They also offer all SOTA models. Or is the performance of those models also worse there?


Github Copilot is likely running models at or close to cost, given that Azure serves all those models. I haven't used Copilot in several months so I can't speak to its performance. My perception back then was that its underperformance relative to peers was because Microsoft was relatively late to the agentic coding game.


> Or is the performance of those models also worse there?

The context and output limit is heavily shrunk down on github copilot[0]. That's the reason why for example Sonnet 4.5 performs noticeably worse under copilot than in claude code.

[0] https://models.dev/?search=sonnet+4.5


Exactly, the actual business value is way smaller people think and its honestly frustrating. Yes they can write boilerplate, yes they sometimes do better than humans in well understood areas. But its negligible considering all the huge issues that come with them. Big tech vendorlocks, data poisoning, unverifiable information, death of authenticity, death of creativity, ignorance of LLM evangelists, power hungriness in a time where humanity should look at how to decrease emissions, theft of original human work, theft of data big tech gets away with since way too long. Its puzzling to me how people actually think this is a net benefit to humanity.


The best I can tell you (from working with LLM's) is that... it's complicated.

There are moments where spending 10 min on a good prompt saves me 2hrs of typing and it finishes that in the time it takes me to go make myself a cup of coffee (~10 min) Those are the good moments.

Then there are moments where it's more like 30 min savings for 10 min of prompting. Those are still pretty good.

Then there are plenty of moments where spending 10 mins on a prompt saves me about 15mins of work. But I have to wait 5 mins for the result, so it ends up being a wash except it has a downside that I didn't really write it myself so the actual details of the solution aren't fully internalized.

There's also plenty of moments where the result at first glance looks like a good / great result but once I start reviewing and fixing things it still ends up being a wash.

I find it actually quite difficult to determine the result quality because at first glance it always looks pretty decent, and then sometimes once you start reviewing it's indeed the case and other times I'm like "well it needs some tweaking" and subsequently spend an hour tweaking.

Now I think the problem is that the response is akin to gambling / conditioning in a sense. Every prompt has a smallish chance to trigger a great result, and since the average result is still about 25% faster (my gut feeling based on what I've 'written' the last few months working with Claude Code) it's just very tempting to pull that slot machine lever even in tasks that I know I will most likely type faster than I can prompt.

I did find a place where (to me, at least) it almost certainly adds value: I find it difficult to think about code during meetings (I really need my attention in the meetings I do) but I can send a few quick prompts for small stuff during meetings and don't really have to context switch. This alone is a decent productivity booster. Refactorings that would've been a 'maybe, one day' can now just be triggered. Best case I spend 10 minutes reviewing and accept it. Worst case I just throw it away.


Most of the issues you listed are moral and not technical. Especially "power hungriness in a time where humanity should look at how to decrease emissions", this may be what you think humanity should do but that is just that, what you think.

I derive a lot of business value from them, many of my colleagues do too. Many programmers that were good at writing code by hand are having lots of success with them, for example Thorsten Ball, Simon Willison, Mitchell Hashimoto. A recent example from Mitchell Hashimoto: https://mitchellh.com/writing/non-trivial-vibing.

>Its puzzling to me how people actually think this is a net benefit to humanity.

I've used them personally to quickly spin up a microblog where I could post my travel pictures and thoughts. The idea of making the interface like twitter (since that's what I use and know) was from me, not wanting to expose my family and friends to any specific predatory platform like twitter, instagram, etc was also from me, supabase as the backend was from a colleague (helped a lot!), the code was all Claude. The result is that they were able to enjoy my website, including my grandparents that just had to paste an URL on the website. I like to think of it a a perhaps very small but net benefit for a very small part of humanity.


Is it a moral judgement to say that when the stove is on fire, we shouldn't be pouring more grease on it?

Is it a moral judgement to say that you shouldn't pick up a bear cub with its mother nearby?

If neither of these are moral judgements, then why would it be a moral judgement to say that humanity should be seeking to reduce its emissions? Just because you personally don't like it, and want to keep doing whatever you like?


Pouring grease on a fire will make it worse. Picking up a bear cub when the mother is nearby will increase (by how much I don't know) the risks of getting attacked by a bear. Both of those sentences to me sound like genuine description of the reality we live in. Climate change is real and caused by human emissions is another one of those, though a bit less precise as "climate change" is less precise. Saying we should or shouldn't do something is something different.

Also, you can increase power capacity by a lot while reducing emissions, with stuff like solar panels or nuclear power.


So moral issues are not relevant? Typical tech enthusiast mindset unfortunately...


if climate change doesn't matter why the hell should anyone care about your vibe coded personal twitter clone?


What if I'm vibe engineering a solution to global warming? Does it cancel itself out?


no


> I've used them personally to quickly spin up a microblog where I could post my travel pictures and thoughts.

Sorry but this a perfect example of the typical demographic that currently boosts the usage of these non-tools: trivial, almost unnecessary use-cases, of service to no-one but self (maybe friends and family too). You could have also spun up a simple microblog on one of the many blogging platforms, with trivial UI complexity, low costs and much smaller environmental impact.


What specific improvements are you hoping for? Without them (in the original forum post) giving concrete examples, prompts, or methodology – just stating "I write good prompts" – it's hard to evaluate or even help them.

They came in primed against agentic work flow. That is fine. But they also came in without providing anything that might have given other people the chance to show that their initial assumptions was flawed.

I've been working with agents daily for several months. Still learning what fails and what works reliably.

Key insights from my experience: - You need a framework (like agent-os or similar) to orchestrate agents effectively - Balance between guidance and autonomy matters - Planning is crucial, especially for legacy codebases

Recent example: Hit a wall with a legacy system where I kept maxing out the context window with essential background info. After compaction, the agent would lose critical knowledge and repeat previous mistakes.

Solution that worked: - Structured the problem properly - Documented each learning/discovery systematically - Created specialized sub-agents for specific tasks (keeps context windows manageable)

Only then could the agent actually help navigate that mess of legacy code.


So at what point are you doing more work on the agent than working on the code directly ? And what are you losing in the process of shifting from the code author to LLM manager ?

My experience is that once I switch to this mode when something blows up I'm basically stuck with a bunch of code that I sort of know, even tough I reviewed it. I just don't have the same insight as I would if I wrote the code, no ownership, even if it was committed in my name. Like any misconceptions I've had about how things work I will still have because I never had to work through the solution, even if I got the final working solution.


My thoughts exactly. They generate a pile of sludge if left to their own devices. Sludge that you will take incredible amounts of time to understand.

The amount of tech debt these things accumulate unchecked is massive.

In some places it doesn’t matter, in some places it matters a lot.


With all that additional work, would you assess you have been more cost effective as just doing these tasks yourself with an Ai companion?


sounds like a huge waste of time


YMMV


I've had agents find several production bugs that slipped me (as I couldn't dedicate enough time to chase down relatively obscure and isolated bug reports).

Of course there are many more bugs they'll currently not find, but when this strategy costs next to nothing (compared to a SWE spending an hour spelunking) and still works sometimes, the trade-off looks pretty good to me.


from a cursory (heh) reading of the cursor forum, it is clear that the participants in the chat are treating ai like the adeptus mechanicus treats the omnissiah.... the machine spirits aren't cooperating with them though.


That kind of comments would be more meaningful and get better responses, if they came with a practical example. Some reasonable real-world problem and how the author tried to solve it using LLM but failed.


I would love to give a quick primer on how I'm using agents:

I'll usually have a main line of work I'm focused on. I'll describe the current behavior and desired changes (need to plumb this var through these functions to use here). "Gpt 5 thinking high" is pretty precise, so if you clearly indicate what you want it usually does exactly what I request. (If this isn't happening for you, make sure you don't have other context in your codebase that confuses it)

While it's working, I'll often be able to prompt another line of work, usually requesting explicitly it not make changes but not switching to ask mode. It will do most of the work to figure out what changes would need to be made and it summarizes them helpfully which allows me to correct it if it's wrong. You can repeat this for as long as the existing models are busy

Types of prompts that work well:

Questions: "what's the function or component for doing X", where else do we do this pattern?

Bug prompts (anything that would take you <2h to fix should be promptable in a single prompt, note you'll get slightly different responses even with the same prompt, so if at first you don't succeed you might explain what went wrong, ask it to improve your prompt, and then try again from scratch. People don't reset context often enough)

Larger scale architecture / plans - this I would recommend switching to plan mode and spending some time going back and forth. Often it will get confused so take your progress (ideally as an .md file) and bring it to a new conversation to keep iterating.

You can even have it suggest jira tickets etc

Understanding different models is important: Claude 4.5 (and most Claude models since 3.5) really want to do stuff. And if you leave them unchecked they'll usually do way more than you asked. And if they perceive themselves to be blocked on a failing test they might delete it or change it to be useless. That said, they're really extraordinary models when you want a quick prototype fleshed out where you don't make all of the decisions. Gpt 5 thinking high is my personal favorite (codex 5 thinking high is also very good in the codex plugin in vscode). Create new context often.


Best things about Claude: it will often figure out a good feedback loop where it can build + test and get quick feedback about whether the thing is working. This works best in Claude code but can be effective in cursor too

Best things about gpt: the precision. I don't even care that they're slow, it just let's me queue up more work

Best things about codex: it's a little smarter at handling very hard or very easy tasks. It might spend less time on easy tasks and even more time on hard ones

Best things about grok: speed plus leetcode style ability

All of them tend to benefit from a feedback loop if you can give them great tests or good static analysis etc, but they will cheat if you let them (any in ts)


I've used this analogy many times:

Codex + GPT-5-high is an offshore consultant. You give it the spec and it'll do the work and come back with something.

Claude is built like a pair programmer, it chats while it works and you can easily interrupt it without breaking the flow.

Codex is clearly more thorough, it's _excellent_ at picking apart Sonnet 4.5 code and finding the subtle gotchas it leaves behind when it just plows to a result.

And like you said, Claude is results first. It'll get where you want it to go, even if it has to mock the whole application to get the tests to pass. =)


You are spot on and summed it up perfectly.

I am using language models as much as anyone and they work but they don't work the way the marketing and popular delusion behind them is pretending they work.

The best book on LLMs and agentic AI is Extraordinary Popular Delusions and the Madness of Crowds by Charles Mackay.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: