> Commented [9]:
This is fundamentally untrue.
An LLM can certainly spit out thousands of lines of
code, but "opens the app itself" is definitely up for
question, as is "clicks the buttons" considering how
unreliable
basically every computer
-
use LLM is.
"It iterates like a developer would, fixing and refining
until it's satisfied" is just a bald
-
faced lie. What're you
talking about? This is not what these models do, nor
what Codex or Claude code does.
This is a clever and sinister
way to write, because it
abuses the soft edges of the truth
-
while coding LLMs
can test products, or scan/fix some bugs, this
suggests
they A) do this autonomously without human
input, B) they do this correctly every time (or ever!), C)
that there is some sort of internal "standard" they follow
and D) that all of this just happens without any human
involvement
---
Ummm. Yeah, no. This actually works. No idea why bozos who obviously don't use the tools write about how the tools don't do this or that. Yes they do. I know because I use them. Today's best agentic harnesses can absolutely do all of the above. Not perfect by any means, not every time, but enough to be useful to me. As some people say "stop larping". If you don't know how a tool works, or what it can do, why the hell would you comment on something so authoritatively? This is very bad.
(I'll make a note that the original article was written by a 100% certified grifter. I happened to be online on llocallama when that whole debacle happened. He's a quack. No doubt about it. But from the quote I just pasted, so is the commenter. Qwaks commenting on qwacks. This is so futile)
> With a chess engine, you could ask any practitioner in the 90's what it would take to achieve "Stage 4" and they could estimate it quite accurately as a function of FLOPs and memory bandwidth.
And the same practitioners said right after deep blue that go is NEVER gonna happen. Too large. The search space is just not computable. We'll never do it. And yeeeet...
No, that's more capabilities than sandboxing. You want fine-grained capabilities such that for every "thread" the model gets access to the minimum required access to do something.
The problem is that it seems (at least for now) a very hard problem, even for very constrained workflows. It seems even harder for "open-ended" / dynamic workflows. This gets more complicated the more you think about it, and there's a very small (maybe 0 in some cases) intersection of "things it can do safely" and "things I need it to do".
Nah, that's just reddit. At this point it's safer to take anything that's popular on reddit as either outright wrong or so heavily out of context that it's not relevant.
Oh, sure, I learned a long time ago that Reddit is a very reliable anti-indicator. But given that HN isn't nearly as bad (but there are moments), it's still strange that people would just repeat something about someone else that they could disprove for themselves in 30 seconds.
For deep dives into AI stuff google deep mind's podcast with Hannah Fry is very good (but obviously limited to goog stuff). I also like Lex for his tech / AI podcasts. Much better interviewer IMO, Dwarkesh talks way too much, and injects his own "insights" for my taste. I'm listening to a podcast to hear what the guests have to say, not the host.
For more light-weight "news-ish" type of podcast that I listen to while walking/driving/riding the train, in no particular order: AI & I (up to date trends, relevant guests), The AI Daily Brief (formerly The AI Breakdown - this is more to keep in touch with what's released in the past month) and any other random stuff that yt pops up for me from listening to these 4 regularly.
I can't think of an interviewer who interjects their viewpoint more and tries to get his guest to acknowledge/agree to his typically shallow level analysis than Lex. The only redeeming quality about his podcast are the guests he gets. I don't think Dwarkesh is great but he's leagues better.
> Trouble is some benchmarks only measure horse power.
IMO it's the other way around. Benchmarks only measure applied horse power on a set plane, with no friction and your elephant is a point sphere. Goog's models have always punched over what benchmarks said, in real world use @ high context. They don't focus on "agentic this" or "specialised that", but the raw models, with good guidance are workhorses. I don't know any other models where you can throw lots of docs at it and get proper context following and data extraction from wherever it's at to where you'd need it.
Yeah, these are made possible largely by better use at high context lengths. You also need a step that gathers all the Ns and selects the best ideas / parts and compiles the final output. Goog have been SotA at useful long context for a while now (since 2.5 I'd say). Many others have come with "1M context", but their usefulness after 100k-200k is iffy.
What's even more interesting than maj@n or best of n is pass@n. For a lot of applications youc an frame the question and search space such that pass@n is your success rate. Think security exploit finding. Or optimisation problems with quick checks (better algos, kernels, infra routing, etc). It doesn't matter how good your pass@1 or avg@n is, all you care is that you find more as you spend more time. Literally throwing money at the problem.
I keep seeing this and I don't think I agree. We outsource thinking everyday. Companies do this everyday. I don't study weather myself, I check an app and bring an umbrella if it says it's gonna rain. My team trusts each other do do some thinking in their area, and present bits sideways / upwards. We delegate lots of things. We collaborate on lots of things.
What needs to be clear is who owns what. I never send something I wouldn't stand by. Not in a correctness sense (I have, am and likely will be wrong on any number of things) but more in a "yeah, that is my output, and I stand by it now" kind of way. Tomorrow it might change.
Also remember that google quip "it's hard to edit an empty file". We have always used tools to help us. From scripts saved here and there, to shortcuts, to macros, IDE setups, extensions and so on. We "think once" and then try not to "think" on every little detail. We'd go nowhere with that approach.
IMO it helps to take a scenario and then imagine every task is being delegated to a randomized impoverished human remote contractor, with the same (lack of) oversight and involvement by the user.
There's a strong overlap between things which bad (unwise, reckless, unethical, fraudulent, etc.) in both cases.
> We outsource thinking everyday. [...] What needs to be clear is who owns what.
Also once you have clarity, there's another layer where some owning/approval/delegation is not permissible.
For example, a student ordering "make me a 3 page report on the Renaissance." Whether the order went to another human or an LLM, it is still cheating, and that wouldn't change even if they carefully reviewed it and gave it a stamp of careful approval.
Right. I don’t think I disagee with anything you’ve said here.
However, if I had an idea and just fobbed the idea off to an LLM who fleshed it out and posted it to my blog, would you want to read the result? Do you want to argue against that idea if I never even put any thought into it and maybe don’t even care?
I’m like you in this regard. If I used an LLM to write something I still “own” the publishing of that thing. However, not everyone is like this.
Managers and business owners outsource thinking to their employees and they deserve huge paychecks for it. Entrepreneurs do it and we celebrate them. But an invention that allows the peon to delegate to an automaton? That’s where I draw the line.
> The result is OK. It has all the features I asked for, and includes document sharing, collaborative editing in real time, support for fonts and line spacing, etc. etc. I could not have paid a developer $170 and got this. The problem, of course, is that, while abstractly impressive, this is completely useless
Well, what would you expect from a few hours of running in a loop with these constraints?
> This project exists to build a document editor from the ground up. Violating these constraints defeats the entire purpose.
> FORBIDDEN dependencies (do NOT install or use these):
> Rich text editor frameworks: ProseMirror, Slate, Quill, TipTap, Draft.js, CKEditor, TinyMCE, Lexical, or any similar library
> CRDT/OT libraries: Yjs, Automerge, ShareDB, OT.js, or any similar library
> Full CSS frameworks: Bootstrap, Tailwind, Material UI (small utility libs for specific needs are OK)
> ORMs: Prisma, TypeORM, Sequelize (use raw SQL or a thin query builder)
I can't help but wonder what you thought you would achieve, and how getting "mostly what you asked for" is still disappointing to you.
> there is no taste being applied.
There are 0 lines in AGENT_PROMPT.md about "taste". You have instructed something/someone on how to build more than what to build.
Your goals are (from a quick skim):
- The goal of this project is ultimately to generate a working alternative to Google Docs with the same functionality.
- You are an autonomous software engineer building AltDocs, a from-scratch alternative to Google Docs.
I see a FEATURES.md file, but not clear if this is from you or expanded by the model. It seems pretty slim.
All in all, I don't get the "disappointment". It seems, from your blog post, that the "model" did most of the things you asked for. The disappointment might come from what you asked for, more than from the "model" being bad... To paraphrase a line from a sitcom: "Damn, Andrew, I can't control the weather!" :)
---
Ummm. Yeah, no. This actually works. No idea why bozos who obviously don't use the tools write about how the tools don't do this or that. Yes they do. I know because I use them. Today's best agentic harnesses can absolutely do all of the above. Not perfect by any means, not every time, but enough to be useful to me. As some people say "stop larping". If you don't know how a tool works, or what it can do, why the hell would you comment on something so authoritatively? This is very bad.
(I'll make a note that the original article was written by a 100% certified grifter. I happened to be online on llocallama when that whole debacle happened. He's a quack. No doubt about it. But from the quote I just pasted, so is the commenter. Qwaks commenting on qwacks. This is so futile)
reply