yes I always thought it's an easy thing. but I changed my mind recently when I had to deal with it.
A lot little things you need to think of. For example.
Client sends a request. The database is temporarily down. The server catches the exception and records the key status as FAILED. The client retries the request (as they should for a 500 error). The server sees the key exists with status FAILED and returns the error again-forever. Effectively "burned" the key on a transient error.
others like:
- you may have Namespace Collisions for users... (data leaks)
- when not using transactions only redis locking you have different set of problem
- the client needs to be implmented correctly. Like client sees timout and generates a new key, and exactly once processing is broken
- you may have race conditions with resource deletes
- using UUID vs keys build from object attributes (different set of issues)
I mean the list can get very long with little details..
None of those are really unsolvable problems. I think though the issue it seems everyone in this thread is having is you can't wrap a non idempotent function to make it idempotent no matter how hard you try you have to design your system around it.
You've some wrong assumption. One is that you are wrong about Poland / Greece wages. In 2026 Polish worker actually earns more than a Greece worker for the same role. Something like 25% more in Poland
Also Polands power grid is quite old and hasn't kept pace with demand. The grid operator last year had to reject thousands of requests for new connections
Not sure you can replace Claude with DeepSeek V4 that easily and have same results.
From what I see while building my own agentic system in Elixir, the problem is in training for your specific harness/contracts. Claude/GPT-style models seem to be trained around very specific contracts used by the harness like tool call formats, planning structure, patching, reading files, recovering from errors, and knowing when to stop.
In practice, you either need a very strong general model that can infer and follow those contracts (expensive), or a weaker model that has been fine-tuned / trained specifically on your own agent contracts. Otherwise, the whole thing becomes flaky very quickly. And I suspect with Deepseek V4 you may get last options.
Idk, my recent experience with Claude is that 4.7 barely knows how to use basic bash tools - how to properly check when programs have finished running, even basic stuff like how to run pytest suites and read the failed tests from the output without re-running the suite to specifically look for them. It's shockingly dumb for all of the tooling they've built into Claude Code (the useless Monitoring tool that blocks bash polling/sleeping that actually works, etc.).
I finally get fed up and started using GPT 5.5 the past 4 days and its a breath a fresh air despite feeling much more minimal. With Claude I had to write so many hooks to enforce behaviors it wouldn't remember and it lacked common sense on. GPT 5.5 does a much better job with things like knowing the AWS CDK CLI can hang on long CloudFormation deployments and it should actively check the deployment status using CloudFormation API rather than hanging for 30+ minutes - and it does this all without asking.
Maybe there's better tooling built into Codex too, but at least on the surface level it seems like how smart the model is makes a significant difference because Claude has more tools than I can count and still struggles to use "grep".
Edit: Like just now - I can't tell you how many times I day I see this sequence:
"Sorry, I'll run in parallel"
"Error editing file"
"File must be read first"
Repeat 10x for the 10 subagents Claude spawned and then it gets stuck until you press escape and it says "You rejected the parallel agents. Running directly now"
Its semi public, but I probably publish it soon once its less embarrassing.
Its an Elixir agent runtime with a thin Go TUI (bubble-tea). Im building it mostly to explore agent orchestration: planner/workers/finalizer flows, local file/code-edit tools, MCP tools, permission gates, run context, compaction, and eventually larger swarms. Erlang/Elixir is interesting for this because the actor/supervision model maps pretty naturally to lots of isolated agents and long-running supervised tasks.
As i said, The main lesson so far is that everything around contracts is much more fragile than I expected unless you use a very strong model. Planners return Markdown instead of JSON, tools get called with subtly wrong args, subagents repeat broken tool calls, finalizers lie about success after workers failed. And various permissions may be interpreted by agents in unexpexted way
I also started with too many modes too early instead of making agentic path extremely solid. That made me understand better why these codebases become huge: there are endless corner cases if you want a harness to work across models, providers, tools...
Stronger models hide a lot of harness weakness and weaker models expose. Making weaker models good enough requires a surprising amount of contract hardening. But that hardening tends to make the system better for stronger models too.
Also elixir http stack was causing a lot of problems (needed to use gun eventually)
Thank you for the writeup, integration with a TUI sounds great. Have you played with Jido (it's built on ReqLLM)? OpenAI also has an interesting Elixir orchestration project (surprisingly).
Thanks! I wasn't aware of Jido or ReqLLM before. ReqLLM looks especially promising, and I will likely use it. At the moment, I'm only integrated with OpenRouter.
There are certainly quirks, but identifying and conforming to those quirks is not that complex. E.g. I had Kimi "fix" my harness to work better with Kimi by pointing it at the (open source) kimi-cli + web search and telling it to figure out which differences might matter (it made compaction more aggressive, and worked around some known looping issues (by triggering compaction if it spotted looping tool calls). Largely addressing the quirks tend to harden the harness for other models too. But, yeah, it is more work to make the smaller models work with instead of against the harness.
I hope they collaborate with open source harness providers (Pi, Opencode) and train models with those. So next generations will have better integration and better overall quality.
Mini paper: that future isn’t the AI replacing humans. its about humans drowning in cheap artifacts.
New unit of measurement proposed: verification debt.
Also introduces: Recursive Garbage → model collapse
This appears to just be the output of LLMs itself? It credits GPT-5.2 and Gemini 3 exclusively as authors, has a public domain license (appropriate for AI output) and is only several paragraphs in length.
Which proves its own points! Absolutely genius! The cost asymmetry of producing and checking for garbage truly is becoming a problem in the recent years, with the advent of LLMs and generative AI in general.
I feel like this means that working in any group where individuals compete against each other results in an AI vs AI content generation competition, where the human is stuck verifying/reviewing.
Not a dig on your (very sensible) comment, but now I always do a double take when I see anyone effusively approving of someone else's ideas. AI turned me into a cynical bastard :(
Yes, I did it as a joke inspired by the PRISM release. But unexpectedly, it makes a good point. And the funny part for was that the paper lists only LLMs as authors.
Also, in a world where AI output is abundant, we humans become the scarce resource the "tools" in the system that provide some connectivity to reality (grounding) for LLM
Plot twist: humans become the new Proof of Work consensus mechanism. Instead of GPUs burning electricity to hash blocks, we burn our sanity verifying whether that Medium article was written by a person or a particularly confident LLM.
"Human Verification as a Service": finally, a lucrative career where the job description is literally "read garbage all day and decide if it's authentic garbage or synthetic garbage." LinkedIn influencers will pivot to calling themselves "Organic Intelligence Validators" and charge $500/hr to squint at emails and go "yeah, a human definitely wrote this passive-aggressive Slack message."
The irony writes itself: we built machines to free us from tedious work, and now our job is being the tedious work for the machines. Full circle. Poetic even. Future historians (assuming they're still human and not just Claude with a monocle) will mark this as the moment we achieved peak civilization: where the most valuable human skill became "can confidently say whether another human was involved."
Bullish on verification miners. Bearish on whatever remains of our collective attention span.
Human CAPTCHA exists to figure out whether your clients are human or not, so you can segment them and apply human pricing. Synthetics, of course, fall into different tiers. The cheaper ones.
The illussion of reasoning was terrible paper. 2^n-1 how it could fit in context size. I tried o3 and he gave me python script saying that inserting all moves is to much for context window. completely different results.
I think that their point was that the problem is easily solvable by humans without code, and shows the ability to chain steps together to achieve a goal.
Is it easily solvable by humans without code? I suspect if you asked a human to write down all the steps in order to solve a Tower of Hanoi with 12 disks they would also give up before completing it. Writing code that produces the correct output is the only realistic way to solve that purely due to the amount of output required.
Not sure why I am being downvoted. I am simply saying that we know there is a defined algorithm for solving Tower of Hanoi, and the source code for it is widely available. So, o3 producing the code as an answer, demonstrates even less intelligence, as it means it is either memorized or copied from the internet. I don't see how this point counters the paper at all.
I believe what they are trying to show in that paper, is that as the chain of operations approaches a large amount (their proxy for complexity), an LLM will inevitable fail. Humans don't have infinite context either, but they can still solve the Tower Of Hanoi without need to resort to either pen or paper, or coding.
I didn't downvote. T the problem with the paper is that it asks the model to output all moves for, say, 15 disks 2 ^ 15 - 1 = 32767
32767 moves in a single prompt. That's not testing reasoning. That’s testing whether the model can emit a huge structured output without error, under a context window limit.
The authors then treat failure to reproduce this entire sequence as evidence that the model can't reason. But that’s like saying a calculator is broken because its printer jammed halfway through printing all prime numbers under 10000.
For me o3 returning Python code isn’t a failure. It’s a smart shortcut. The failure is in the benchmark design. This benchmark just smells.
> That’s testing whether the model can emit a huge structured output without error, under a context window limit.
Agreed. But to be fair, 1) a relatively simple algorithm can do it, and more importantly 2) a lot of people are trying to build products around doing exactly this (emit large structured output without error).
I agree 15 disks is very difficult for a human, probably on a sheer stamina level; but I managed to do 8 in about 15 minutes by playing around (I.e. no practice). They do state that there is a massive drop in performance at this point.
Remember that with Towers of Hanoi every extra disk doubles the number of moves required. So 15 discs is 128x more moves. If you did eight in 15m then fifteen would take you 32 hours.
The same for me. I only knew how to assign variables, use for loops, if->then, and use poke command. And from this specific point I started thinking about myself as programmer event that the only thing I wrote with C64 basic was a ball moving on the screen. :)