More

syntex · 2026-05-18T20:06:52 1779134812

Be careful when transposing game-learned behaviors into real life.

syntex · 2026-05-10T11:14:29 1778411669

yes I always thought it's an easy thing. but I changed my mind recently when I had to deal with it.

A lot little things you need to think of. For example.

Client sends a request. The database is temporarily down. The server catches the exception and records the key status as FAILED. The client retries the request (as they should for a 500 error). The server sees the key exists with status FAILED and returns the error again-forever. Effectively "burned" the key on a transient error.

others like:

- you may have Namespace Collisions for users... (data leaks) - when not using transactions only redis locking you have different set of problem - the client needs to be implmented correctly. Like client sees timout and generates a new key, and exactly once processing is broken - you may have race conditions with resource deletes - using UUID vs keys build from object attributes (different set of issues)

I mean the list can get very long with little details..

asdfaoeu · 2026-05-10T12:05:15 1778414715

None of those are really unsolvable problems. I think though the issue it seems everyone in this thread is having is you can't wrap a non idempotent function to make it idempotent no matter how hard you try you have to design your system around it.

dataflow · 2026-05-10T11:48:45 1778413725

> The database is temporarily down. The server catches the exception and records the key status as FAILED.

This is the bug regardless of idempotency, right? It should be recording something like RESOURCE_UNAVAILABLE.

syntex · 2026-05-08T14:14:41 1778249681

You've some wrong assumption. One is that you are wrong about Poland / Greece wages. In 2026 Polish worker actually earns more than a Greece worker for the same role. Something like 25% more in Poland

Also Polands power grid is quite old and hasn't kept pace with demand. The grid operator last year had to reject thousands of requests for new connections

syntex · 2026-05-04T08:39:43 1777883983

Not sure you can replace Claude with DeepSeek V4 that easily and have same results.

From what I see while building my own agentic system in Elixir, the problem is in training for your specific harness/contracts. Claude/GPT-style models seem to be trained around very specific contracts used by the harness like tool call formats, planning structure, patching, reading files, recovering from errors, and knowing when to stop.

In practice, you either need a very strong general model that can infer and follow those contracts (expensive), or a weaker model that has been fine-tuned / trained specifically on your own agent contracts. Otherwise, the whole thing becomes flaky very quickly. And I suspect with Deepseek V4 you may get last options.

o10449366 · 2026-05-04T09:45:29 1777887929

Idk, my recent experience with Claude is that 4.7 barely knows how to use basic bash tools - how to properly check when programs have finished running, even basic stuff like how to run pytest suites and read the failed tests from the output without re-running the suite to specifically look for them. It's shockingly dumb for all of the tooling they've built into Claude Code (the useless Monitoring tool that blocks bash polling/sleeping that actually works, etc.).

I finally get fed up and started using GPT 5.5 the past 4 days and its a breath a fresh air despite feeling much more minimal. With Claude I had to write so many hooks to enforce behaviors it wouldn't remember and it lacked common sense on. GPT 5.5 does a much better job with things like knowing the AWS CDK CLI can hang on long CloudFormation deployments and it should actively check the deployment status using CloudFormation API rather than hanging for 30+ minutes - and it does this all without asking.

Maybe there's better tooling built into Codex too, but at least on the surface level it seems like how smart the model is makes a significant difference because Claude has more tools than I can count and still struggles to use "grep".

Edit: Like just now - I can't tell you how many times I day I see this sequence:

"Sorry, I'll run in parallel"

"Error editing file"

"File must be read first"

Repeat 10x for the 10 subagents Claude spawned and then it gets stuck until you press escape and it says "You rejected the parallel agents. Running directly now"

rirze · 2026-05-04T13:44:09 1777902249

I’m finding great success having Claude design and review code but having codex actually implement it.

cpursley · 2026-05-04T09:21:34 1777886494

I love to learn more about the system you’re building out in Elixir and your learnings if any of it is public.

syntex · 2026-05-04T12:36:15 1777898175

Its semi public, but I probably publish it soon once its less embarrassing.

Its an Elixir agent runtime with a thin Go TUI (bubble-tea). Im building it mostly to explore agent orchestration: planner/workers/finalizer flows, local file/code-edit tools, MCP tools, permission gates, run context, compaction, and eventually larger swarms. Erlang/Elixir is interesting for this because the actor/supervision model maps pretty naturally to lots of isolated agents and long-running supervised tasks.

As i said, The main lesson so far is that everything around contracts is much more fragile than I expected unless you use a very strong model. Planners return Markdown instead of JSON, tools get called with subtly wrong args, subagents repeat broken tool calls, finalizers lie about success after workers failed. And various permissions may be interpreted by agents in unexpexted way

I also started with too many modes too early instead of making agentic path extremely solid. That made me understand better why these codebases become huge: there are endless corner cases if you want a harness to work across models, providers, tools...

Stronger models hide a lot of harness weakness and weaker models expose. Making weaker models good enough requires a surprising amount of contract hardening. But that hardening tends to make the system better for stronger models too.

Also elixir http stack was causing a lot of problems (needed to use gun eventually)

cpursley · 2026-05-04T12:52:52 1777899172

Thank you for the writeup, integration with a TUI sounds great. Have you played with Jido (it's built on ReqLLM)? OpenAI also has an interesting Elixir orchestration project (surprisingly).

syntex · 2026-05-04T13:26:42 1777901202

Thanks! I wasn't aware of Jido or ReqLLM before. ReqLLM looks especially promising, and I will likely use it. At the moment, I'm only integrated with OpenRouter.

cpursley · 2026-05-04T14:46:39 1777905999

Yeah, I use ReqLLM in my product and some side projects. So far, so good.

vidarh · 2026-05-04T10:38:34 1777891114

There are certainly quirks, but identifying and conforming to those quirks is not that complex. E.g. I had Kimi "fix" my harness to work better with Kimi by pointing it at the (open source) kimi-cli + web search and telling it to figure out which differences might matter (it made compaction more aggressive, and worked around some known looping issues (by triggering compaction if it spotted looping tool calls). Largely addressing the quirks tend to harden the harness for other models too. But, yeah, it is more work to make the smaller models work with instead of against the harness.

dandaka · 2026-05-04T09:54:38 1777888478

I hope they collaborate with open source harness providers (Pi, Opencode) and train models with those. So next generations will have better integration and better overall quality.

syntex · 2026-05-03T10:51:18 1777805478

These benchmarks means very little. The real test is model + harness so agentic system that can fulfill given goals.

syntex · 2026-04-21T06:19:48 1776752388

hallucinates in pretty much every answer

syntex · 2026-01-27T23:08:19 1769555299

The Post-LLM World: Fighting Digital Garbage https://archive.org/details/paper_20260127/mode/2up

Mini paper: that future isn’t the AI replacing humans. its about humans drowning in cheap artifacts. New unit of measurement proposed: verification debt. Also introduces: Recursive Garbage → model collapse

a little joke on Prism)

Springtime · 2026-01-28T02:38:21 1769567901

> The Post-LLM World: Fighting Digital Garbage https://archive.org/details/paper_20260127/mode/2up

This appears to just be the output of LLMs itself? It credits GPT-5.2 and Gemini 3 exclusively as authors, has a public domain license (appropriate for AI output) and is only several paragraphs in length.

doodlesdev · 2026-01-28T03:58:01 1769572681

Which proves its own points! Absolutely genius! The cost asymmetry of producing and checking for garbage truly is becoming a problem in the recent years, with the advent of LLMs and generative AI in general.

parentheses · 2026-01-28T06:15:44 1769580944

Totally agree!

I feel like this means that working in any group where individuals compete against each other results in an AI vs AI content generation competition, where the human is stuck verifying/reviewing.

dormento · 2026-01-28T12:58:22 1769605102

> Totally agree!

Not a dig on your (very sensible) comment, but now I always do a double take when I see anyone effusively approving of someone else's ideas. AI turned me into a cynical bastard :(

syntex · 2026-01-28T08:48:33 1769590113

Yes, I did it as a joke inspired by the PRISM release. But unexpectedly, it makes a good point. And the funny part for was that the paper lists only LLMs as authors.

Also, in a world where AI output is abundant, we humans become the scarce resource the "tools" in the system that provide some connectivity to reality (grounding) for LLM

mrbonner · 2026-01-28T01:23:09 1769563389

Plot twist: humans become the new Proof of Work consensus mechanism. Instead of GPUs burning electricity to hash blocks, we burn our sanity verifying whether that Medium article was written by a person or a particularly confident LLM.

"Human Verification as a Service": finally, a lucrative career where the job description is literally "read garbage all day and decide if it's authentic garbage or synthetic garbage." LinkedIn influencers will pivot to calling themselves "Organic Intelligence Validators" and charge $500/hr to squint at emails and go "yeah, a human definitely wrote this passive-aggressive Slack message."

The irony writes itself: we built machines to free us from tedious work, and now our job is being the tedious work for the machines. Full circle. Poetic even. Future historians (assuming they're still human and not just Claude with a monocle) will mark this as the moment we achieved peak civilization: where the most valuable human skill became "can confidently say whether another human was involved."

Bullish on verification miners. Bearish on whatever remains of our collective attention span.

kinduff · 2026-01-28T01:37:30 1769564250

Human CAPTCHA exists to figure out whether your clients are human or not, so you can segment them and apply human pricing. Synthetics, of course, fall into different tiers. The cheaper ones.

direwolf20 · 2026-01-28T01:35:02 1769564102

Bullish on verifiers who accept money to verify fake things

syntex · 2025-11-21T15:09:13 1763737753

I see that author decorating webiste for Christmas :)

syntex · 2025-06-10T14:35:40 1749566140

The illussion of reasoning was terrible paper. 2^n-1 how it could fit in context size. I tried o3 and he gave me python script saying that inserting all moves is to much for context window. completely different results.

roboboffin · 2025-06-10T15:31:55 1749569515

I think that their point was that the problem is easily solvable by humans without code, and shows the ability to chain steps together to achieve a goal.

jwitthuhn · 2025-06-10T18:27:58 1749580078

Is it easily solvable by humans without code? I suspect if you asked a human to write down all the steps in order to solve a Tower of Hanoi with 12 disks they would also give up before completing it. Writing code that produces the correct output is the only realistic way to solve that purely due to the amount of output required.

roboboffin · 2025-06-10T16:17:03 1749572223

Not sure why I am being downvoted. I am simply saying that we know there is a defined algorithm for solving Tower of Hanoi, and the source code for it is widely available. So, o3 producing the code as an answer, demonstrates even less intelligence, as it means it is either memorized or copied from the internet. I don't see how this point counters the paper at all.

I believe what they are trying to show in that paper, is that as the chain of operations approaches a large amount (their proxy for complexity), an LLM will inevitable fail. Humans don't have infinite context either, but they can still solve the Tower Of Hanoi without need to resort to either pen or paper, or coding.

syntex · 2025-06-10T17:30:13 1749576613

I didn't downvote. T the problem with the paper is that it asks the model to output all moves for, say, 15 disks 2 ^ 15 - 1 = 32767

32767 moves in a single prompt. That's not testing reasoning. That’s testing whether the model can emit a huge structured output without error, under a context window limit.

The authors then treat failure to reproduce this entire sequence as evidence that the model can't reason. But that’s like saying a calculator is broken because its printer jammed halfway through printing all prime numbers under 10000.

For me o3 returning Python code isn’t a failure. It’s a smart shortcut. The failure is in the benchmark design. This benchmark just smells.

daveguy · 2025-06-10T18:24:29 1749579869

> That’s testing whether the model can emit a huge structured output without error, under a context window limit.

Agreed. But to be fair, 1) a relatively simple algorithm can do it, and more importantly 2) a lot of people are trying to build products around doing exactly this (emit large structured output without error).

roboboffin · 2025-06-10T17:43:53 1749577433

No worries, I wasn’t saying to you directly.

I agree 15 disks is very difficult for a human, probably on a sheer stamina level; but I managed to do 8 in about 15 minutes by playing around (I.e. no practice). They do state that there is a massive drop in performance at this point.

teach · 2025-06-10T19:51:49 1749585109

Remember that with Towers of Hanoi every extra disk doubles the number of moves required. So 15 discs is 128x more moves. If you did eight in 15m then fifteen would take you 32 hours.

syntex · on May 11, 2025

The same for me. I only knew how to assign variables, use for loops, if->then, and use poke command. And from this specific point I started thinking about myself as programmer event that the only thing I wrote with C64 basic was a ball moving on the screen. :)