More

chisleu · 2025-11-09T22:23:46 1762727026

An open-source, code-first Go toolkit for building, evaluating, and deploying sophisticated AI agents with flexibility and control.

chisleu · 2025-11-01T14:10:13 1762006213

Total tangent, but I got to ride in some of these on a recent trip to India and I was really impressed with the build quality and utilitarian usefulness of the design.

chisleu · 2025-09-22T18:45:15 1758566715

Here is the demo video on it. The video w/ sound input -> sound output while doing translation from the video to another language was the most impressive display I've seen yet.

https://www.youtube.com/watch?v=_zdOrPju4_g

chisleu · 2025-09-10T18:01:00 1757527260

Because of the prompt processing speed, small models like Qwen 3 coder 30b a3b are the sweet spot for mac platform right now. Which means a 32 or 64GB mac is all you need to use Cline or your favorite agent locally.

DrAwdeOccarim · 2025-09-10T18:35:27 1757529327

Yes, I use LM Studio daily with Qwen 3 30b a3b. I can't believe how good it is locally.

paool · 2025-09-11T07:43:58 1757576638

Can you use your Qwen instance in CLIs like Claude code, codex, or whatever open source coding agent?

Or do you have to copy paste into LM studio?

evilduck · 2025-09-11T14:02:38 1757599358

Yeah you can, so long as you're hosting your local LLM through something with an OpenAI-compatible API (which is a given for almost all local servers at this point, including LM Studio).

https://opencode.ai and https://github.com/QwenLM/qwen-code both allow you to configure any API as the LLM provider.

That said, running agentic workloads on local LLMs will be a short and losing battle against context size if you don't have hardware specifically bought for this purpose. You can get it running and it will work for several autonomous actions but not nearly as long as a hosted frontier model will work.

TrajansRow · 2025-09-11T14:42:23 1757601743

Unfortunately, IDE integration like this tends to be very prefill intensive (more math than memory). That puts Apple Silicon at a disadvantage without the feature that we’re talking about. Presumably the upcoming M5 will also have dedicated matmul acceleration in the GPU. This could potentially change everything in favor of local AI, particularly on mobile devices like laptops.

evilduck · 2025-09-11T18:50:16 1757616616

Cline has a new "compact" prompt enabled for their LM Studio integration which greatly alleviates the long system prompt prefill problem, especially for Macs which suffer from low compute (though it disables MCP server usage, presumably the lost part of the prompt is what made that work well).

It seems to work better for me when I tested it and Cline's supposedly adding it to the Ollama integration. I suspect that type of alternate local configuration will proliferate into the adjacent projects like Roo, Kilo, Continue, etc.

Apple adding hardware to speed it up will be even better, the next time I buy a new computer.

DrAwdeOccarim · 2025-09-11T10:27:48 1757586468

LM Studio lets you run a model as a local API (OpenAI-compatible REST server).

chisleu · 2025-09-06T02:07:27 1757124447

I've been using GLM 4.5 and GLM 4.5 Air for a while now. The Air model is light enough to run on a macbook pro and is useful for Cline. I can run the full GLM model on my Mac Studio, but the TPS is so slow that it's only useful for chatting. So I hooked up with openrouter to try but didn't have the same success. Any of the open weight models I try with open router give sub standard results. I get better results from Qwen 3 coder 30b a3b locally than I get from Qwen 3 Coder 480b through open router.

I'm really concerned that some of the providers are using quantized versions of the models so they can run more models per card and larger batches of inference.

KronisLV · 2025-09-06T06:05:38 1757138738

> I get better results from Qwen 3 coder 30b a3b locally than I get from Qwen 3 Coder 480b through open router. I'm really concerned that some of the providers are using quantized versions of the models so they can run more models per card and larger batches of inference.

This doesn't match my experience precisely, but I've definitely had cases where some of the providers had consistently worse output for the same model than others, the solution there was to figure out which ones those are and to denylist them in the UI.

As for quantized versions, you can check it for each model and provider, for example: https://openrouter.ai/qwen/qwen3-coder/providers

You can see that these providers run FP4 versions:

  * DeepInfra (Turbo)

And these providers run FP8 versions:

  * Chutes
  * GMICloud
  * NovitaAI
  * Baseten
  * Parasail
  * Nebius AI Studio
  * AtlasCloud
  * Targon
  * Together
  * Hyperbolic
  * Cerebras

I will say that it's not all bad and my experience with FP8 output has been pretty decent, especially when I need something done quickly and choose to use Cerebras - provided their service isn't overloaded, their TPS is really, really good.

You can also request specific precision on a per request basis: https://openrouter.ai/docs/features/provider-routing#quantiz... (or just make a custom preset)

snthpy · 2025-09-06T07:46:06 1757144766

Interesting. Thanks for sharing. What about qwen3-coder on Cerebras? I'm happy to pay the $50 for the speed as long as results are good. How does it compare with glm-4.5?

KronisLV · 2025-09-06T09:25:04 1757150704

I wish that Cerebras had a direct pay per use API option instead of pushing you towards OpenRouter and HuggingFace (the former sometimes throws 429, so either the speed is great, or there is no speed): https://www.cerebras.ai/pricing but I imagine that for most folks their subscription would be more than enough!

As for how Qwen3 Coder performs, there's always SWE-bench: https://www.swebench.com/

By the numbers:

  * it sits between Gemini 2.5 Pro and GPT-5 mini
  * it beats out Kimi K2 and the older Claude Sonnet 3.7
  * but loses out to Claude Sonnet 4 and GPT-5

Personally, I find it sufficient for most tasks (from recommendations and questions to as close to vibe coding as I get) on a technical level. GLM 4.5 isn't on the site at the time of writing this, but they should match one another pretty closely. Feeling wise, I still very much prefer Sonnet 4 to everything else, but it's both expensive and way slower than Cerebras (not even close).

Update: also seems like the Growth plan on their page says "Starting from 1500 USD / month" which is a bit silly when the new cheapest subscription is 50 USD / month.

jbellis · 2025-09-06T13:28:41 1757165321

Quantization matters a lot more than r/locallama wants to believe. Here's Qwen3 Coder vs Qwen3 Coder @fp8: https://brokk.ai/power-ranking?version=openround-2025-08-20&...

vincirufus · 2025-09-06T02:10:06 1757124606

yeah I too have heard similar concerns with Open models on OpenRouter, but haven't been able to verify it, as I don't use that a lot

numlocked · 2025-09-06T02:37:24 1757126244

(OpenRouter COO here) We are starting to test this and verify the deployments. More to come on that front -- but long story short is that we don't have good evidence that providers are doing weird stuff that materially affects model accuracy. If you have data points to the contrary, we would love them.

We are heavily incentivized to prioritize/make transparent high-quality inference and have no incentive to offer quantized/poorly-performing alternatives. We certainly hear plenty of anecdotal reports like this, but when we dig in we generally don't see it.

An exception is when a model is first released -- for example this terrific work by artificial analysis: https://x.com/ArtificialAnlys/status/1955102409044398415

It does take providers time to learn how to run the models in a high quality way; my expectation is that the difference in quality will be (or already is) minimal over time. The large variance in that case was because GPT OSS had only been out for a couple of weeks.

For well-established models, our (admittedly limited) testing has not revealed much variance between providers in terms of quality. There is some but it's not like we see a couple of providers 'cheating' by secretly quantizing and clearly serving less intelligence versions of the model. We're going to get more systematic about it though and perhaps will uncover some surprises.

blitzar · 2025-09-06T06:24:09 1757139849

> We ... have no incentive to offer quantized/poorly-performing alternatives

However your providers do have such an incentive.

indigodaddy · 2025-09-06T03:57:18 1757131038

So what's the deal with Chutes and all the throttling and errors. Seems like users are losing their minds over this.. at least from all the reddit threads I'm seeing

typpilol · 2025-09-06T08:20:14 1757146814

What's chutes?

arcanemachiner · 2025-09-06T08:43:04 1757148184

Cheap provider on OpenRouter:

https://openrouter.ai/provider/chutes

typpilol · 2025-09-06T18:15:25 1757182525

Ahh. Thanks

chandureddyvari · 2025-09-06T03:00:45 1757127645

Unsolicited advice: Why doesn’t open router provide hosting services for OSS models that guarantee non-quantised versions of the LLMs? Would be a win-win for everyone.

jjani · 2025-09-06T03:47:44 1757130464

Would make very little business sense at this point - currently they have an effective monopoly on routing. Hosting would just make them one provider among a few dozen. It would make the other providers less likely to offer their services through openrouter. It would come with lots of concerns that openrouter would favor routing towards their own offerings. It would be a huge distraction to their core business which is still rapidly growing. Would need massive capital investment. And another thousand reasons I haven't thought of.

jatins · 2025-09-06T03:13:52 1757128432

In fact I thought that's what OpenRouter was hosting them all along

chisleu · 2025-08-29T13:11:47 1756473107

This is going to improve the quality of LLM responses for users. I'm for this.

chisleu · 2025-08-20T17:03:57 1755709437

> You don’t have to randomize the first part of your object keys to ensure they get spread around and avoid hotspots.

As of when? According to internal support, this is still required as of 1.5 years ago.

arpinum · 2025-08-21T01:42:35 1755740555

I think there is some nuance needed here. If you ask support to partition your bucket then they will be a bit annoying if you ask for specific partition points and the first part of the prefix is not randomised. They tried to push me to refactor the bucket first to randomise the beginning of the prefix, but eventually they did it.

The auto partitioning is different. It can isolate hot prefixes on its own and can intelligently pick the partition points. Problem is the process is slow and you can be throttled for more than a day before it kicks in.

chisleu · 2025-08-24T19:49:06 1756064946

> but eventually they did it

They can do this with manual partitioning indeed. I've done it before, but it's not ideal because the auto partitioner will scale beyond almost anything AWS will give you with manual partitioning unless you have 24/7 workloads.

> you can be throttled for more than a day before it kicks in

I expect that this would depends on your use case. If you are dropping content you need to scale out to tons of readers, that is absolutely the case. If you are dropping tons of content with well distributed reads, then the auto partitioner is The Way.

laurent_du · 2025-08-20T19:19:41 1755717581

He's not talking about the prefix, just the beginning of the object key.

viccis · 2025-08-20T20:30:54 1755721854

The prefix is not separate from the object key. It's part of it. There's no randomization that needs to be done on either anymore.

chisleu · 2025-08-24T19:43:52 1756064632

and indeed the bucket is not separate from the object key. the API separates it logically "for humans" but it's all one big string

chisleu · 2025-08-18T22:55:14 1755557714

/agree

We are in the infancy of LLM technology.

chisleu · 2025-08-08T16:41:07 1754671267

How was he doing "complex agentic coding" when the APIs have such extreme context and throughput limitations?

chisleu · 2025-08-07T12:32:00 1754569920

holy shit it does. The scene with him inventing the new compression algorithm basically foreshadowed the gooning to follow local LLM availability.