Hacker Newsnew | past | comments | ask | show | jobs | submit | disiplus's commentslogin

I have them all. They're not just as good. Whoever tells you that looked only at the benchmarks, not real use. They all fall short at some point.

Kimi K2.5 is the best one, but it's still not at the level of what Anthropic released with opus 4.5.


We’ll have to give it 3 weeks.

I think in the West we think everything is blocked. But for example, if you book an eSIM, when you visit you already get direct access to Western services because they route it to some other server. Hong Kong is totally different: they basically use WhatsApp and Google Maps, and everything worked when I was there.

But also yes, parent is right, HF is more or less inaccessible, and Modelscope frequently cited as the mirror to use (although many Chinese labs seems to treat HF as the mirror, and Modelscope as the "real" origin).

Yeah, they're the good guys. I suspect the open source work is mostly advertisements for them to sell consulting and services to enterprises. Otherwise, the work they do doesn't make sense to offer for free.

Haha for now our primary goal is to expand the market for local AI and educate people on how to do RL, fine-tuning and running quants :)

Amazing work and people should really appreciate that the opportunity costs of your work are immense (given the hype).

On another note: I'm a bit paranoid about quantization. I know people are not good at discerning model quality at these levels of "intelligence" anymore, I don't think a vibe check really catches the nuances. How hard would it be to systematically evaluate the different quantizations? E.g. on the Aider benchmark that you used in the past?

I was recently trying Qwen 3 Coder Next and there are benchmark numbers in your article but they seem to be for the official checkpoint, not the quantized ones. But it is not even really clear (and chatbots confuse them for benchmarks of the quantized versions btw.)

I think systematic/automated benchmarks would really bring the whole effort to the next level. Basically something like the bar chart from the Dynamic Quantization 2.0 article but always updated with all kinds of recent models.


Thanks! Yes we actually did think about that - it can get quite expensive sadly - perplexity benchmarks over short context lengths with small datasets are doable, but it's not an accurate measure sadly. We're actually investigating currently what would be the best efficient course of action on evaluating quants - will keep you posted!

> How hard would it be to systematically evaluate the different quantizations? E.g. on the Aider benchmark that you used in the past?

Very hard. $$$

The benchmarks are not cheap to run. It'll cost a lot to run them for each quant of each model.


Yes sadly very expensive :( Maybe a select few quants could happen - we're still figuring out what is the most economical and most efficient way to benchmark!

Roughly how much does it cost to run one of the popular benchmarks? Are we talking $1,000, $10,000, or $100k?

Oh it's more time that's the issue - each benchmark takes 1-3 hours ish to run on 8 GPUs, so running on all quants per model release can be quite painful.

Assume AWS spot say $20/hr B200 for 8 GPUs, then $20 ish per quant, so assuming benchmark is on BF16, 8bit, 6, 5, 4, 3, 2 bits then 7 ish tests so $140 per model ish to $420 ish/hr. Time wise 7 hours to 1 day ish.

We could run them after a model release which might work as well.

This is also on 1 benchmark.


This would be amazing

Working on it! :)

I hope that is exactly what is happening. It benefits them, and it benefits us.

Even if its a different session it can be enough. But that said i had times where it rewrote tests "because my implementation was now different so the tests needed to be updated" so you have to prompt even that to tell it to not touch the tests.

and then verify that it obeyed the prompt!

Someone needs to build an agentic tool that does strict, enforced TDD.


As a father of two boys, i can give you some feedback. The AI stories you will generate will probably be crap and not worth paying for. What my kids love is when i put them (like i take a picture of them, and then generate them in jungle or whatever setup it is with gemini banana) They want that i print them those out, i know it's temporary but its fun for us all. So you could combine those two things.

yes - we do combine those both and you can upload photos to get your kids into the book.

Tell that to linkedin group. they keep doing that, they dont credit it, but i assume at least 60% of other people can tell.

To be fair, linkedin has always been filled to the brim with unhinged slop, even before AI was a thing.

thats by design, you know all those huge security implications. now image if it was so easy to setup and install and use.

they are source of data, so they are not fully without value.


You're putting a higher price on general user chatter over ad income? Everything is a source of data. How much it can benefit from what's on offer with each data point isn't as much of a given.


This is real, so im a freelancer, i used this small invoicing platfrom to create invoices for my customers. At "work" im working on accounting systems, and erp-s. So with AI, why would i pay monthly for invoicing when i can build it myself. After i day i had invoicing working. Like the simple thing where you get PDF out. Then i started implementing doube-entry booking. And support different tax systems. And then, but we need a sales part then crm, then warehouse. Then projects to track time and so on. And now i have a full saas that i dont need and im not going to waste time on competing in that market. Now im thinking of puting it as open source.


"Invoicing for freelancers" has just about as many solutions as "to do" lists or ticket systems. Just use what you built if it works, open sourcing it is likely to get zero interest among the thousands of other options.


what would be interesting, as i find myself building the same thing, is to know about your decision matrix of the things you chose to do and build. I've even got the contractor working on my house logging his hours using this thing. I'm considering making a stripe connect integration to do payouts with this thing!

im downloading it as we speek to try to run it on a 32gb 5090 + 128gb ddr5 i will compare it to glm 4.7-flash that was my local model of choice


Likewise curious to hear how it goes! 80B seems too big for a 5090, I'd be surprised if it runs well un-quantized.


Interested to hear how this goes!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: