Looks great, but looking at the benchmark, can’t help but think about how crazy good dots-ocr is as a model. Too bad they’re not as open as the Deepseek team because its so crazy good and would love to know how it was trained.
Did we read the same graph? DeepSeek Gundam 200 dpi appeared to get similar perf as dots-ocr, but with less tokens needed. The x axis is inverted, descending with distance from the origin.
You should be more concerned that Chinese labs can train models that are just as good for 10X less because Americans treat the USD’s status as the global reserve currency as the ultimate bitter lesson. Who needs better math and engineering when you can print money to buy more GPUs???
The author literally said gpt is spending 10x more for equivalent. This really means ChatGPT had that intelligence at that cost a year or two ago. Smaller domain models are better at focused tasks in that area but can’t be generalized.
Yeah, it’s hilarious to be having this conversation about MLEs while attributing the bad outcomes to anything other than poorly designed reward functions, i.e. management. If an engineer burned millions on failed training runs because they did a shit job of creating a policy that maximized for the desired outcome, they’d get canned, but that’s just a Tuesday for your average MBA with VC backing.
Gemini just eclipsed ChatGPT to be #1 on the Apple app store for these kinds of apps. The 2.5 pro series is also good/SOTA at coding, but unfortunately poorly trained for the agentic workflows that have become predominant.
Can you generate 8-bit AVR assembly code to multiply two 24 bit posit numbers
You get some pretty funny results from the models that have no idea what a posit is. It's usually pretty clear to tell if they know what they are supposed to be doing. I haven't had a success yet (haven't tried for a while though). Some of them have come pretty close, but usually it's the trying to squeeze more than 8 bits of data into a register is what brings them down.
Yeah, so it’d be interesting to see if provided the correct context/your understanding of its error pattern, it can accomplish this.
One thing you learn quickly about working with LLMs if they have these kind of baked-in biases, some of which are very fixed and tied to their very limited ability to engage in novel reasoning (cc François Chollet), while others are far more loosely held/correctable. If it sticks with the errant patten, even when provided the proper context, it probably isn’t something an off-the-shelf model can handle.