Hacker Newsnew | past | comments | ask | show | jobs | submit | more Bjorkbat's commentslogin

Somewhat related, but I’ve been feeling as of late what can best be described as “benchmark fatigue”.

The latest models can score something like 70% on SWE-bench verified and yet it’s difficult to say what tangible impact this has on actual software development. Likewise, they absolutely crush humans at sport programming but are unreliable software engineers on their own.

What does it really mean that an LLM got gold on this year’s IMO? What if it means pretty much nothing at all besides the simple fact that this LLM is very, very good at IMO style problems?


Far as i can tell here, the actual advancement is in the methodology used to create a model tuned for this problem domain, and how efficient that method is. Theoretically then, making it easier to build other problem-domain-specific models.

That a highly tuned model designed to solve IMO problems can solve IMO problems is impressive, maybe, but yeah it doesn't really signal any specific utility otherwise.


Alright, it does look pretty charming, and I especially like that it's open-source since pretty much anyone buying a domestic robot is likely to be a tinkerer of some sort, but at the same time it reminds me of the Jibo (https://robotsguide.com/robots/jibo).

For those who don't remember (I couldn't remember the name, only the face, had to look hard for it) it was a desktop robot released in 2014 that was hyped pretty hard at the time. It didn't help that the company that launched it was founded by a fairly well-known MIT professor.

And yeah, it was a flop. The $900 price tag wasn't helping things, but neither was the fact that it didn't really do anything that an Alexa couldn't. You bought it solely because you really liked the idea of robots and thought it was cool, not at all for its value around the house.

I'm not gonna dunk on this too hard since it's probably just a fun company side-project, but I might change my tune if they get too high on hype.


I recall it as less an evolution and more a complete tonal shift the moment o3 was evaluated on ARC-AGI. I remember on Twitter Sam made some dumb post suggesting they had beaten the benchmark internally and Francois calling him out on his vagueposting. Soon as they publicly released the scores, it was like he was all-in on reasoning.

Which I have to admit I was kind of disappointed by.


What exactly is "reasoning"?


In this context I believe it refers to models that are trained to generate an internal dialogue that is then fed back in as additional input. This cycle might be performed several times before generating the final output text.

This is in contrast to the way that GPT-2/3/“original 4” work, which is by repeatedly generating the next finalized token based on the full dialogue thus far.


Who invented internal dialogue?


Something missed in arguments such as these is that in measuring fair use there's a consideration of impact on the potential market for a rightsholder's present and future works. In other words, can it be proven that what you are doing is meaningfully depriving the author of future income.

Now, in theory, you learning from an author's works and competing with them in the same market could meaningfully deprive them of income, but it's a very difficult argument to prove.

On the other hand, with AI companies it's an easier argument to make. If Anthropic trained on all of your books (which is somewhat likely if you're a fairly popular author) and you saw a substantial loss of income after the release of one of their better models (presumably because people are just using the LLM to write their own stories rather than buy your stuff), then it's a little bit easier to connect the dots. A company used your works to build a machine that competes with you, which arguably violates the fair use principle.

Gets to the very principle of copyright, which is that you shouldn't have to compete against "yourself" because someone copied you.


> a consideration of impact on the potential market for a rightsholder's present and future works

This is one of those mental gymnastics exercises that makes copyright law so obtuse and effectively unenforceable.

As an alternative, imagine a scriptwriter buys a textbook on orbital mechanics, while writing Gravity (2013). A large number of people watch the finished film, and learn something about orbital mechanics, therefore not needing the textbook anymore, causing a loss of revenue for the textbook author. Should the author be entitled to a percentage of Gravity's profit?

We'd be better off abolishing everything related to copyright and IP law alltogether. These laws might've made sense back in the days of the printing press but they're just nonsensical nowadays.


Well I mean you're constructing very convoluted and weak examples.

I think, in your example, the obvious answer is no, they're not entitled to any profits of Gravity. How could you possibly prove Gravity has anything to do with someone reading, or not reading, a textbook? You can't.

However, AI participates in the exact same markets it trains from. That's obviously very different. It is INTENDED to DIRECTLY replace the things it trains on.

Meaning, not only does an LLM output directly replace the textbook it was trained on, but that behavior is the sole commercial goal of the company. That's why they're doing it, and that's the only reason they're doing it.


> It is INTENDED to DIRECTLY replace the things it trains on.

Maybe this is where I'm having trouble. You say "exact same markets" -- how is a print book the exact same market as a web/mobile text-generating human-emulating chat companion? If that holds, why can't I say a textbook is the exact same market as a film?

I could see the argument if someone published a product that was fine-tuned on a specific book, and marketed as "use this AI instead of buying this book!", but that's not the case with any of the current services on the market.

I'm not trying to be combative, just trying to understand.. they seem like very different markets to me.


> how is a print book the exact same market as a web/mobile text-generating human-emulating chat companion? If that holds, why can't I say a textbook is the exact same market as a film?

Because the medium is actually the same. The content of a book is not paper, or a cover. It's text, and specifically the information in that text.

LLMs are intended to directly compete with and outright replace that usecase. I don't need a textbook on, say, Anatomy, because ChatGPT can structure and tell me about Anatomy, and in fact with say the exact same content slightly re-arranged.

This doesn't really hold for fictional books, nor does it hold for movies.

Watching a movie and reading a book are inherently different experiences, which cannot replace one another. Reading a textbook and asking ChatGPT about topic X is, for all intents and purposes, the same experience. Especially since, remember, most textbooks are online today.


Is it? If a teacher reads a book, then gives a lecture on that topic, that's decidedly not the same experience. Which step about that process makes it not the same experience? Is it the fact that they read the book using their human brain and then formed words in a specific order? Is it the fact that they're saying it out loud that's transformative? If we use ChatGPT's TTS feature, why is that not the same thing as a human talking about a topic after they read a book since it's been rearranged?


Well there's multiple reasons why it's not the same experience. It's a different medium, but it's also different content. The textbook may be used as a jumping-off point, supplemented by decades of real-life experience the professor has.

And, I think, elephant in the room with these discussions: we cannot just compare ChatGPT to a human. That's not a foregone conclusions and, IMO, no, you can't just do that. You have to justify it.

Humans are special. Why? Because we are Human. Humans have different and additional rights which machines, and programs, do not have. If we want to extend our rights to machines, we can do that... but not for free. Oh no, you must justify that, and it's quiet hard. Especially when said machines appear to work against Humans.


Personally I think a more effective analogy would be if someone used a textbook and created an online course / curriculum effective enough that colleges stop recommending the purchase of said textbook. It's honestly pretty difficult to imagine a movie having a meaningful impact on the sale of textbooks since they're required for high school / college courses.

So here's the thing, I don't think a textbook author going against a purveyor of online courseware has much of a chance, nor do I think it should have much of a chance, because it probably lacks meaningful proof that their works made a contribution to the creation of the courseware. Would I feel differently if the textbook author could prove in court that a substantial amount of their material contributed to the creation of the courseware, and when I say "prove" I mean they had receipts to prove it? I think that's where things get murky. If you can actually prove that your works made a meaningful contribution to the thing that you're competing against, then maybe you have a point. The tricky part is defining meaningful. An individual author doesn't make a meaningful contribution to the training of an LLM, but a large number of popular and/or prolific numbers can.

You bring up a good point, interpretation of fair use is difficult, but at the end of the day I really don't think we should abolish copyright and IP altogether. I think it's a good thing that creative professionals have some security in knowing that they have legal protections against having to "compete against themselves"


> An individual author doesn't make a meaningful contribution to the training of an LLM, but a large number of popular and/or prolific numbers can.

That's a point I normally use to argue against authors being entitled to royalties on LLM outputs. An individual author's marginal contribution to an LLM is essentially nil, and could be removed from the training set with no meaningful impact on the model. It's only the accumulation of a very large amount of works that turns into a capable LLM.


Yeah, this is something I find kind of tricky. I definitely believe that AI companies should get permission from rightsholders to train on their works, but actually compensating them for their works seems pointless. To make the royalties worthwhile you'd have to raise the cost per query to an absolutely absurd level


The amounts are not the only problem; there's no good way to measure which input in the training contributed to what degree to the output. I wouldn't be surprised if it turns out it's fundamentally impossible.

Paying everyone a flat rate per query is probably the only way you could do it; any other approach is either going to be contested as unfair in some way, or will be too costly to implement. But then, a flat rate is only fair if it covers everyone in proportion to the contribution, which will get diluted by the portion of training data that's not obviously attributable, like Internet comments or Wikipedia or public domain stuff or internally generated data, so I doubt authors would see any meaningful royalties from this anyway. The only thing it would do, is to make LLMs much more expensive for the society to use.


> it's a good thing that creative professionals have some security in knowing that they have legal protections

This argument would make sense if it was across the board, but it's impossible (and pretty ridiculous) to enforce in basically anything except very narrow types of media.

Let's say I come up with a groundbreaking workout routine. Some guy in the gym watches me for a while, adopts it, then goes on to become some sort of bodybuilding champion. I wouldn't be entitled to a portion of his winnings, that would be ridiculous.

Let's say I come up with a cool new fashion style. Someone sees my posts on insta and starts dressing similarly, then ends up with a massive following and starts making money in a modelling career. I wouldn't be entitled to a portion of their income, that would be ridiculous.

And yet, for some reason, media is special.


Honestly feels like the whole Soham Parekh thing on Twitter is one giant joke with the one sincere / honest remark being the original from @Suhail.

Like, I can't wrap my head around this many people having some kind of experience with a single guy who's claim to be fame is basically gaming the interview process at an incredible amount of Y Combinator startups.


Yeah, I'm surprised someone who's been working at over 50 companies in only 3 years wasn't caught sooner. Some of the stories are wild enough that they had to have been shared with others at the time.


Founders don't like to go around advertising that they got tricked by a scammer. They're trying to impress everyone and raise money. Telling the whole world that you got scammed is not a good look.


Years ago I was hired as an Engineering Manager at a small company, within about the first month on the job I had to have the awkward conversation about firing two of the employees on my team.

See it turned out that the boss worked remote 4-1/2 days of the week, and the employees were in office.

One would show up at 10, take 2 hour lunch around 11.30 and leave around 2.30. He did not work remote. This employee was as always behind on his work.

Second wasn’t even a programmer. He just lied on his resume and got the job over beers and a handshake. He was a graphics designer, he played off a few WordPress template installs as his portfolio.

To keep a story short, the owner spent months doing everything but firing the two employees, demanding I try to teach the designer some computer science and ignored the other scammer. He refused to believe he paid these men 6-figures for years on end. That I must be coming in here to lie and wreck his company, that I left my cushy high-frequency trading job to ruin his startup.

When I asked the sole good engineer on the team what in the hell was going on at this company she simply told me “Oh, the old manager just did all the work for the other two guys since he was their buddy and hired them originally.”


It doesn't have to be the whole world, just their inner circle. If people are that reluctant to admit fault (50 times!), then that's a dismal statement on how often we see truth in society overall.


Yeah, I was about to say, it sounds a lot like this guy is just riding an intense high from getting Claude to build some side-project he's been putting off, which I feel is like 90% of all cases where someone writes a post like this.

But then I never really hear any update on whether the high is still there or if it's tapered off and now they're hitting reality.


For sure.

Fwiw, I use Claude Pro in my own side project. When it’s great, I think it’s a miracle. But when it hits a problem it’s a moron.

Recently I was fascinated to see it (a) integrate swagger into a golang project in two minutes with full docs added for my api endpoints, and then (b) spend 90 minutes unable to figure out that it couldn’t align a circle to the edge of a canvas because it was moving the circle in increments of 20px and the canvas was 150px wide.

Where it’s good it’s a very good tool, where it’s bad it’s very bad indeed.


It's a smart purchase, it's just that I don't see how these datasets factor into super-intelligence. I don't think you can create a super-intelligent AI with more human data, even if it's high-quality data from paid human contributors.

Unless we watered-down the definition of super-intelligent AI. To me, super-intelligence means an AI that has an intelligence that dwarfs anything theoretically possible from a human mind. Borderline God-like. I've noticed that some people have referred to super-intelligent AI as simply AI that's about as intelligent as Albert Einstein in effectively all domains. In the latter case, maybe you could get there with a lot of very, very good data, but it's also still a leap of imagination for me.


I think this is kind of a philosphical distinction to a lot of people: the assumption is that a computer that can reason like a smart person but still runs at the speed of a computer would appear superintelligent to us. Speed is already the way we distinguish supercomputers from normal ones.


I'd say superintelligence is more about producing deeper insight, making more abstract links across domains, and advancing the frontiers of knowledge than about doing stuff faster. Thinking speed correlates with intelligence to some extent, but at the higher end the distinction between speed and quality becomes clear.


If anything, "abstract links across domains" is the one area where even very low intelligence AI's will still have an edge, simply because any AI trained on general text has "learned" a whole lot of random knowledge about lots of different domains; more than any human could easily acquire. But again, this is true of AI's no matter how "smart" they are. Not related to any "super intelligence" specifically.

Similarly, "deeper insight" may be surfaced occasionally simply by making a low-intelligence AI 'think' for longer, but this is not something you can count on under any circumstances, which is what you may well expect from something that's claimed to be "super intelligent".


I don't think current models are capable of making abstract links across domains. They can latch onto superficial similarities, but I have yet to see an instance of a model making an unexpected and useful analogy. It's a high bar, but I think that's fair for declaring superintelligence.

In general, I agree that these models are in some sense extremely knowledgeable, which suggests they are ripe for producing productive analogies if only we can figure out what they're missing compared to human-style thinking. Part of what makes it difficult to evaluate the abilities of these models is that they are wildly superhuman in some ways and quite dumb in others.


I think they can make abstract links across domains.

Like the prompt "How can a simplicial complex be used in the creation of black metal guitar music?" https://chatgpt.com/share/684d52c0-bffc-8004-84ac-95d55f7bdc...

It is really more of a value judgement of the utility of the answer to a human.

Some kind of automated discovery across all domain pairs for something that a human finds utility in the answer seems almost like the definition of an intractable problem.

Superintelligence just seems like marketing to me in this context. As if AGI is so 2024.


> It's a high bar, but I think that's fair for declaring superintelligence.

I have to disagree because the distinction between "superficial similarities" and genuinely "useful" analogies is pretty clearly one of degree. Spend enough time and effort asking even a low-intelligence AI about "dumb" similarities, and it'll eventually hit a new and perhaps "useful" analogy simply as a matter of luck. This becomes even easier if you can provide the AI with a lot of "context" input, which is something that models have been improving at. But either way it's not superintelligent or superhuman, just part of the general 'wild' weirdness of AI's as a whole.


I think you misunderstood what I meant about setting a high bar. First, passing the bar is a necessary but not sufficient condition for superintelligence. Secondly, by "fair for" I meant it's fair to set a high bar, not that this particular bar is the one fair bar for measuring intelligence. It's obvious that usefulness of an analogy generator is a matter of degree. Eg, a uniform random string generator is guaranteed to produce all possible insightful analogies, but would not be considered useful or intelligent.

I think you're basically agreeing with me. Ie, current models are not superintelligent. Even though they can "think" super fast, they don't pass a minimum bar of producing novel and useful connections between domains without significant human intervention. And, our evaluation of their abilities is clouded by the way in which their intelligence differs from our own.


I don't know about "useful" but this answer from o3-pro was nicely-inspired, I thought: https://chatgpt.com/share/684c805d-ef08-800b-b725-970561aaf5...

I wonder if the comparison is actually original.


Comparing the process of research to tending a garden or raising children is fairly common. This is an iteration on that theme. One thing I find interesting about this analogy is that there's a strong sense of the model's autoregressiveness here in that the model commits early to the gardening analogy and then finds a way to make it work (more or less).

The sorts of useful analogies I was mostly talking about are those that appear in scientific research involving actionable technical details. Eg, diffusion models came about when folks with a background in statistical physics saw some connections between the math for variational autoencoders and the math for non-equilibrium thermodynamics. Guided by this connection, they decided to train models to generate data by learning to invert a diffusion process that gradually transforms complexly structured data into a much simpler distribution -- in this case, a basic multidimensional Gaussian.

I feel like these sorts of technical analogies are harder to stumble on than more common "linguistic" analogies. The latter can be useful tools for thinking, but tend to require some post-hoc interpretation and hand waving before they produce any actionable insight. The former are more direct bridges between domains that allow direct transfer of knowledge about one class of problems to another.


> The sorts of useful analogies I was mostly talking about are those that appear in scientific research involving actionable technical details. Eg, diffusion models came about when folks with a background in statistical physics saw some connections between the math for variational autoencoders and the math for non-equilibrium thermodynamics.

These connections are all over the place but they tend to be obscured and disguised by gratuitous divergences in language and terminology across different communities. I think it remains to be seen if LLM's can be genuinely helpful here even though you are restricting to a rather narrow domain (math-heavy hard sciences) and one where human practitioners may well have the advantage. It's perhaps more likely that as formalization of math-heavy fields becomes more widespread, that these analogies will be routinely brought out as a matter of refactoring.


My POV, speed + good evaluation are all you need. Infinite monkeys and Shakespeare.


> It's a smart purchase, it's just that I don't see how these datasets factor into super-intelligence.

It's a smart purchase for the data, and it's a roadblock for the other AI hyperscalers. Meta gets Scale's leading datasets and gets to lock out the other players from purchasing it. It slows down OpenAI, Anthropic, et al.

These are just good chess moves. The "super-intelligence" bit is just hype/spin for the journalists and layperson investors.


> These are just good chess moves. The "super-intelligence" bit is just hype/spin for the journalists and layperson investors.

Which is kind of what I figured, but I was curious if anyone disagreed.


Super-intelligent game-playing AIs, for decades, were trained on human data.


I'll believe that AI is anywhere near as smart as Albert Einstein in any domain whatsoever (let alone science-heavy ones, where the tiniest details can be critical to any assessment) when it stops making stuff up with the slightest provocation. Current 'AI' is nothing more than a toy, and treating it as super smart or "super intelligent" may even be outright dangerous. I'm way more comfortable with the "stochastic parrot" framing, since we all know that parrots shouldn't always be taken seriously.


Earlier today in a conversation about how AI ads all look the same, I described them as 'clouds of usually' and 'a stale aftertaste of many various things that weren't special'.

If you have a cloud of usually, there may be perfectly valid things to do with it: study it, use it for low-value normal tasks, make a web page or follow a recipe. Mundane ordinary things not worth fussing over.

This is not a path to Einstein. It's more relevant to ask whether it will have deleterious effects on users to have a compliant slave at their disposal, one that is not too bright but savvy about many menial tasks. This might be bad for people to get used to, and in that light the concerns about ethical treatment of AIs are salient.


> I'm way more comfortable with the "stochastic parrot" framing, since we all know that parrots shouldn't always be taken seriously.

First, comfort isn't a great gauge for truth.

Second, many of us have seen this metaphor and we're done with it, because it confuses more than it helps. For commentary, you could do worse than [1] and [2]. I think this comment from [2] by "dr_s" is spot on:

    > There is no actual definition of stochastic parrot, it's just a derogatory
    > definition to downplay "something that, given a distribution to sample
    > from and a prompt, performs a kind of Markov process to repeatedly predict
    > the most probable next token".
    >
    > The thing that people who love to sneer at AI like Gebru don't seem to
    > get (or willingly downplay in bad faith) is that such a class of functions
    > also include thing that if asked "write me down a proof of the Riemann
    > hypothesis" says "sure, here it is" and then goes on to win a Fields
    > medal. There are no particular fundamental proven limits on how powerful
    > such a function can be. I don't see why there should be.
I suggest this: instead of making the stochastic parrot argument, make a specific prediction: what level of capabilities are out of reach? Give your reasons, too. Make your writing public and see how you do. I agree with "dr_s" -- I'm not going to bet against the capabilities of transformer based technologies, especially not ones with tool-calling as part of their design.

To go a step further, some counter-arguments take the following shape: "If a transformer of size X doesn't have capability C, wait until they get bigger." I get it: this argument can feel unsatisfying to the extent it is open-ended with no resolution criteria. (Nevertheless, increasing scale has indeed shown to make many problems shallow!) So, if you want to play the game honestly, require specific, testable predictions. For example, ask a person to specify what size X' will yield capability C.

[1]: https://www.lesswrong.com/posts/HxRjHq3QG8vcYy4yy/the-stocha...

[2]: https://www.lesswrong.com/posts/7aHCZbofofA5JeKgb/memetic-ju...


btw, question

Isn't stochastic parrot just a modern reframing of Searle's Chinese room, or am I oversimplifying here?


Fair enough. If you aren't willing to give your friend $14 billion to join your company so you can hang out more, then are you two really friends?


Wang didn't get $14b, he only owns about 15% of Scale. We also don't know how much he sold. He could have sold all of his stock (netting him around $4.5b), none, or something in the middle.


I could see a friend giving me 4% of his net worth on those terms, esp. if i might return a few percent, in the fullness of time.


It also seems like a sort of different situation with Zuck because, I’m pretty sure, he’ll still be able to get by with only 96% of his net worth.


If a friend doesn’t give you 4% of their net worth, how can you be certain you are one of their 25 closest friends?


I don't actually think this is the case, but nonetheless I think it would be kind of funny if LLMs somehow "discovered" linguistic relativity (https://en.wikipedia.org/wiki/Linguistic_relativity).


Related, when o3 finally came out ARC-AGI updated their graph because it didn’t perform nearly as well as the version of o3 that “beat” the benchmark.

https://arcprize.org/blog/analyzing-o3-with-arc-agi


The o3-preview test was with very expensive amounts of compute, right? I remember it was north of $10k so makes sense it did better


Point remains though, they crushed the benchmark using a specialized model that you’ll probably never have access to, whether personally or through a company.

They inflated expectations and then released to the public a model that underperforms


They revealed the price points for running those evaluations. IIRC the "high" level of reasoning cost tens of thousands of dollars if not more. I don't think they really inflated expectations. In fact a lot of what we learned is that ARC-AGI probably isn't a very good AGI evaluation (it claims to not be one, but the name suggests otherwise).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: