Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Current generation of AI code generators are loaded with ethics and copyright problems. Plus, there's the issue of "copying something without understanding" angle.

The most advanced tools were template based generators and real-time static checkers plus language servers. AI makes things way more complicated than it is.

It's not only bleed of GPL into MIT. It's also bleeding of source-available licensed corpus to AI models. These things leak their training data like crazy. Ask the right question and get functions from training sets verbatim.

When everything is combined, this is a huge problem. It's not that these problems are individually OK. They're huge already. The resulting problem is a sum of huge problems.



Everyone in the industry has been copying code from Stack Overflow (and also generally shitting on the concept of IP altogether) for years and nobody cares, but suddenly LLMs come out everybody is a copyright stickler. Give me a break.


> Everyone in the industry

I doubt this genericism applies to codebases the like of the *BSD's, or the linux kernel.


The code on Stack Overflow is already licensed with Creative Commons, plus people put their code there with the intent of being shared and used.

There are GPL projects which provide people their livelihoods, because they get grants to develop that code with a GPL license. Ingesting the same code to a model sans it’s license not only infringes on the license, but allows this code to seep to places where it shouldn’t (by design), and puts this man’s livelihood in jeopardy.

Companies frowned upon GPL for years because of its liability, and now they can feast over this code with these models.

Same is for source available repositories. These companies put their code out for eyes only, not for reproduction and introduction to other code bases. These systems also infringe on these licenses, and attacking to the business models of these companies.

I’m not a copyright stickler. I just respect people and their choices they made with their code.

P.S.: I can share the tweets of that researcher if you want.


> The code on Stack Overflow is already licensed with Creative Commons

It is CC-BY-SA so it requires attribution (+ share alike).[1] That is the hard part with code written by LLMs.

[1] https://stackoverflow.com/help/licensing


Considering all the code I write is GPL licensed and I always write a comment on top of SO inspired code blocks with their respective URLs, I don't think I'm doing something wrong.

Update: SA 4.0 accepts GPLv3 as compatible, and I use GPLv3 exclusively. I'm on clear.

LLMs neither know provenance nor licenses about the things they generate, you're right. I think that part of the problem is ignored not just because it's hard, but it's convenient to ignore, too.


If my subordinate was copying code from StackOverflow without attribution I would be annoyed enough to send a grouchy email. Behavior like that is bad hacker citizenship, and bad for long-term maintenance. You should at least include a hyperlink to the SO question.

I also think SO is different about mindless copy-pasting. Outside of rote beginner stuff it’s infrequent that someone has the exact same question as you, and that the best answer works by simple copy-pasting. Often the modification is simple enough that even GPT can do it :) But making sure the SO question is relevant, and modifying the answer accordingly, is a check on understanding that LLMs don’t really have. In particular, a SO answer might be “wildly wrong” syntactically but essentially correct semantically. LLMs can give you the exact opposite problem.


The last time I copied something from SO it was this:

> DENSE_RANK() OVER (ORDER BY TotalMark DESC) AS StudRank

And then I filled in my column names and alias. This is 90% of what is happening with LLMs / SO copying. Copy / paste of syntax like this absolutely does not need a link or attribution and is in no way copyrightable in the first place.


This is not 90% of what's happening with LLMs. Everyone I saw using LLMs were requesting whole algorithms or even program boilerplate which doesn't contain much boilerplate but tons of logic.

Case in point: https://x.com/docsparse/status/1581461734665367554

This is not akin to copying a 2-line trick from SO.

On the other hand, the most significant part of code I copied from SO was using two iostream iterators to automatically tokenize an incoming string. 5-6 lines at most.

This block has a 10+ line comment on top of it not only explaining how it works, but it has a link to the original answer at SO.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: