Current generation of AI code generators are loaded with ethics and copyright pr...

fallingknife · on May 18, 2024

Everyone in the industry has been copying code from Stack Overflow (and also generally shitting on the concept of IP altogether) for years and nobody cares, but suddenly LLMs come out everybody is a copyright stickler. Give me a break.

mariusor · on May 18, 2024

> Everyone in the industry

I doubt this genericism applies to codebases the like of the *BSD's, or the linux kernel.

bayindirh · on May 18, 2024

The code on Stack Overflow is already licensed with Creative Commons, plus people put their code there with the intent of being shared and used.

There are GPL projects which provide people their livelihoods, because they get grants to develop that code with a GPL license. Ingesting the same code to a model sans it’s license not only infringes on the license, but allows this code to seep to places where it shouldn’t (by design), and puts this man’s livelihood in jeopardy.

Companies frowned upon GPL for years because of its liability, and now they can feast over this code with these models.

Same is for source available repositories. These companies put their code out for eyes only, not for reproduction and introduction to other code bases. These systems also infringe on these licenses, and attacking to the business models of these companies.

I’m not a copyright stickler. I just respect people and their choices they made with their code.

P.S.: I can share the tweets of that researcher if you want.

nequo · on May 18, 2024

> The code on Stack Overflow is already licensed with Creative Commons

It is CC-BY-SA so it requires attribution (+ share alike).[1] That is the hard part with code written by LLMs.

[1] https://stackoverflow.com/help/licensing

bayindirh · on May 18, 2024

Considering all the code I write is GPL licensed and I always write a comment on top of SO inspired code blocks with their respective URLs, I don't think I'm doing something wrong.

Update: SA 4.0 accepts GPLv3 as compatible, and I use GPLv3 exclusively. I'm on clear.

LLMs neither know provenance nor licenses about the things they generate, you're right. I think that part of the problem is ignored not just because it's hard, but it's convenient to ignore, too.

nicklecompte · on May 18, 2024

If my subordinate was copying code from StackOverflow without attribution I would be annoyed enough to send a grouchy email. Behavior like that is bad hacker citizenship, and bad for long-term maintenance. You should at least include a hyperlink to the SO question.

I also think SO is different about mindless copy-pasting. Outside of rote beginner stuff it’s infrequent that someone has the exact same question as you, and that the best answer works by simple copy-pasting. Often the modification is simple enough that even GPT can do it :) But making sure the SO question is relevant, and modifying the answer accordingly, is a check on understanding that LLMs don’t really have. In particular, a SO answer might be “wildly wrong” syntactically but essentially correct semantically. LLMs can give you the exact opposite problem.

fallingknife · on May 18, 2024

The last time I copied something from SO it was this:

> DENSE_RANK() OVER (ORDER BY TotalMark DESC) AS StudRank

And then I filled in my column names and alias. This is 90% of what is happening with LLMs / SO copying. Copy / paste of syntax like this absolutely does not need a link or attribution and is in no way copyrightable in the first place.

bayindirh · on May 18, 2024

This is not 90% of what's happening with LLMs. Everyone I saw using LLMs were requesting whole algorithms or even program boilerplate which doesn't contain much boilerplate but tons of logic.

Case in point: https://x.com/docsparse/status/1581461734665367554

This is not akin to copying a 2-line trick from SO.

On the other hand, the most significant part of code I copied from SO was using two iostream iterators to automatically tokenize an incoming string. 5-6 lines at most.

This block has a 10+ line comment on top of it not only explaining how it works, but it has a link to the original answer at SO.