Mezzanine from Massive Attack, The Miseducation of Lauryn Hill, Big Pun’s Capital Punishment, You’ve Come a Long Way Baby from Fatboy Slim, and Hello Nasty from the Beastie Boys. The K&D Sessions from Kruder and Dorfmeister. Stunt by Barenaked Ladies. Then there’s one of my personal favorites, Mermaid Avenue from Billy Bragg and Wilco.
You mentioned rock. How about Hellbilly Deluxe from Rob Zombie? Follow the Leader by Korn. The New Radicals released Maybe You’ve Been Brainwashed Too. Garbage’s Version 2.0, and Lenny Kravitz 5, and Van Halen III. What’s more rock than Walking into Clarksville by Page & Plant?
The US dollar being the reserve currency means printing more of it will add to inflation slightly slower than otherwise. It doesn’t mean we can just add trillions or quadrillions of dollars of liquidity to the economy with no added productivity and see no impact.
The US dollar is now also the oil exchange currency, and the US is a major oil producer so harder to squeeze. Remember what else happened in 1973 and 1979?
I find it distasteful and disturbing that copyright infringement by the people training the LLM in violation of a license is considered contamination by the licensed code. It’s not contamination. The code didn’t seep into your codebase. If the LLM was trained in such a way that portions of code long enough to be protectable then the license was violated by humans. The liability for the problem doesn’t lie on the shoulders of the contributors to the originally licensed code. It lies on the people inserting it into your codebase without following the terms of the license.
The article also singles out the GPL repeatedly as a source of contamination. It doesn’t mention source-available proprietary licenses. It doesn’t mention code put online with no clear license, which according to the Bern Convention and the laws in at least the United States is automatically copyright protected with no license for use by others at all. It doesn’t talk about attribution for BSD-style or CC-SA-Attribution licenses. There’s no mention of leaked proprietary code. It just singles out GPL as some sort of unique problem.
This seems quite shoddy and biased for an article by someone who’s writing about the law.
If the trained LLM spits out large, recognizable portions of licensed code and you use it in your product don’t count on that case to keep you from defending yourself in court. The court found in Bartz v. Anthropic that training was fair use. They also found that pirating content to train against was not fair use, and Anthropic paid $1,500,000,000 in a settlement.
There are licenses on most software source code. If you redistribute works derived from that code, you must abide by those licenses or you are violating the copyright. That’s what’s meant by “piracy" here.
Now if you have an LLM that has trained on code and learned to actually write new software, only small snippets too short to be protected by copyright should be identical between the training material and the output. However, if you’re getting output that is substantial in size and recognizably derivative from the original that’s an issue that hasn’t yet as far as I’m aware been settled in court. One would hope the major player LLMs don’t copy and paste large functional chunks of existing programs.
It would certainly seem to me that the code you sell after using an LLM should meet the same standards for difference in implementation as if it was written by a human. That should apply to both copyright protection and patent protection.
The seller of the code has no visibility on the training set of the LLM. If the situation you're describing ends up being illegal, responsibility should fall on the LLM provider to provide tools to detect such overlap with their training sets, and on the clients to run the tools.
The provider of the LLM should want to enable this and to take on that responsibility (I mean take it from the clients), otherwise no one will want to use the tool. Maybe there could be AI tool-use lawsuit insurance, but I feel like that's worse than the copyright infringement detection tool for everyone involved.
I can see the tool happening in the EU, but nowhere else basically, especially in the US, the government sees "AI dominance" as a national priority and a national security priority.
I find it pretty horrible that a company can pay a mere fine that is a small percentage of its total funding in exchange from materially benefiting from a conspiracy to commit a series of criminal acts.
If Anthropic hadn’t pirated training materials would they even exist? Would they still have been as competitive ?
Would they still have gotten every bit of VC funding in anticipation of future successes derived in part from past crimes?
What’s next ? Armed bank robbery when VC funding dries up?
Also fair use is much more limited in the EU. Don't know how it applies here or if there where any rulings. Are you going to stop doing business with the EU (and Japan etc.)?
I thought fair use was decided on a case by case basis, and could not be guaranteed? If true, wouldn't that mean that in other cases it could be ruled differently?
I don't have the exact ruling in front of me, but IIRC the judge pretty clearly said that training a model was fair use. IIRC, he declared it "quintessentially transformative".
The case by case basis was about acquisition and possession of the copyrighted material. Anthropic pirated a large number of books and illegally stored digital copies of many that they did purchase legally. The training being protected doesn't give them the right to violate copyright in that way.
Google, for example, purchased print versions of their training material and had a small army of employees digitize them and then delete the digital copies when they were done. That hasn't been challenged AFAIK, but would likely have been found to be not a violation. That's I think what was meant by case by case basis.
It's like if someone breaks into my house and I shoot them with my gun, that's very likely self defense, but if I'm not allowed to own a gun, I may still end up in trouble with the law.
Whether or not you’re pirating and making illegal copies of something depends greatly on the terms under which you’re allowed to make those copies. You can copy GPL-licensed code all day every day so long as you abide by the license. The same is true of the BSD licenses, MIT, ISC, Apache, et cetera.
If you’re copying or making substantially derivative works of them outside the terms of the license, you’re violating the copyright.
> If you’re copying or making substantially derivative works of them outside the terms of the license, you’re violating the copyright
I don't disagree with that.
What I'm saying is that the judge ruled that training a model using copyrighted books wasn't derivative. It was transformative, so the training wasn't a copyright violation.
He then went on to say that the way Anthropic acquired and handled that material was a copyright violation because Anthropic pirated and copied a large number of books that were not under a license like the ones you mentioned. The downloaded a bunch of books you would find at most bookstores and then actually purchased copies of them much later once they were accused of violation copyrights.
I'm just trying to make that clear because I've heard a lot of people who don't understand that the violation wasn't about the act of training or material they used, it was just how they acquired the training material.
That was one case in front of one judge. It’s weak precedent if it’s precedent at all.
Also, the reasoning behind it being transformative instead of derivative is that the output isn’t supposed to be large, unchanged chunks of the input. There’s no actual guarantee your small model run under OpenClaw won’t recreate whole modules of the input.
Other than putting something into the public domain I don't really know any open source licence that doesn't require at least attribution. One can assume that 99.9% of training data had some sort of license requirements, so just blindly using it is a copyright violation. People just don't seem to care.
It is probably fair that a huge share of code that is Foss is licensed under GPL, much larger than the share of source available proprietary licensed code
That's the wrong metric, however. Thousands of small pet repos are unlikely to have more code than a single Chromium repo (mostly LGPL), Linux, Qt, etc.
Is GPL a larger share of source out there than BSD, MIT, ISC, CC, BSL, Apache, and source available combined? Enough bigger that it is repeatedly mentioned as a singular issue without so much as the words “or other licenses”?
You would assume that there is more proprietary code available to read on the internet than GPL code? Do you have any rationale for that assumption?
Basically all GPL code is available on the web and there is a vast amount of it. I barely see any current non-FOSS code on the internet, although I think it would be fair to count the big projects who have been using pseudo-OSS licenses lately as proprietary. Wouldn't a safer assumption be a ratio of 10:1 or 100:1 for lines of GPL vs. lines of "shared source?"
Basically true if you add "to paying customers", there's no obligation to publish it otherwise. You can even sell your GPL software on DVD if you like.
Rather, I suspect viral copyleft is why this lawyer is focusing on the the GPL. It's the only(?) FOSS license that can force a proprietary codebase into the open.
Slashdot is still online and updating. Some of us still use optical drives from time to time (especially those of us with an existing stock of M-Disc media for long-term archiving).
Gateway was purchased by Acer, so it’s not like they just disappeared. Not any more than DEC, SBC, or Studebaker anyway. They were just absorbed.
Mezzanine from Massive Attack, The Miseducation of Lauryn Hill, Big Pun’s Capital Punishment, You’ve Come a Long Way Baby from Fatboy Slim, and Hello Nasty from the Beastie Boys. The K&D Sessions from Kruder and Dorfmeister. Stunt by Barenaked Ladies. Then there’s one of my personal favorites, Mermaid Avenue from Billy Bragg and Wilco.
You mentioned rock. How about Hellbilly Deluxe from Rob Zombie? Follow the Leader by Korn. The New Radicals released Maybe You’ve Been Brainwashed Too. Garbage’s Version 2.0, and Lenny Kravitz 5, and Van Halen III. What’s more rock than Walking into Clarksville by Page & Plant?
reply