Hacker Newsnew | past | comments | ask | show | jobs | submit | ascorbic's commentslogin

Using them was allowed as fair use – it was the downloading of the pirated copies that was infringement. That's why Anthropic switched to scanning paper books.

> That's why Anthropic switched to scanning paper books.

After they threw away all the tainted data from the pirated books, right?


No, because the judge ruled that the training was fair use and the model itself wasn't infringing.

That sounds pretty applicable to this case, right? _Access_ to the Claude is illicit, but distilling is not. Distilling is fair use.

Yes, as part of the settlement

Have you a source for that? Because everyhthing I've read tells me that they paid out a settlement but no mention of deleting the training data or the models that were tainted, e.g. [0]

[0] https://www.theguardian.com/technology/2025/sep/05/anthropic...


> Using them was allowed as fair use

That is only relevant in the US, and even there it is still not clear-cut whether the fair use doctrine applies on all these scenarios. Outside of the US the situation is also quite different: for example take a look at the recent ruling on GEMA vs OpenAI in Germany.

The reality is that the copyright issue with generative AI is very complex and reaching anything resembling a conclusion will take much more than a few opinion paragraphs from an American district judge.


Isn't scanning also a form of copyright infringement? You are making a digital copy of a book, which is the same thing as downloading a book from the internet...

I think that we can run a perhaps silly thought experiment.

Suppose that I have a nearly perfect memory and I could remember all the books I read. Suppose also that I have a million year life span so I could read 7 million books. Then, what happens if at the end of all of those years, or at any earlier moment I answer questions from people and I exploit commercially the knowledge I gathered reading those books? Would my reading those books be study or copyright infringement? Remember the nearly perfect memory hypotheses.

Of course it's a bit silly because the time to train a LLM and the time I need to read all those books is different by orders of magnitude and that changes the perspective. Who would complain with me today if their heirs lose some money on 7 million AD? Who would even notice that I started that million years long endeavor. Who's going to be there to ask me questions by then? Humans? Birds? Lizards? And I can say that I am studying like everybody else before me, but does an LLM study? And I am sure there are many other nuances.

Anyway, I don't think that scanning is any different than photons hitting my retina. The difference is in what happens next: the faithfulness of memory, the amount of knowledge, the speed of accumulating it. After all a huge amount of quantity can become quality.


Can I pay for a movie, hit record, sleep in the theatre and play it back when I get home? I pinky promise that I will close my eyes while recording. Its still the same photons hitting my own camera retina.

Many of us here are software developers by choice or hobby and we know it better than regular folks that scale changes everything and can break our assumptions and business if you design something for wrong scale.

Yet why do we still want to insist that a human and machine are the same and same rules apply when it comes to AI, though we know they operate at different speed and scale?


This is a bit of a trick question. The law is explicitly written to make this illegal. If it was not explicit, it most likely would be legal by time shifting precedent.

https://www.law.cornell.edu/uscode/text/18/2319B


The illegal part would be reciting the stuff you memorized to other people. Copyright doesn’t prevent you from making a copy as long as you don’t distribute it afaik.

Copyright is about exclusive publication, production, sale, or distribution.

An LLM is just a really, really big, really, really elaborate "choose your own adventure" book.

You aren't a book.


> Suppose also that I have a million year life span

But that's what makes the usual analogies with humans fail from the start. The laws were made with the assumption that they apply to humans which are a known quantity. This breaks down when you apply them with system with vastly increased (and ever increasing) capabilities.

> Anyway, I don't think that scanning is any different than photons hitting my retina.

If I ask you 10 years from now to give me a completely accurate depiction of what your retina registered yesterday at 5:52 PM, will you be able to? And can you give me a copy?


The thought experiment falls apart immediately by the mere fact that—even given all the other fantastical abilities such as perfect memory and impossible lifespan—you can still only answer one question at a time. As has been repeated ad nauseam, scale puts an hard stop on the comparison of LLMs to humans.

Let’s switch up your scenario. Let’s say the subject isn’t a human with machine-like qualities but instead a computer with human-like limitations. All the books were fed to that one computer, and for technical reasons it cannot be duplicated and can only answer one question at a time. Suddenly the infringement isn’t as problematic and the ways to commercially exploit that data are minimal.

Furthermore, even with perfect memory it would take time to read all those books, you’d never keep up with everything released in a single year. Nor would you be able to reproduce everything perfectly due to required time and lack of ability (perfectly recalling a painting or photograph does not mean you have the skills to make an exact copy).

All these comparisons are silly and useless anyway (though in your particular case I think you are arguing in good faith). Computers are not human. If a person was caught killing animals of an endangered species and used as a defence “but what about the natural predators in that habitat? I’m just doing the same as them”, we’d rightfully see through the bullshit and scoff at such an obviously flawed comparison.


TLDR: It's just like a human, if a human were fundamentally different.

How is it different than reading the book, and writing down a copy, and publishing it as your work? Even without selling it, but then on top, selling it too. It isn't. There is no thought experiment that absolves the copyright and citation laundering.

And the systematic nature of the excerpt service makes the excerpts different from fair use quotes. A reference quote is not a service that can reproduce the entire work, and the reference quote cites the actual source of the insight/wisdom/research/poetry/etc.

The only thought experiment is why might someone even try to excuse this activity? I can think of a few.


No, there is a famous law case to prove that's allowed: https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

Copyright protects the presentation of knowledge, not the knowledge itself, which is uncopyrightable in almost all jurisdictions.

As long as the book was a legal copy, that is allowed legally.


Here we have a 15% limit on scanning for fair use

As long as it is destructive, and the digital copy is access-restricted to equal the licenses or physical copies destroyed, then it falls under fair use.

I'm pretty sure every book I've seen has a page that says you're not allowed to copy/scan/photograph it.

that per-se doesn't mean you are bound by it.

> That's why Anthropic switched to scanning paper books.

Could they not just subscribe to the academic publishers like universities do? Or buy eBooks? I don't understand how the "scanning" part is relevant here other than used physical books being cheaper perhaps?


Bulk second-hand books are a lot cheaper than ebooks. Also not all books are available as ebooks, and ebooks have terms of service that presumably prevent them being used for training.

If using the books is fair use, then distilling the model, which is just a derived product of those books is also fair use.

These companies are trying to have their cake and eat it too.


Hmm, training on a book’s text smears the content all over the weights, merging it with all other texts. The original text isn’t intentionally supposed to be reproducible in any larger part (although IIRC models were able to emit fairly large chunks verbatim).

Quite unlikely, training on behavior purportedly approximately replicates the behavior. It gets replicated intentionally as a whole.

IANAL, but I see significant differences with intent to copy a significant part as a whole into a competing product, surely shouldn’t fit under legal concept of fair use, no matter whether scanning books for LLM training fits or not.

Whether such things (behaviors) are copyrightable - and should they be so - is another interesting question. Those aren’t algorithms or databases (stuff clearly and explicitly covered in many copyright laws), those are human expectation models, something like how we train animals or teach our own.


It's the exact same training process for both of your examples. I don't really see how you can claim books are not replicated, but that output from other LLMs is.

Process is the same, but intent is not. One intent is to extract information from the book for better general eloquence and overall awareness - not for replicating the book itself (ability to recall verbatim fragments is a side effect, not the goal). Another intent is to replicate the behavior, carry it over using training.

Again - IANAL - but in my understanding (and I spend some time reading on this), the legal concept of fair use is all about the intent how copyrighted material is used. It's all copying or distribution, but law does make distinction about what and why.


> Hmm, training on a book’s text smears the content all over the weights, merging it with all other texts. The original text isn’t intentionally supposed to be reproducible in any larger part (although IIRC models were able to emit fairly large chunks verbatim).

I agree with that, however that doesn't make the output copyrightable then.

I think these AI companies live in a legal fantasy where they can take any content they want, put it into the mixer without caring about copyright and then what comes out of it is somehow copyrighted.

They have to pick one or the other, either the content copyright tains the model or it doesn't but the model isn't subject to copyright.

> those are human expectation models, something like how we train animals or teach our own.

But more importantly, made by machines, and one of the requirements for copyright is the human factor.


> I agree with that, however that doesn't make the output copyrightable then.

It could be - databases are copyrightable. It's long established that if you put some effort into categorizing and processing information, you get rights for that work. Basically, you can get rights a phone book or a map, even if individual bits are not copyrightable. You can also get rights on a compilation or a catalog of other copyrighted works - although original authors' rights remain. But - there's a legal trick to avoid a liability even if you infringe: fair use doctrine.

> where they can take any content they want, put it into the mixer without caring about copyright and then what comes out of it is somehow copyrighted.

Yes. It's not a legal fantasy, though - that's what they had actually pulled off, as far as I understand it (and, again, IANAL, just a layman who's interested in this stuff a little bit). They argued their work is so highly transformative to allow fair use doctrine to shield from liability on copyright infringement claims. And courts seemed to agree, making this fantasy a reality. Just because that's how legal system works.

Model is still a derived work (AFAIK there's no legal way to clear that) of all the books and articles and whatever else is copyrightable (plus a ton more of non-copyrightable stuff), but there's no liability for training on all that stuff, because courts had ruled - and that happens on a case-per-case basis - that it falls under fair use.

And there's the difference: now Anthropic argues that copying the behavior verbatim is not transformative enough to shield Alibaba from liability by invoking fair use. Now it's up to the courts (if they sue and don't just do the PR dances) to check it out.

Disclaimer: first, I'm not a legal expert, and second - I'm not arguing whether anything is right or wrong, just mapping what happened or being argued to what I know about copyright.


> I think these AI companies live in a legal fantasy where they can take any content they want, put it into the mixer without caring about copyright and then what comes out of it is somehow copyrighted.

The mixer you're talking about is what they seem to claim to be transformative use, no? Unless I'm misunderstanding something, it's not a legal fantasy.


> The mixer you're talking about is what they seem to claim to be transformative use, no? Unless I'm misunderstanding something, it's not a legal fantasy.

If it's transformative use, then it's transformative use of ... what exactly? Copyrighted works? I think the law is pretty clear on what happens on transformative use of copyrighted works.


> it's transformative use of ... what exactly? Copyrighted works?

Yes. Among other stuff, but non-copyrighted stuff is not exactly an issue so it can be left out of our focus most of the time.

> I think the law is pretty clear on what happens on transformative use of copyrighted works.

Ah, if only - it's not. You could be mixing it up with concept of derived work - that's where the law is pretty clear (I think). AFAIK (IANAL), transformativeness is merely a suggestive factor for fair use consideration, and then it's all "whatever court decides" with a bunch of guidelines and precedents.


Probably, yes. It's likely just a breach in their terms of service. You'll note that they're not suing them – they're trying to get the government to do their work for them.

In a different world it is not fair use. The benefits of the crime should be always taken off. If you isolate the training and pirating, you may say that it was fair, but that completely misses the point. The sole purpose of pirating (aka crime) was to train the models.

Copyright infringement isn't usually a crime.

Yet you can get jailed?

You don't need to use a lawyer to draw up the docs unless you have special requirements: you can use the proforma memorandum (it's auto-filled if you apply online) and adopt the model articles of association.

- https://assets.publishing.service.gov.uk/media/5a7da236e5274...

- https://www.gov.uk/guidance/model-articles-of-association-fo...


There are dozens of us!

Sure, somebody else holds your identity, but it's pretty easy to control it yourself. By its nature if you're using somebody to host your stuff, you're trusting them with it. I made Cirrus so you can self-host your PDS for free, but you still need to trust Cloudflare to run it.

It’s great that tings like this exist but as long as this is how identities work on ATProto it’s unfortunately going to be a niche thing.

The easy way is to create an account on Bluesky. That's as simple as creating an account on any other social network. The important bit though is that you can then, if you want, migrate to a different host and take your identity with you, or if you're brave you can self-host. This is by its nature something for power users, but regular users can use Bluesky accounts without ever knowing or caring about PDSs, DIDs or signing keys etc.

You can host it for free on Cloudflare using my Cirrus PDS: https://cirrus.earth/

For anyone curious about the issues discussed in this topic, see:

Identity and your signing key https://cirrus.earth/concepts/identity/

> Cloudflare secrets are write-only: once set, they cannot be retrieved through the dashboard or the API. This is good for security and bad for recovery. The wizard prints the key exactly once during pds init. Save it then.

Also:

Back up your signing key https://cirrus.earth/guides/back-up-signing-key/


It's also worth noting that if you're using `did:plc` for your identity (the default for Bluesky and most places) then you can rotate your signing key if you lose it. It's only `did:web` that makes it impossible to recover or rotate a key.

Brilliant, so happy to learn about this!

> Cloudflare may play this smart: force bots to pay for access

https://developers.cloudflare.com/agents/tools/payments/


Most people reading this, then

Since sync 1.1 last year, you can run a relay on a relatively small VPS


There are:

- 221 with over 5 accounts

- 74 with over 20 accounts

- 19 with over 250 accounts

- 8 with over 1000 accounts.

And only a handful of those have open signups (13 with open signups have >50 users).

Many of them are actually ActivityPub instances with a PDS bridge, e.g., https://join.wafrn.net/

And most of the other open signup instances are also primarily designed as their own social network, just using AT proto as a compatibility layer, e.g., https://sprk.so/ https://haruhwa.com/ (which is an invite-based, snapchat-style ephemeral social network), https://surf.social/, https://pckt.blog/ (a microblogging platform), aesthetic.computer (a collaborative programming/art platform)

That leaves only bluesky, blacksky, eurosky, selfhosted.social, self.surf and npmx.social.

Even during Facebook's heyday, the unsuccessful diaspora/friendica/gnu social/etc networks had more decentralization than that.


Spark, pckt (and leaflet, tangled) etc aren't using atproto as a compatibility layer: they're fully-fledged apps built on the network.

Right, but I think a typical Bluesky user never sees data from there in their feeds?

Don't get me wrong, I think it's pretty cool that you can run all these different apps and have them store their data on their own PDSes. And theoretically it's possible for everyone in my Bluesky feed to be on their own PDS and use different apps. But the question from a Mastodon point of view is: is that the case in practice, and if not, how likely is it that there will ever be a significant portion of non-Bluesky posts in an average microblog feed, on atproto?


The decision as to whether mass screening is justified or not is complex, and varies a lot by test/condition/population etc. Luckily there are lots of smart people whose job it is to do these caclulations.

In your list, 1-4 are common enough, the tests are accurate enough, the costs of intervention are low enough and the benefits of early intervention are high enough to justify screening, which is why they do generally screen for them at least in hgiher risk groups. The other two are more mixed, which is why mass screening is less common.

All the evidence for full body scans is that they are not justified for asymptomatic people. The false positives are high, the costs of these false positives are high, and the imporved outcomes are too low to justify them. If you want one, go ahead, but realise that almost anything it finds is likely to be false either positive or not likely to ever cause a problem, and you'd have to deal with the worry and invasive tests and even surgery in aid of something that may never cause any trouble.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: