1 trillion web pages archived is quite an achievement. But...there's no way to s...

qwertytyyuu · 2025-10-06T10:36:11 1759746971

That would be hell to index

Exuma · 2025-10-06T10:49:29 1759747769

I imagine it would be no different than current indexing strategies with a temporal aspect baked in... it would act almost like a different site, and maybe roll up the results after the fact by domain

citbl · 2025-10-06T10:40:02 1759747202

If it was a commercial problem, e.g. from Google, it would be solved.

The reality is that many things don't exist simply because someone isn't paid to do it.

Keyframe · 2025-10-06T10:49:04 1759747744

How much AI companies have benefited by leeching off of IA and Common Crawl, it's a shame there's no at least some money flowing back in.

1gn15 · 2025-10-06T13:57:09 1759759029

I remember this functionality existing on Kagi or something. But I can't find it.

bluebarbet · 2025-10-06T11:33:43 1759750423

Consider the privacy implications of that. It would effectively create a parallel web where `robots.txt` counts for nothing and where it becomes - retroactively - impossible to delete one's site. Yes, there's ultimately no way to prevent it happening, given that the data is public. But to make the existing IA searchable is IMO just a terrible idea.

breakingcups · 2025-10-06T12:42:46 1759754566

Actually, I believe the IA respects robots.txt retroactively, eg. putting something on the disallow list now removes the same page scrapes from a yeaer ago from public access in teh Wayback Machine, but I'd love to be corrected on that.

1gn15 · 2025-10-06T16:26:57 1759768017

IIRC the IA no longer cares about robots.txt after it kept getting abused [1] to take down older pages. You can still request to take down pages, but it needs a form and a reason. [2]

(Remember, robots.txt is not a privacy measure, it's supposed to be something that prevents crawlers from getting stuck in tar pits!)

[1] https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

[2] https://help.archive.org/help/how-do-i-request-to-remove-som...

bluebarbet · 2025-10-07T10:06:15 1759831575

Useful to know. My more general position, which apparently is not much shared here, is that removing one's site from the internet has historically meant that the site stops being accessible, stops being indexed, and stops being findable with a simple search. If, going forward, we're going to revise that norm, IMO it would be polite at least to respect it retroactively.

fragmede · 2025-10-07T10:17:20 1759832240

That seems in conflict with the idea that once something's been released, it can't ever truly be unreleased.

bluebarbet · 2025-10-06T14:32:20 1759761140

It may do. I remember looking into it and not getting a definitive answer. The issue here is that taking a site offline has surely been widely understood as the ultimate robots.txt `Disallow` instruction to search engines. IMO we should respect that.

1gn15 · 2025-10-06T16:18:32 1759767512

Related: https://wiki.archiveteam.org/index.php/Robots.txt

(Also, consider that when you forbid such functionality, the only thing that happens is that its development becomes private. It's like DRM: it only hurts legitimate customers.)

emporas · 2025-10-06T10:40:59 1759747259

I use GPT web search, and I ask it usually to find textbooks from IA. It works really well for textbooks, but not sure about web pages.