Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

1 trillion web pages archived is quite an achievement. But...there's no way to search them? You have to know what url your want to pull from the archive, which reduces the usefulness of the service. I'd like to search through all those trillion pages for, say, the name of an artist, or for a filename, or for image content.


That would be hell to index


I imagine it would be no different than current indexing strategies with a temporal aspect baked in... it would act almost like a different site, and maybe roll up the results after the fact by domain


If it was a commercial problem, e.g. from Google, it would be solved.

The reality is that many things don't exist simply because someone isn't paid to do it.


How much AI companies have benefited by leeching off of IA and Common Crawl, it's a shame there's no at least some money flowing back in.


I remember this functionality existing on Kagi or something. But I can't find it.


Consider the privacy implications of that. It would effectively create a parallel web where `robots.txt` counts for nothing and where it becomes - retroactively - impossible to delete one's site. Yes, there's ultimately no way to prevent it happening, given that the data is public. But to make the existing IA searchable is IMO just a terrible idea.


Actually, I believe the IA respects robots.txt retroactively, eg. putting something on the disallow list now removes the same page scrapes from a yeaer ago from public access in teh Wayback Machine, but I'd love to be corrected on that.


IIRC the IA no longer cares about robots.txt after it kept getting abused [1] to take down older pages. You can still request to take down pages, but it needs a form and a reason. [2]

(Remember, robots.txt is not a privacy measure, it's supposed to be something that prevents crawlers from getting stuck in tar pits!)

[1] https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

[2] https://help.archive.org/help/how-do-i-request-to-remove-som...


Useful to know. My more general position, which apparently is not much shared here, is that removing one's site from the internet has historically meant that the site stops being accessible, stops being indexed, and stops being findable with a simple search. If, going forward, we're going to revise that norm, IMO it would be polite at least to respect it retroactively.


That seems in conflict with the idea that once something's been released, it can't ever truly be unreleased.


It may do. I remember looking into it and not getting a definitive answer. The issue here is that taking a site offline has surely been widely understood as the ultimate robots.txt `Disallow` instruction to search engines. IMO we should respect that.


Related: https://wiki.archiveteam.org/index.php/Robots.txt

(Also, consider that when you forbid such functionality, the only thing that happens is that its development becomes private. It's like DRM: it only hurts legitimate customers.)


I use GPT web search, and I ask it usually to find textbooks from IA. It works really well for textbooks, but not sure about web pages.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: