Hacker Newsnew | past | comments | ask | show | jobs | submit | danielheath's commentslogin

Usually after you solve the POW challenge, sites let you make a lot of requests before asking you to complete another.

They showed up when the AI money did. The evidence is circumstantial, but… some of them are remarkably well engineered (from a “how difficult is it to identify this traffic” perspective, in a way that never existed before (I have been running a quite sizeable site for 8 years, over 200k registered users, and you don’t need to register to use 99% of it).

I run a quite large website and there are a few patterns.

The usage is extremely quick, and follows easy-to-spot patterns. We noticed a spike in bounce rate.

They never come from Google, and the bad programmed ones just crawl several pages at a time, faster than a user could do.

Then there's the crazy spikes in visits from specific countries, pretty much scraping the entire content. Often from pools of IPs. In some cases had 30% unexplained (meaning: it wasn't viral or a marketing campaign) random sustained increases in traffic.

There's also the fact they don't interact with the complicated widgets, so zero XHR requests other than analytics pings.

They also don't cause spikes in Google Analytics, so I assume it's blocked, but they show up in logs and in the internal analytics.

It's not enough to DDOS the website at all, but it's a lot of noise in statistics that we gotta learn to filter.


> They never come from Google, and the bad programmed ones just crawl several pages at a time, faster than a user could do.

I’ve triggered this kind of “bot protection” right here on Hacker News many times. I did that by having a bunch of Hacker News pages open and then closing and reopening my browser. I’ve also triggered it by opening a bunch of links in the background too quickly. I’ve also triggered it by reading the article, then clicking back and upvoting/favouriting too quickly. I’m also located in Singapore, which people have started to advocate for blocking here recently.

A single non-bot legitimate user can easily trigger these kinds of heuristics just by using the site in a way you don’t expect. This can affect some users disproportionately more than others, e.g. disabled people who need to use assistive technology.


Oh I also do this all the time.

What I mean by "too fast" is opening 50 pages in the span of two or three milliseconds.

Either way, I'm not blocking. The CDN is handling the traffic alright.


I hate that sort of thing - when I rolled my own proof-of-work bot protection (providers wanted $$$$), I set it up so that

A) you'd have to open >200 tabs, and B) if any tab solves the proof-of-work, any that are still waiting to do so reload in the background.


Yes, circumstantial is exactly the point; it's easy to use AI as a scapegoat because it's something popular to hate on.

It's circumstantial evidence, but Occam's Razor also applies.

It's not a hostile DOS in the traditional sense (I've mitigated a few of those) - no "pay us to make it stop", no pattern to the requests other than "fetch every unique URL a few times".

It wasn't happening until financial incentives to gather large datasets for AI training appeared.

Bad actors (using residential proxies & claiming to be a real browser) mostly showed up after folk started blocking ones that identified themselves as AI scrapers.

It's obvious to blame AI training because there's a shortage of better explanations. Who else would be paying for these (expensive) residential botnets, only to use them to (eg) web-scrape wikipedia (which offers free downloads of its content in a structured format)?

The simplest explanation of the technical behavior is "a bot coded to follow every link it sees & save the results", and the simplest explanation of the motive to run such a bot is "to train a large language model".


no "pay us to make it stop"

"use Cloudflare to make it stop"


Or fastly, or akamai, or bunny, or any number of other providers.

Cloudflare are merely the cheapest of the bunch.


Exactly. They (and most of all, Big G) stand to profit greatly from this browser discrimination. What better than to make more sites use them by launching DDoS attacks in the name of "AI scraping".

Spent a bit of time with it.

Biggest peeve so far: It's very easy to build the 'premium' version of a building (eg police HQ instead of police station) and utterly annihilate your city budget - with no ability to cancel / undo.

"Click on the correct-looking-but-actually-wrong button functionally ends your game" is... not great.

Other than that, I'm really enjoying it.


I see. I'm adding an option to "disable" a building, so you'll be able to suspend a city service due to budget constraints or any other reason.

I more meant "I had 200m in the bank and accidentally spent 170m of it to start constructing a police HQ".

If you bulldoze a building when it's only 2% of the way through being constructed, you should probably be able to get 98% of the resources spent back - whether housing demand, money, whatever you spent.

Possibly 100% back until you hit 5%, just to allow for those mis-clicks.


One has 1:2 fanout, the other has 1:50 fanout.


Yes, we should be.

My computer should run programs when I tell it to run them.

Don’t blunt _every_ tool just to make them harder to cut yourself on.


I hope you're in the very small minority of people who rigorously manage untrusted downloads and whitelist every binary, because you're operating an appliance from the 1970s, sticking a metal fork into an un-earthed toaster. Most people need help from their operating system.


then we, the very small minority, want a button to disable that help.


Increased metadata isn't tool blunting in itself though, even if MacOS uses it for being... annoying is one way of saying it.

Provenance information bundled into a file is not the worst idea in the world IMO. We have created/modified timestamps on files already, right? There's definitely the question of "why" but hey if more of my binaries just had at least a tag about who put them there that would be a win in my book.

Not an argument for doing what MacOS does, just an argument that the info would be nice to have.


I sincerely agree. By the way, thanks for lending your machine for my "Network-Retransmission-and-Compute-as-a-service" network.


It’s not blunting a tool, it’s sheathing it. Modern software requires too much proxied trust for this attitude to work.


I mean, it’s been tried; a reading of relevant historical texts would give you lots of ammunition to support either argument.


The product is automation; the tech is llms


It feels snappy _compared with most sites_.

That's the point!

It feels _very_ sluggish if I try it after spending some time using a windows 98 VM, or a library catalog from 1990.


Not if you were browsing the internet of that time using a 28k8 modem.


Every time there's a new discussion of some arm board, I compare the price / features / power use with the geekom n100 SBC I picked up awhile back.

As far as I can tell, the OrangePi 6 remains distinctly uncompetitive with SBCs based on low-end intel chips.

- Orange pi consumes much more power (despite being an arm CPU) - A bit faster on some benchmarks, a bit slower on others - Intel SBC is about 60% the price, and comes with case + storage - Intel SBC runs mainline linux and everything has working drivers


I get what you’re saying, but Chaucer was not in _my_ lifetime.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: