the larger the trial size, the smaller the outcome
I find this a bit surprising. Could there be something else affecting the accuracy of larger trials? Perhaps they are not as careful, or cutting corners somewhere?
Maybe. Those could be the case. But ignoring all confounding factors, this phenomenon is possible with numerical experiments alone. One of the meanings of "the Law of Small Numbers".
Sure, could be just lucky. But if there are several successful small studies, and several unsuccessful large ones (no idea if this is the case here), we should probably look for a better explanation.
It does not require more explanation: publication bias means null results aren't in the literature; do enough small low quality trials and you'll find a big effect sooner or later.
Then the supposed big effect attracts attention and ultimately properly designed studies which show no effect.
Just my hypothesis, but I wonder if larger sample sizes provide a more diverse population.
A study with 1000 individuals is likely a poor representation of a species of 8.2 billion. I understand that studies try to their best to use a diverse population, but I often question how successful many studies are at this endeavor.
If that's the case, we should question whether different homogeneous population groups respond differently to the substance under test. After all, we don't want to know the "average temperature of patients in a hospital", do we?
No, the other way around. It's the combination of two well known effects. Well, three if you're uncharitable.
1. Small studies are more likely to give anomalous results by chance. If I pick three people at random, it's not that surprising if I happened to get three women. It would be a lot different if I sampled 1,000 people.
2. Studies that show any positive result tend to get published, and ones that don't tend to get binned.
Put those together, and you see a lot of tiny studies with small positive results. When you do a proper study, the effect goes away. Exactly as you would expect.
The less charitable effect is "they made it up". It happens.
Absolutely. They have dramatically worsened the world, with little to no net positive impact. Nearly every (if not all) positive impacts have an associated negative that that dwarfs it.
LLMs aren't going anywhere, but the world would be a better place if they hadn't been developed. Even if they had more positive impacts, those would not outweigh the massive environmental degradation they are causing or the massive disincentive they created against researching other, more useful forms of AI.
IMO LLMs have been a net negative on society, including my life. But I'm merely pointing out the stark contrast on this website, and that fact that we can choose to live differently.
I am not anti-AI, nor unhappy about how any current LLM works. I'm unhappy about how AI is used and abused to collective detriment. LLM scraper spam leading to increased centralization and wider impacting failures is just one example.
Your position is similar to saying that medical drugs have been a net negative on society, because some drugs have been used and abused to collective detriment (and other negative effects, such as doctors prescribing pills instead of suggesting lifestyle changes). Does it mean that we would be better off without any medical drugs?
My position is that the negatives outweigh the positives, and I don't appreciate your straw man response. It's clear your question is not genuine and you're here to be contrarian.
A solid secondary option is making LLM scraping for training opt-in, and/or compensating sites that were/are scraped for training data. Hell, maybe then you could not knock websites over incentivizing them to use Cloudflare in the first place.
But that means LLM researchers have to respect other people's IP which hasn't been high on their todo lists as yet.
bUt ThAT dOeSn'T sCaLe - not my fuckin problem chief. If you as an LLM developer are finding your IP banned or you as a web user are sick of doing "prove you're human" challenges, it isn't the website's fault. They're trying to control costs being arbitrarily put onto them by a disinterested 3rd party who feels entitled to their content, which it costs them money to deliver. Blame the asshole scraping sites left and right.
Edit: and you wouldn't even need to go THAT far. I scrape a whole bunch of sites for some tools I built and a homemade news aggregator. My IP has never been flagged because I keep the number of requests down wherever possible, and rate-limit them so it's more in line with human like browsing. Like so much of this could be solved with basic fucking courtesy.
Not to speak for the other poster, but... That's not a good-faith question.
Most of the problems on the internet in 2025 aren't because of one particular technology. They're because the modern web was based on gentleman's agreements and handshakes, and since those things have now gotten in the way of exponential profit increases on behalf of a few Stanford dropouts, they're being ignored writ large.
CF being down wouldn't be nearly as big of a deal if their service wasn't one of the main ways to protect against LLM crawlers that blatantly ignore robots.txt and other long-established means to control automated extraction of web content. But, well, it is one of the main ways.
Would it be one of the main ways to protect against LLM web scraping if we investigated one of the LLM startups for what is arguably a violation of the Computer Fraud and Abuse Act, arrested their C-suite, and sent each member to a medium-security federal prison (I don't know, maybe Leavenworth?) for multiple years after a fair trial?
I'm Sure there will be an investigation... By the SEC when the bubble pops and takes the S&P with it. No prison though, probably jobs at the next ponzi scheme
b) Do you have a citation for that? my understanding is that while some weights can go to zero and effectively be removed, no (actually used in prod) network architecture or training method allows arbitrary connections.
Mainly because global video data corpus is > 100k larger than global text corpus, so you will need to train much larger models for much longer (than current LLMs).
reply