Hacker Newsnew | past | comments | ask | show | jobs | submit | bndr's commentslogin

I run a small startup called SEOJuice, where I need to crawl a lot of pages all the time, and I can say that the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar, just to get access to any website. The bandwith and storage are the smallest cost factor.

Even though, in my case, users add their own domains, it's still took me quite a bit of time to reach 99% chance to crawl a website — with a mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries, otherwise I would get 403 immediately with no HTML being served.


Very interesting!

Yes, in this day and age, I could definitely see web pages being harder to crawl by search engines (and SEO companies and other users of automated web crawling technologies (AI agents?)) than they were in the early days of the Internet due to many possible causes -- many of which you've excellently described!

In other words, there's more to be aware of for anyone writing a search engine (or search-engine-like piece of software -- SEO, AI Agent, etc., etc.) than there was in the early days of the Internet, where everything was straight unencrypted http and most URLs were easily accessible without having to jump through additional hoops...

Which leads me to wonder... on the one hand, a website owner may not want bots and other automated software agents spidering their site (we have ROBOTS.TXT for this), but on the flip side, most business owners DO want publicity and easy accessibility for sales and marketing purposes, thus, they'd never want to issue a 403 (or other error code) for any public-facing product webpage...

Thus there may be a market for testing public facing business/product websites against faulty "I can't give you that web page for whatever reason" error codes from a wide variety of clients, from a wide variety of locations around the world.

That market is related to the market for testing if a website is up and functioning properly (the "uptime market"), again, from a wide variety of locations around the world, using a wide variety of browsers...

So, a very interesting post!

Also (for future historians!) compare all of the restrictive factors which may prevent access to a public-facing web page today Vs. Tim Berners-Lee original vision for the web, which was basically to let scientists (and other academic types!) SHARE their data PUBLICLY with one another!

(Things have changed... a bit! :-) )

Anyway, a very interesting post, and a very interesting article -- for both present and future Search Engine programmers!


> the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar … mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries

I would like to register my hatred and contempt for what you do. I sincerely hope you suffer drastic consequences for your antisocial behavior.


Please elaborate, why exactly is it antisocial? Because Cloudflare decides who can or cant access a users website? When they specifically signed up for my service.

But how does that work?

Does Cloudflare force firewall rules for those who choose to use it for their websites?

If the tool that does the crawling identifies itself properly, does Cloudflare block it even if users do not tell Cloudflare to block it?


It intentionally circumvents the explicit desires of those who own the websites being exploited. It is nonconsensual. It says “fuck you, yes” to a clearly-communicated “please no”.

OP literally said that users add their domains, meaning they are explicitly ASKING OP to scrape their websites.

Users sign up for my service.

You employ residential proxies. As such, you enable and exploit the ongoing destruction of the Internet commons. Enjoy the money!

Cloudflare and Big Tech are primary contributors to the impairment and decline of the Internet commons for moats, control, and profit; you are upset at the wrong parties.

Why not both?

This is kind of like getting upset with people who go to ATMs because drug dealers transact in cash lol.

I would argue that the ability to crawl and scrape is core to the original ethos of the internet and all the hoops people jump through to block non-abusive scraping of content is in fact more anti-social than circumventing these mechanisms.

I wonder if circumvention is legal. It's so odd. In the US it seems you can just do this whereas if I'd start something like this in the EU, I don't think I could.

In Italy it’s a crime punishable up to 12 years to access any protected computer system without authorization, especially if it causes a DoS to the owner

Consider the case of selfhosting a web service on a low performance server and the abusive crawling goes on loop fetching data (which was happening when I was self hosting gitlab!)

https://www.brocardi.it/codice-penale/libro-secondo/titolo-x...


Can't your users just whitelist your IPs?

I'm in a similar boat and getting customers to whitelist IPs is always a big ask. In the best case they call their "tech guy", in the worst case it's a department far away and it has to go through 3 layers of reviews for someone to adapt some Cloudflare / Akamai rules.

And then you better make sure your IP is stable and a cloud provider isn't changing any IP assignments in the future, where you'll then have to contact all your clients again with that ask.


They're mostly non-technical/marketing people, but yes that would be a solution. I try to solve the issue "behind the scenes" so for them it "just works", but that means building all of these extra measures.

Would it make sense to advertise to the more technical minded a discount if they set up an IP whitelist with a tutorial you could provide ? A discount in exchange for reduced costs to you ?

The right solution is to be registered at Cloudflare, but then getting the customer reach the guy who handles Cloudflare settings (a few clicks) is the hard part.

Blocking seems really popular. I wonder if it coincides with stack overflow closing.

Just stop scraping. I'll do everything to block you.

> in my case, users add their own domains

Seems like they're only scraping websites their clients specifically ask them to


Now you've gamified it :)

It's a pretty easy game to win as the blocker. If you receive too many 404s against pages that don't exist, just ban the IP for a month. Actually got the idea from a hackernews comment too. Also thinking that if you crawl too many pages you should get banned as well.

There's no point in playing tug of war against unethical actors, just ban them and be done with it.

I don't think it's an uncommon opinion to behave this way either, nor are the crawlers users I want to help in any capacity either.


So you're blocking the absolute bottom of the barrel scrapers and feel like you 'won' because you don't even notice any scraper that isn't complete trash.

Then again why block them if they don't cause any issue in the first place? Instead of going ballistic on IPs that you don't vibe with you could also just do proper rate limiting.


If you think the game is played on a single IP address, you are not adept enough to be weighing in on this discussion.

What is the crawler is using a shared IP and you end up blocking legitimate users with the bad actor?

He said "it's pretty easy", probably not realizing there are whole industries on both sides of that cat and mouse game, making it not easy.

I've been working on the same tool since 2024 where I thought it might be a good time to build a tool for all the people who will build their own tools, eventually they will need to market it.

So I built a SEO/GEO Automation Tool for Small to Mid-Size Businesses who don't have a full-time team for that. [0]

The goal is to provide teams visibility across all the channels — Search and AI and give them the tools needed to outrank their competition. So far so good, the fully bootstrapped venture has grown over the last year and I've built quite a few big features — sophisticated audit system, AI Responses Monitoring, Crawler Analytics, Competitors Monitoring etc.

[0] https://seojuice.io


Adding a bit of context as well: This started out as a internal linking tool, but grew into something more based on the customer feedback — the database has now reached about 10TB of data about keywords, pages, AI responses etc, where I know who was ranking where and why.

And I'm trying to offer this "data advantage" to website owners, so they can grow, and also this is something that will be hard to replicate (at least quickly) with AI.


Oh wow, that was painful to read, I especially liked this analysis part:

> Different naming conventions (DW_OP_* vs DW_op_*)


Clearly not copied! Look at the case difference! Duh!


Hey HN!

I’m working on SEOJuice [1], an automated tool for internal linking and on-page SEO optimizations. It's designed to make life a little easier for indie founders and small business owners who don’t have time to dig deep into SEO.

So far, I’ve managed to scale it to $3,000 MRR, and recently made the move from the cloud to Hetzner, which has been a game-changer for cost efficiency. We’re running across multiple servers now, and handling everything from link analysis to on-page updates with a bit more control.

The journey’s been a mix of hands-on coding (and a lot of coffee) and constant optimization. It’s been challenging but incredibly fun to see how much can be automated without compromising on quality.

Happy to chat more about the tech stack or any of the growth pains if anyone’s interested!

[1] https://seojuice.io


Oh wow, my package on the front page again. Glad that it's still being used.

This was written 10 years ago when I was struggling with pulling and installing projects that didn't have any requirements.txt. It was frustrating and time-consuming to get everything up and running, so I decided to fix it, apparently many other developers had the same issue.

[Update]: Though I do think the package is already at a level where it does one thing and it does it good. I'm still looking for maintainers to improve it and move it forward.


> This was written 10 years ago when I was struggling with pulling and installing projects that didn't have any requirements.txt

And 10 years later this is still a common problem!


Not if you use proper python env and package management tools like pdm or poetry.


Hey everyone, I know HN community is very polarizing, and the discussions here are always great to read through as both sides are always eager to prove the other wrong. I think we need more of that in the community. People not being afraid to disagree.

I'm really curious to hear your thoughts and experiences.


I have a bit of a niggle about the use of the word "polarizing". That word implies things that I think are harmful overall, such as being unwilling to work with people you disagree with.

That said, I agree that it's important to express your opinions and stand by things you think are right. It's equally important to listen to those who disagree with you and take what they say as additional data that may (or may not) lead you to modify your opinion. At the very least, openly and honestly listening to others will inform you as to why they have a differing opinion. "Everyone seems crazy if you don't understand their point of view."

Also, "compromise" isn't a dirty word. It's how we get anything done.


Looking for feedback.

Cheers :)


If you see any bugs, please let me know. The product is quite fresh.


Why not just retrain the python team to another language? I mean, software engineers are not really language specific, they can learn other languages if needed.


They were maintaining Python itself, likely very well (as one would expect) compensated. It’d be a waste to have these devs do product development.


All SWEs at the same level at Google are making the same compensation (with some exceptions for high-flying AI researchers). They Python SWEs certainly weren't making more than anyone else.


That's not true at all. (Excluding the factor of location), the compensation of a SWE depends not only the level, but also on tenure, on performance rating (and the history of rating), and on stock market fluctuations (whether the stock price was low or high when the stocks were granted).

One of the rumors is that the better compensated you are on your level, the more likely you are to be targeted for layoff, because it saves the eng cost the most.


None of those depend on the project you're working on, which is my point.


All SWEs at Google are well compensated. Not all of them would be a good fit for maintaining Python.


Then I don't understand the point you were making in your first post.


A waste of talent not of cash.


That's the thing, it's not clear that the Python core engineers are more talented than other Google SWEs on average. You have all sorts of talented engineers working on all sorts of random projects within Google.


They have three months to find new roles/teams. Their employment only ends if they can't.


Like finding a lunch table to sit at on your first day of school


Assuming this wasn't financially motivated.


https://vadimkravcenko.com

Mostly I help developers grow — I share my thoughts as a CTO about building digital products, growing teams, scaling development and in general being a good technical founder.

Some of the popular posts are:

- https://vadimkravcenko.com/shorts/things-they-didnt-teach-yo... - Things they didn't teach you at the university

- https://vadimkravcenko.com/shorts/project-estimates/ - Rules of thumb for Project Estimations

- https://vadimkravcenko.com/shorts/contracts-you-should-never... - Contracts you should never sign.

Most of the blog posts have ended up on the Frontpage here, here's the list: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

Cheers, Vadim


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: