More

bndr · 2026-02-23T13:26:42 1771853202

I run a small startup called SEOJuice, where I need to crawl a lot of pages all the time, and I can say that the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar, just to get access to any website. The bandwith and storage are the smallest cost factor.

Even though, in my case, users add their own domains, it's still took me quite a bit of time to reach 99% chance to crawl a website — with a mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries, otherwise I would get 403 immediately with no HTML being served.

peter_d_sherman · 2026-02-24T02:02:52 1771898572

Very interesting!

Yes, in this day and age, I could definitely see web pages being harder to crawl by search engines (and SEO companies and other users of automated web crawling technologies (AI agents?)) than they were in the early days of the Internet due to many possible causes -- many of which you've excellently described!

In other words, there's more to be aware of for anyone writing a search engine (or search-engine-like piece of software -- SEO, AI Agent, etc., etc.) than there was in the early days of the Internet, where everything was straight unencrypted http and most URLs were easily accessible without having to jump through additional hoops...

Which leads me to wonder... on the one hand, a website owner may not want bots and other automated software agents spidering their site (we have ROBOTS.TXT for this), but on the flip side, most business owners DO want publicity and easy accessibility for sales and marketing purposes, thus, they'd never want to issue a 403 (or other error code) for any public-facing product webpage...

Thus there may be a market for testing public facing business/product websites against faulty "I can't give you that web page for whatever reason" error codes from a wide variety of clients, from a wide variety of locations around the world.

That market is related to the market for testing if a website is up and functioning properly (the "uptime market"), again, from a wide variety of locations around the world, using a wide variety of browsers...

So, a very interesting post!

Also (for future historians!) compare all of the restrictive factors which may prevent access to a public-facing web page today Vs. Tim Berners-Lee original vision for the web, which was basically to let scientists (and other academic types!) SHARE their data PUBLICLY with one another!

(Things have changed... a bit! :-) )

Anyway, a very interesting post, and a very interesting article -- for both present and future Search Engine programmers!

gilrain · 2026-02-23T14:57:35 1771858655

> the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar … mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries

I would like to register my hatred and contempt for what you do. I sincerely hope you suffer drastic consequences for your antisocial behavior.

bndr · 2026-02-23T15:00:20 1771858820

Please elaborate, why exactly is it antisocial? Because Cloudflare decides who can or cant access a users website? When they specifically signed up for my service.

demetris · 2026-02-23T21:28:32 1771882112

But how does that work?

Does Cloudflare force firewall rules for those who choose to use it for their websites?

If the tool that does the crawling identifies itself properly, does Cloudflare block it even if users do not tell Cloudflare to block it?

gilrain · 2026-02-23T15:03:36 1771859016

It intentionally circumvents the explicit desires of those who own the websites being exploited. It is nonconsensual. It says “fuck you, yes” to a clearly-communicated “please no”.

joncrane · 2026-02-23T15:26:06 1771860366

OP literally said that users add their domains, meaning they are explicitly ASKING OP to scrape their websites.

bndr · 2026-02-23T15:04:19 1771859059

Users sign up for my service.

gilrain · 2026-02-23T15:08:17 1771859297

You employ residential proxies. As such, you enable and exploit the ongoing destruction of the Internet commons. Enjoy the money!

toomuchtodo · 2026-02-23T17:07:49 1771866469

Cloudflare and Big Tech are primary contributors to the impairment and decline of the Internet commons for moats, control, and profit; you are upset at the wrong parties.

conception · 2026-02-24T01:49:19 1771897759

Why not both?

christoff12 · 2026-02-23T16:08:02 1771862882

This is kind of like getting upset with people who go to ATMs because drug dealers transact in cash lol.

prettyblocks · 2026-02-23T20:00:44 1771876844

I would argue that the ability to crawl and scrape is core to the original ethos of the internet and all the hoops people jump through to block non-abusive scraping of content is in fact more anti-social than circumventing these mechanisms.

mettamage · 2026-02-23T19:06:12 1771873572

I wonder if circumvention is legal. It's so odd. In the US it seems you can just do this whereas if I'd start something like this in the EU, I don't think I could.

fuomag9 · 2026-02-23T19:23:09 1771874589

In Italy it’s a crime punishable up to 12 years to access any protected computer system without authorization, especially if it causes a DoS to the owner

Consider the case of selfhosting a web service on a low performance server and the abusive crawling goes on loop fetching data (which was happening when I was self hosting gitlab!)

https://www.brocardi.it/codice-penale/libro-secondo/titolo-x...

mrweasel · 2026-02-23T14:20:48 1771856448

Can't your users just whitelist your IPs?

dewey · 2026-02-23T16:36:19 1771864579

I'm in a similar boat and getting customers to whitelist IPs is always a big ask. In the best case they call their "tech guy", in the worst case it's a department far away and it has to go through 3 layers of reviews for someone to adapt some Cloudflare / Akamai rules.

And then you better make sure your IP is stable and a cloud provider isn't changing any IP assignments in the future, where you'll then have to contact all your clients again with that ask.

bndr · 2026-02-23T14:38:56 1771857536

They're mostly non-technical/marketing people, but yes that would be a solution. I try to solve the issue "behind the scenes" so for them it "just works", but that means building all of these extra measures.

cassepipe · 2026-02-23T16:12:10 1771863130

Would it make sense to advertise to the more technical minded a discount if they set up an IP whitelist with a tutorial you could provide ? A discount in exchange for reduced costs to you ?

dolmen · 2026-02-24T08:56:19 1771923379

The right solution is to be registered at Cloudflare, but then getting the customer reach the guy who handles Cloudflare settings (a few clicks) is the hard part.

0xdeadbeefbabe · 2026-02-23T16:30:47 1771864247

Blocking seems really popular. I wonder if it coincides with stack overflow closing.

spiderfarmer · 2026-02-23T15:58:07 1771862287

Just stop scraping. I'll do everything to block you.

ssgodderidge · 2026-02-23T16:09:59 1771862999

> in my case, users add their own domains

Seems like they're only scraping websites their clients specifically ask them to

Keyframe · 2026-02-23T16:00:29 1771862429

Now you've gamified it :)

shimman · 2026-02-23T16:12:40 1771863160

It's a pretty easy game to win as the blocker. If you receive too many 404s against pages that don't exist, just ban the IP for a month. Actually got the idea from a hackernews comment too. Also thinking that if you crawl too many pages you should get banned as well.

There's no point in playing tug of war against unethical actors, just ban them and be done with it.

I don't think it's an uncommon opinion to behave this way either, nor are the crawlers users I want to help in any capacity either.

basilikum · 2026-02-24T07:09:51 1771916991

So you're blocking the absolute bottom of the barrel scrapers and feel like you 'won' because you don't even notice any scraper that isn't complete trash.

Then again why block them if they don't cause any issue in the first place? Instead of going ballistic on IPs that you don't vibe with you could also just do proper rate limiting.

Klonoar · 2026-02-23T22:44:13 1771886653

If you think the game is played on a single IP address, you are not adept enough to be weighing in on this discussion.

stevewodil · 2026-02-23T17:24:12 1771867452

What is the crawler is using a shared IP and you end up blocking legitimate users with the bad actor?

Keyframe · 2026-02-23T18:07:56 1771870076

He said "it's pretty easy", probably not realizing there are whole industries on both sides of that cat and mouse game, making it not easy.

bndr · 2026-01-11T18:40:16 1768156816

I've been working on the same tool since 2024 where I thought it might be a good time to build a tool for all the people who will build their own tools, eventually they will need to market it.

So I built a SEO/GEO Automation Tool for Small to Mid-Size Businesses who don't have a full-time team for that. [0]

The goal is to provide teams visibility across all the channels — Search and AI and give them the tools needed to outrank their competition. So far so good, the fully bootstrapped venture has grown over the last year and I've built quite a few big features — sophisticated audit system, AI Responses Monitoring, Crawler Analytics, Competitors Monitoring etc.

[0] https://seojuice.io

bndr · 2026-01-11T18:48:30 1768157310

Adding a bit of context as well: This started out as a internal linking tool, but grew into something more based on the customer feedback — the database has now reached about 10TB of data about keywords, pages, AI responses etc, where I know who was ranking where and why.

And I'm trying to offer this "data advantage" to website owners, so they can grow, and also this is something that will be hard to replicate (at least quickly) with AI.

bndr · 2025-11-25T13:38:36 1764077916

Oh wow, that was painful to read, I especially liked this analysis part:

> Different naming conventions (DW_OP_* vs DW_op_*)

collingreen · 2025-11-26T18:09:37 1764180577

Clearly not copied! Look at the case difference! Duh!

bndr · on Oct 28, 2024

Hey HN!

I’m working on SEOJuice [1], an automated tool for internal linking and on-page SEO optimizations. It's designed to make life a little easier for indie founders and small business owners who don’t have time to dig deep into SEO.

So far, I’ve managed to scale it to $3,000 MRR, and recently made the move from the cloud to Hetzner, which has been a game-changer for cost efficiency. We’re running across multiple servers now, and handling everything from link analysis to on-page updates with a bit more control.

The journey’s been a mix of hands-on coding (and a lot of coffee) and constant optimization. It’s been challenging but incredibly fun to see how much can be automated without compromising on quality.

Happy to chat more about the tech stack or any of the growth pains if anyone’s interested!

[1] https://seojuice.io

bndr · on Sept 30, 2024

Oh wow, my package on the front page again. Glad that it's still being used.

This was written 10 years ago when I was struggling with pulling and installing projects that didn't have any requirements.txt. It was frustrating and time-consuming to get everything up and running, so I decided to fix it, apparently many other developers had the same issue.

[Update]: Though I do think the package is already at a level where it does one thing and it does it good. I'm still looking for maintainers to improve it and move it forward.

joshdavham · on Sept 30, 2024

> This was written 10 years ago when I was struggling with pulling and installing projects that didn't have any requirements.txt

And 10 years later this is still a common problem!

v3ss0n · on Oct 7, 2024

Not if you use proper python env and package management tools like pdm or poetry.

bndr · on May 16, 2024

Hey everyone, I know HN community is very polarizing, and the discussions here are always great to read through as both sides are always eager to prove the other wrong. I think we need more of that in the community. People not being afraid to disagree.

I'm really curious to hear your thoughts and experiences.

JohnFen · on May 16, 2024

I have a bit of a niggle about the use of the word "polarizing". That word implies things that I think are harmful overall, such as being unwilling to work with people you disagree with.

That said, I agree that it's important to express your opinions and stand by things you think are right. It's equally important to listen to those who disagree with you and take what they say as additional data that may (or may not) lead you to modify your opinion. At the very least, openly and honestly listening to others will inform you as to why they have a differing opinion. "Everyone seems crazy if you don't understand their point of view."

Also, "compromise" isn't a dirty word. It's how we get anything done.

bndr · on May 14, 2024

Looking for feedback.

Cheers :)

bndr · on May 6, 2024

If you see any bugs, please let me know. The product is quite fresh.

bndr · on April 27, 2024

Why not just retrain the python team to another language? I mean, software engineers are not really language specific, they can learn other languages if needed.

throwaway5959 · on April 28, 2024

They were maintaining Python itself, likely very well (as one would expect) compensated. It’d be a waste to have these devs do product development.

CydeWeys · on April 28, 2024

All SWEs at the same level at Google are making the same compensation (with some exceptions for high-flying AI researchers). They Python SWEs certainly weren't making more than anyone else.

bdd8f1df777b · on April 29, 2024

That's not true at all. (Excluding the factor of location), the compensation of a SWE depends not only the level, but also on tenure, on performance rating (and the history of rating), and on stock market fluctuations (whether the stock price was low or high when the stocks were granted).

One of the rumors is that the better compensated you are on your level, the more likely you are to be targeted for layoff, because it saves the eng cost the most.

CydeWeys · on April 29, 2024

None of those depend on the project you're working on, which is my point.

throwaway5959 · on April 28, 2024

All SWEs at Google are well compensated. Not all of them would be a good fit for maintaining Python.

CydeWeys · on April 28, 2024

Then I don't understand the point you were making in your first post.

throwaway5959 · on April 29, 2024

A waste of talent not of cash.

CydeWeys · on April 30, 2024

That's the thing, it's not clear that the Python core engineers are more talented than other Google SWEs on average. You have all sorts of talented engineers working on all sorts of random projects within Google.

smartician · on April 28, 2024

They have three months to find new roles/teams. Their employment only ends if they can't.

fuckthemusers · on April 29, 2024

Like finding a lunch table to sit at on your first day of school

phendrenad2 · on April 28, 2024

Assuming this wasn't financially motivated.

bndr · on July 5, 2023

https://vadimkravcenko.com

Mostly I help developers grow — I share my thoughts as a CTO about building digital products, growing teams, scaling development and in general being a good technical founder.

Some of the popular posts are:

- https://vadimkravcenko.com/shorts/things-they-didnt-teach-yo... - Things they didn't teach you at the university

- https://vadimkravcenko.com/shorts/project-estimates/ - Rules of thumb for Project Estimations

- https://vadimkravcenko.com/shorts/contracts-you-should-never... - Contracts you should never sign.

Most of the blog posts have ended up on the Frontpage here, here's the list: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

Cheers, Vadim