Dumb question: I keep seeing posts about how ~"the volume of AI scrapers is maki...

senko · 2026-01-24T18:56:06 1769280966

> There must a ton of new full-web datasets out there, right?

Sadly, no. There's CommonCrawl (https://commoncrawl.org/) which still, sadly, far removed from "full-web dataset."

So everyone runs their own search instead, hammering the sites, going into gray areas (you either ignore robots.txt or your results suck), etc. It's a tragedy of the commons that keeps Google entrenched: https://senkorasic.com/articles/ai-scraper-tragedy-commons

Terretta · 2026-01-23T17:39:24 1769189964

> the volume of AI scrapers is making hosting untenable

Aside from that potential, it's also not true.

A Pentium Pro or PIII SSE with circa 1998-99 Apache happily delivers a billion hits a month w/o breaking a sweat unless you think generating pages for every visit is better than generating pages when they change.

Tenemo · 2026-01-23T18:42:54 1769193774

I think it is true that it is a real problem (EDIT: but doesn't necessarily make "hosting untenable"), but you are correct to point out that modern pages tend to be horribly optimized (and that's the source of the problem). Even "dynamic" pages using React/Next.js etc. could be pre-rendered and/or cached and/or distributed via CDNs. A simple cache or a CDN should be enough to handle pretty much any scrapping traffic unless you need to do some crazy logic on every page visit – which should almost never be the case on public-facing sites. As an example, my personal site is technically written in React, but it's fully pre-rendered and doesn't even serve JS – it can handle huge amounts of bot/scrapping traffic via its CDN.

consumer451 · 2026-01-23T21:48:46 1769204926

OK, I agree with both of you. I am an old who is aware of NGINX and C10k. However, my question is: what are the economic or technical difficulties that prevent one of these new web-scale crawlers from releasing og-pagerank-api.com? We all love to complain about modern Google SERP, but what actually prevents that original Google experience from happening, in 2026? Is it not possible?

Or, is that what orgs like Perplexity are doing, but with an LLM API? Meaning that they have their own indexes, but the original q= SERP API concept is a dead end in the market?

Tone: I am asking genuine questions here, not trying to be snarky.

arantius · 2026-01-24T02:42:45 1769222565

What prevents it is that the web in 2026 is very different than it was when OG pagerank became popular (because it was good). Back then, many pages linked to many other pages. Now a significant amount of content (newer content, which is often what people want) is either only in video form, or in a walled garden with no links, neither in or out of the walls. Or locked up in an app, not out on the general/indexable/linkable web. (Yes, of course, a lot of the original web is still there. But it's now a minority at best.)

Also, of course, the amount of spam-for-SEO (pre-slop slop?) as a proportion of what's out there has also grown over time.

IOW: Google has "gotten worse" because the web has gotten worse. Garbage in, garbage out.

consumer451 · 2026-01-24T03:32:38 1769225558

Thanks for the reply. I mentioned tech, but forgot about time. Yeah, that makes solid sense.

> Or locked up in an app...

I believe you may have at least partially meant Discord, for which I personally have significant hate. Not really for the owners/devs, but why in the heck would any product owner want to hide the knowledge of how to user their app on a closed platform? No search engine can find it, no LLM can learn from it(?). Lost knowledge. I hate it so much. Yes, user engagement, but knowledge vs. engagement is the battle of our era, and knowledge keeps losing.

r/anything is so much better than a Discord server, especially in the age of "Software 3.0"

consumer451 · 2026-01-23T22:16:32 1769206592

Please see my reply to the other child comment. That is my actual question, apologies for not being more clear.