In 2026, “website security” increasingly means “deciding which machines are allowed to read your site, and under what terms.” The web is being re-engineered by bots that don’t just index—they extract, answer elsewhere, and (in the agent era) start clicking things like a drunk intern with root access.
Thesis: the “AI bot onslaught” is not a moral panic about automation. It’s a sudden incentive shift: scraping now directly powers model training, AI search, and agentic products, so the traffic is bigger, more aggressive, and less reciprocal. Hosts and CDNs respond the only way they can at scale: crank the WAF, block first, and accept collateral damage.
1) What changed (and why it changed so fast)
Bots have always been part of the web’s background radiation. Search crawlers, SEO parasites, price scrapers, credential stuffers, random “research projects” running on a Raspberry Pi in a dorm room—standard internet wildlife.
What changed in the last two years is that crawling became a direct input to competitive advantage. Not “we might sell more ads.” Not “our index gets better.” More like: “if we vacuum up enough of the open web, our model becomes meaningfully more capable, and we win a platform war.” That turns crawling from nuisance into strategy.
Three forces landed at the same time
- LLM training appetite: more data (especially high-quality, recent, and niche) improves coverage and reduces embarrassing blind spots.
- AI search appetite: answer engines need freshness, recall, and broad indexing—often without the old search bargain of sending traffic back.
- Agentic AI appetite: agents don’t just read pages; they attempt workflows. They log in, search, paginate, retry, click, submit. In logs, an agent can look less like “crawler” and more like “a human who drank six espressos and is now speedrunning your website.”
This is why the shift feels sudden. The web didn’t become automated overnight. The value of automation did—and it brought a lot of new money, tooling, and operational aggression with it.
2) The new bot taxonomy: it’s not one problem
One reason site owners feel blindsided is that “AI bots” is a junk drawer label. Even the big AI vendors now split bots by purpose because the behaviours, ethics, and controls differ.
2.1 Training crawlers (bulk ingestion)
These are the classic “vacuum cleaner” bots—high volume, wide coverage, often hitting older content too. Some identify themselves. Plenty don’t. The key behaviour is extraction without referral: the site pays the cost, the bot operator gets the asset.
2.2 Search/indexing crawlers (freshness + recall)
These bots behave more like traditional search engines: repeated crawling, sitemap usage, structured discovery. The twist is that the product is no longer “ten blue links”—it’s an answer box that may reduce click-through. That changes the economics and pushes publishers toward defensive moves.
2.3 User-directed fetchers ("go read this page for me")
AI products increasingly retrieve pages on-demand when users ask questions. OpenAI, for example, documents separate agents for search and training, and also a user-initiated fetcher (ChatGPT-User) that may retrieve content in response to user actions—and notes that robots.txt rules may not apply in the same way for user-initiated requests.
That’s not a judgement—it’s a warning: your “crawler policy” and your “user access policy” are merging into one messy control surface.
2.4 Agents (interactive automation)
Agents are where things get spicy. They don’t just request URLs. They explore your site like an impatient QA engineer: lots of requests, non-linear navigation, retries, form submissions, and weird edge paths. WAFs and bot managers are not built to be gentle philosophers—they’re built to keep the site up. So they will treat agent-like behaviour as hostile unless proven otherwise.
3) Why the web is breaking: costs, risk, and the collapse of the old bargain
3.1 The cost side is real (ask Wikimedia)
One of the cleanest public datapoints comes from the Wikimedia Foundation. They describe how request volume has shifted, with heavy growth driven by scraping bots collecting training data, and report that bandwidth used for downloading multimedia content grew by 50% since January 2024—not from humans, but largely from automated programs scraping Wikimedia Commons.
Translation: even a mature, well-engineered, traffic-hardened nonprofit is feeling meaningful strain from crawler behaviour that doesn’t resemble classic “human spikes.” It’s not that Wikimedia can’t scale. It’s that the crawler traffic pattern is unprecedented, persistent, and expensive.
3.2 The risk side is also real (WAFs don’t block “just” bots)
At the infrastructure layer, bot traffic and abuse traffic share a lot of surface area: automated probing, credential stuffing, vulnerability scanning, injection attempts, and high-rate behaviour. So defensive systems get tuned to stop the bad stuff, and in the process they sometimes block the “fine” stuff too: webhooks, uptime monitoring, developer tools, accessibility tools, and—occasionally—ordinary humans behind a sketchy IP reputation.
3.3 The old bargain is dying: crawl no longer implies referral
Search used to be a mutually beneficial scam: websites fed the index, the index sent traffic back, and everyone pretended that was a fair trade.
Now, crawlers increasingly feed systems that answer in place. The website pays the bandwidth and infra bill; the user gets the answer elsewhere; the publisher gets… character development.
4) The defensive response: “Welcome to the WAF state”
When a hosting provider says they’re upgrading WAF rules and some sites may see 403 errors during testing, that’s not a quirky ops note. That’s the normal operational signature of an arms race.
4.1 Why providers harden by default
- Shared fate: on shared platforms, one noisy bot wave can degrade performance for thousands of unrelated sites.
- Support economics: it’s cheaper to block aggressively and handle edge-case tickets than to let the platform be dragged into the sea.
- Liability optics: if the provider is seen as “not protecting customers,” the brand cost is worse than a few false positives.
4.2 Proof that “block AI bots” has become a mainstream knob
Cloudflare now offers a dedicated Block AI bots feature that blocks verified bots classified as AI crawlers, plus some unverified bots that behave similarly. It’s presented as a normal security configuration—because it is.
The important subtext: once a CDN ships a one-click switch, it’s no longer a niche complaint. It’s a category. And categories tend to spread.
5) How we got here: the rapid escalation mechanics
It’s tempting to narrate this as “AI companies are scraping.” True but incomplete. The real story is a convergence of capability, commoditisation, and incentives.
5.1 Capability: better automation that looks like humans
Modern bots don’t need to announce themselves as Python-urllib/2.7 anymore. They can run headless browsers, spoof fingerprints, rotate user agents, distribute load across IP space, and behave plausibly enough to pass crude filters.
5.2 Commoditisation: everyone can do it
Scraping at scale used to be an enterprise hobby. Now it’s a weekend project with a credit card. That means the web isn’t dealing with a few crawlers; it’s dealing with an ecosystem.
5.3 Incentives: the rewards are outsized
If your model improves because you ingested more niche documentation, more local news, more images, more forum threads—your product gets better. If your AI search gets better, you keep the user. If your agent completes tasks, you win retention. The payoff function is steep.
6) Conflicting forces: who wants what (and why they can’t all get it)
This isn’t a single conflict. It’s several competing, partially incompatible visions of the web.
- AI vendors want coverage, freshness, and multimodal corpora—ideally without paying per-site negotiation costs.
- Publishers want monetisation and attribution—plus the right to say “no” without being punished in visibility.
- Users want convenience—and rarely care who subsidises the compute and bandwidth.
- Hosts/CDNs want stable platforms and predictable costs—so they prefer blunt controls.
- Regulators want fairness, transparency, and safety—while moving at the speed of committee minutes.
These forces are now colliding in the same place: the HTTP request. That’s why the “AI era” feels less like a new app wave and more like a re-litigated foundation of the web itself.
7) Future scenarios (12–36 months): what could the web become?
Here are plausible trajectories. The interesting question isn’t “which one happens?” It’s “which ones can coexist without the whole system turning into a permissions bureaucracy?”
Scenario A: Paywalled crawling (tollbooths everywhere)
Large publishers negotiate licensing and access fees. Smaller sites either block everything, accept being scraped, or get bundled into intermediated deals they didn’t negotiate. This is the “cable TV” future of the web: packages, gatekeepers, and a lot of money for people who already have leverage.
Scenario B: Bot passports (identity and attestation)
Verified bots become the only bots that get reliable access. Expect published IP ranges, signed tokens, and “good bot” registries. This is the “governance becomes the product” future: crawling becomes an IAM problem.
Scenario C: The dark forest web (default-deny)
More sites require JS challenges, logins, or simply block unknown automation. The open web shrinks; walled gardens grow. Humans still browse, but machines increasingly get the “no” version of the internet.
Scenario D: Negotiated machine interfaces (APIs for agents, not HTML scraping)
Sites expose structured endpoints for summarisation, quoting, and retrieval. Agents stop scraping HTML and start buying access. Optimistic version: a cleaner web. Cynical version: the API economy, but with more legal disclaimers and fewer hobbies.
Scenario E: Adversarial content (poison, traps, and legal warfare)
As blocking rises, some sites attempt countermeasures: trap pages, misleading content, dynamic payloads. That risks collateral damage to legitimate indexing and archiving—and pushes the arms race toward more expensive, more intrusive bot detection.
8) Informed opinion: the web is becoming an access-controlled substrate
My bet: we’re not heading to a single outcome. We’re heading to stratification.
- Tier 1: major platforms with negotiated access, verified bots, and paid crawling.
- Tier 2: long-tail publishers who block aggressively and rely on human traffic, newsletters, or social distribution.
- Tier 3: everything else that becomes quietly harvested, mirrored, and commoditised—until it disappears behind a login.
And floating above it all: agents. Because once agents exist, the definition of “a user” gets blurry. If a user asks an agent to read a page, is that a bot request or a user request? The answer determines whether robots.txt remains a meaningful concept or becomes a historical curiosity—like dial-up modems or corporate shame.
What to Watch
- More CDN knobs: “Block AI bots” will become as common as DDoS protection settings.
- Granular bot identities: more vendors split bots by purpose (training vs search vs user fetch) to reduce blowback.
- Licensing infrastructure: intermediaries offering “pay-per-crawl” or “verified access” marketplaces.
- Legal precedent: outcomes that clarify whether training and AI search are treated as indexing, copying, or something new.
- Agent etiquette standards: rate limits, caching norms, and “agent-friendly” interfaces—if the industry chooses civilisation over chaos.
Sources
Wikimedia Foundation (Diff) — How crawlers impact the operations of the Wikimedia projects
OpenAI — Overview of OpenAI Crawlers
Anthropic — Does Anthropic crawl data from the web, and how can site owners block the crawler?
Cloudflare — Block AI Bots
AWS — How to manage AI Bots with AWS WAF and enhance security