Today’s gonna be a bit of a ranter. The whole AI-Crawler situation has gone from bad to awful, potentially ending the web as we know it. AI companies like OpenAI (ChatGPT) and Anthropic (Claude) have gotten so used to stealing that they’re no longer afraid to cause extreme costs to others for their own gain. Many of them now employ crawlers that include methods to bypass limits and filters. It needs to stop.
I’ve personally been hit by OpenAI weeks ago, which managed to generate 49 TB of traffic costing me about 13€ for that day, and later on Anthropic tried to do the same based on the access logs. And it seems I’m not the only one, with Uberspace being hit even worse. rileyb3d appears to have been harassed in a similar way and was forced to take down their own website entirely. And you know it’s gotten really bad when CloudFlare out of all things makes their AI-Crawler protection tool available for Free users. They only make things available for Free users that are widespread, so even CloudFlare has had enough now.
This situation has completely spiraled out of control for everyone and I don’t see much future in the free web anymore if it continues. AI Companies no longer care about copyright, licensing or similar, and it’s only going to get worse until governments wake up. Any work you published is being used to train AI models, no matter if your license allows for it or requires payment. None of them care, and lawsuits are piling up.
Anyway, bypassing the whole ranting part, I’d like to share how to identify AI-Crawlers and protect yourself a bit. The “Captain Obvious” protection is to not publicly post anything you don’t want used in AI model training, but that’s not what I’m aiming for. Over the few weeks after the OpenAI DDoS, I’ve been able to identify a few of the bots by their User Agent:
- OpenAI (ChatGPT) uses
OpenAI,GPT,OAI-SearchBot, andOAI-Bot. Filter out any User Agent containing these, and you shut down most of their AI-Crawling. - Anthropic (Claude) so far only uses
claudebot, with no other User Agents matching their IPs. - OpenAI also appears to occasionally use
facebook.com/externalhit, as it has been tracked to match their IP range in the same hour as an OpenAI crawling attempt. - Amazon appears to only use
amazonbotandfacebook.com/externalhit. - Many open source AI-Crawlers use
GPT,OpenAI,facebook.com/externalhit, and evenChrome/124.0.6367.118. Not even proper browser user agents are safe from this greed.
If you’re using nginx, you should consider filtering on a global level, and this doesn’t appear to be the first time either. MariusV (on Github) appears to have done prior work similar reasons, and my own variant of his scripts simply add the known bots to it. Include it, then redirect the bots to AI generated slop with if($bad_crawler) and if($bad_referrer). Bet they’ll enjoy poisoning their own training data.
