Blocking AI bots and controlling crawlers

by Phil Hawksworth

AI offers some incredible opportunities as a tool for developers, but we might not want all of the many AI services out there scraping everything we publish on our web sites to be used as their training content. There are ways to tell these bots not to crawl our data, and Netlify Edge Functions can help make this pretty straightforward.

#TL;DR

We’ll look at two important mechanisms to block AI bots from scraping your sites, and implement them with a simply generated config file and with an Edge Function.

Deploy and play

If you prefer to go straight to deploying your own copy of an example, you can do that by clicking the button below

Deploy to Netlify

The simplest way to disallow bots from crawling your sites is to state this in a robots.txt file served from the root of your site. A robots.txt file is designed to instruct web crawlers about what content on your site they can and cannot access.

#Generating your robots.txt file for convenience

The robots.txt file will need to include a rule declaration for every known AI bot you wish to ban from scraping your content. Since the list of known AI bots is rather long, and likely to get longer, it can be helpful to generate the file in order to avoid typos and errors. It also means we can reuse the same single list of AI bots for something else… we’ll get to that later.

Most, if not all web frameworks make it trivial to generate a file from some data. For the sake of illustration, I’ve not used a framework for this example, and instead just made a tiny build script which adds the following declaration to a robots.txt file for every item it finds in a seperate list of User Agent strings for know AI crawlers.

The declaration we need for each bot:

User-agent: AGENT_NAME
Disallow: /

Here’s out list of AI bots held in a .json file for convenience:

agents.json
[
"AdsBot-Google",
"Amazonbot",
"anthropic-ai",
"Applebot",
"AwarioRssBot",
"AwarioSmartBot",
"Bytespider",
"CCBot",
"ChatGPT",
"ChatGPT-User",
...
]

There’s nothing special about that little node script to make the file. You can see it here if you’re curious: build.js

Serve the resulting robots.txt file from the root of your site, and AI crawlers should honor it and not scrape the content of your site.

“Should”.

Sadly, not all AI products repsect the rules found in a robots.txt file, so we need to reach for another option for additional confidence:

#Blocking HTTP requests based on the User Agent String using an Edge Function

Edge Functions give us a low latency, high performance way to filter the requests being made to any resources in our sites.

Adding an Edge Function to a site is as simple as adding a TypeScript or JavaScript file to your site at this location, where Netlify knows to look for your Edge Functions:

/netlify/edge-functions/

Here’s an Edge Function that compares the User Agent string of the incoming HTTP request and compares it to our list of known AI bots, returning those requests an HTTP 401 response, while letting all other requests proceed as normal. It uses the same list of AI bots that we created to feed our robots.txt file. So that’s handy.

/netlify/edge-functions/bots.ts
import { Config } from "@netlify/edge-functions";
import agents from "../../agents.json" with { type: "json" };
export default async (request: Request) => {
// Get the user agent string of the requester
const ua = request.headers.get('user-agent');
// Check against our list of known AI bots
let isBot = false;
agents.forEach(agent => {
if (ua.toLowerCase().includes(agent.toLowerCase())) {
isBot = true;
return;
}
})
// If the requester is an AI bot, disallow with a 401
if(isBot) {
return new Response(null, { status: 401 });
}
// Otherwise, continue with the request as normal
else {
return;
}
};
// This edge function is executed for all requests across the site
export const config: Config = {
path: "*",
};

Sadly, some bots have been found to mis-report their names in their User Agent strings, so we can’t rely on that technique alone either. Doubing-up and using both of these techniques should do the trick.

#Turning this into a utility

Those who operate multiple sites might find that this type of facility could be useful multiple times. It feels like a good contender to be packaged up as an integration which could be enabled with a couple of clicks for any of your sites.

To learn about how to do that, I’d recommend this guide on creating a Netlify Integration to insert Edge Functions into a site.

#Acknowledgements

This guide was inspired by the approaches taken and documented in these great posts: