How to Block AI Bots with robots.txt in 2026
Why Block AI Bots?
AI companies train their large language models by crawling the web and ingesting website content. In 2026, this practice has become a significant concern for publishers, creators, and businesses who do not want their content used for AI training without consent or compensation.
The major AI crawlers include:
- GPTBot — OpenAI’s crawler for training data
- ChatGPT-User — OpenAI’s crawler for real-time browsing features
- ClaudeBot — Anthropic’s crawler
- Google-Extended — Google’s crawler for Gemini AI training
- Bytespider — ByteDance’s crawler for training purposes
- CCBot — Common Crawl’s bot, used by many AI training datasets
- Meta-ExternalAgent — Meta’s AI training crawler
Blocking these bots does not affect your search engine rankings. Google has explicitly stated that blocking Google-Extended has no impact on your appearance in Google Search results. The regular Googlebot, which indexes your site for search, is a separate user agent.
Using robots.txt to Block AI Crawlers
The robots.txt file lives at the root of your website (e.g., https://example.com/robots.txt) and tells web crawlers which parts of your site they may access. While compliance is voluntary (bots can choose to ignore it), all major AI companies have committed to respecting robots.txt directives.
Basic Syntax
Each rule in robots.txt has two parts: a User-agent directive specifying which bot the rule applies to, and a Disallow directive specifying what is off-limits.
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
The above blocks GPTBot, ClaudeBot, and Google-Extended from your entire site. The / means everything.
Block All Known AI Bots
Here is a comprehensive block list for 2026:
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: PerplexityBot
Disallow: /
Allow Search Engines, Block AI
Make sure you are not accidentally blocking search engine crawlers. Your robots.txt should still allow Googlebot, Bingbot, and other search engine bots:
# Allow search engines
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# Block AI crawlers
User-agent: GPTBot
Disallow: /
If you have a blanket User-agent: * / Disallow: / rule, it will block everything, including search engines. Be specific about which bots you want to block.
Using the seokit Robots.txt Generator
Building a robots.txt file manually is error-prone. A misplaced directive can accidentally block search engines or leave AI bots unblocked.
The seokit Robots.txt Generator provides a visual interface with one-click AI bot blocking. Select the bots you want to block from a checklist, configure your sitemap URL, and download a valid robots.txt file. The tool validates your configuration and warns about potential issues.
Partial Blocking
You may want AI bots to access some pages but not others. For example, you might allow your homepage and product pages while blocking your blog content:
User-agent: GPTBot
Disallow: /blog/
Disallow: /articles/
Allow: /
This tells GPTBot it can access everything except the /blog/ and /articles/ directories.
Limitations of robots.txt
It Is Advisory, Not Enforced
robots.txt is a protocol based on good faith. Any bot can technically ignore it. However, all major AI companies face legal and reputational consequences for violating robots.txt directives, so compliance is high among legitimate crawlers.
It Does Not Block Scraping
robots.txt prevents well-behaved bots from crawling. It does not prevent someone from manually copying your content or using tools that ignore robots.txt. For stronger protection, consider additional measures like rate limiting, authentication, or legal notices.
It Is Not Retroactive
Blocking GPTBot today does not remove content that was already crawled and ingested into a training dataset. It only prevents future crawling.
Additional Protection Measures
Meta Tags
The noai and noimageai meta tags provide page-level control:
<meta name="robots" content="noai, noimageai">
Use the seokit Meta Tag Generator to build these tags correctly.
HTTP Headers
Some sites use the X-Robots-Tag HTTP header for resources like PDFs and images that cannot contain meta tags.
Legal Notices
Add a clear statement to your terms of service prohibiting use of your content for AI training. This creates a legal basis for enforcement.
Generate Your robots.txt Now
Protecting your content from AI crawlers takes less than a minute. Use the seokit Robots.txt Generator to build a valid robots.txt file with AI bot blocking, sitemap references, and proper search engine access. Download it and upload it to your site’s root directory.