Skip to content
robots.txt Guide: Syntax, Examples & Common Mistakes

robots.txt Guide: Syntax, Examples & Common Mistakes

What Is robots.txt?

The robots.txt file is a plain text file placed at the root of your website (e.g., https://example.com/robots.txt) that tells search engine crawlers which parts of your site they are allowed to access. It follows the Robots Exclusion Protocol, an informal standard that nearly all legitimate crawlers respect.

Despite its simplicity, robots.txt is one of the most commonly misconfigured files on the web. A single misplaced directive can accidentally block Google from indexing your most important pages, or conversely, expose administrative areas you intended to keep private. This robots.txt guide covers everything you need to know: syntax, practical examples, and the mistakes that hurt your SEO.

robots.txt Syntax Reference

User-agent

The User-agent directive specifies which crawler the following rules apply to. Use * to target all crawlers:

User-agent: *

To target a specific crawler:

User-agent: Googlebot

Disallow

The Disallow directive blocks a crawler from accessing a specific path:

Disallow: /admin/
Disallow: /private/

An empty Disallow: directive means nothing is blocked (the crawler can access everything):

User-agent: *
Disallow:

Allow

The Allow directive permits access to a specific path within a disallowed directory. Googlebot and most modern crawlers support this:

Disallow: /images/
Allow: /images/public/

This blocks all of /images/ except the /images/public/ subdirectory.

Sitemap

The Sitemap directive points crawlers to your XML sitemap:

Sitemap: https://example.com/sitemap.xml

You can include multiple sitemap references. Always use the full URL, not a relative path.

Comments

Lines starting with # are comments:

# Block crawlers from admin area
Disallow: /admin/

Practical robots.txt Examples

Allow Everything (Default)

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

This is appropriate for most public websites. No pages are blocked, and the sitemap location is specified.

WordPress

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-json/
Disallow: /?s=
Disallow: /search/

Sitemap: https://example.com/sitemap.xml

This blocks the WordPress admin area (while allowing admin-ajax.php, which some themes need), core includes, plugin directories, and internal search result pages that can generate duplicate content.

Shopify

User-agent: *
Disallow: /admin
Disallow: /cart
Disallow: /orders
Disallow: /checkouts/
Disallow: /checkout
Disallow: /collections/*sort_by*
Disallow: /collections/*+*
Disallow: /collections/*%2B*
Disallow: /collections/*%2b*
Disallow: /search
Disallow: /account

Sitemap: https://example.com/sitemap.xml

Shopify-specific rules block administrative pages, cart/checkout flows, filtered collection URLs that create duplicate content, and search pages.

Static Site (Astro, Next.js, Gatsby)

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

Static sites typically have no administrative areas or dynamic duplicate content, so an open robots.txt with just a sitemap reference is usually sufficient.

Blocking AI Crawlers

If you want to prevent AI training bots from scraping your content:

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

Note that the general * rule still allows search engines to crawl. The specific AI bot rules only block named crawlers. Use the seokit robots.txt generator to create rules for all known AI crawlers.

Common robots.txt Mistakes

Blocking Your Entire Site

User-agent: *
Disallow: /

This blocks all crawlers from your entire site. It is the correct choice during development or for private sites, but accidentally deploying this to production is catastrophic for SEO. Your site will be deindexed within days.

Fix: Always audit your robots.txt after deployment. Verify it with Google Search Console’s robots.txt tester.

Blocking CSS and JavaScript Files

Disallow: /css/
Disallow: /js/

Modern search engines render pages to understand their content. Blocking CSS and JavaScript prevents rendering, which means Google sees your page as broken and cannot index it properly.

Fix: Remove these rules. There is no SEO benefit to blocking static assets.

Using robots.txt for Security

robots.txt is publicly accessible. Listing paths like /admin/, /secret-api/, or /backup/ in your robots.txt actually advertises these paths to anyone who reads the file. It is a security anti-pattern.

Fix: Use proper authentication and access controls for sensitive areas. robots.txt is for crawler management, not security.

Incorrect Path Syntax

Paths in robots.txt are case-sensitive and must start with /. These are not equivalent:

Disallow: /Admin/     # Blocks /Admin/ only
Disallow: /admin/     # Blocks /admin/ only

On case-sensitive servers (Linux), /Admin/ and /admin/ are different directories. On case-insensitive servers (Windows), they are the same. Always match the exact case used in your URLs.

No Sitemap Directive

Omitting the Sitemap directive is not technically an error, but it misses an easy win. While you should also submit your sitemap through Google Search Console and Bing Webmaster Tools, the robots.txt sitemap directive provides an additional discovery mechanism.

Conflicting Rules

When multiple rules match a URL, most crawlers use the most specific rule (the one with the longest path match). However, behavior is not perfectly consistent across all crawlers:

Disallow: /products/
Allow: /products/featured/

Googlebot will allow /products/featured/ because the Allow rule is more specific. But some lesser-known crawlers may not handle this correctly. Test your rules to ensure they work as intended.

How to Test Your robots.txt

Google Search Console

The URL Inspection tool in Google Search Console shows whether a URL is blocked by robots.txt. For a comprehensive test, use the robots.txt report to validate your entire file.

seokit robots.txt Generator

The seokit robots.txt generator lets you build and validate your robots.txt file interactively. It generates the correct syntax for your CMS, includes AI bot blocking options, and validates the output before you deploy it.

Manual Verification

After deploying your robots.txt, visit https://yoursite.com/robots.txt in a browser to verify the file is accessible and contains the correct directives. Check that the file returns a 200 status code, not a 404 or 500.

robots.txt vs Meta Robots vs X-Robots-Tag

Understanding when to use each:

  • robots.txt — Controls whether crawlers can access specific URL paths. It prevents crawling but does not prevent indexing if the URL is discovered through links.
  • Meta robots tag — An HTML <meta> tag that controls whether a specific page should be indexed or followed. Use this to prevent indexing of pages that crawlers can access.
  • X-Robots-Tag — An HTTP header that serves the same function as the meta robots tag but can be applied to non-HTML files like PDFs and images.

For pages you want to keep out of search results, use the meta robots noindex tag rather than robots.txt. Blocking a page with robots.txt prevents Google from seeing the noindex tag, which can paradoxically result in the URL appearing in search results.

Conclusion

Your robots.txt file is a small but powerful part of your SEO infrastructure. Get it right and crawlers efficiently index your important content. Get it wrong and you could be invisible to search engines or exposing paths you meant to keep hidden.

Generate a properly formatted robots.txt with the seokit robots.txt generator, validate your meta tags with the meta tag generator, and audit your overall technical SEO setup to ensure crawlers can access everything they need.