Skip to main content
TECHNICAL

robots.txt (RFC 9309): The Web's Crawl Access Control

RFC 9309 standardizes the Robots Exclusion Protocol, defining how publishers control crawler access to their content. This guide covers the specification's technical details, matching rules, error handling, and how it complements AIPREF usage preferences.

Jithin Raj & Originary Team

Summary

The Robots Exclusion Protocol, standardized as RFC 9309 in September 2022, is the web's mechanism for crawl access control. It tells automated clients (crawlers, bots, agents) which URL paths they may fetch from an origin.

Critical: robots.txt is not access authorization or security. It is a cooperative signal. Listing paths in robots.txt makes them discoverable. Use real authentication for sensitive resources.

RFC 9309 clarifies syntax, matching rules, error handling, and caching behavior that were ambiguous in the original 1994 specification. AIPREF builds on this foundation by adding usage preference semantics via HTTP headers and robots.txt directives.

What RFC 9309 Standardizes

Location and Format

  • Served at /robots.txt from the origin root
  • Must be UTF-8 encoded
  • Content-Type must be text/plain
  • File size should be parseable up to at least 500 kibibytes
  • The path /robots.txt is always implicitly allowed

Groups and Rules

A robots.txt file consists of one or more groups. Each group:

  • Begins with one or more User-agent: lines specifying which crawlers the rules apply to
  • Contains Allow: and Disallow: rules for URL path patterns
  • User-agent matching is case-insensitive
  • If no applicable group is found, all access is allowed by default

Matching Rules

RFC 9309 defines precise matching behavior:

  • Case-sensitive path matching: /private/Private
  • Longest match wins: Most specific rule applies when multiple patterns match
  • Wildcard *: Matches zero or more characters
  • End anchor $: Matches end of URL path
  • When Allow and Disallow have equal specificity: Allow takes precedence
  • Comments: Lines starting with # are ignored

Fetch Errors and Caching

RFC 9309 provides clear guidance for handling fetch errors and caching:

SituationMeaningCrawler Behavior
4xx statusFile unavailable or doesn't existCrawler MAY access any resources (no restrictions)
5xx statusServer or network errorTreat as complete disallow until reachable
RedirectsFile has movedFollow up to a reasonable limit, evaluate in origin context
CachingAvoid frequent refetchesCache up to 24 hours; may extend if unreachable; standard HTTP cache-control applies

Important: 4xx vs 5xx Semantics

A 404 Not Found means "no restrictions" - crawlers may proceed. A 503 Service Unavailable means "assume everything is disallowed" until the file is reachable. This distinction is critical for proper crawler behavior.

What robots.txt Does NOT Do

Security Warning

RFC 9309 explicitly states: "These rules are not a form of access authorization."

  • It does not provide authentication or authorization. Malicious actors can ignore robots.txt. Use proper authentication (passwords, tokens, sessions) for sensitive resources.
  • Listing paths exposes them publicly. A line like Disallow: /admin/ tells everyone your admin panel is at /admin/
  • It does not control usage after access. robots.txt only controls whether a crawler fetches content. It says nothing about training, indexing, or other downstream usage. That's where AIPREF comes in.

How AIPREF Complements robots.txt

RFC 9309 handles crawl access. AIPREF (draft-ietf-aipref-attach) adds usage preference semantics. They work together:

robots.txt Role

Controls which URL paths crawlers may fetch. Binary yes/no decision per path.

AIPREF Role

Expresses how content may be used after access (training, search, etc.) via Content-Usage headers and robots.txt directives.

Combined Example

User-agent: *
Allow: /
Disallow: /internal/
Content-Usage: train-ai=n
Content-Usage: /public/ train-ai=y

This configuration keeps /internal/ off limits to crawlers (RFC 9309), while expressing usage preferences: default no AI training, but training allowed for /public/ (AIPREF). The AIPREF draft explicitly updates RFC 9309 to add the Content-Usage directive.

Copy-Paste Cookbook

1. Minimal Allow All

User-agent: *
Allow: /

Explicitly allows all crawlers to access all paths.

2. Block Subtree with Carve-Out

User-agent: *
Disallow: /private/
Allow: /private/press/

Blocks /private/ but allows /private/press/ (longest match wins).

3. Wildcards and End Anchors

User-agent: *
Disallow: *.bak$
Disallow: /tmp/*
Allow: /tmp/public/

* matches any characters, $ anchors to end of path.

4. Target Specific Crawler

User-agent: GPTBot
Disallow: /

User-agent: *
Allow: /

Blocks GPTBot specifically while allowing all other crawlers.

5. Combine Crawl Control with AIPREF Preferences

User-agent: *
Allow: /
Disallow: /private/
Content-Usage: train-ai=n, search=y
Content-Usage: /research/ train-ai=y

Combines RFC 9309 crawl rules with AIPREF usage preferences for path-specific control.

Quick Testing Checklist

  1. Verify file is accessible:
    curl -sI https://example.com/robots.txt

    Should return 200 OK with Content-Type: text/plain

  2. Check UTF-8 encoding: Ensure file is saved as UTF-8, not Latin-1 or other encodings
  3. Validate rule precedence: Test URLs where Allow and Disallow patterns overlap to confirm longest-match behavior
  4. Test error scenarios: Verify 4xx returns allow-all behavior, 5xx returns disallow-all
  5. If using AIPREF: Confirm Content-Usage lines are within the correct group and properly formatted

Non-Standard Extensions

Some crawlers support additional directives that are not part of RFC 9309:

  • Crawl-delay: - Rate limiting (supported by some crawlers, not standard)
  • Sitemap: - Sitemap location (widely supported, not in RFC 9309)
  • Host: - Preferred host (not standard)

Use these with caution. They may be ignored by some crawlers and are not guaranteed to work consistently.

Bottom Line

Keep robots.txt as your durable control surface for crawler access. RFC 9309 makes the rules predictable under redirects, errors, and caching.

Use AIPREF to express how content may be used after access. Together, they reduce ambiguity for publishers and responsible crawlers.

Remember: robots.txt is cooperative signaling, not security. Use real authentication for sensitive resources.

Further Reading

Need help implementing robots.txt and AIPREF?

Learn how Originary helps publishers combine crawl control with verifiable usage preferences and cryptographic receipts.