robots.txt (RFC 9309): The Web's Crawl Access Control
RFC 9309 standardizes the Robots Exclusion Protocol, defining how publishers control crawler access to their content. This guide covers the specification's technical details, matching rules, error handling, and how it complements AIPREF usage preferences.
Summary
The Robots Exclusion Protocol, standardized as RFC 9309 in September 2022, is the web's mechanism for crawl access control. It tells automated clients (crawlers, bots, agents) which URL paths they may fetch from an origin.
Critical: robots.txt is not access authorization or security. It is a cooperative signal. Listing paths in robots.txt makes them discoverable. Use real authentication for sensitive resources.
RFC 9309 clarifies syntax, matching rules, error handling, and caching behavior that were ambiguous in the original 1994 specification. AIPREF builds on this foundation by adding usage preference semantics via HTTP headers and robots.txt directives.
What RFC 9309 Standardizes
Location and Format
- Served at
/robots.txtfrom the origin root - Must be UTF-8 encoded
- Content-Type must be
text/plain - File size should be parseable up to at least 500 kibibytes
- The path
/robots.txtis always implicitly allowed
Groups and Rules
A robots.txt file consists of one or more groups. Each group:
- Begins with one or more
User-agent:lines specifying which crawlers the rules apply to - Contains
Allow:andDisallow:rules for URL path patterns - User-agent matching is case-insensitive
- If no applicable group is found, all access is allowed by default
Matching Rules
RFC 9309 defines precise matching behavior:
- Case-sensitive path matching:
/private≠/Private - Longest match wins: Most specific rule applies when multiple patterns match
- Wildcard
*: Matches zero or more characters - End anchor
$: Matches end of URL path - When Allow and Disallow have equal specificity: Allow takes precedence
- Comments: Lines starting with
#are ignored
Fetch Errors and Caching
RFC 9309 provides clear guidance for handling fetch errors and caching:
| Situation | Meaning | Crawler Behavior |
|---|---|---|
4xx status | File unavailable or doesn't exist | Crawler MAY access any resources (no restrictions) |
5xx status | Server or network error | Treat as complete disallow until reachable |
| Redirects | File has moved | Follow up to a reasonable limit, evaluate in origin context |
| Caching | Avoid frequent refetches | Cache up to 24 hours; may extend if unreachable; standard HTTP cache-control applies |
Important: 4xx vs 5xx Semantics
A 404 Not Found means "no restrictions" - crawlers may proceed. A 503 Service Unavailable means "assume everything is disallowed" until the file is reachable. This distinction is critical for proper crawler behavior.
What robots.txt Does NOT Do
Security Warning
RFC 9309 explicitly states: "These rules are not a form of access authorization."
- It does not provide authentication or authorization. Malicious actors can ignore robots.txt. Use proper authentication (passwords, tokens, sessions) for sensitive resources.
- Listing paths exposes them publicly. A line like
Disallow: /admin/tells everyone your admin panel is at/admin/ - It does not control usage after access. robots.txt only controls whether a crawler fetches content. It says nothing about training, indexing, or other downstream usage. That's where AIPREF comes in.
How AIPREF Complements robots.txt
RFC 9309 handles crawl access. AIPREF (draft-ietf-aipref-attach) adds usage preference semantics. They work together:
robots.txt Role
Controls which URL paths crawlers may fetch. Binary yes/no decision per path.
AIPREF Role
Expresses how content may be used after access (training, search, etc.) via Content-Usage headers and robots.txt directives.
Combined Example
User-agent: * Allow: / Disallow: /internal/ Content-Usage: train-ai=n Content-Usage: /public/ train-ai=y
This configuration keeps /internal/ off limits to crawlers (RFC 9309), while expressing usage preferences: default no AI training, but training allowed for /public/ (AIPREF). The AIPREF draft explicitly updates RFC 9309 to add the Content-Usage directive.
Copy-Paste Cookbook
1. Minimal Allow All
User-agent: * Allow: /
Explicitly allows all crawlers to access all paths.
2. Block Subtree with Carve-Out
User-agent: * Disallow: /private/ Allow: /private/press/
Blocks /private/ but allows /private/press/ (longest match wins).
3. Wildcards and End Anchors
User-agent: * Disallow: *.bak$ Disallow: /tmp/* Allow: /tmp/public/
* matches any characters, $ anchors to end of path.
4. Target Specific Crawler
User-agent: GPTBot Disallow: / User-agent: * Allow: /
Blocks GPTBot specifically while allowing all other crawlers.
5. Combine Crawl Control with AIPREF Preferences
User-agent: * Allow: / Disallow: /private/ Content-Usage: train-ai=n, search=y Content-Usage: /research/ train-ai=y
Combines RFC 9309 crawl rules with AIPREF usage preferences for path-specific control.
Quick Testing Checklist
- Verify file is accessible:curl -sI https://example.com/robots.txt
Should return
200 OKwithContent-Type: text/plain - Check UTF-8 encoding: Ensure file is saved as UTF-8, not Latin-1 or other encodings
- Validate rule precedence: Test URLs where Allow and Disallow patterns overlap to confirm longest-match behavior
- Test error scenarios: Verify 4xx returns allow-all behavior, 5xx returns disallow-all
- If using AIPREF: Confirm
Content-Usagelines are within the correct group and properly formatted
Non-Standard Extensions
Some crawlers support additional directives that are not part of RFC 9309:
Crawl-delay:- Rate limiting (supported by some crawlers, not standard)Sitemap:- Sitemap location (widely supported, not in RFC 9309)Host:- Preferred host (not standard)
Use these with caution. They may be ignored by some crawlers and are not guaranteed to work consistently.
Bottom Line
Keep robots.txt as your durable control surface for crawler access. RFC 9309 makes the rules predictable under redirects, errors, and caching.
Use AIPREF to express how content may be used after access. Together, they reduce ambiguity for publishers and responsible crawlers.
Remember: robots.txt is cooperative signaling, not security. Use real authentication for sensitive resources.
Further Reading
Need help implementing robots.txt and AIPREF?
Learn how Originary helps publishers combine crawl control with verifiable usage preferences and cryptographic receipts.