robots.txt (RFC 9309): crawl access control for the web
Critical: robots.txt is not access authorization or security. It is a cooperative signal. Listing paths in robots.txt makes them discoverable. Use real authentication for sensitive resources.
RFC 9309 clarifies syntax, matching rules, error handling, and caching behavior that were ambiguous in the original 1994 specification. AIPREF builds on this foundation by adding usage preference semantics via HTTP headers and robots.txt directives.
What RFC 9309 standardizes
Location and format
- Served at
/robots.txtfrom the origin root. - Must be UTF-8 encoded.
- Content-Type must be
text/plain. - File size should be parseable up to at least 500 kibibytes.
- The path
/robots.txtis always implicitly allowed.
Groups and rules
A robots.txt file consists of one or more groups. Each group:
- Begins with one or more
User-agent:lines specifying which crawlers the rules apply to. - Contains
Allow:andDisallow:rules for URL path patterns. - User-agent matching is case-insensitive.
- If no applicable group is found, all access is allowed by default.
Matching rules
RFC 9309 defines precise matching behavior:
- Case-sensitive path matching.
/privateis not the same as/Private. - Longest match wins. Most specific rule applies when multiple patterns match.
- Wildcard
*. Matches zero or more characters. - End anchor
$. Matches end of URL path. - When Allow and Disallow have equal specificity. Allow takes precedence.
- Comments. Lines starting with
#are ignored.
Fetch errors and caching
RFC 9309 provides clear guidance for handling fetch errors and caching:
4xx status
Meaning: file unavailable or does not exist. Crawler behavior: crawler MAY access any resources (no restrictions).
5xx status
Meaning: server or network error. Crawler behavior: treat as complete disallow until reachable.
Redirects
Meaning: file has moved. Crawler behavior: follow up to a reasonable limit, evaluate in origin context.
Caching
Meaning: avoid frequent refetches. Crawler behavior: cache up to 24 hours; may extend if unreachable; standard HTTP cache-control applies.
404 Not Found means "no restrictions" - crawlers may proceed. A 503 Service Unavailable means "assume everything is disallowed" until the file is reachable. This distinction is critical for proper crawler behavior.What robots.txt does NOT do
Security warning. RFC 9309 explicitly states: "These rules are not a form of access authorization."
- It does not provide authentication or authorization. Malicious actors can ignore robots.txt. Use proper authentication (passwords, tokens, sessions) for sensitive resources.
- Listing paths exposes them publicly. A line like
Disallow: /admin/tells everyone your admin panel is at/admin/. - It does not control usage after access. robots.txt only controls whether a crawler fetches content. It says nothing about training, indexing, or other downstream usage. That is where AIPREF comes in.
How AIPREF complements robots.txt
RFC 9309 handles crawl access. AIPREF (draft-ietf-aipref-attach) adds usage preference semantics. They work together:
robots.txt role
Controls which URL paths crawlers may fetch. Binary yes/no decision per path.
AIPREF role
Expresses how content may be used after access (training, search, etc.) via Content-Usage headers and robots.txt directives.
Combined example
User-agent: * Allow: / Disallow: /internal/ Content-Usage: train-ai=n Content-Usage: /public/ train-ai=y
This configuration keeps /internal/ off limits to crawlers (RFC 9309), while expressing usage preferences: default no AI training, but training allowed for /public/ (AIPREF). The AIPREF draft explicitly updates RFC 9309 to add the Content-Usage directive.
Copy-paste cookbook
1. Minimal allow all
User-agent: * Allow: /
Explicitly allows all crawlers to access all paths.
2. Block subtree with carve-out
User-agent: * Disallow: /private/ Allow: /private/press/
Blocks /private/ but allows /private/press/ (longest match wins).
3. Wildcards and end anchors
User-agent: * Disallow: *.bak$ Disallow: /tmp/* Allow: /tmp/public/
* matches any characters, $ anchors to end of path.
4. Target specific crawler
User-agent: GPTBot Disallow: / User-agent: * Allow: /
Blocks GPTBot specifically while allowing all other crawlers.
5. Combine crawl control with AIPREF preferences
User-agent: * Allow: / Disallow: /private/ Content-Usage: train-ai=n, search=y Content-Usage: /research/ train-ai=y
Combines RFC 9309 crawl rules with AIPREF usage preferences for path-specific control.
Quick testing checklist
- Verify file is accessible:
curl -sI https://example.com/robots.txt
Should return200 OKwithContent-Type: text/plain. - Check UTF-8 encoding. Ensure file is saved as UTF-8, not Latin-1 or other encodings.
- Validate rule precedence. Test URLs where Allow and Disallow patterns overlap to confirm longest-match behavior.
- Test error scenarios. Verify 4xx returns allow-all behavior, 5xx returns disallow-all.
- If using AIPREF. Confirm
Content-Usagelines are within the correct group and properly formatted.
Non-standard extensions
Some crawlers support additional directives that are not part of RFC 9309:
Crawl-delay:- Rate limiting (supported by some crawlers, not standard).Sitemap:- Sitemap location (widely supported, not in RFC 9309).Host:- Preferred host (not standard).
Use these with caution. They may be ignored by some crawlers and are not guaranteed to work consistently.
What PEAC does not do
- PEAC does not author or maintain RFC 9309; that work belongs to the IETF.
- PEAC does not enforce robots.txt upstream of the publisher; enforcement stays at the origin and its infrastructure.
- PEAC does not block crawlers, throttle requests, or replace WAF, CDN, or auth rules.
- PEAC does not assert that a crawler obeyed robots.txt; it carries a signed record of what the agent attested at the boundary.
- PEAC does not replace AIPREF; it composes with AIPREF so adherence can be recorded and verified offline.
Bottom line
Keep robots.txt as your durable control surface for crawler access. RFC 9309 makes the rules predictable under redirects, errors, and caching.
Use AIPREF to express how content may be used after access. Together, they reduce ambiguity for publishers and responsible crawlers.
Remember: robots.txt is cooperative signaling, not security. Use real authentication for sensitive resources.
Further reading
- RFC 9309: Robots Exclusion Protocol - official IETF specification (September 2022).
- RFC 9309 on IETF Datatracker - full text with errata and discussion.
- AIPREF Attachment Specification - how AIPREF extends RFC 9309 with Content-Usage.
- AIPREF: AI Usage Preferences - comprehensive guide to the AIPREF specification.