AI bot detection and crawler signals
"AI detection" is having a moment. But most people mean one of two things:
- Content authenticity. Is this content real, or did an AI model generate or alter it?
- Traffic detection. Is this visitor a human, or an AI bot quietly crawling my site or API?
Those are different jobs. Both matter. Both are easy to get wrong if you only rely on classifiers and vibes.
Originary takes a different view: every time an AI system touches your data, there should be a clear, verifiable trail of what happened. That trail needs to work for developers, auditors, partners, and automated agents at the same time.
PEAC-Receipt HTTP header.1. What "agent and crawler identification" really covers
People often bundle three separate capabilities under "AI detection":
Fake vs real (content authenticity)
Classifying whether a text, image, audio, or video file was generated or altered by an AI model, usually with a probability score.
Model fingerprinting (who generated this)
Inferring which model family or vendor likely produced the artifact, or using watermarks and statistical fingerprints to attribute it.
Bot and agent detection (who is calling me)
Detecting that an incoming request is from an AI agent or crawler, not from a person in a browser, and understanding which agent, under what declared purpose.
Agent and crawler identification is the missing visibility layer between your content and the growing universe of AI crawlers, copilots, and headless agents.
2. Why "detection-only" is not enough
There is real value in content-level detection and model fingerprinting. But they have hard limits:
- It is an arms race. As models improve, naive classifiers become less reliable. A detector that feels strong this quarter may be unreliable next quarter. (We have seen 20%+ false positive drops in under 6 months.)
- Scores are not proof. A "0.84 likelihood of AI" score is a hint. It is not a signed record that will stand up in an audit, complaint, or partner review.
- No policy, no economics. Even if you know something is AI-generated, that does not tell you whether the agent respected your usage policy, paid you for access, or is allowed to keep the data.
- Detection lag. By the time you detect unauthorized AI training on your content, the model is already deployed. You cannot un-train it.
Enterprises, regulators, and serious publishers need more than yes/no classification:
- Machine-readable policies agents can parse
- Cryptographic proof access followed those terms
- Chain linking suspicious outputs back to access events
- Audit trail that survives review (not server logs you control)
That is where Originary and PEAC push beyond detection-only to detection + policy + signed records.
3. The four pillars of useful agent and crawler identification
In practice, agent and crawler identification becomes powerful when you combine four signal types:
- Pillar 1: metadata
- Pillar 2: model fingerprints
- Pillar 3: access events
- Pillar 4: artifact repository
3.1 Metadata: the quiet truth-teller
Metadata is "data about the data." For agent and crawler identification, you care about at least three layers:
File / media layer
- EXIF data, container metadata (images/audio/video)
- C2PA provenance, content credentials
- Timestamps, edit history, device hints
- Gotcha: easily stripped unless embedded + signed
Transport layer
- HTTP headers, TLS fingerprints, ASN ranges
- User-Agent, model hints, API keys
- Rate patterns, timing, geo
On its own, metadata can be spoofed. Combined with signed records, it becomes a strong integrity check. In PEAC, metadata is not an afterthought - effective AI preference policies (AIPREF) are discovered and snapshotted into every record, so audits are self-contained.
3.2 Model fingerprints: which model touched this
Model fingerprinting tries to answer: which model family or vendor produced this artifact?
- Risk and compliance. Some models may be disallowed for regulated data.
- Attribution and economics. Different pricing for different model types.
- Cross-checking claims. Detect mismatches between claims and reality.
In Originary's world, model fingerprints feed into policy and records: policies can say "allow research use from approved models, block others." Records include which model was declared at access time.
3.3 Access: every AI call as a verifiable event
This is the most undervalued pillar. Traditional logs tell you IP, path, timestamp. That is not enough for AI agents and 402-style paid access.
In a PEAC-aware environment, each AI call becomes a structured, signed event:
agent_id -> which agent or client called you agent_type -> crawler, copilot, aggregator, training pipeline model_id -> declared model family in use policy_version -> which policy applied enforcement -> e.g. http-402 for payment-gated access payment -> rail, amount, currency, provider evidence aipref -> snapshot of AI usage preferences in effect issued_at -> when the record was generated
The PEAC kernel signs records using Ed25519 and ships them in a PEAC-Receipt header, ready for offline or online verification.
3.4 Artifact repository: cases, not random files
Once you have detection and rich access events, you need somewhere to put them. An artifact repository is:
- A structured library of artifacts: requests, responses, media, forensics, and records.
- Grouped into cases or projects: incidents, audits, fraud investigations.
- Enriched with metadata, fingerprints, and PEAC records.
This lets banks, insurers, publishers, and regulators reconstruct what happened, show chain-of-custody evidence for review, and re-run analyses when policies change. Originary's goal: your live AI traffic and artifact repository are two views of the same records layer.
4. How Originary + PEAC change agent and crawler identification in practice
4.1 Publish policies that agents can actually read
Every PEAC-aware service exposes a discovery file at /.well-known/peac.txt that advertises protocol version, payment rails, record requirements, and verification endpoints.
AIPREF policies describe how your content may be used. These are snapshotted into every record. AI agents can no longer pretend they did not know your terms.
4.2 Enforce and measure with HTTP 402 and records
When an AI agent hits a protected resource, it receives an HTTP 402 Payment Required response. Once the agent pays or proves entitlement, the PEAC kernel issues a signed record binding: what was accessed, who accessed it, which policy applied, and payment details.
Agent and crawler identification becomes not just "yes, that looked like a bot" but "yes, that bot paid, under these terms, here is the verified record."
4.3 Give good agents a way to prove they are good
Most serious AI agents want a clean way to respect content owners. Originary + PEAC give them that path: pre-fetch peac.txt, integrate 402 flows, attach records when passing data downstream.
That is agent and crawler identification as positive infrastructure rather than only defensive heuristics.
4.4 Make bad or ambiguous agents stand out
Once good agents follow rules and produce records, what remains is easier to handle: crawlers ignoring peac.txt, tools spoofing user-agents, traffic with no records. These become clear anomalies. You can throttle, block, or address based on evidence rather than suspicion.
5. What PEAC does not do
- PEAC does not run a model registry, score agents, or rank crawlers.
- PEAC does not classify traffic; classifiers and fingerprints stay where they are.
- PEAC does not block, throttle, or enforce; those decisions stay with the operator.
- PEAC does not assert an agent identity is "real"; it carries a signed record of what each agent attested at the boundary.
- PEAC does not replace your WAF, edge rules, or fraud platform; it produces a portable signed record alongside them.
6. Where this is going next
This post is the high-level overview. We will follow up with a focused series on metadata, access events, fingerprinting, and artifact repositories.
Explore the building blocks:
- AIPREF - machine-readable AI usage preferences.
- x402 / HTTP 402 - payment gating for machine actions.
- PEAC records - verifiable access records.