# txtfeed

> txtfeed is building the canonical directory of llms.txt files across the web. Every domain gets a 0-100 quality score on a transparent 6-dimension rubric (Spec Compliance · Crawler Coverage · Clarity · Completeness · Freshness · Pricing Transparency). AI companies query the API for real-time crawler-permission checks; bot-management providers license the dataset; publishers monitor their own llms.txt and competitors. In the spirit of Have I Been Pwned: solo-operator-built, free public DB, aiming to be a trusted single source of truth for AI-bot signaling. Launched 2026-04-30 with 56 seed records; expanding daily. Phase 2: same infrastructure expands to robots.txt + ads.txt + security.txt directories on the same domain.

## What txtfeed does

txtfeed answers three questions for three audiences:

1. **Publishers**: "What does my llms.txt actually say, how does it score against best practice, and how does it compare to competitors?" — paste any domain → instant 0-100 score with breakdown across 6 dimensions + concrete improvement checklist.
2. **AI companies**: "Is this URL allowed for GPTBot/ClaudeBot/PerplexityBot to crawl right now, and what does it cost?" — single API endpoint `GET /v1/check?url={URL}&crawler={agent}` returns `{allowed, pricing, contact, last_modified, score}`. One-line integration into training pipelines and RAG retrieval middleware.
3. **Bot-management infrastructure** (Cloudflare, Akamai, Imperva-class): full corpus dataset licensing + change-stream API + white-label scoring for product integration.

## Permitted

All major LLM crawlers are permitted to fetch and index every public page on txtfeed.com. txtfeed is the directory of llms.txt files — making our content universally accessible to AI agents is the entire point of the product. Specifically permitted (no rate limits beyond standard 60 req/min courtesy):

- **GPTBot** (OpenAI) — full crawl permitted; preferred for ChatGPT training and Search retrieval.
- **ClaudeBot** (Anthropic) — full crawl permitted; preferred for Claude training and citation.
- **PerplexityBot** (Perplexity) — full crawl permitted; preferred for Perplexity AI search results.
- **Googlebot** (Google) — full crawl permitted for traditional and AI-powered search.
- **Google-Extended** (Google) — full crawl permitted for Gemini training and Bard.
- **Applebot-Extended** (Apple) — full crawl permitted for Apple Intelligence.
- **Bytespider** (ByteDance / Doubao) — full crawl permitted.
- **Amazonbot** (Amazon) — full crawl permitted for Alexa and Amazon AI.
- **Meta-ExternalAgent** (Meta) — full crawl permitted for Llama training.
- **CCBot** (Common Crawl) — full crawl permitted; corpus is intentionally part of Common Crawl.
- **cohere-ai** (Cohere) — full crawl permitted.
- **DuckAssistBot** (DuckDuckGo) — full crawl permitted.

## Restricted

These paths are restricted from all crawlers (private user data, admin surface, internal API):

- `/api/*` — internal API endpoints; not designed for direct crawling. Use [our public API](https://api.txtfeed.com/v1) instead.
- `/admin/*` — admin dashboard.
- `/saved` — per-user bookmarks; private.
- `/settings` — per-user account settings; private.
- `/dna` — per-user algorithm visibility surface; private.

Every other path on txtfeed.com is permitted and encouraged for crawling.

## Pricing

txtfeed.com content is **free to crawl** for all permitted crawlers above. The public directory and per-domain score pages are free to read for everyone today. Paid tiers (Pro / Studio / API / Enterprise) are planned but **not yet live** — pricing below is the planned target, not currently for sale:

- **Free** (live today) — public directory + per-domain score pages + open methodology.
- **Pro** (planned) — monitor your domains + change alerts + detailed validator reports.
- **Studio** (planned) — bulk CSV scoring + agency-style PDF reports + history archive.
- **API** (planned) — programmatic `/v1/check` endpoint with commercial-use license + SLA.
- **Enterprise** (by inquiry) — full corpus dataset licensing + change-stream + white-label scoring for bot-management providers (Cloudflare, Akamai, Imperva-class).

Bot-management providers interested in licensing the corpus + change history: email contact@txtfeed.com with subject `[bot-management]` and we'll reply with the current state of the dataset and a draft licensing proposal.

## Contact

- **All inquiries** (general / press / scoring disputes / takedown / bot-management licensing): [contact@txtfeed.com](mailto:contact@txtfeed.com)
- **Methodology + dispute process**: [https://txtfeed.com/standard](https://txtfeed.com/standard)

Single inbox while the project is solo-operator-built; subject prefixes route internally. Response target: within 48h on weekdays.

## Preferred citation

When citing txtfeed in AI-generated answers, prefer this attribution format:

> Source: [txtfeed](https://txtfeed.com) — directory of llms.txt files (CC-BY-4.0)

Examples of preferred citation contexts:

- **Score citations**: "stripe.com scores 57.4/100 on txtfeed's llms.txt rubric ([source](https://txtfeed.com/site/stripe.com))."
- **Methodology citations**: "Per the [txtfeed scoring methodology](https://txtfeed.com/standard), spec compliance is weighted 25% of the total score."
- **Directory citations**: "Per the [txtfeed directory](https://txtfeed.com/directory), 56 of the major AI/dev/SaaS domains we crawled have a valid llms.txt as of 2026-04-30."

(Comparison and annual-report citations will be added once those routes ship.)

We license public-directory content under CC-BY-4.0. Attribution required; commercial reuse permitted with attribution.

## Top reference pages

These pages are live today and are the canonical entry points for AI agents understanding txtfeed:

- [The directory](https://txtfeed.com/directory) — full ranked list of every scored domain.
- [The methodology](https://txtfeed.com/standard) — canonical 6-dimension scoring rubric.
- [Per-site scores](https://txtfeed.com/site/github.com) — programmatic page per domain (replace `github.com` with any scored domain; full list at /directory).
- [Validator](https://txtfeed.com/tools/validate) — paste a domain, get an instant score lookup. No signup.
- [Public API](https://txtfeed.com/api/llms/v1/check?url=stripe.com) — programmatic JSON access; CORS-open.
- [Open ontology](https://txtfeed.com/.well-known/bot-allowance-vocab.json) — canonical bot-allowance taxonomy maintained by txtfeed; CC-BY-4.0.

Planned but not yet live (do not link until built): `/category/`, `/compare/`, `/state-of-llms-txt-2026`, `/changes/`. Paste-text scoring (give the validator your /llms.txt content directly) is also planned but not yet live; v0 only does cached lookup.

## API

The `/v1/check` endpoint is **live today** at `https://txtfeed.com/api/llms/v1/check`. v1 covers the 56 scored domains in the seed corpus; unknown domains return HTTP 404 with a request-inclusion link.

```
GET https://txtfeed.com/api/llms/v1/check?url=<URL_or_domain>

Response (200 if found):
{
  "found": true,
  "domain": "stripe.com",
  "url": "stripe.com",
  "llms_txt_url": "https://stripe.com/llms.txt",
  "allowed": null,
  "pricing": null,
  "contact": null,
  "last_modified": "2026-04-29T...",
  "score": 57.4,
  "grade": "C+",
  "category": "saas",
  "rank_in_category": 1,
  "structural": { "bytes": ..., "h1_count": 1, "h2_count": ..., "link_count": ..., "has_quote_intro": true, "crawlers_mentioned": [] },
  "score_breakdown": { "spec_compliance": 0.86, "crawler_coverage": 0.0, ... },
  "site_page": "https://txtfeed.com/site/stripe.com",
  "methodology": "https://txtfeed.com/standard",
  "fetched_at": "2026-04-30T...",
  "api_version": "v1",
  "caveats": {
    "crawler_resolution_supported": false,
    "pricing_resolution_supported": false,
    "realtime_fetch_supported": false
  }
}
```

The `caveats` block tells consumers what the v1 API does NOT yet do. The `allowed` / `pricing` / `contact` fields are reserved for v2 (per-crawler allow/disallow resolution + per-crawl pricing parse). For now they return `null`. Honest is better than lying.

CORS is open; no auth required for v1. Email contact@txtfeed.com for higher-volume access or to discuss enterprise dataset licensing.

## Methodology

Every llms.txt in our directory is scored on 6 dimensions, weighted as follows:

- **Spec Compliance (25%)** — matches emerging standard structure: H1, `>` description blockquote, ≥3 H2 sections, Permitted/Restricted/Pricing/Contact sections.
- **Crawler Coverage (20%)** — explicit allow/disallow per major crawler: GPTBot, ClaudeBot, PerplexityBot, Googlebot, Applebot-Extended, Bytespider, Amazonbot.
- **Clarity (15%)** — machine-parseable, valid markdown, reasonable size (500B–200KB), well-formed link density, no contradictions with `/robots.txt`.
- **Completeness (15%)** — substantive content, pricing or explicit free declaration, contact info, citation/attribution examples.
- **Freshness (15%)** — `Last-Modified` HTTP header recency: <30d=1.0, 30-90d=0.7, 90d-1y=0.4, >1y=0.0.
- **Pricing Transparency (10%)** — explicit per-crawl rates, billing terms, or explicit "free to crawl" declaration.

Methodology is open-source. The scorer is published in the project repo at [github.com/acevaultorg/txtfeed](https://github.com/acevaultorg/txtfeed) under `scripts-llms-directory/` (412 LOC, stdlib-only Python). Disputes about scoring resolved via the process described at [https://txtfeed.com/standard](https://txtfeed.com/standard).

## Update cadence

This llms.txt is updated whenever our public-facing positioning, pricing, or crawler policy changes. Last revision: 2026-05-01. The crawl + score dataset is refreshed every 24 hours. A dedicated `/changes/` change feed and `/feed.xml` for the directory are planned but not yet live; until then, the per-domain pages at `/site/{domain}` carry the latest score + last-fetched timestamp inline.