llms.txt: The New robots.txt for the AI Era
A new standard is emerging to help websites communicate with AI systems. Here's what llms.txt is, why it matters, and how to implement it today.
SigilEdge Team
SigilEdge
In 1994, a simple text file changed the web forever. robots.txt gave website owners a way to communicate with search engine crawlers, telling them what to index and what to ignore. It became a universal standard that every major search engine respects to this day.
Thirty years later, we need something similar for AI.
Enter llms.txt: an emerging specification that helps websites communicate with large language models and the AI systems built on them. It’s not yet as universal as robots.txt, but it’s gaining traction fast, and early adopters are positioning themselves as preferred sources for AI citation.
The Problem llms.txt Solves
AI systems interact with websites differently than traditional search crawlers. When Googlebot visits your site, it’s indexing content for a search results page. When GPTBot or ClaudeBot visits, they might be:
- Training data collection: Gathering content to improve future model versions
- Real-time retrieval: Fetching current information to answer user questions (RAG)
- Citation sourcing: Looking for authoritative content to reference in responses
The problem? Website owners have no standardized way to communicate their preferences for these different use cases. You might be fine with AI citing your content but not want it used for training. You might want attribution in a specific format. You might have licensing terms that apply.
robots.txt doesn’t solve this. It’s binary (allow or disallow crawling) and doesn’t address the nuances of AI interaction.
What Is llms.txt?
llms.txt is a proposed standard (documented at llmstxt.org) that provides AI-specific instructions for how language models should interact with your content. Think of it as robots.txt for the AI era, but with richer semantics.
The file lives at the root of your domain (e.g., https://example.com/llms.txt) and contains structured information about:
- Identity: Who you are and how to contact you about AI-related matters
- Permissions: What content AI can access, cite, or use for training
- Attribution: How you want to be credited when cited
- Licensing: Terms that apply to AI use of your content
- Preferences: Specific instructions for AI interaction
Here’s a basic example:
# llms.txt for example.com
# Documentation: https://llmstxt.org
# Site identity
name: Example Company
url: https://example.com
contact: ai@example.com
# Content permissions
allow: /blog/*
allow: /docs/*
allow: /products/*
disallow: /internal/*
disallow: /drafts/*
# Citation preferences
citation-format: "Source: Example Company (example.com)"
require-attribution: true
# Training permissions
training: disallow
The Key Sections Explained
Identity Block
name: Example Company
url: https://example.com
contact: ai@example.com
This tells AI systems who owns the content and how to reach you for AI-specific inquiries. The contact email is particularly useful. Some AI companies are proactively reaching out to website owners about content usage, and having a dedicated contact makes this smoother.
Access Permissions
allow: /blog/*
allow: /docs/*
disallow: /internal/*
Similar to robots.txt, but specifically for AI crawlers. This is separate from your search engine permissions. You might want Google to index your pricing page but prefer AI systems not to scrape and regurgitate your pricing structure.
Citation Preferences
citation-format: "Source: Example Company (example.com)"
require-attribution: true
link-back: true
This is where llms.txt really differentiates itself. You can specify:
- citation-format: The exact text format you want when AI cites you
- require-attribution: Whether citation is mandatory for use
- link-back: Whether you want a hyperlink included (relevant for AI tools that support links, like Perplexity)
Training Permissions
training: disallow
rag: allow
A critical distinction that robots.txt can’t make:
- training: Whether your content can be used to train future AI models
- rag: Whether your content can be retrieved in real-time to answer questions (Retrieval-Augmented Generation)
Many publishers want their content cited in real-time answers (good for visibility) but don’t want it absorbed into training data without compensation. This distinction matters.
Licensing
license: CC-BY-4.0
commercial-use: allowed
derivative-works: with-attribution
For sites with specific licensing terms, this section clarifies how AI systems should treat your content legally.
Why This Matters Now
Several forces are converging to make llms.txt increasingly important:
1. AI Companies Are Listening
OpenAI, Anthropic, and others have implemented robots.txt directives for their crawlers. They’re actively looking for signals about website owner preferences. Having a clear llms.txt file positions you as a cooperative, authoritative source, potentially improving your chances of being cited.
2. Legal Pressure Is Building
Multiple lawsuits are working through courts regarding AI training data. The outcome is uncertain, but the direction is clear: content creators want more control over how their work is used. llms.txt provides a proactive way to declare your terms before regulations force the issue.
3. Citation Is Becoming Currency
As AI answers replace traditional search results, citation becomes the new ranking. Being cited by ChatGPT or Perplexity drives brand awareness even when users don’t click through. A clear llms.txt that facilitates proper attribution helps ensure you get credit.
4. Differentiation Opportunity
Most websites don’t have an llms.txt file yet. Early adopters can establish themselves as AI-forward, potentially receiving preferential treatment from AI systems that recognize cooperative sources.
How to Implement llms.txt
Step 1: Create the File
Create a plain text file named llms.txt in your site’s root directory. Start with the basics:
# llms.txt for yoursite.com
# Learn more: https://llmstxt.org
name: Your Company Name
url: https://yoursite.com
contact: ai@yoursite.com
# What content is available
allow: /blog/*
allow: /products/*
allow: /about
# How to cite us
citation-format: "Source: Your Company (yoursite.com)"
require-attribution: true
# Training policy
training: disallow
rag: allow
Step 2: Define Your Content Zones
Think about your site in terms of what you want AI to access:
Typically allow:
- Blog posts and articles
- Product descriptions
- Documentation
- Public-facing informational pages
Typically disallow:
- Internal tools and dashboards
- User-generated content you don’t own
- Paywalled content
- Staging or draft content
Step 3: Decide Your Training Stance
This is a business decision. Consider:
- Allow training: Maximum exposure, but you lose control over how content is used
- Disallow training, allow RAG: Your content can be cited in real-time but won’t be baked into model weights
- Disallow both: Maximum control, minimum AI visibility
Most commercial sites are choosing the middle path: allow real-time retrieval and citation, disallow training data usage.
Step 4: Specify Attribution
Be specific about how you want to be credited:
citation-format: "According to [Your Brand] (yoursite.com)"
or
citation-format: "[Your Brand], retrieved from yoursite.com"
The format should be natural enough to fit in AI-generated prose while clearly identifying your brand.
Step 5: Deploy and Test
Upload the file to your site root and verify it’s accessible at https://yoursite.com/llms.txt. Test by fetching it in a browser or with curl:
curl https://yoursite.com/llms.txt
Advanced Patterns
Per-Section Rules
For complex sites, you can specify different rules for different sections:
# Blog - open for citation
section: /blog/*
allow: true
training: disallow
rag: allow
citation-format: "From the Example Blog"
# Docs - require link-back
section: /docs/*
allow: true
require-link: true
citation-format: "Example Docs (example.com/docs)"
# Products - structured data preferred
section: /products/*
allow: true
prefer-structured: true
Structured Content Hints
You can tell AI systems about your content structure:
structured-data: schema.org
sitemap: https://example.com/sitemap.xml
feed: https://example.com/feed.xml
This helps AI crawlers find and parse your content more effectively.
Update Frequency
Signal how often your content changes:
update-frequency: daily
cache-duration: 24h
This helps AI systems know when to re-fetch content for accurate real-time answers.
Current Adoption Status
Let’s be direct: llms.txt is not yet a universal standard. As of early 2025:
- No AI company has officially committed to respecting
llms.txtin the way they respectrobots.txt - Adoption is growing among forward-thinking publishers and tech companies
- The specification is still evolving with community input
So why implement it now?
- Signal intent: Having an
llms.txtclearly communicates your preferences, which could matter in future legal or business contexts - Early mover advantage: When (not if) AI companies start respecting these files, early adopters will be ahead
- Minimal effort: It’s a single text file that takes minutes to create
- No downside: Having the file doesn’t hurt anything
The Relationship to robots.txt
llms.txt doesn’t replace robots.txt. They serve different purposes:
| Aspect | robots.txt | llms.txt |
|---|---|---|
| Target | Search engine crawlers | AI/LLM systems |
| Primary function | Crawl permissions | Usage permissions + preferences |
| Attribution | Not applicable | Specify citation format |
| Training vs. retrieval | No distinction | Explicit separation |
| Licensing | Not addressed | Can specify terms |
| Adoption | Universal | Emerging |
You should have both files, with potentially different rules. For example:
robots.txt:
User-agent: *
Allow: /
User-agent: GPTBot
Disallow: /pricing/
llms.txt:
allow: /
disallow: /pricing/
training: disallow
rag: allow
citation-format: "Source: Example (example.com)"
What Comes Next
The llms.txt specification will likely evolve as:
- AI companies respond: Expect announcements about support (or explanations for why not)
- Legal clarity emerges: Court decisions and regulations will shape what control website owners have
- Tooling develops: Expect plugins, validators, and generators for common CMS platforms
- The spec matures: Additional fields and capabilities will be added based on real-world needs
The companies implementing llms.txt now are the ones who will shape how it evolves. By participating early, you’re not just adopting a standard; you’re helping define it.
Getting Started Today
Here’s your action plan:
- Create a basic
llms.txtwith your site identity, content permissions, and citation preferences - Deploy it to your site root and verify it’s accessible
- Review your
robots.txtto ensure AI crawler permissions align with your intent - Monitor AI crawler traffic in your logs to understand how AI systems interact with your content
- Test your citability by asking AI assistants questions your content should answer
The AI search era is here. The websites that communicate clearly with AI systems (through structured data, clean content, and explicit preferences like llms.txt) will be the ones that get cited, credited, and discovered.
The websites that don’t will become invisible training data, their expertise absorbed without attribution.
Which will you be?
Want to optimize your entire site for AI citation without manual configuration? Join the SigilEdge beta and we’ll handle the technical implementation, including AI-optimized content delivery, with zero code changes.
Tagged with
Ready to optimize for AI search?
Join our beta and make your website citable by AI answer engines, with zero code changes.
Join the Beta