Technical • February 3, 2025

llms.txt: The New robots.txt for the AI Era

A new standard is emerging to help websites communicate with AI systems. Here's what llms.txt is, why it matters, and how to implement it today.

SigilEdge Team

SigilEdge

In 1994, a simple text file changed the web forever. robots.txt gave website owners a way to communicate with search engine crawlers, telling them what to index and what to ignore. It became a universal standard that every major search engine respects to this day.

Thirty years later, we need something similar for AI.

Enter llms.txt: an emerging specification that helps websites communicate with large language models and the AI systems built on them. It’s not yet as universal as robots.txt, but it’s gaining traction fast, and early adopters are positioning themselves as preferred sources for AI citation.

The Problem llms.txt Solves

AI systems interact with websites differently than traditional search crawlers. When Googlebot visits your site, it’s indexing content for a search results page. When GPTBot or ClaudeBot visits, they might be:

Training data collection: Gathering content to improve future model versions
Real-time retrieval: Fetching current information to answer user questions (RAG)
Citation sourcing: Looking for authoritative content to reference in responses

The problem? Website owners have no standardized way to communicate their preferences for these different use cases. You might be fine with AI citing your content but not want it used for training. You might want attribution in a specific format. You might have licensing terms that apply.

robots.txt doesn’t solve this. It’s binary (allow or disallow crawling) and doesn’t address the nuances of AI interaction.

What Is llms.txt?

llms.txt is a proposed standard (documented at llmstxt.org) that provides AI-specific instructions for how language models should interact with your content. Think of it as robots.txt for the AI era, but with richer semantics.

The file lives at the root of your domain (e.g., https://example.com/llms.txt) and contains structured information about:

Identity: Who you are and how to contact you about AI-related matters
Permissions: What content AI can access, cite, or use for training
Attribution: How you want to be credited when cited
Licensing: Terms that apply to AI use of your content
Preferences: Specific instructions for AI interaction

Here’s a basic example:

# llms.txt for example.com
# Documentation: https://llmstxt.org

# Site identity
name: Example Company
url: https://example.com
contact: ai@example.com

# Content permissions
allow: /blog/*
allow: /docs/*
allow: /products/*
disallow: /internal/*
disallow: /drafts/*

# Citation preferences
citation-format: "Source: Example Company (example.com)"
require-attribution: true

# Training permissions
training: disallow

The Key Sections Explained

Identity Block

name: Example Company
url: https://example.com
contact: ai@example.com

This tells AI systems who owns the content and how to reach you for AI-specific inquiries. The contact email is particularly useful. Some AI companies are proactively reaching out to website owners about content usage, and having a dedicated contact makes this smoother.

Access Permissions

allow: /blog/*
allow: /docs/*
disallow: /internal/*

Similar to robots.txt, but specifically for AI crawlers. This is separate from your search engine permissions. You might want Google to index your pricing page but prefer AI systems not to scrape and regurgitate your pricing structure.

Citation Preferences

citation-format: "Source: Example Company (example.com)"
require-attribution: true
link-back: true

This is where llms.txt really differentiates itself. You can specify:

citation-format: The exact text format you want when AI cites you
require-attribution: Whether citation is mandatory for use
link-back: Whether you want a hyperlink included (relevant for AI tools that support links, like Perplexity)

Training Permissions

training: disallow
rag: allow

A critical distinction that robots.txt can’t make:

training: Whether your content can be used to train future AI models
rag: Whether your content can be retrieved in real-time to answer questions (Retrieval-Augmented Generation)

Many publishers want their content cited in real-time answers (good for visibility) but don’t want it absorbed into training data without compensation. This distinction matters.

Licensing

license: CC-BY-4.0
commercial-use: allowed
derivative-works: with-attribution

For sites with specific licensing terms, this section clarifies how AI systems should treat your content legally.

Why This Matters Now

Several forces are converging to make llms.txt increasingly important:

1. AI Companies Are Listening

OpenAI, Anthropic, and others have implemented robots.txt directives for their crawlers. They’re actively looking for signals about website owner preferences. Having a clear llms.txt file positions you as a cooperative, authoritative source, potentially improving your chances of being cited.

2. Legal Pressure Is Building

Multiple lawsuits are working through courts regarding AI training data. The outcome is uncertain, but the direction is clear: content creators want more control over how their work is used. llms.txt provides a proactive way to declare your terms before regulations force the issue.

3. Citation Is Becoming Currency

As AI answers replace traditional search results, citation becomes the new ranking. Being cited by ChatGPT or Perplexity drives brand awareness even when users don’t click through. A clear llms.txt that facilitates proper attribution helps ensure you get credit.

4. Differentiation Opportunity

Most websites don’t have an llms.txt file yet. Early adopters can establish themselves as AI-forward, potentially receiving preferential treatment from AI systems that recognize cooperative sources.

How to Implement llms.txt

Step 1: Create the File

Create a plain text file named llms.txt in your site’s root directory. Start with the basics:

# llms.txt for yoursite.com
# Learn more: https://llmstxt.org

name: Your Company Name
url: https://yoursite.com
contact: ai@yoursite.com

# What content is available
allow: /blog/*
allow: /products/*
allow: /about

# How to cite us
citation-format: "Source: Your Company (yoursite.com)"
require-attribution: true

# Training policy
training: disallow
rag: allow

Step 2: Define Your Content Zones

Think about your site in terms of what you want AI to access:

Typically allow:

Blog posts and articles
Product descriptions
Documentation
Public-facing informational pages

Typically disallow:

Internal tools and dashboards
User-generated content you don’t own
Paywalled content
Staging or draft content

Step 3: Decide Your Training Stance

This is a business decision. Consider:

Allow training: Maximum exposure, but you lose control over how content is used
Disallow training, allow RAG: Your content can be cited in real-time but won’t be baked into model weights
Disallow both: Maximum control, minimum AI visibility

Most commercial sites are choosing the middle path: allow real-time retrieval and citation, disallow training data usage.

Step 4: Specify Attribution

Be specific about how you want to be credited:

citation-format: "According to [Your Brand] (yoursite.com)"

citation-format: "[Your Brand], retrieved from yoursite.com"

The format should be natural enough to fit in AI-generated prose while clearly identifying your brand.

Step 5: Deploy and Test

Upload the file to your site root and verify it’s accessible at https://yoursite.com/llms.txt. Test by fetching it in a browser or with curl:

curl https://yoursite.com/llms.txt

Advanced Patterns

Per-Section Rules

For complex sites, you can specify different rules for different sections:

# Blog - open for citation
section: /blog/*
  allow: true
  training: disallow
  rag: allow
  citation-format: "From the Example Blog"

# Docs - require link-back
section: /docs/*
  allow: true
  require-link: true
  citation-format: "Example Docs (example.com/docs)"

# Products - structured data preferred
section: /products/*
  allow: true
  prefer-structured: true

Structured Content Hints

You can tell AI systems about your content structure:

structured-data: schema.org
sitemap: https://example.com/sitemap.xml
feed: https://example.com/feed.xml

This helps AI crawlers find and parse your content more effectively.

Update Frequency

Signal how often your content changes:

update-frequency: daily
cache-duration: 24h

This helps AI systems know when to re-fetch content for accurate real-time answers.

Current Adoption Status

Let’s be direct: llms.txt is not yet a universal standard. As of early 2025:

No AI company has officially committed to respecting llms.txt in the way they respect robots.txt
Adoption is growing among forward-thinking publishers and tech companies
The specification is still evolving with community input

So why implement it now?

Signal intent: Having an llms.txt clearly communicates your preferences, which could matter in future legal or business contexts
Early mover advantage: When (not if) AI companies start respecting these files, early adopters will be ahead
Minimal effort: It’s a single text file that takes minutes to create
No downside: Having the file doesn’t hurt anything

The Relationship to robots.txt

llms.txt doesn’t replace robots.txt. They serve different purposes:

Aspect	robots.txt	llms.txt
Target	Search engine crawlers	AI/LLM systems
Primary function	Crawl permissions	Usage permissions + preferences
Attribution	Not applicable	Specify citation format
Training vs. retrieval	No distinction	Explicit separation
Licensing	Not addressed	Can specify terms
Adoption	Universal	Emerging

You should have both files, with potentially different rules. For example:

robots.txt:

User-agent: *
Allow: /

User-agent: GPTBot
Disallow: /pricing/

llms.txt:

allow: /
disallow: /pricing/
training: disallow
rag: allow
citation-format: "Source: Example (example.com)"

What Comes Next

The llms.txt specification will likely evolve as:

AI companies respond: Expect announcements about support (or explanations for why not)
Legal clarity emerges: Court decisions and regulations will shape what control website owners have
Tooling develops: Expect plugins, validators, and generators for common CMS platforms
The spec matures: Additional fields and capabilities will be added based on real-world needs

The companies implementing llms.txt now are the ones who will shape how it evolves. By participating early, you’re not just adopting a standard; you’re helping define it.

Getting Started Today

Here’s your action plan:

Create a basic llms.txt with your site identity, content permissions, and citation preferences
Deploy it to your site root and verify it’s accessible
Review your robots.txt to ensure AI crawler permissions align with your intent
Monitor AI crawler traffic in your logs to understand how AI systems interact with your content
Test your citability by asking AI assistants questions your content should answer

The AI search era is here. The websites that communicate clearly with AI systems (through structured data, clean content, and explicit preferences like llms.txt) will be the ones that get cited, credited, and discovered.

The websites that don’t will become invisible training data, their expertise absorbed without attribution.

Which will you be?

Want to optimize your entire site for AI citation without manual configuration? Join the SigilEdge beta and we’ll handle the technical implementation, including AI-optimized content delivery, with zero code changes.

Tagged with

llms.txt AI Crawlers Technical SEO Standards Implementation

Ready to optimize for AI search?

Join our beta and make your website citable by AI answer engines, with zero code changes.

Join the Beta