The New Scrapers: Understanding AI User Agents

In the pre-AI era, robots.txt management was simple: you either allowed Googlebot and Bingbot to index your site, or you blocked them. Today, the crawler ecosystem is far more complex. Dozens of new bots are actively crawling the web not to index and rank your pages, but to scrape your intellectual property, articles, and documentation to train AI models.

This has sparked an ongoing battle between content creators and AI companies. Fortunately, you can take control of how bots crawl your site by updating your robots.txt file for modern user-agents.

Identifying the Key AI Crawlers

Before writing rules, you need to understand which agents are scraping your site. Here are the most prominent AI crawlers in 2026:

GPTBot: OpenAI's primary crawler used to gather training data for ChatGPT and future models.
ChatGPT-User: OpenAI's agent that crawls web pages on behalf of ChatGPT users asking real-time questions. Blocking this prevents ChatGPT from referencing your live site in user chats.
ClaudeBot: Anthropic's crawler used to train Claude models.
Google-Extended: Google's opt-out token to prevent your site content from being used to train Gemini and search generative models.
PerplexityBot: The bot used by Perplexity AI to crawl and summarize live web pages for search queries.

Configuring robots.txt for Selective Access

You can choose to block all AI training crawlers while still allowing standard search engines (like Google and Bing) to index your pages. Here is a recommended production configuration:

# Allow search engines to index everything
User-agent: Googlebot
Disallow: /admin/
Disallow: /private/

User-agent: Bingbot
Disallow: /admin/

# Block AI crawlers from scraping training data
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Allow real-time search synthesis bots (to ensure you get cited)
User-agent: PerplexityBot
Allow: /
User-agent: ChatGPT-User
Allow: /

Beyond Robots.txt: Edge-Level Blocking

It's important to remember that robots.txt is a polite suggestion. Rogue or open-source crawlers often ignore it. To guarantee protection, you should implement edge-level firewall rules. By using platforms like Cloudflare or Next.js Edge Middleware, you can inspect the incoming User-Agent header and reject requests matching known scrapers with a 403 Forbidden response before they consume your server resources.

Robots.txt in 2026: How to Control AI Crawlers Scraping Your Content

The New Scrapers: Understanding AI User Agents

Identifying the Key AI Crawlers

Configuring robots.txt for Selective Access

Beyond Robots.txt: Edge-Level Blocking

Recommended insights

Scaling Real-Time Financial Data: How We Architected AlphaTradeCircle

How to Deploy an Enterprise-Grade MVP in Under 30 Days

The 24-Hour MVP: How to Launch and Validate Your Startup Overnight

Ready to rank in generative search summaries?