• What is robots.txt?

robots.txt

robots.txt refers to a plain-text file placed at the root of a website that tells search engine crawlers (and other well-behaved bots) which parts of the site they’re permitted or forbidden to crawl. Introduced as the Robots Exclusion Protocol in 1994 and formalised as IETF RFC 9309 in 2022, robots.txt is the oldest and simplest piece of SEO infrastructure - and one of the most commonly misconfigured, because its directives are subtler than they look.

The core directives

User-agent. Specifies which crawler the following rules apply to. User-agent: * applies to all crawlers; User-agent: Googlebot applies only to Google’s.

Disallow. Tells the named crawler not to fetch URLs matching a given path prefix. Disallow: /admin/ forbids crawling of anything under /admin/.

Allow. Carves out exceptions from a Disallow. Useful when a broad Disallow blocks something that should remain crawlable. Disallow: /private/ followed by Allow: /private/blog/ blocks everything under /private/ except /private/blog/.

Sitemap. Points crawlers to the location of the site’s XML sitemap. Sitemap: https://example.com/sitemap.xml.

Crawl-delay. Requests a minimum pause between crawler requests. Respected by Bing and Yandex; ignored by Google.

The most important misconception

robots.txt controls crawling, not indexing.

Blocking a URL in robots.txt prevents crawlers from fetching it. But if external sites link to that URL, search engines can still index the URL itself - without its content - and show it in results with a note like “no information available for this page”. robots.txt is the wrong tool for keeping pages out of the index.

The right tool is a noindex meta tag (or X-Robots-Tag header) on the page itself. But for that to work, the crawler has to be allowed to fetch the page and see the noindex directive - which means robots.txt must not block it. This is the most common intermediate-SEO mistake: blocking URLs in robots.txt thinking this removes them from the index, which it doesn’t.

Common robots.txt patterns

Block staging and internal tools. Disallow: /staging/, Disallow: /internal/. Keeps crawl budget focused on public content.

Block faceted-navigation parameter combinations. On large e-commerce sites, URL parameters generate huge numbers of low-value crawlable URLs. robots.txt can cut these out of crawl entirely, while canonical URLs handle the indexing signal.

Disallow AI-training crawlers. User-agent: GPTBot, User-agent: Claude-Web, etc. A growing practice among publishers who want their content indexed in search but not scraped for model training.

Point to the sitemap. Always include the sitemap directive, even if the sitemap is submitted separately via search engine webmaster tools. It costs nothing and helps lesser-known crawlers.

Common failure modes

Four robots.txt mistakes that cause real SEO damage:

Disallowing the whole site by accident. Disallow: / forbids crawling of everything. This ships accidentally on launches and staging misconfigurations. If your entire organic traffic drops to near-zero after a deploy, robots.txt is the first thing to check.

Blocking CSS and JavaScript. Older SEO advice suggested blocking /assets/ or /js/ folders to conserve crawl budget. Modern Google renders pages fully to assess them; blocking resources means Google sees a broken version and ranking drops. Never block resource files needed for rendering.

Using robots.txt for secrets. “Secret” admin URLs listed in robots.txt are publicly visible to anyone who reads the file. Security by obscurity fails doubly here.

Contradicting redirect rules. Blocking a URL in robots.txt prevents Google from seeing the redirect from it. If you’re trying to migrate old URLs to new ones, let old URLs remain crawlable long enough for Google to see the 301s.

A practical worked example

A B2B SaaS launching a rebrand disallowed the entire old URL path structure in robots.txt on launch day, reasoning that the new URLs were the new source of truth. Two weeks in, organic traffic had dropped 60%. Cause: Google couldn’t crawl the old URLs, so it never saw the 301 redirects. The old URLs stayed indexed but pointed nowhere; the new URLs hadn’t yet accumulated enough ranking signals. Fix: remove the Disallow, let Google re-crawl old URLs and follow the 301s, and wait six weeks for re-indexing. The robots.txt move had delayed the migration by roughly two months.

We built Penfriend to produce content that never needs to hide behind robots.txt. Every page generated is content the brand would want indexed; robots.txt stays used for its actual purpose (blocking staging, admin, and parameter variants) rather than as damage control.

Related terms

Here's how we can help you

Want a glossary just like this?

Get in touch for our DFY glossary service.