Robots.txt is a small text file at the root of your domain that controls how search engine crawlers interact with your site. It sounds simple — and the syntax is indeed simple — but robots.txt errors are some of the most severe technical SEO mistakes a site can make.

Accidentally blocking Googlebot from your entire site is more common than you'd think.

What Robots.txt Does

Robots.txt tells crawlers which parts of your site they can and cannot access. It does not control indexing — a page can still be indexed by Google if other pages link to it, even if it's blocked in robots.txt. To prevent indexing, you need noindex meta tags or x-robots-tag headers.

Robots.txt controls crawl access. Noindex controls indexation. They're complementary, not interchangeable.

Basic Robots.txt Syntax

A basic robots.txt file looks like this:

`` User-agent: * Disallow: /admin/ Disallow: /checkout/ Allow: /

Sitemap: https://example.com/sitemap.xml `

- User-agent: * — applies to all crawlers - Disallow: — paths that shouldn't be crawled - Allow: — explicit exceptions to disallow rules - Sitemap: — reference to your XML sitemap

Common Robots.txt Mistakes

Blocking the Entire Site

` User-agent: * Disallow: / `

This blocks all crawlers from your entire site. It's the most catastrophic robots.txt error and is surprisingly common — often left in place after development and forgotten.

Blocking CSS and JavaScript

Blocking your CSS and JavaScript files prevents Google from rendering your pages correctly, which can hurt rankings significantly. Never block: - /wp-content/themes/ (WordPress) - Your main CSS and JS files - Any assets Google needs to render your pages

Conflicting Rules

When multiple rules apply to the same URL, Google follows the most specific rule. Conflicting rules can create unexpected behaviour. Test every important URL pattern against your robots.txt.

What to Block

Block paths that: - Have no SEO value (admin areas, internal search results, checkout pages) - Would waste crawl budget (infinite pagination, faceted navigation generating millions of URLs) - Contain sensitive information that shouldn't be crawled

Testing Your Robots.txt

Always test your robots.txt before and after any changes: 1. Use Google Search Console's robots.txt Tester 2. Test your most important URLs explicitly 3. After making changes, wait 24-48 hours for Googlebot to re-crawl

The Crawl Budget Consideration

For large sites (100,000+ pages), robots.txt plays an important role in crawl budget management. By preventing crawlers from accessing duplicate, thin, or unimportant pages, you ensure Google spends its crawl budget on your content that matters.

For most small to medium sites, crawl budget is not a limiting factor — Google will eventually crawl everything. But smart robots.txt configuration is still good practice.

Specifying Your Sitemap

Always include a Sitemap directive in your robots.txt. This helps search engines discover your sitemap even if you haven't submitted it via Search Console:

` Sitemap: https://example.com/sitemap.xml ``

Use the absolute URL including your protocol (https).

Monitoring Robots.txt Issues

Check Google Search Console's Coverage report for pages marked as "Blocked by robots.txt". If important pages appear there, audit your robots.txt immediately.

Regular robots.txt audits should be part of your technical SEO maintenance routine — especially after site migrations, platform changes, or when new sections of your site are launched.