Robots.txt: Configure Without Breaking SEO

Robots.txt is a small text file at the root of your domain that controls how search engine crawlers interact with your site. It sounds simple — and the syntax is indeed simple — but robots.txt errors are some of the most severe technical SEO mistakes a site can make.

Accidentally blocking Googlebot from your entire site is more common than you'd think.

What Robots.txt Does

Robots.txt tells crawlers which parts of your site they can and cannot access. It does not control indexing — a page can still be indexed by Google if other pages link to it, even if it's blocked in robots.txt. To prevent indexing, you need noindex meta tags or x-robots-tag headers.

Robots.txt controls crawl access. Noindex controls indexation. They're complementary, not interchangeable.

Basic Robots.txt Syntax

A basic robots.txt file looks like this:

``User-agent: * Disallow: /admin/ Disallow: /checkout/ Allow: /

Sitemap: https://example.com/sitemap.xml`

- User-agent: *— applies to all crawlers -Disallow:— paths that shouldn't be crawled -Allow:— explicit exceptions to disallow rules -Sitemap: — reference to your XML sitemap


Common Robots.txt Mistakes
Blocking the Entire Site


User-agent: *
Disallow: /


This blocks all crawlers from your entire site. It's the most catastrophic robots.txt error and is surprisingly common — often left in place after development and forgotten.
Blocking CSS and JavaScript
Blocking your CSS and JavaScript files prevents Google from rendering your pages correctly, which can hurt rankings significantly. Never block:
- /wp-content/themes/ (WordPress)
- Your main CSS and JS files
- Any assets Google needs to render your pages
Conflicting Rules
When multiple rules apply to the same URL, Google follows the most specific rule. Conflicting rules can create unexpected behaviour. Test every important URL pattern against your robots.txt.
What to Block
Block paths that:
- Have no SEO value (admin areas, internal search results, checkout pages)
- Would waste crawl budget (infinite pagination, faceted navigation generating millions of URLs)
- Contain sensitive information that shouldn't be crawled
Testing Your Robots.txt
Always test your robots.txt before and after any changes:
1. Use Google Search Console's robots.txt Tester
2. Test your most important URLs explicitly
3. After making changes, wait 24-48 hours for Googlebot to re-crawl
The Crawl Budget Consideration
For large sites (100,000+ pages), robots.txt plays an important role in crawl budget management. By preventing crawlers from accessing duplicate, thin, or unimportant pages, you ensure Google spends its crawl budget on your content that matters.
For most small to medium sites, crawl budget is not a limiting factor — Google will eventually crawl everything. But smart robots.txt configuration is still good practice.
Specifying Your Sitemap
Always include a Sitemap directive in your robots.txt. This helps search engines discover your sitemap even if you haven't submitted it via Search Console:


Sitemap: https://example.com/sitemap.xml

Use the absolute URL including your protocol (https).

Monitoring Robots.txt Issues

Check Google Search Console's Coverage report for pages marked as "Blocked by robots.txt". If important pages appear there, audit your robots.txt immediately.

Regular robots.txt audits should be part of your technical SEO maintenance routine — especially after site migrations, platform changes, or when new sections of your site are launched.

Robots.txt: Configure Without Breaking SEO

What Robots.txt Does

Basic Robots.txt Syntax

Common Robots.txt Mistakes

Blocking the Entire Site

Blocking CSS and JavaScript

Conflicting Rules

What to Block

Testing Your Robots.txt

The Crawl Budget Consideration

Specifying Your Sitemap

Monitoring Robots.txt Issues

Analyse your site with AIPageSEO

More from the Blog

CSS Errors and SEO: How Broken Styles Hurt Your Rankings

JSON-LD Validation: Fix Structured Data Errors

SERP Preview: Optimise Titles for More Clicks