Configure robots.txt for Crawl Control and SEO

Write and validate robots.txt rules to control search engine crawling. Covers disallow directives, crawl-delay, sitemap declarations, and common...

The robots.txt file is the first thing search engine crawlers request when visiting your domain. It controls which parts of your site crawlers can access, making it one of the highest-leverage files for SEO. A single misconfigured line can deindex an entire site section.

How robots.txt Works

Located at https://yourdomain.com/robots.txt, this plain text file contains directives that instruct compliant crawlers which URL paths to avoid. It does not prevent indexing on its own -- it only blocks crawling. A page blocked in robots.txt can still appear in search results if other pages link to it.

Syntax Reference

User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /search?
Allow: /admin/public/

User-agent: Googlebot
Crawl-delay: 1

Sitemap: https://example.com/sitemap.xml

Key Directives

User-agent - Specifies which crawler the rules apply to. Use * for all crawlers.
Disallow - Blocks crawling of the specified path prefix. Disallow: / blocks the entire site.
Allow - Overrides a broader Disallow rule for a specific path. Googlebot supports this; not all crawlers do.
Crawl-delay - Requests a delay in seconds between requests. Googlebot ignores this directive; use Search Console's crawl rate settings instead.
Sitemap - Declares your XML sitemap location. This is case-sensitive and must be an absolute URL.

What to Block

Block paths that waste crawl budget on low-value pages:

Internal search results (/search?) - These create infinite URL combinations with no unique content.
Admin and staging areas (/admin/, /staging/) - No indexing value, and exposing admin paths is a security risk.
Shopping cart and checkout (/cart/, /checkout/) - Session-specific pages that change per user.
Faceted navigation parameters (/products?color=*&size=*) - Hundreds of filter combinations dilute crawl budget.
Print and PDF versions (/print/, /*.pdf$) - If these duplicate existing page content.

What NOT to Block

Never block resources that Googlebot needs to render your pages:

CSS and JavaScript files - Blocking these prevents Google from rendering your site, which harms rankings.
Image directories - Blocking images removes them from Google Images and reduces visual search traffic.
Pages you want indexed - This seems obvious, but CMS migrations and staging-to-production transitions frequently leave Disallow: / in place.

Validation and Testing

Google Search Console includes a robots.txt tester under Settings > Crawling > robots.txt. Paste your file or test individual URLs against your current rules.

Test before deploying. Stage your robots.txt changes and validate with the tester before pushing to production. A single typo in a Disallow rule can block critical pages.

# Quick validation: fetch and inspect
curl -s https://yourdomain.com/robots.txt

Common Mistakes

Blocking CSS/JS

# WRONG - breaks rendering
Disallow: /wp-content/themes/
Disallow: /assets/js/

Google needs these files to understand your page layout and content.

Leftover Staging Rules

After migrating from staging to production, the Disallow: / rule from your staging environment may still be present. This blocks all crawling.

Using robots.txt for Security

robots.txt is publicly readable. Do not use it to hide sensitive directories. Use authentication or server-level access controls instead.

Missing robots.txt

A missing robots.txt file returns a 404, which crawlers interpret as "crawl everything." This is technically fine but means you lose control over crawl budget allocation. Every production site should have an explicit robots.txt file.

Monitoring

Check Google Search Console's "Crawl Stats" report monthly to verify that Googlebot is respecting your rules and not wasting time on blocked paths. If crawl requests suddenly spike for disallowed paths, your robots.txt may have been overwritten during a deployment.

शीर्ष पर वापस जाएँ