XML Sitemap Priority, Changefreq, and Crawl Rate | OpsBlu Docs

XML Sitemap Priority, Changefreq, and Crawl Rate

Configure XML sitemaps that actually influence crawl behavior. Covers lastmod accuracy, sitemap segmentation, and why Google ignores priority and.

XML sitemaps help search engines discover and understand the structure of your site. However, most sites misconfigure their sitemaps with incorrect signals that waste crawl budget or mislead crawlers. The key insight: Google ignores the priority and changefreq fields entirely. Only loc and lastmod matter.

What Google Actually Uses from Your Sitemap

Fields That Matter

Field Used by Google Notes
loc Yes The canonical URL of the page
lastmod Yes, if accurate Must reflect actual content changes, not auto-generated timestamps
changefreq No Google has explicitly confirmed they ignore this field
priority No Google has explicitly confirmed they ignore this field

Google's John Mueller confirmed in 2023: "We essentially ignore changefreq and priority. We focus on the URLs themselves and the lastmod date."

Why lastmod Accuracy Matters

Google uses lastmod to decide whether to re-crawl a page. If your CMS sets lastmod to the current date every time the sitemap regenerates (regardless of content changes), Google learns that your lastmod values are unreliable and starts ignoring them.

<!-- BAD: Auto-generated lastmod that doesn't reflect real changes -->
<url>
  <loc>https://example.com/about</loc>
  <lastmod>2024-11-15</lastmod> <!-- Updated daily, content unchanged since 2023 -->
</url>

<!-- GOOD: lastmod reflects actual content modification -->
<url>
  <loc>https://example.com/about</loc>
  <lastmod>2023-06-20</lastmod> <!-- Last real content edit -->
</url>

Sitemap Structure Best Practices

Segmentation by Content Type

Large sites should split sitemaps by content type for better diagnostics and crawl management:

<!-- sitemap-index.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-pages.xml</loc>
    <lastmod>2024-11-01</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-blog.xml</loc>
    <lastmod>2024-11-15</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
    <lastmod>2024-11-14</lastmod>
  </sitemap>
</sitemapindex>

Benefits of segmentation:

  • Search Console reports sitemap-level indexation stats, so you can quickly see which content types have indexation problems
  • Smaller sitemaps load faster and are less likely to timeout during crawling
  • You can update individual sitemaps without regenerating the entire index

Size Limits

  • Maximum 50,000 URLs per sitemap file (Google's hard limit)
  • Maximum 50MB uncompressed per file (practical limit is lower)
  • Use gzip compression: Serve as sitemap.xml.gz to reduce transfer time
  • No limit on sitemap index entries: You can have hundreds of child sitemaps

Only Include Indexable URLs

Every URL in your sitemap should be:

Including non-indexable URLs in your sitemap wastes crawl budget and signals poor site hygiene to Google.

# Validate sitemap URLs
import requests
from lxml import etree

def audit_sitemap(sitemap_url):
    resp = requests.get(sitemap_url, timeout=30)
    root = etree.fromstring(resp.content)
    ns = {'s': 'http://www.sitemaps.org/schemas/sitemap/0.9'}

    issues = []
    for url_elem in root.findall('.//s:url', ns):
        loc = url_elem.find('s:loc', ns).text
        try:
            page = requests.head(loc, timeout=10, allow_redirects=False)
            if page.status_code != 200:
                issues.append(f"{loc} -> {page.status_code}")
        except requests.RequestException as e:
            issues.append(f"{loc} -> ERROR: {e}")

    print(f"Checked {len(root.findall('.//s:url', ns))} URLs, {len(issues)} issues")
    for issue in issues:
        print(f"  {issue}")

Influencing Crawl Rate

Since priority and changefreq are ignored, how do you actually influence how often Google crawls specific pages?

Pages with more internal links and links from higher-authority pages get crawled more frequently. This is a stronger crawl signal than anything in the sitemap.

Fresh Content Signals

Pages that genuinely change frequently (news feeds, pricing pages, inventory) get re-crawled more often when Google detects the pattern -- but only if lastmod values have been historically accurate.

Crawl Rate Settings in Search Console

For large sites experiencing crawl pressure, Search Console offers a crawl rate limiter under Settings > Crawling. This only reduces the maximum crawl rate; you cannot increase it above Google's default.

Server Response Time

Fast-responding servers get more crawl budget. If your TTFB is under 200ms, Google will crawl more pages per session than if your server takes 2 seconds to respond.

Sitemap Submission and Monitoring

Submit sitemaps through Search Console and reference them in robots.txt:

# robots.txt
Sitemap: https://example.com/sitemap-index.xml

Monitor sitemap health in Search Console under Sitemaps:

  • Submitted vs. Indexed: If indexed count is significantly lower than submitted, you have indexation issues
  • Last read date: If Google has not read your sitemap in 30+ days, check for accessibility issues
  • Errors: Fix any reported parsing errors immediately