XML sitemaps help search engines discover and understand the structure of your site. However, most sites misconfigure their sitemaps with incorrect signals that waste crawl budget or mislead crawlers. The key insight: Google ignores the priority and changefreq fields entirely. Only loc and lastmod matter.
What Google Actually Uses from Your Sitemap
Fields That Matter
| Field | Used by Google | Notes |
|---|---|---|
loc |
Yes | The canonical URL of the page |
lastmod |
Yes, if accurate | Must reflect actual content changes, not auto-generated timestamps |
changefreq |
No | Google has explicitly confirmed they ignore this field |
priority |
No | Google has explicitly confirmed they ignore this field |
Google's John Mueller confirmed in 2023: "We essentially ignore changefreq and priority. We focus on the URLs themselves and the lastmod date."
Why lastmod Accuracy Matters
Google uses lastmod to decide whether to re-crawl a page. If your CMS sets lastmod to the current date every time the sitemap regenerates (regardless of content changes), Google learns that your lastmod values are unreliable and starts ignoring them.
<!-- BAD: Auto-generated lastmod that doesn't reflect real changes -->
<url>
<loc>https://example.com/about</loc>
<lastmod>2024-11-15</lastmod> <!-- Updated daily, content unchanged since 2023 -->
</url>
<!-- GOOD: lastmod reflects actual content modification -->
<url>
<loc>https://example.com/about</loc>
<lastmod>2023-06-20</lastmod> <!-- Last real content edit -->
</url>
Sitemap Structure Best Practices
Segmentation by Content Type
Large sites should split sitemaps by content type for better diagnostics and crawl management:
<!-- sitemap-index.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-pages.xml</loc>
<lastmod>2024-11-01</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-blog.xml</loc>
<lastmod>2024-11-15</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
<lastmod>2024-11-14</lastmod>
</sitemap>
</sitemapindex>
Benefits of segmentation:
- Search Console reports sitemap-level indexation stats, so you can quickly see which content types have indexation problems
- Smaller sitemaps load faster and are less likely to timeout during crawling
- You can update individual sitemaps without regenerating the entire index
Size Limits
- Maximum 50,000 URLs per sitemap file (Google's hard limit)
- Maximum 50MB uncompressed per file (practical limit is lower)
- Use gzip compression: Serve as
sitemap.xml.gzto reduce transfer time - No limit on sitemap index entries: You can have hundreds of child sitemaps
Only Include Indexable URLs
Every URL in your sitemap should be:
- Returning HTTP 200
- Self-canonicalized (the canonical tag points to itself)
- Not blocked by robots.txt
- Not tagged with
noindex
Including non-indexable URLs in your sitemap wastes crawl budget and signals poor site hygiene to Google.
# Validate sitemap URLs
import requests
from lxml import etree
def audit_sitemap(sitemap_url):
resp = requests.get(sitemap_url, timeout=30)
root = etree.fromstring(resp.content)
ns = {'s': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
issues = []
for url_elem in root.findall('.//s:url', ns):
loc = url_elem.find('s:loc', ns).text
try:
page = requests.head(loc, timeout=10, allow_redirects=False)
if page.status_code != 200:
issues.append(f"{loc} -> {page.status_code}")
except requests.RequestException as e:
issues.append(f"{loc} -> ERROR: {e}")
print(f"Checked {len(root.findall('.//s:url', ns))} URLs, {len(issues)} issues")
for issue in issues:
print(f" {issue}")
Influencing Crawl Rate
Since priority and changefreq are ignored, how do you actually influence how often Google crawls specific pages?
Internal Link Weight
Pages with more internal links and links from higher-authority pages get crawled more frequently. This is a stronger crawl signal than anything in the sitemap.
Fresh Content Signals
Pages that genuinely change frequently (news feeds, pricing pages, inventory) get re-crawled more often when Google detects the pattern -- but only if lastmod values have been historically accurate.
Crawl Rate Settings in Search Console
For large sites experiencing crawl pressure, Search Console offers a crawl rate limiter under Settings > Crawling. This only reduces the maximum crawl rate; you cannot increase it above Google's default.
Server Response Time
Fast-responding servers get more crawl budget. If your TTFB is under 200ms, Google will crawl more pages per session than if your server takes 2 seconds to respond.
Sitemap Submission and Monitoring
Submit sitemaps through Search Console and reference them in robots.txt:
# robots.txt
Sitemap: https://example.com/sitemap-index.xml
Monitor sitemap health in Search Console under Sitemaps:
- Submitted vs. Indexed: If indexed count is significantly lower than submitted, you have indexation issues
- Last read date: If Google has not read your sitemap in 30+ days, check for accessibility issues
- Errors: Fix any reported parsing errors immediately