Sitemap XML
Sitemap XML refers to an XML-formatted file listing the URLs on a website that the site owner wants search engines to know about, along with optional metadata for each URL (last modification date, change frequency, priority). The sitemap is not a browsing tool for humans - it’s a discovery signal for crawlers. A sitemap doesn’t force indexing, but it dramatically increases the chance that a crawler finds every URL worth indexing, especially on large or poorly-interlinked sites.
The core file structure
A valid XML sitemap is a simple document:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/page-1/</loc>
<lastmod>2026-02-14</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
The loc element is required; the rest are optional. In practice, Google ignores changefreq and priority entirely - they’re legacy fields that persist in the spec but carry no ranking weight. lastmod, on the other hand, Google does use when it’s accurate and the source is trusted.
Size limits and sitemap indexes
A single sitemap file is capped at 50,000 URLs or 50MB uncompressed (whichever comes first). Larger sites split into multiple sitemaps and use a sitemap index - an XML file that lists the locations of the individual sitemaps:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap><loc>https://example.com/sitemap-pages.xml</loc></sitemap>
<sitemap><loc>https://example.com/sitemap-blog.xml</loc></sitemap>
<sitemap><loc>https://example.com/sitemap-products.xml</loc></sitemap>
</sitemapindex>
A sitemap index can reference up to 50,000 sitemaps, each with up to 50,000 URLs - effectively 2.5 billion URLs, more than any realistic site.
Sitemap variants
Three specialised sitemap types worth knowing:
Image sitemaps. Extend the URL entries with <image:image> tags to surface imagery Google might miss in its page rendering. Most useful for e-commerce and photography-heavy sites.
Video sitemaps. Surface video content with duration, thumbnail, and transcript metadata. Important for sites where video is the primary content.
News sitemaps. A specific format for Google News-eligible sites, with tight time-window and article-specific metadata requirements. Required for Google News inclusion.
How sitemaps are submitted
Four discovery mechanisms:
robots.txt reference. Add Sitemap: https://example.com/sitemap.xml to your robots.txt. Every crawler finds it.
Google Search Console submission. Upload the sitemap URL directly in GSC. Gives visibility into indexing status per sitemap, which is operationally useful.
Bing Webmaster Tools. Bing’s equivalent submission interface, with similar diagnostic benefits.
Standard crawl discovery. If your sitemap lives at /sitemap.xml, most crawlers will find it without being told.
What to include and exclude
The sitemap should list indexable, canonical, live URLs - nothing else. Four common mistakes:
Including non-canonical URLs. URLs with UTM parameters, filter combinations, or session IDs. The sitemap should list only the canonical versions.
Including redirected URLs. URLs that return 301s or 302s shouldn’t appear in sitemaps - they waste crawl budget and confuse the signal.
Including blocked URLs. URLs disallowed in robots.txt or carrying a noindex tag shouldn’t appear. Listing them in a sitemap while blocking them elsewhere sends contradictory signals.
Including deleted URLs. 404s in the sitemap are a common clean-up task. Keep sitemap entries current.
Why sitemap discipline matters for SEO
Three reasons:
Crawl efficiency on large sites. For a site with 50,000+ URLs, crawlers can’t discover every page through link graph alone. The sitemap is the difference between comprehensive indexing and patchy coverage.
Faster indexing of new content. A freshly published page listed in the sitemap with an updated lastmod gets discovered and indexed faster than one waiting to be found through organic crawl.
Diagnostics via Search Console. The “Coverage” and “Indexing” reports in GSC compare sitemap URLs against what’s actually indexed. A large gap (“submitted but not indexed”) signals content-quality or technical issues that need investigation.
Sitemap hygiene checklist
Six things worth auditing quarterly:
- Sitemap is accessible and returns 200 OK
- Sitemap matches the current set of canonical, live URLs
lastmodvalues reflect real content changes, not deploy timestamps- Sitemap is referenced in robots.txt
- Sitemap is submitted in Search Console and Bing Webmaster Tools
- Indexed-versus-submitted ratio in GSC is above 80%; investigate if it drops
We built Penfriend to produce content with clean URL structures that integrate naturally into sitemap generation - every new piece gets listed, every URL is stable, every last-modified timestamp reflects real content changes.
Related terms
- XML Sitemap - the same concept; common alternative term
- robots.txt - the file pair that drives crawler behaviour
- Canonical URL - sitemaps should only list canonicals
- Search Engine Optimization (SEO) - sitemap hygiene is a foundational SEO practice
- On-Page Optimization - sitemap work is part of technical on-page SEO
