Google is usually very coy when adressing SEO, so it was really refreshing to see a simply-worded response from Google Employee, John Mueller, about how the search engine uses the sitemap.xml file.
Mueller starts out by explaining what the sitemap file does:
A Sitemap file helps search engines to discover new and updated URLs on your website. In particular, if your website is fairly large, then this can help them to be able to focus on the new & updated content, instead of having to blindly crawl through everything to see if anything has changed. That can result in new content being found much faster, which can be quite noticeable especially if the site is larger or more complex.
None of that information is especially interesting. But it gets a bit more interesting…
Find the number of indexed URLs for your website: These statistics are recalculated daily and very accurate. You can find these in the Sitemaps detail page.
Interesting. I wonder why Google would recalculate these every day.
Discover canonicalization issues: If the numbers there don’t match up, that’s frequently a sign that you’re specifying URLs in the Sitemap file that don’t match what we find during our crawling. That’s usually a sign that you need to work on canonicalization.
This is handy. Canonicalization issues are notoriously hard to define and then ensure that you’ve fixed, so this is a helpful tool to that end.
Help with canonicalization: When we find multiple URLs on your site that show identical content, we will give any URL that’s listed in a Sitemap an extra edge, even if you don’t use other canonicalization methods.
Well that’s clear. Sitemaps are a tie-breaker.
Find badly-indexed parts of your site: These counts are supplied per Sitemap file, so you can create separate Sitemap files for logical sections of your site, to discover areas where Google isn’t indexing as much as you’d like.
This is an idea people have talked about (this post on Video Sitemaps comes to mind) for a while but you don’t see in use very often.
Prioritize crawl errors: In the crawl errors section, URLs that were specified in Sitemaps files are listed separately. Since you specifically supplied these URLs, we assume that you want them indexed, and that any crawl errors there are important.
When you give a URL to Google in your sitemap, any errors on that page are listed separately so they are given a heightened priority.
Finally Mueller came back to the discussion and added an addendum…
Edited to add: another thing that I personally find extremely useful with regards to Sitemaps is that if you’re not generating them directly with your CMS, you invariably find out a lot about how your website is crawlable, and what kind of URLs are discovered during that process. I’ve seen many cases where crawling a website with a tool on your side (eg a Sitemaps generator) will bubble up issues that you might miss otherwise, be that session-IDs in URLs, duplicate content through URL differences, infinite spaces (such as endless calendars), or even parts of a site that aren’t linked at all.
We’ve found the same thing Mueller is talking about in doing SEO Audits for clients. Running a spider through your site (especially one like Screaming Frog for Mac or Xenu for Windows) is one of the most enlightening things you can do for any site. Over time, you get a sixth sense for how search engines view a site and spot problems quickly.
All and all, it’s nice to see Google Employees in public forums giving concrete answers. Hopefully this continues!