Sitemaps
Just like the name says, a sitemap is a map to a web site. Simple. Really, just a list of links that point to all the pages. There are two kinds of sitemap we’ll discuss here: HTML and XML. Today, we’ll focus on HTML sitemaps; next Thursday, we’ll get into the more exotic XML variety.
HTML sitemaps are placed on a regular web page and typically linked from the home page. Conceivably, this sort of reference could prove useful for people who are looking for specific information on a large or complex site. And in fact, up to 25% of internet users used to rely on sitemaps at least some of the time to find content. I say “used to” because that number hit its height in about 2002. Since then, there has been a steady decline in sitemap use to around 7% in 2008 to its current level of something somewhat less. So if nobody’s really using your HTML sitemap, why do you need it?
The answer is, of course, search marketing. Search spiders aren’t very smart (as we’ve noted here before). They have trouble following certain kinds of links and reading some sorts of link text. Sometimes they get trapped in loops they can’t get out of. Sometimes they index vast numbers of dynamically generated pages that don’t really exist. Sometimes they skip entire sections of a site. An HTML sitemap—properly designed—provides an easy set of pathways into the site for spiders to follow.
And by properly designed, we mean that HTML sitemaps should:
- be made of nothing but text links.
- contain no links other than the map links (no need for normal page navigation here).
- be built on a logical structure
- contain no more than 50 links per page*—if you have more links than that, you should separate your sitemaps into multiple levels. For instance, you can make sitemap_1 with links for category and sub category only and with additional links to sub-category/product pages. Or, you can break them into multiple pages based on a simple alphabetical sort. Or whatever. Just be sure to link multiple sitemaps to each other.
- be prominently linked from the home page. Sitemap links in the footer are okay as long as there isn’t much content above it on the page. We prefer linking to sitemap from above the main header whenever possible.
- be kept up to date. Larger sites should consider investing in scripts or other technology to automate their sitemaps. Generating them dynamically will ensure that the links are always current.
- be linked from every indexable page on the site. If a spider comes into your site for the first time from somewhere in the deep pages, this will help it crawl back up the structure to find the rest of them.
* Reason: some spiders will only follow and index a set number of links per page, always starting from the first they encounter. This number is different for different search engines, but 50 seems pretty safe. This is also the reason to place your sitemap link at the top of the page. If your homepage has 50+ links on it before you get to the sitemap, some engines may never see it.
XML Sitemaps
In 2005, Google unveiled what they called the “Sitemaps Protocol.” The idea was to create a single format for building a sitemap file that all (or at least “most”) search engines could use to find and index pages that might be otherwise difficult to crawl. This protocol uses XML as a formatting medium. It’s simple enough to code by hand, but robust enough to support dynamic, database-driven systems.
At first, only Google crawled sitemap.xml files, but they encouraged webmasters to create and publish them by opening a submission service. You would build an XML sitemap, upload it to your web server, then submit the URL to Google via their webmaster interface. The Goog would crawl it, and—in theory—follow all the links and index all your pages.
It actually worked rather well. Pretty soon, all the web pros were calling the system “Google Sitemaps” and uploading and submitting like crazy. With so many sitemaps installed on so many websites, it wasn’t long before the other major engines adopted the protocol.
Are XML sitemaps a magic bullet?
No. Don’t be silly. But they are useful additions to a website’s structural navigation, especially for complex architectures that may be resistant to spider crawls. We’ve used them on many sites and find that a valid XML sitemap can lead to a faster, more accurate indexing.
So what is this thing?
It’s really a pretty simple construction. You could easily make one without any understanding of XML at all. The Sitemaps Protocol dictates a text file, with the extension “xml,” using this template:
<?xml version="1.0" encoding="utf-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://example.com/</loc> <lastmod>2006-11-18</lastmod> <changefreq>daily</changefreq> <priority>0.8</priority> </url> </urlset>
Every page on your site that you want crawled will have an entry between <loc></locl> markers. You do not have to set every parameter. This would be a valid sitemap for a site with a home page and three internal pages.
<?xml version="1.0" encoding="utf-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://example.com/</loc> <loc>http://example.com/page.html</loc> <loc>http://example.com/page2.html</loc> <loc>http://example.com/page3.html</loc> </url> </urlset>
The other parameters—lastmod, changefreq, and priority—are nice ideas, but ideas we’ve never seen have any effect. So use ‘em or don’t. You can write an XML sitemap with any text editor. Just be sure to save it with “utf-8″ encoding and with the name sitemap.xml. (To save in “utf-8″ encoding in Notepad, click “save as” and you’ll find it in a pull-down menu at the very bottom of the box.)
And wait! It can be even simpler! The Sitemaps Protocol also stipulates that a simple list of URLs in a text file like:
http://example.com/
http://example.com/page.html
http://example.com/page2.html
http://example.com/page3.html
(The file would be named “sitemap.txt” instead of “sitemap.xml” and also must be “utf-8″ encoded.)
And wait again! Even simpler than that! There are a host of online tools that will turn a list of urls into an XML sitemap, or even spider your site for you and produce the sitemap file from that.
There are just a couple of rules to be mindful of:
- Sitemap files cannot be over 10 MB
- Sitemap files can be compressed as a gzip file
- The maximum number of URLs per file is 50,000
- Multiple sitemaps can be linked together with a “Master Sitemap”
- Sitemaps should not contain duplicate URLs
- Sitemaps should be referenced in your robots.txt file using this notation:
Sitemap: <sitemap_location>
(of course, “sitemap_location” would be the actual URL address of your sitemap file)
When you have the file ready, you should use one of the many XML sitemap verification services. An invalid sitemap won’t help much.
Should you submit the file to search engines?
You can. If your site is brand new, it might help. But if you’ve done it right—complete with an entry in the robots.txt file—you really shouldn’t have to. Google, Bing, and Yahoo all know where you live.