Sitemaps

Just like the name says, a sitemap is a map to a web site. Simple. Really, just a list of links that point to all the pages.   There are two kinds of sitemap we’ll discuss here: HTML and XML. Today, we’ll focus on HTML sitemaps; next Thursday, we’ll get into the more exotic XML variety.

HTML sitemaps are placed on a regular web page and typically linked from the home page. Conceivably, this sort of reference could prove useful for people who are looking for specific information on a large or complex site. And in fact, up to 25% of internet users used to rely on sitemaps at least some of the time to find content.  I say “used to” because that number hit its height in about 2002. Since then, there has been a steady decline in sitemap use to around 7% in 2008 to its current level of  something somewhat less. So if nobody’s really using your HTML sitemap, why do  you need it?

The answer is, of course, search marketing. Search spiders aren’t very smart (as we’ve noted here before). They have trouble following certain kinds of links and reading some sorts of link text. Sometimes they get trapped in loops they can’t get out of. Sometimes they index vast numbers of dynamically generated pages that don’t really exist. Sometimes they skip entire sections of a site. An HTML sitemap—properly designed—provides an easy set of pathways into the site for spiders to follow.

And by properly designed, we mean that HTML sitemaps should:

* Reason: some spiders will only follow and index a set number of links per page, always starting from the first they encounter. This number is different for different search engines, but 50 seems pretty safe. This is also the reason to place your sitemap link at the top of the page. If your homepage has 50+ links on it before you get to the sitemap, some engines may never see it.

XML Sitemaps

In 2005, Google unveiled what they called the “Sitemaps Protocol.”  The idea was to create a single format for building a sitemap file that all (or at least “most”) search engines could use to find and index pages that might be otherwise difficult to crawl.  This protocol uses XML as a formatting medium. It’s simple enough to code by hand, but robust enough to support dynamic, database-driven systems.

At first, only Google crawled sitemap.xml files, but they encouraged webmasters to create and publish them by opening a submission service. You would build an XML sitemap, upload it to your web server, then submit the URL to Google via their webmaster interface. The Goog would crawl it, and—in theory—follow all the links and index all your pages.

It actually worked rather well. Pretty soon, all the web pros were calling the system “Google Sitemaps” and uploading and submitting like crazy. With so many sitemaps installed on so many websites, it wasn’t long before the other major engines adopted the protocol.

Are XML sitemaps a magic bullet?

No. Don’t be silly. But they are useful additions to a website’s structural navigation, especially for complex architectures that may be resistant to spider crawls. We’ve used them on many sites and find that a valid XML sitemap can lead to a faster, more accurate indexing.

So what is this thing?

It’s really a pretty simple construction. You could easily make one without any understanding of XML at all. The Sitemaps Protocol dictates a text file, with the extension “xml,” using this template:

<?xml version="1.0" encoding="utf-8"?>  <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">          <url>                  <loc>http://example.com/</loc>                  <lastmod>2006-11-18</lastmod>                  <changefreq>daily</changefreq>                  <priority>0.8</priority>          </url>  </urlset>

Every page on your site that you want crawled will have an entry between <loc></locl> markers. You do not have to set every parameter. This would be a valid sitemap for a site with a home page and three internal pages.

<?xml version="1.0" encoding="utf-8"?>  <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">          <url>                  <loc>http://example.com/</loc>                  <loc>http://example.com/page.html</loc>                  <loc>http://example.com/page2.html</loc>                  <loc>http://example.com/page3.html</loc>          </url>  </urlset>

The other parameters—lastmod, changefreq, and priority—are nice ideas, but ideas we’ve never seen have any effect. So use ‘em or don’t. You can write an XML sitemap with any text editor. Just be sure to save it with “utf-8″ encoding and with the name sitemap.xml. (To save in “utf-8″ encoding in Notepad, click “save as” and you’ll find it in a pull-down menu at the very bottom of the box.)

And wait! It can be even simpler! The Sitemaps Protocol also stipulates that a simple list of URLs in a text file like:

http://example.com/

http://example.com/page.html

http://example.com/page2.html

http://example.com/page3.html

(The file would be named “sitemap.txt” instead of “sitemap.xml” and also must be “utf-8″ encoded.)

And wait again! Even simpler than that! There are a host of online tools that will turn a list of urls into an XML sitemap, or even spider your site for you and produce the sitemap file from that.

There are just a couple of rules to be mindful of:

When you have the file ready, you should use one of the many XML sitemap verification services. An invalid sitemap won’t help much.

Should you submit the file to search engines?

You can. If your site is brand new, it might help. But if you’ve done it right—complete with an entry in the robots.txt file—you really shouldn’t have to. Google, Bing, and Yahoo all know where you live.