"Sitemaps" are smaller step forward than we had hoped
There has been a lot of press about the XML sitemaps standard, now that many of the major Internet search portals have agreed to adopt it. You can visit the official site and some of the posts covering the major engines' adoption of it.
But my excitement was tempered by limited the scope of what the standard covers. The sitemaps format does let you list URLs of individual pages on your site, along with some meta data about them; but despite it's name, it does not let you provide a true hierarchal "map" of your site. So you cannot, for example, indicate which pages are in the "Marketing" section vs. which pages are in "Support". It's mostly a straight list of uncategorized URLs, but conveys not structure or topology, no "map".
I'm not "trashing it", it IS an improvement for spiders, they will know which URLs to fetch in a much more efficient way. And the folks at sitemaps.org have also got to the big guys to pay attention and adopt it. I want to make it clear that I strongly commend them for this. But lets review an older standard, and look at sitemaps 0.9, and then perhaps start talking about sitemaps 2.0... :-)
For those of you who remember Ultraseek, it had a "standard" 10 years ago for listing the URLs on your site, so that spiders knew what to fetch and when each page had last changed. It was called sitelist.txt. It also lacked any concept of site structure, but it was a simple standard, so that was beyond the scope.
So how does the new XML sitemaps standard compare to the old Ultraseek sitelist.txt spec? Based on my preliminary review it looks like:
Similar:
* They both list out individual URLs.
* They both list the last modification date. (in different formats, more on that later)
* Neither conveys any structure about the site, like Marketing vs. Sales vs. Support.
Pros of the new sitemaps standard:
+ Also allows for adding a "priority" field, so you can tell the spider which pages are more important. It 's a percentage from 0.0 to 1.0, or 0% to 100%. I suspect some companies are going to have trouble deciding that ANY of their pages deserve less than a 100% priority. Fortunately it only impacts the weighting within that one site - you can't "cheat" and give your pages "1,000%" priority and bump competitors off the results, though I suspect some will try. :-)
+ You can tell the spider how often you think the page will change, vs. letting the spider guess for itself.
+ The date and time is human readable. You can just put the date (omit the time) if you want, though for frequently updated sites I think you'd want to keep the time component, for minute by minute accuracy. This would be especially true if enterprise search engine vendors also adopt this standard. With the timezone omitted, I'm not sure if it assumes GMT?
= (neutral) This format is XML, vs. Ultraseek's delimited text file. People who love XML will call this a win, but for such simple data the old Ultraseek format was much easier to parse. XML can easily convey a hierarchy, so there is a potential for future functionality, a true "map", but for now, seems like overkill to me.
Pros for the older sitelist.txt / Ultraseek format:
+ More compact
+ Also included the size of the document
+ Deleted documents could be indicated with a new size of "0" bytes, so the spider would know to DELETE that URL from its index.
= (neutral) dates were in the numeric Unix epoch format - easy for a computer to parse, but impossible for humans to read. Since humans rarely view the file directly, and in 1995 performance was a bigger concern, I think there was some merit to this simpler format back in that time.
This last feature, knowing that document is now GONE without needing to try refetching it, seems like a big advantage for the older format. I hope the new format will adopt it as well.
What I'd like to see in the sitemaps 2.0 standard:
* Allow me to put URLs into a hierarchy, and perhaps the spec could suggest some common nodes, such as:
Support
Downloads
FAQs
Contacting Support
Marketing
Press Releases
White Papers
Product Brochures
Products
Super Widget
Beta Zapper
Gamma Wave
...
We have a simple in-house format for representing this, it could be adapted to XML.
* Provide additional meta data about the page, such as title, author, summary, product code
* Perhaps identify one or more subject categories, perhaps based on a DMOZ trove/hive/path location.
An advantage of having some of this data repeated in this format, vs. fetching the actual page and looking at some standard meta data, is that an overview of a site could be done much more quickly. A spider could get an overview of 1,000 pages in one fetch, in the time it might take it to just fetch 2 or 3 pages normally.
I looked at a lot of these metadata/sitemap formats a few years ago, including a carefully-designed (but not very robot-friendly) one from Australia (AGLS, http://www.agls.gov.au/).
Two quick thoughts.
First, it isn't clear whether the separate metadata is authoritative over the document metadata or if it is a potentially-stale cache of that. Either approach is useful, ambiguity is not.
Second, this is all solved by RSS (pronounced "Atom"). Atom feeds are verbose, but they they are clear and supported by lots of software. Any real search engine should already be reading and mining feeds anyway.
Posted by: Walter Underwood | June 05, 2007 at 05:43 PM