Web Search and Dates, again....
Miles likes to brag that our first argument with the Google founders goes back to 2000/2001, about dates, and he's blogged out date issues before.
I was reminded of this old argument today. We had a power glitch in Cupertino a little while ago, so I went online to see if there was any news - after Twittering about it of course!
Google brought back a top result with a promising title... from 2004:
http://www.sfgate.com/cgi-bin/article.cgi?f=/g/archive/2004/10/01/power01.DTL
Whatever... the other search portals had the same or similarly old results, so I'm not gonna site and bash Google. In all fairness this is a tough problem to fix 100%, and all search engines have issues with bad dates.
AND, if no paper or blog talked about it, then there's nothing to "find" anyway. Google can't find what isn't there.
Going to twitter search, I found my own posting, and then some guy asking about driving times to Cupertino.
But freshness of content remains a problem.
I have a compromise suggestion for "the powers the be". How about, when a spider makes a guess as to the proper date of a story, that it also add a "confidence" to that. For example, if the URL encodes a date, then I'd call that high confidence. Or there's a newspaper byline. At the other end is the when a web server gives the current date and time every time a page is fetched - so clearly not connected to the content, so a lower confidence. And then some default weight, the first time a spider encounters a piece of content.
Even if a page doesn't have a date a spider can derive a "bare minimum" date based on the dates of other pages that link to it.
===
Great point Mark! We often point out that 'first seen on' date is good, but 'date of first link into' is a good hint as well. Often spiders do not maintain much in the way of page to page state, but those that do can certainly can add this as a validation!
Posted by: MarkH | April 23, 2009 at 01:30 AM