« FAST and Microsoft tie the knot: It's official | Main | A proposed standard for enterprise search »

May 05, 2008

The problem with alerts - Google or otherwise

I use Google alerts to keep an eye on current events. Over the weekend I got an alert: "AMEC uses Verity's K2" - Now, since Verity is part of former competitor Autonomy, and because K2 is generally not being actively marketed, I decided to read the article. Sure enough, the content is dated January 2004, but Google Alerts thinks it is brand new. So I have to conclude that either the publisher just changed something on the page, or Google is just finding that document - either way, Google thinks this is news and in reality, it isn't.

Not long after we started SearchButton.com, we met the Google founders Sergey and Larry. Mark Bennett, my co-founder at SearchButton and here at New Idea Engineering, asked about the then-young Google's handling of dates and recency, and the Google guys took the position that date wasn't that important. This has led to a couple of energetic email exchanges over the last few years, but my recent alert illustrates the problem Google - and most other search technologies have - in generating really useful alerts. In fact, this subject was of such relevance to enterprise search owners, we had an article about the importance of dates in the first issue of our enterprise search newsletter in April of 2003.

One could argue that Segey and Larry are right, seeing as how they own a 767 and I'm lucky to be zipping around in a Cessna 172. But as the content on the internet gets older and older, this date issue will be more and more of a problem. Heck, go search the web for 'java printing', and you find the top article talks about Java 1.2 and Java 1.4 - is that really the newest information available?

In fairness, the Google Search Appliance does now allow you to include document recency in its ranking - so maybe they are coming around to our point of view - but this is not a Google problem: nearly every public and enterprise search crawler has the same problem.

It's tricky for an automated spider to recognize a new document. First, web servers lie about dates all the time - either through mis-configuration, or because of dynamic content that makes a page date 'just now' even if the document is years old. Technology can use checksums and other techniques to fingerprint a document - but even then, how can a crawler be as smart as a 5th grader?

Mark has long maintained that search engines need to capture a 'first seen on' date internally, as well as fingerprint the important content area of a web page. These two steps would allow a crawler to recognize when the content of a page has changed, versus when the web server says it was changed.  This kind of stuff isn't easy - but it's yet another thing that illustrates why enterprise search is so much more complex than plain old Internet search.

Is any of this important to your enterprise search application?

TrackBack

TrackBack URL for this entry:
https://www.typepad.com/services/trackback/6a00d8341c84cf53ef00e55210a8b48833

Listed below are links to weblogs that reference The problem with alerts - Google or otherwise:

Comments

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated, and will not appear until the author has approved them.