July 21, 2008

New site for quality search tools and components

We're happy to announce that we've kicked off the beta of a new site to help the community of intranet, customer facing, and local search by proving a directory to the best of open source, no cost, low cost and commercial software tools, components, and products.

That site, mentioned by Steve Arnold this morning in his interview with Mark and I, is

SearchComponenetsOnline

We're beginning to post the tools we've been following for a few months now, and will have many more over the coming days, weeks, and months. Let us know if there's a tool you want to see listed by replying with a comment.

June 18, 2008

Search Quality: You Can't Improve What You Don't Measure

In our latest survey of new newsletter subscribers we found that 29% had no formal metrics for measuring quality of search results.  Search metrics allow you to keep search on the right track and can be a powerful tool for managing your systems.  They are a wonderful source for insights and trends.  We thought we would share a couple that we think work well. Many of these are covered in greater depth in Interpreting Your Search Activity Reports in the Enterprise Search newsletter.

  • Count the number of people who use search  
  • Count the total number of searches  
  • Count the number of zero search results  
  • User feedback on top 100 searches  
  • Track email complaints about search  
  • Measure number of clicks on navigators (navigation menu items)  
  • Business Goals  
  •    
    • Reduce call volume (normallized for growth in customer base) by enabling self-service from search: results are good enough to reduce calls.
    • Reduce e-mail volume (again adjusted for growth in customer base) by enabling self-service from search: results are good enough to reduce e-mails. 
    • Revenue       
    • Add-on revenue       

March 17, 2008

Search 2.0 - example of odd suggestion of related product based on social factors

I was looking at wireless wifi cameras on Amazon tonight.

When looking at the D-Link Wireless Internet Camera, I happened to notice the "Customers Who Bought This Item Also Bought" section, suggesting 3 other items:
1: D-Link Securicam ... etc...
2: D-Link power adapter...
3: *** Levi's Big & Tall Cargo Pants ***  :-)

Apparently enough portly network IT folks want to look stylish that Amazon has found a statistical correlation.

I don't have a problem with this, though it's a bit off topic.

On one popular DVD site, the system that suggests related DVDs is said to be "blind" in terms of the subject matter.  Correlations are drawn between DVDs strictly by their inventory ID.  Usually, a comedy will be statistically related to other comedies, because humans who like one movie tended to also like the other.  But a blind correlation system doesn't care that they are both comedies, just that those who rent dvd A also rent dvd B.

But, if people who rented a certain comedy also happened to rent a dog training DVD, the system would be just as happy to suggest it.  That's the case with Amazon tonight I believe.

There are 3 main reasons that two seemingly unrelated items might be thought by a computer to be "related" based on this blind statistical approach.

1: They are related in some logical way the humans care about.  Perhaps a DVD in the Comedy genre featured a dog as a central character.  And good hearted dog owners who had bought the dog training DVD also particularly enjoyed the comedy that featured a dog.

2: This is an anomaly:  they are not related in any real way, and this is just a statistical fluke.  This can happen especially in smaller data sets.

3: A small connection (either logical or random) has been artificially amplified by positive feedback.  In my Amazon example, perhaps the wireless network camera and the pants were both added near the same time, or some other fluke caused them to be accidentally associated.  But then the portly IT folks tended to see the pants and enough of them saw that they looked nice and noticed the "Big & Tall" moniker, and decided to check them out, or perhaps mentioned the pants to their other portly IT brethren  So fluke has now been amplified.  Depending on the system, it might show up on other related networking equipment pages.

More on case # 3, spurious signal amplification through early positive feedback

In the case of the camera and pants both appealing to portly IT folks, you could argue that this is a "good" thing, a happy accident, serendipity.

But this can work the other way.  Suppose in the example of the comedy starring a dog, and the dog training video, that the title of the comedy is misleading.  Suppose an original customer was looking for dog training videos with a simple keyword search.  And suppose the comedy was called "How to Train a Dog".  The user accidentally clicks on it, sees that it's a comedy and not a training video, and leaves.  But the system still notices that click through, and later tries again showing the comedy video title in the results list to another dog trainer, this time a bit higher, and they are also fooled and click through.  Now TWO people have made that mistake, but a simplistic algorithm is starting to get confidence in that association and this cycle continues, until the comedy shows up at the top of the dog training videos and everybody clicks on it.  This mistake, and let's assume it really is a mistake that no humans would want, is now "cemented" into the system.

In this case, perhaps the system is weighting this "Implicit Feedback" too heavily.  Folks clicked on the comedy title, but none of them specifically said "yes, this is related".

I think this happens with popularity rankings on Google sometimes.  An early mediocre page is linked to and clicked on repeated, which generates more and more links to it.  Or the page for an older version of software was popular and widely linked to.  The newer version is substantially different, but people still keep getting the older documentation, and are thus annoyed and confused.  In this case, a few humans went out of their way to create the links, so this is "Explicit Feedback".   This happened to me with Nutch 0.7x vs. Nutch 0.9x, and I saw another recent mention, though I don't recall the details.

There are likely workarounds for all of these edge cases; tweaks that could be applied.  But in general, this is a problem that needs to be watched over as folks retool Web 2.0 techniques for the enterprise and for search.  And for every tweak, new edge cases may result.  So, we're not against these things, but we just think they need periodic monitoring, and if a bad association is accidentally formed, the operator should have an easy way to override it and remove the association.

April 25, 2007

The Most Important Taxonomy for your Web Site

Taxonomies continue to be popular in companies, but I have to wonder if they are really that useful for the majority of organizations.  I can’t tell you how many times I‘ve had otherwise intelligent people tell me “We plan to implement a corporate enterprise search solution – as soon as we finish our taxonomy project”. When I hear this, I know search won’t be happening for at last two years or more, and in the meantime every visitor to the web site suffers. Usually I spend a few seconds feeling sorry for their users and/or employees, but then I realize that innovative companies with real work to do are moving ahead full speed with Enterprise Search 2.0 platforms and I feel better.

Taxonomies generally fall into on of two categories: subject-based taxonomies and content based taxonomies.

Subject or "Domain" taxonomies attempt to completely describe all of the terms in a field, as well as the relationship between the terms. Typically these relationships are hierarchical, and they are the kind of taxonomies we use to classify knowledge - the kind of taxonomies your biology teacher would talk about. You need a real subject matter expert to create useful subject based taxonomy. And whatever you do, don't hire two (or more) subject experts, because they will never agree on the taxonomy.

Content based taxonomies are organized using existing content. Organization charts, computer directory/folder structures, or social tagging content is typically a 'content based' taxonomy. These taxonomies are often built by humans - you do it yourself when you decide what folders to use on your computer. But these can also be done automatically with tools many search and content management vendors sell.

Whether you go with a subject or a content taxonomy for your company, hooking it into your enterprise search technology will be a trick. This is the dirty little secret of the search software business: There are few, if any, commercial engines that can really take advantage of a complex taxonomy. What do you do with it, after all? Do you tag every document with the full taxonomy of terms in the hierarchy for every term in the document? Do you think that somehow the search engine will automatically know what to do with the taxonomy, and look up and down the taxonomy tree to find related terms? Verity had a great concept when they invented Topics in the late 80s, but since then even they have lost some of the taxonomy emphasis.

We think there is a third kind of taxonomy that is even more important that the traditional subject and content taxonomies: we call it a Behavior-Based Taxonomy.

Really, the reason most companies want a taxonomy is to help people find content. You can probably keep several experts and a bunch of computes working for years to anticipate every possible term and every possible hierarchy that someone on your internet or intranet site may use. But we think the most important taxonomy on any web site is the list of search terms that people actually use when they search a site.

If your search engine can provide great results for the ‘top 100’ queries on your site, you have a lot of happy users. Why do you think search experts at trade shows have finally started talking about your search logs? You can't know what your behavior-based taxonomy (BBT) is unless you are monitoring your search activity at least quarterly. Verify that the ‘top 100’ queries are working fine - either with organic search results or with featured links (or best bets or result promotion, depending on your search vendor).

You keep your Behavior Based Taxonomy up-to-date, and your search users will be satisfied!

 

Search Blog Archive

Dr Search

  • Dr. Search is the technical genius of enterprise search. Feel free to Ask the Doctor any questions you may have about enterprise search.

Enterprise Search Newsletter

Other Resources