« February 2008 | Main | April 2008 »

4 posts from March 2008

March 19, 2008

Advanced Duplicate Detection (also related to spam detection and clustering)

We need to do a dedicated article about this area, but I wanted to share some material here that we have written about it, and that will likely re-appear in a future article.

In our recent newsletter article, we covered the problem of generic duplicate detection in search, and them duplicate detection in federated search.

A SearchDev posting Mark talked more about why checksums aren't always enough for duplicate detection, in messages 485 and 490

March 17, 2008

Search 2.0 - example of odd suggestion of related product based on social factors

I was looking at wireless wifi cameras on Amazon tonight.

When looking at the D-Link Wireless Internet Camera, I happened to notice the "Customers Who Bought This Item Also Bought" section, suggesting 3 other items:
1: D-Link Securicam ... etc...
2: D-Link power adapter...
3: *** Levi's Big & Tall Cargo Pants ***  :-)

Apparently enough portly network IT folks want to look stylish that Amazon has found a statistical correlation.

I don't have a problem with this, though it's a bit off topic.

On one popular DVD site, the system that suggests related DVDs is said to be "blind" in terms of the subject matter.  Correlations are drawn between DVDs strictly by their inventory ID.  Usually, a comedy will be statistically related to other comedies, because humans who like one movie tended to also like the other.  But a blind correlation system doesn't care that they are both comedies, just that those who rent dvd A also rent dvd B.

But, if people who rented a certain comedy also happened to rent a dog training DVD, the system would be just as happy to suggest it.  That's the case with Amazon tonight I believe.

There are 3 main reasons that two seemingly unrelated items might be thought by a computer to be "related" based on this blind statistical approach.

1: They are related in some logical way the humans care about.  Perhaps a DVD in the Comedy genre featured a dog as a central character.  And good hearted dog owners who had bought the dog training DVD also particularly enjoyed the comedy that featured a dog.

2: This is an anomaly:  they are not related in any real way, and this is just a statistical fluke.  This can happen especially in smaller data sets.

3: A small connection (either logical or random) has been artificially amplified by positive feedback.  In my Amazon example, perhaps the wireless network camera and the pants were both added near the same time, or some other fluke caused them to be accidentally associated.  But then the portly IT folks tended to see the pants and enough of them saw that they looked nice and noticed the "Big & Tall" moniker, and decided to check them out, or perhaps mentioned the pants to their other portly IT brethren  So fluke has now been amplified.  Depending on the system, it might show up on other related networking equipment pages.

More on case # 3, spurious signal amplification through early positive feedback

In the case of the camera and pants both appealing to portly IT folks, you could argue that this is a "good" thing, a happy accident, serendipity.

But this can work the other way.  Suppose in the example of the comedy starring a dog, and the dog training video, that the title of the comedy is misleading.  Suppose an original customer was looking for dog training videos with a simple keyword search.  And suppose the comedy was called "How to Train a Dog".  The user accidentally clicks on it, sees that it's a comedy and not a training video, and leaves.  But the system still notices that click through, and later tries again showing the comedy video title in the results list to another dog trainer, this time a bit higher, and they are also fooled and click through.  Now TWO people have made that mistake, but a simplistic algorithm is starting to get confidence in that association and this cycle continues, until the comedy shows up at the top of the dog training videos and everybody clicks on it.  This mistake, and let's assume it really is a mistake that no humans would want, is now "cemented" into the system.

In this case, perhaps the system is weighting this "Implicit Feedback" too heavily.  Folks clicked on the comedy title, but none of them specifically said "yes, this is related".

I think this happens with popularity rankings on Google sometimes.  An early mediocre page is linked to and clicked on repeated, which generates more and more links to it.  Or the page for an older version of software was popular and widely linked to.  The newer version is substantially different, but people still keep getting the older documentation, and are thus annoyed and confused.  In this case, a few humans went out of their way to create the links, so this is "Explicit Feedback".   This happened to me with Nutch 0.7x vs. Nutch 0.9x, and I saw another recent mention, though I don't recall the details.

There are likely workarounds for all of these edge cases; tweaks that could be applied.  But in general, this is a problem that needs to be watched over as folks retool Web 2.0 techniques for the enterprise and for search.  And for every tweak, new edge cases may result.  So, we're not against these things, but we just think they need periodic monitoring, and if a bad association is accidentally formed, the operator should have an easy way to override it and remove the association.

March 03, 2008

New Idea Engineering Presentation at Fast Forward 2008

If you missed New Idea Engineering at Fast Forward 2008 in Orlando, Mark Bennett and Miles Kehoe detailed information security issues for enterprise search and ways to address them in “Protecting Confidential Information - Addressing Information Security Issues for Enterprise Search".  You can register and download the white paper  "Mapping Security Requirements to Enterprise Search"  at http://www.ideaeng.com/pub/wp/.


Quick Summary

The proliferation and increasing power of enterprise search requires companies to pay more attention to protecting confidential information such as customer and personnel information, intellectual property, and undisclosed strategies. Appropriate access should be addressed at the document, sub-document, and sub-field levels.  The white paper highlights actual "gotchas" that have been seen at consumer sites and that you can learn from.

  Here’s the full presentation:  FASTForward08 Slides

If you did not attend FASTForward08 and would like a copy of the slides, email me and I'll send them to you.

Deep Web proposes federation resource site

Sol Ledeman of Deep Web Technologies wants to create a one-stop demo center for federation technology and has invited all of the major vendors to participate.

Federated search is becoming increasingly popular as more corporate customers are looking for ways to delivery results from multiple enterprise search installations, often from many different vendors. Sometimes the issue is technical, sometimes political, but nearly all companies have three or more search vendor technologies running somewhere behind the firewall.

The one thing we'd like to have seen in Sol's challenge is security, since that's what we think separates the winners from the also-rans in federation. It's not always easy, but it is 'real world' in companies. Nonetheless, a demo site where users can compare vendor solutions 'apples to apples' on the same data sources would be nice.

By the way, we've seen some confusion among our customers and prospects on the subject, so we've taken a shot at defining 'federated search' in our Enterprise Search newsletter. We hope that helps some.