« New Idea Engineering Presentation at Fast Forward 2008 | Main | Advanced Duplicate Detection (also related to spam detection and clustering) »

March 17, 2008

Search 2.0 - example of odd suggestion of related product based on social factors

I was looking at wireless wifi cameras on Amazon tonight.

When looking at the D-Link Wireless Internet Camera, I happened to notice the "Customers Who Bought This Item Also Bought" section, suggesting 3 other items:
1: D-Link Securicam ... etc...
2: D-Link power adapter...
3: *** Levi's Big & Tall Cargo Pants ***  :-)

Apparently enough portly network IT folks want to look stylish that Amazon has found a statistical correlation.

I don't have a problem with this, though it's a bit off topic.

On one popular DVD site, the system that suggests related DVDs is said to be "blind" in terms of the subject matter.  Correlations are drawn between DVDs strictly by their inventory ID.  Usually, a comedy will be statistically related to other comedies, because humans who like one movie tended to also like the other.  But a blind correlation system doesn't care that they are both comedies, just that those who rent dvd A also rent dvd B.

But, if people who rented a certain comedy also happened to rent a dog training DVD, the system would be just as happy to suggest it.  That's the case with Amazon tonight I believe.

There are 3 main reasons that two seemingly unrelated items might be thought by a computer to be "related" based on this blind statistical approach.

1: They are related in some logical way the humans care about.  Perhaps a DVD in the Comedy genre featured a dog as a central character.  And good hearted dog owners who had bought the dog training DVD also particularly enjoyed the comedy that featured a dog.

2: This is an anomaly:  they are not related in any real way, and this is just a statistical fluke.  This can happen especially in smaller data sets.

3: A small connection (either logical or random) has been artificially amplified by positive feedback.  In my Amazon example, perhaps the wireless network camera and the pants were both added near the same time, or some other fluke caused them to be accidentally associated.  But then the portly IT folks tended to see the pants and enough of them saw that they looked nice and noticed the "Big & Tall" moniker, and decided to check them out, or perhaps mentioned the pants to their other portly IT brethren  So fluke has now been amplified.  Depending on the system, it might show up on other related networking equipment pages.

More on case # 3, spurious signal amplification through early positive feedback

In the case of the camera and pants both appealing to portly IT folks, you could argue that this is a "good" thing, a happy accident, serendipity.

But this can work the other way.  Suppose in the example of the comedy starring a dog, and the dog training video, that the title of the comedy is misleading.  Suppose an original customer was looking for dog training videos with a simple keyword search.  And suppose the comedy was called "How to Train a Dog".  The user accidentally clicks on it, sees that it's a comedy and not a training video, and leaves.  But the system still notices that click through, and later tries again showing the comedy video title in the results list to another dog trainer, this time a bit higher, and they are also fooled and click through.  Now TWO people have made that mistake, but a simplistic algorithm is starting to get confidence in that association and this cycle continues, until the comedy shows up at the top of the dog training videos and everybody clicks on it.  This mistake, and let's assume it really is a mistake that no humans would want, is now "cemented" into the system.

In this case, perhaps the system is weighting this "Implicit Feedback" too heavily.  Folks clicked on the comedy title, but none of them specifically said "yes, this is related".

I think this happens with popularity rankings on Google sometimes.  An early mediocre page is linked to and clicked on repeated, which generates more and more links to it.  Or the page for an older version of software was popular and widely linked to.  The newer version is substantially different, but people still keep getting the older documentation, and are thus annoyed and confused.  In this case, a few humans went out of their way to create the links, so this is "Explicit Feedback".   This happened to me with Nutch 0.7x vs. Nutch 0.9x, and I saw another recent mention, though I don't recall the details.

There are likely workarounds for all of these edge cases; tweaks that could be applied.  But in general, this is a problem that needs to be watched over as folks retool Web 2.0 techniques for the enterprise and for search.  And for every tweak, new edge cases may result.  So, we're not against these things, but we just think they need periodic monitoring, and if a bad association is accidentally formed, the operator should have an easy way to override it and remove the association.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/2197018/27166786

Listed below are links to weblogs that reference Search 2.0 - example of odd suggestion of related product based on social factors:

Comments

Post a comment

Comments are moderated, and will not appear on this weblog until the author has approved them.

If you have a TypeKey or TypePad account, please Sign In

Search Blog Archive

Dr Search

  • Dr. Search is the technical genius of enterprise search. Feel free to Ask the Doctor any questions you may have about enterprise search.

Enterprise Search Newsletter

Other Resources