« August 2007 | Main | November 2007 »

3 posts from October 2007

October 29, 2007

Google Appliance Growing Up?


The newest version of the Google Search Appliance (GSA) is available, and it's starting to look like a pretty decent solution for more and more corporations.

Google released Version 5 provides what they call “Universal Search"  in October. The newest release for the entire GSA line (except the Mini) includes a number of excellent enterprise features including enhanced security; parametric search, Wiki KeyMatch, a social tagging for best bets; and an application called One Box, a search federator tool.

GSA security now includes Windows Integrated Authorization (WIA) and includes a security API to customize special security needs. It handles security both at crawl time and at search time. It fully respects data store security from all sources, so users only see documents, best bets, parametric results, and features which they have permission to view.

The parametric code in Universal Search is based on open source code available from Google (http://code.google.com/p/parametric/). In demos, it looks like most of the parametric demos we've seen; so we'll have to say more once we have a chance to drill down.

The odd feature in this release is the Wiki KeyMatch feature. Essentially it lets any employee tag a search result list by add "best bet" suggestions to the top of the result list for a given query. It looks like anyone can suggest a related or better result for any query. Apparently this has worked well in Google for a while, and Google folks say it's great. Administrators are notified when new tags are added or updated, and the best bet does show who created the tag. As Jimmy Wales says about his Wikipedia product, anyone posting understands that if the best bet is not useful or appropriate it's going to be removed; so in a sense any author who wants his/her best bet to survive, it better be good. I have to admit the corporate manager I’ve talked to are a bit skeptical; but it can potentially start using the 'wisdom of the crowd' to get better results where it works.

OneBox is a search federation application that provides a way to combine results from a number of different corporate data sources, as well as from Google Apps. As one of the Google folks said recently, "One Box is a way of pulling in live data (such as employee info, salesforce.com data, business objects data) right into your search results."

Google has a solution for SharePoint, Documentum, Livelink, and FileNet, as well as to Google Apps. They provide an API so you can write your own, and we're sure third party developers are busy working on then now. The Google provided connectors are free; but third party connectors may be priced depending on how the developer wants to market it.

Finally, Google also seems to have improved their existing "data biasing" to allow administrators to 'query tweak' using URL patterns and document recency.

The only bad news for small users and corporate departments is that the new upgrade and features are not (yet) available for the popular Google Mini.

If you looked at the Google offerings a while ago and they didn’t meet your needs, you may want to take a second look. It looks like they’ve started to come of age in the enterprise search market.



October 28, 2007

Explicit Tagging on the net; Implicit tagging in companies

Collaboration is a hot topic among folks who prognosticate about the future of enterprise search. It's made such a positive difference for internet search at sites like Flickr and among the Facebook-type networking sites of the world, it only makes sense that it should be able to help enterprise search as well.  In his recent book "Everything is Miscellaneous", David  Weinberger talks about the the value of meta-data in organizing content now that everything is in electrons rather than in the physical world of atoms. 

He talks about Flickr as an example of meta-tagging - explicit tagging - improving retrieval. Flickr has 8 million registered users, but on a recent visit to the site, it reported that 5000 pictures had been uploaded in the minute just prior to my visit. That must take an incredible amount of tagging just to break even!

By the way, one benefit of massive meta-tagging, as Weinberger points out, is that companies like Flickr, using some smarts behind the scene, can effectively "learn" to associate "Golden Gate Bridge" and "San Francisco”. This is no doubt similar to the technology that leading enterprise search vendors   are beginning to incorporate around  “fact extraction".

A Problem of Scale

Even though we are fans of explicit tagging as implemented in Flickr, the problem we see with tagging in the enterprise is the significantly smaller base of potential taggers inside the firewall.

Let's consider Cisco, a high tech company with a large store of online content and a highly motivated, highly technical employee base. They have about 60,000 employees; let’s assume that 5% of those employees would actively tag documents given the ability. That means there may be up to 3,000 active collaborators over time. But with an intranet of millions of pages, it's going to be a while before any significant number of pages has useful tags.

The good news is that we do see the solution in the intranet as a more implicit form of 'tagging': document views for a given query. In the same way that someone tagging a picture on Flickr is adding an explicit "vote" associating a picture with a term, a corporate user is entering an implicit  "vote" for a document when he or she opens a document after a search. That is, the user "tags" the document in question with the search term(s) used to find it. When we can find a way to automagically tweak the relevance of a document for a given term without having to do any special handling, then collaborative technology will have found a niche in enterprise search.

Of course, there are always fringe cases: what if a user opens a document and finds the document is totally wrong? Won't that rank a document higher? The answer is yes - but the fact that the boost is a tiny one means the document will only marginally have a better score. And trust us, if you provide a document feedback capability for your users, you'll hear about the bad documents and you can offset the "mistaken" tags with the "thumbs down" votes. We think over time, this implicit tagging will work far better in the corporate environment, even if human-provided explicit tags will continue to be better indicators.

Now to see if any vendors are using that technology now. Do you know of any?

October 19, 2007

Is Gartner missing a trend?

The new Gartner 'Magic Quadrant' report for Information Technology, released last month, shows few surprises in the actual vendor chart. But the report goes on to explain that, of the open-source search engines, "none of them are significant enough to threaten the commercial market". They go on to specifically mention Lucene, saying "enterprises don't consider it a significant alternative". We beg to differ.

Gartner does talk about IBM's strong support of Lucene; and they do say that, if IBM invests substantially in the technology, Lucene may reach its potential. However, we see a number of companies already placing their bets on Lucene - although here I am considering the Apache 'Lucene-Solr-Nutch' franchise as a single, related set of tools.  The list of Lucene users we know includes start-up vertical search companies that don't have much money; but we also see some well-funded and growing public companies which are choosing to build their skills in-house for total control over their own search destiny. Netflix, Monster.Com, and Pearson Scott Foresman are just a few of the companies that use Lucene-Solr-Nutch and are incredibly happy with their choice. And more are looking every day.

The open source path may not be right for every company. Lucene is a toolkit, and we tell our customers that "some assembly is required". It is still weak on filters for document formats, it offers weak stemmer support, and has no integrated support for document security. Lucene and Solr don't include a spider/crawler, although Nutch is always available for that. And while there are wrappers for other popular languages, you will probably want some developers who know Java pretty well. But once you have the right skills in-house, it provides pretty good search in a lightweight, portable application.

We agree with Gartner when they say support from a major vendor like IBM would be a major benefit to the Lucene franchise; but we don't think it's necessary. Think about this: Lucene included a parametric search capability months before the Google Search Appliance did. And the Lucene franchise features search term highlighting; completely tunable relevance and a transparent relevance algorithm; and the capability of fine tuning just about everything to work exactly as you want it. It may be a toolkit, but it is sure a pretty good one for many environments.

It's not like Gartner to miss the wave completely; maybe they are just not listening to the same people we've been talking to with in the corporate world.