August 31, 2012

Some interesting NLP Projects, Semantics, Disambiguation, etc.

Some interesting Natual Language Processing, Semantics, Sentiment Analysis, and other projects at a German university (though thankfully the page is in English)


August 21, 2012

Mind the gap

A few weeks ago, a former client asked me about the 'lay of the land' in enterprise search - which companies were the one to be considered for evaluation. It's something I'm frequently asked, and one big reason why I strive to stay current with all of the leading commercial and open source vendors in the market.

As I pulled together the list, it occurred to me that recent consolidation has led to an odd situation: there is no longer a 'mid-market' in enterprise search.

Under $25,000(US), there are a number of options from free and low-cost open source (SearchBlox and my employer LucidWorks come to mind). 

Google has discontinued its low cost (blue) search appliance, and raised the cost of its regular (yellow) one to apparently be well above $25K.

We also have the old-school major commercial vendors - like FAST (now Microsoft SharePoint Search); Autonomy (now HP); Endeca (now Oracle), and finally Vivisimo (now IBM). Trend or not, these enterprise search products command high initial outlay, often significant implementation costs, and high ongoing 'support' once you've rolled it out. Looks like the mid-market is gone.

So now the question is: What do you get for the difference in price? I'd suggest not much in the way of capability; nothing in terms of scalability; and very very little in the way of flexibility.  I guess it's 'caveat emptor' - buyer beware!

What about some products/projects I haven't mentioned? Well, the focus of my article here is on enterprise search. Great candidates like Coveo are 'windows only' which disqualifies them from my list. I suppose you could consider the GSA as not enterprise ready, but I think appliances make the OS issue irrelevant. I've also omitted mentioning other projects because they have not yet shipped a 'Version 1.0' release - that's testware, no matter who it's from. And I'm sure there are open source projects where a single person is making all the calls - I don't consider that enterprise ready either.

I’ll be looking for the day when the big guys start value pricing their software licenses and help bring the market into line with today’s reality.

If you think I've unfairly represented the market, let me know - I'm not shy about posting comments that differ with my viewpoint.




August 20, 2012

What "Totally Automatic!" Really Means: AI / NLP / Machine Learning Considerations in Search Technology

Many advanced technologies use statistical machine learning and other numerical methods, or other techniques that come close to the claims previously made by AI software companies. While progress is being made, there are several points to consider when looking at these systems:

  • Can you override the system’s default behavior? Some vendors’ claims of “completely automatic” may actually mean it operates as a black box, with few diagnostic tools or adjustments. Such systems may not be suitable to put directly in front of customers, or at least not for driving the central content on a page.
  • Detect vs. Judge - Does the system simply detect trends and changes, or is it also making value judgments about those changes? Statistical methods have made much more progress on the former than on the latter. For high value customer experiences, it’s better for the computer to prioritize things for human operators to look at, and perhaps offer operators various convenient actions they can select from.
  • Supervised vs. Unsupervised – Although there’s a technical definition for this distinction, it really boils down to whether you can train the system and/or have predefined categories, etc., or whether the system is totally automatic. Although totally automatic sounds like less work, it’s less likely to give impressive results for primary customer facing activities.
  • Pretty numbers and graphs – There’s a tendency for some software companies to bring forth grids of floating point numbers, out to 6 or 8 decimal places, as proof that their software works well. Or they may claim things like “our software improves relevancy by 30%!” A good POC or A/B testing is a much better “proof” of software efficacy.
  • Machine generated graphics are a mixed bag. Graphs that simply reinforce arbitrary relevancy improvements aren’t really more useful than the numeric claims.
    However, graphics that help to visualize large amounts of data in innovative ways, spotting trends and differences, especially if they are interactive and can “drill down” into particular areas, can be very helpful. Of course they still require decent input, and reasonably trained people to interpret the graphs.
    Also, simple “cluster” graphs, while potentially useful, are no longer novel tools by themselves, and have rarely been exposed directly to end users, despite the suggestive demos of search vendors. To really leverage data visualization tools companies need to staff and train at a higher level than for just “running a search appliance”.

August 09, 2012

Changes at Lucid Imagination

This morning Lucid Imagination announced that it is changing its name and simplifying its product family.

Lw_logoEffective today, Lucid Imagination becomes LucidWorks. The company remains the place to go for the highest quality open source search platforms, with their product offerings falling under the moniker LucidWorks Product Suite. The product suite is structured in to areas - LucidWorks Search and LucidWorks Big Data. 

LucidWorks Search is available on premises on in the cloud (Amazon and Azure); while LucidWorks Big Data is currently on premises or Amazon. LucidWorks strives to provide a more advanced feature set than the open source version of Solr. For example, the LucidWorks Search product release includes a user interface for creating and managing collections; click-through relevancy boosting; and better security implementation than that found in the open source release. Dealing with LucidWorks provides you features that you'd expect in an enterprise search platform well ahead of the availability in the open source release. LucidWorks continues to sell services and training for Solr for those of you who don't need the advanced capabilities in LucidWorks' products.

Finally, LucidWorks is launching a new support site SearchHub, which will take the place of the existing devZone Community Portal. It will be up in the near future, but you can sign up to be notified as soon as logins are available.

LucidWorks is the way to go for fully supported Solr.



August 06, 2012

Exceptions to the 'the search platform you already have' post

Last week I wrote a post suggesting that the least expensive search platform you can purchase is the one you already have - by fixing it up. Since then, I've heard from people as near as Cupertino and as far away as Hungary pointing out a flaw that I will share with you now: What if the search platform you are using no longer has support from the vendor? Maybe FAST ESP, or Verity K2, or Ultraseek, or so many others. Or what if the platform you have now was the wrong one for your search application? What do you do then?

As the bartender says in Jimmy Stewart's movie Harvey, "It depends".

Just because your platform is no longer supported by your vendor doesn't automatically make it a misfit. We have customers who purchased FAST ESP under perpetual license and they are still humming along quite well, thank you. Heck, I even know of a few companies using Verity's K2 - again under perpetual license - and they are doing fine. Are these options ideal? No. For example, they cannot index the latest PDF format which causes some problems. Are there workarounds? Of course - but they really should upgrade when they can.

If the platform just isn't right, then maybe it's time to start looking around. How can you tell? Well, one way is physical limits: your license is good for 5 million docs, and you now have 6 million. Upgrade the count, or switch. What if the nature of your content has changed or you changed content management systems? Maybe time to look around.

In any case, I stand by my assertion that, above a certain level of functionality, just about any search engine can be made to run pretty darn well - although I'm willing to add the caveat "in most situations". There - I've said it. Disagree? Send me an email or leave a comment!