13 posts categorized "Nutch"

January 12, 2009

Virtualization and Search: Performance Tests Summary

We've been researching Virtualization and Search, and have recently presented on the topic at a couple shows.  We wanted to compare the performance penalty running a spider on virtual machine instead of a physical machine.  This is a summary of our findings. 

What we expected to find in terms of a performance penalty:

  • Estimated 3 to 20% penalty
  • Leaning towards 20% given the heavy disk IO

Actual results:

  • We found an approximate 10% average penalty.
  • There was a pretty wide margin, various tests measured between 0% to 17%, but always under the 20% we had estimated.  (actually less than zero percent in some cases, but we labeled those as outliers)
  • Overall the performance was better than expected, and certainly reasonable for many applications.

The test and environment:

  • HP mid tier workstation, dual core, AMD Athlon 64 X2 4400+
  • 8 Gigs memory, local SATA disk (non RAID)
  • Microsoft Windows Server 2008 64-bit for both host and client, using MS Hyper-V.
  • Sun's 64-bit JVM v6 set to 1 Gig max (which it did not fully respect)
  • Nutch 0.9 stock distribution
  • Dataset was the Enron public emails, approx half million emails in 1.5 Gigs of source data, served by IIS on separate local machine
  • Email files were mapped to text filter
  • Data was fetched and indexed into Lucene by Nutch
  • Clock time ranged from 31 to 35 minutes for all tests, with the physical tests (non-virtual) giving the widest measured deviation

Drop us an email if you'd like the full PDF when published.

January 10, 2008

Updated 2008 Enterprise Search Vendor Roundup

Jan. 10, 2008 - San Jose, CA, USA 

Microsoft announced they were acquiring FAST Search on January 8, forcing New Idea Engineering to amend our January 4th article "2008 Enterprise Search Vendors:  The new 'Fab4 ... and 1/2" (http://www.ideaeng.com/pub/entsrch/2008/number_01/article01.html). The announcement validates our original assessment and reinforces that search is mission critical for corporations, driving Microsoft to invest in a better search technology.

Some Highlights from NIE's 2008 Enterprise Search Vendor Roundup
 
Autonomy IDOL and FAST Search continue to hold the high end. K2 and Ultraseek are finally retiring.
Google's new version 5 appliance has arrived in the enterprise search mainstream.
Endeca is moving from the ecommerce side and had one of the most impressive search demos at ESS West 2007.
Lucene/ Nutch/ Solr (LNS) open source search engines continue to gain customer mindshare.
Microsoft with its acquistion moves in as Tier 1.
IBM and Oracle still not there.
 
Autonomy IDOL and FAST Search continue to hold the high end, evolving into "search platforms" that go beyond traditional drop in applications. The two leaders from earlier this decade, K2 and Ultraseek, are fading.

Google's new version 5 appliance has arrived in the enterprise search mainstream. While the new version won't satisfy every requirement, it addresses many of the earlier integration issues that had held it back. Expect to see the Google logo on a lot more enterprise portals.

Endeca has created some slick administration tools, doing very well in a head-to-head comparison with Autonomy and FAST despite their continued progress in this area.  As the importance of administration continues to increase, we are more enthusiastic about them in the Enterprise space.

Open source tools based on Lucene, including Nutch and Solr (LNS) are increasingly considered by companies, especially in niches that need to micromanage document relevancy and rating. Lucene and its derivatives are increasingly embedded in other software packages and services, to the point that many users won't even realize they're using it.

We had expected IBM to be the next entrant into the "Tier 1" lineup, based on their iPhrase acquisition. To our surprise, when we saw IBM at ESS East 2007, they were featuring one of their older engines, the OmniFind Enterprise Edition. IBM OmniFind is still not one of our new Fab 4 and an 1/2.

Dieselpoint, Intellisearch, Reccomind, ISYS, ZyLAN, Vivisimo, Siderean and Exalead have strong presences in niche markets.
 
To read the full article ... 2008 Enterprise Search Vendors: The New Fab 4 ... and 1/2. http://www.ideaeng.com/pub/entsrch/2008/number_01/article01.html

October 19, 2007

Is Gartner missing a trend?

The new Gartner 'Magic Quadrant' report for Information Technology, released last month, shows few surprises in the actual vendor chart. But the report goes on to explain that, of the open-source search engines, "none of them are significant enough to threaten the commercial market". They go on to specifically mention Lucene, saying "enterprises don't consider it a significant alternative". We beg to differ.

Gartner does talk about IBM's strong support of Lucene; and they do say that, if IBM invests substantially in the technology, Lucene may reach its potential. However, we see a number of companies already placing their bets on Lucene - although here I am considering the Apache 'Lucene-Solr-Nutch' franchise as a single, related set of tools.  The list of Lucene users we know includes start-up vertical search companies that don't have much money; but we also see some well-funded and growing public companies which are choosing to build their skills in-house for total control over their own search destiny. Netflix, Monster.Com, and Pearson Scott Foresman are just a few of the companies that use Lucene-Solr-Nutch and are incredibly happy with their choice. And more are looking every day.

The open source path may not be right for every company. Lucene is a toolkit, and we tell our customers that "some assembly is required". It is still weak on filters for document formats, it offers weak stemmer support, and has no integrated support for document security. Lucene and Solr don't include a spider/crawler, although Nutch is always available for that. And while there are wrappers for other popular languages, you will probably want some developers who know Java pretty well. But once you have the right skills in-house, it provides pretty good search in a lightweight, portable application.


We agree with Gartner when they say support from a major vendor like IBM would be a major benefit to the Lucene franchise; but we don't think it's necessary. Think about this: Lucene included a parametric search capability months before the Google Search Appliance did. And the Lucene franchise features search term highlighting; completely tunable relevance and a transparent relevance algorithm; and the capability of fine tuning just about everything to work exactly as you want it. It may be a toolkit, but it is sure a pretty good one for many environments.

It's not like Gartner to miss the wave completely; maybe they are just not listening to the same people we've been talking to with in the corporate world.