8 posts categorized "Nutch"

November 08, 2011

Are you spending too much on enterprise search?

If your organization uses enterprise search, or if you are in the market for a new search platform, you may want to attend our webinar next week "Are you spending too much for search?". The one hour session will address:

  • What do users expect?
  • Why not just use Google?
  • How much search do you need?
  • Is an RFI a waste of time?   

Date: Wednesday, November 16 2011

Time: 11AM Pacific Standard Time / 1900 UTC

Register today!

August 09, 2011

So how many machines does *your* vendor suggest for 100,000,000+ document dataset?

We've been chatting with folks lately about really large data sets.  Clients who have a problem, and vendors who claim they can help.

But a basic question keeps coming up - not licensing - but "how many machines will we need?"  And not everybody can put their data on a public cloud, and private clouds can't always spit out a dozen virtual machines to play with, plus duplicates of that for dev and staging, so not quite as trivial as some folks thing.

The Tier-1 vendors can handle hundreds of millions of dcs, sure, but usually on quite a few machines, plus of course their premium licensing, and some non trivial setup at that point.

And as much as we love Lucene, Solr, Nutch and Hadoop, our tests show you need a fair number of machines if you're going to turn around a half billion docs in less than a week.

And beyond indexing time, once you start doing 3 or 4 facet filters, you also hit another performance knee.

We've got 4 Tier-2 vendors on our "short list" that might be able to reduce machine counts by a factor of 10 or more over the Tier-1 and open source guys.  But we'd love to hear your experiences.

December 05, 2010

Share your successes at ESS East next May

ESSSpringLogo Our friends over at InfoToday who run the successful Enterprise Search Summit conferences have asked us  to announce that the date for submitting papers to their Spring show in New York in May 2011 has been extended until Wednesday, December 8. You can find out what they are looking for and how to submit your proposal online at http://www.enterprisesearchsummit.com/Spring2011/CallForSpeakers.aspx.

Michelle Manafy, who runs the program again next May, really likes to have speakers who have found creative and successful ways to select, deploy, or manage ongoing enterprise search operations. We've co-presented with several of our customers in the past, and trust me, it's great fun and not bad for your career! And - no promises - the weather at ESS East has been great for just about every year - and we've been there for nearly 6 years now!

A friend told me something years ago that I've always fond helpful; I hope you'll take it to heart: 'Everything you know, someone else needs to know'. Don't worry if your search project isn't perfect; or worry that someone will find fault with what you've done. Trust me: there are many organizations newer to enterprise search than you are, and anything you found helpful will sure be valuable for them as well. And you get to attend al of the sessions, so you might learn more as well! A 'win-win' situation if I've ever seen one!

See you in New York!

/s/Miles

 

 

July 22, 2010

Document filters webinar July 28 2010

ISYS Document filter independent ISYS is hosting a webinar on Wednesday, July 28 at 1PM Eastern to talk  about the role document filters play in successful search indexing and display. You can register now.

Of course, as a search technology company, ISYS has enjoyed great success, particularly among law enforcement where search has to work right at a reasonable price. We've always liked their technology and their approach.

But like every search platform, ISYS needed filters to convert so-called 'binary' formats like Microsoft Office, PDF, or even Photoshop files into a stream of text - after all, today's search platforms primarily operate on words.. in textual format. But ISYS looked at the market at the time, and found that two of their competitors, Autonomy and Oracle, own the best of the filter technologies.

Like any company, they made a 'make or buy' decision, and in their case, making their own filters was the right answer for them, and possibly for you. You see, ISYS decided to start selling their filter technology independent of their search platform, so now you can acquire some really great filtering and viewing technology for just about any search engine, 'off the shelf'. Their customers include other vendors with the need to extract text from various types of content, not just search vendors but also eDiscovery and eCompliance companies and many others who don’t want to pay excessive prices for technology - and who want really great filtering at a reasonable cost.

Then, a few years back, ISYS decided that open source platforms Lucene and Solr - which had no filters - needed them as well. So now you can buy a great filter pack 'off the shelf' with no huge volume commitment - no volume commitment at all! And you can get world class filtering for your open source search project.

Come hear ISYS, the guys from Lucid Imagination, and us here at New Idea Engineering talk about the critical role of filters in your search applications. See you then!

/s/Miles

June 06, 2009

Impressions of first Lucene/Solr SF Meetup

Kudos to Carl, our NIE Marketeer and defacto social director, for getting us to attend, well worth it, and conveniently coinciding with Gilbane.

The Good:

  • VERY entertaining, very informative.  Lots of good info about upcoming versions of Lucene and Solr, including additional performance tweaks.
  • A friendly, supportive bunch of like-minded nerds, and I mean this is the best possible way.
  • Also discussions of other related Apache projects.  We're all gonna need a cheat sheet pretty soon to keep track of it all.
  • Lucene/Solr will soon have implemented much of the core features of Autonomy IDOL, Endeca, FAST, etc.  They really ought to be spying.  :-)

Personally I think Otis & co. might wanna fly out for the next one.  I also think Dieselpoint ought to attend and talk about Open Pipeline.  If we get up enough energy maybe we could even volunteer to do that next time, we're on the board after all, but this is really Chris's baby.

The Not-so-Good:

  • About 50 terms that clients would not understand.  Don't get me wrong, we love the Map/Reduce, Bayesian, K-Means, SVD stuff, but most corporate clients would be lost.
  • Not much for Enterprise Packaging.  Ironically it's the mundane aspects of search, from a non-developer standpoint, that are still not on the horizon.  Not a criticism of the developers, they have what they need.
  • Not much about Nutch.  Nutch 1.0 is out, along with rumors of a revised admin GUI, but not much coverage here.

Impressions of Lucid Imagination:

This event was sponsored by Lucid, a company that recently got funding for bringing commercial packaging and services to the open source search world, and their senior staff includes quite a few of the core committers.

  • A very sincere bunch of guys.
  • They haven't sold their souls to corporate America, I think their "geek cred" is still well in tact.
  • Probably will not be filling in enterprise packaging pot holes any time soon.
  • Do they understand the Enterprise Market?

Also a shout out to LinkedIn and IBM for giving back to open source community.

There was also an "open mic" segment, and I'd like to give a shout to Avi Rappaport - I agree 1,000%, "stop words bad!" (or at least the blind use of index time stop words)


Surprises:

  • Not much of a threat to Google Appliance, due to packaging.  Yes, Google scales with their Map/Reduce and relevancy algorithms, and the open source guys have responded, but that's not the stuff that makes Google tick these days.
  • And despite the impressive and rapidly evolving core technologies, also not a real threat to the other Tier One vendors like FAST and Autonomy.  More on this seeming contradiction in a bit.
  • The Tier 2 vendors of the world, Attivio, Exalead, Dieselpoint, etc. DO need to pay attention.  There is a place for Tier 2 vendors, but they need to mind what the open source products do and do not provide more carefully.
  • It's really cool to see IBM willing to contribute so aggressively to the open source search engines, even though they sell several of their own.  A naive person might think they are competing with themselves, sabotaging their own sales guys, but they're a lot smarter than that.  They are selling their commercial search products as pure search, those technologies are always part of a larger (and more expensive) grand business solution.  They know what they're doing!

For similar reasons, still not a huge threat to Autonomy, MS/FAST, Endeca, etc. on corporate services.  I said earlier that the Apache projects are implementing a lot of the "secret sauce" that launched Autonomy and Endeca, etc, so you'd think this represents "a clear and present danger", but Mike Lynch's secret algorithms are not why people buy IDOL anymore.  Things like giant reference accounts, professional services, and commercial grade spiders have a lot more to with why big companies still pay six figures for search technology.

And speaking of surprises and Lucid Imagination, I wanna circle back to their PR a few months back when they got their funding and launched their company.  They talked about relevancy in their press releases!?  Wow... Yes, Lucene and Solr have some good traction there, but that specific competitive advantage has been used by almost every commercial search vendor in the past 15 years, including Verity, Autonomy and Google!

I would've expected them to say something like "we're gonna do for Lucene what RedHat did for Linux" - this would have been a very clear business-oriented proposition, though to be fair lots of companies have used that business model as well.  It wouldn't be original, but would be more business centric.  Then again, I'm not in Marketing, and their VC's obviously liked their pitch, so what do I know!

s/Mark

January 12, 2009

Virtualization and Search: Performance Tests Summary

We've been researching Virtualization and Search, and have recently presented on the topic at a couple shows.  We wanted to compare the performance penalty running a spider on virtual machine instead of a physical machine.  This is a summary of our findings. 

What we expected to find in terms of a performance penalty:

  • Estimated 3 to 20% penalty
  • Leaning towards 20% given the heavy disk IO

Actual results:

  • We found an approximate 10% average penalty.
  • There was a pretty wide margin, various tests measured between 0% to 17%, but always under the 20% we had estimated.  (actually less than zero percent in some cases, but we labeled those as outliers)
  • Overall the performance was better than expected, and certainly reasonable for many applications.

The test and environment:

  • HP mid tier workstation, dual core, AMD Athlon 64 X2 4400+
  • 8 Gigs memory, local SATA disk (non RAID)
  • Microsoft Windows Server 2008 64-bit for both host and client, using MS Hyper-V.
  • Sun's 64-bit JVM v6 set to 1 Gig max (which it did not fully respect)
  • Nutch 0.9 stock distribution
  • Dataset was the Enron public emails, approx half million emails in 1.5 Gigs of source data, served by IIS on separate local machine
  • Email files were mapped to text filter
  • Data was fetched and indexed into Lucene by Nutch
  • Clock time ranged from 31 to 35 minutes for all tests, with the physical tests (non-virtual) giving the widest measured deviation

Drop us an email if you'd like the full PDF when published.

January 10, 2008

Updated 2008 Enterprise Search Vendor Roundup

Jan. 10, 2008 - San Jose, CA, USA 

Microsoft announced they were acquiring FAST Search on January 8, forcing New Idea Engineering to amend our January 4th article "2008 Enterprise Search Vendors:  The new 'Fab4 ... and 1/2" (http://www.ideaeng.com/pub/entsrch/2008/number_01/article01.html). The announcement validates our original assessment and reinforces that search is mission critical for corporations, driving Microsoft to invest in a better search technology.

Some Highlights from NIE's 2008 Enterprise Search Vendor Roundup
 
Autonomy IDOL and FAST Search continue to hold the high end. K2 and Ultraseek are finally retiring.
Google's new version 5 appliance has arrived in the enterprise search mainstream.
Endeca is moving from the ecommerce side and had one of the most impressive search demos at ESS West 2007.
Lucene/ Nutch/ Solr (LNS) open source search engines continue to gain customer mindshare.
Microsoft with its acquistion moves in as Tier 1.
IBM and Oracle still not there.
 
Autonomy IDOL and FAST Search continue to hold the high end, evolving into "search platforms" that go beyond traditional drop in applications. The two leaders from earlier this decade, K2 and Ultraseek, are fading.

Google's new version 5 appliance has arrived in the enterprise search mainstream. While the new version won't satisfy every requirement, it addresses many of the earlier integration issues that had held it back. Expect to see the Google logo on a lot more enterprise portals.

Endeca has created some slick administration tools, doing very well in a head-to-head comparison with Autonomy and FAST despite their continued progress in this area.  As the importance of administration continues to increase, we are more enthusiastic about them in the Enterprise space.

Open source tools based on Lucene, including Nutch and Solr (LNS) are increasingly considered by companies, especially in niches that need to micromanage document relevancy and rating. Lucene and its derivatives are increasingly embedded in other software packages and services, to the point that many users won't even realize they're using it.

We had expected IBM to be the next entrant into the "Tier 1" lineup, based on their iPhrase acquisition. To our surprise, when we saw IBM at ESS East 2007, they were featuring one of their older engines, the OmniFind Enterprise Edition. IBM OmniFind is still not one of our new Fab 4 and an 1/2.

Dieselpoint, Intellisearch, Reccomind, ISYS, ZyLAN, Vivisimo, Siderean and Exalead have strong presences in niche markets.
 
To read the full article ... 2008 Enterprise Search Vendors: The New Fab 4 ... and 1/2. http://www.ideaeng.com/pub/entsrch/2008/number_01/article01.html

October 19, 2007

Is Gartner missing a trend?

The new Gartner 'Magic Quadrant' report for Information Technology, released last month, shows few surprises in the actual vendor chart. But the report goes on to explain that, of the open-source search engines, "none of them are significant enough to threaten the commercial market". They go on to specifically mention Lucene, saying "enterprises don't consider it a significant alternative". We beg to differ.

Gartner does talk about IBM's strong support of Lucene; and they do say that, if IBM invests substantially in the technology, Lucene may reach its potential. However, we see a number of companies already placing their bets on Lucene - although here I am considering the Apache 'Lucene-Solr-Nutch' franchise as a single, related set of tools.  The list of Lucene users we know includes start-up vertical search companies that don't have much money; but we also see some well-funded and growing public companies which are choosing to build their skills in-house for total control over their own search destiny. Netflix, Monster.Com, and Pearson Scott Foresman are just a few of the companies that use Lucene-Solr-Nutch and are incredibly happy with their choice. And more are looking every day.

The open source path may not be right for every company. Lucene is a toolkit, and we tell our customers that "some assembly is required". It is still weak on filters for document formats, it offers weak stemmer support, and has no integrated support for document security. Lucene and Solr don't include a spider/crawler, although Nutch is always available for that. And while there are wrappers for other popular languages, you will probably want some developers who know Java pretty well. But once you have the right skills in-house, it provides pretty good search in a lightweight, portable application.


We agree with Gartner when they say support from a major vendor like IBM would be a major benefit to the Lucene franchise; but we don't think it's necessary. Think about this: Lucene included a parametric search capability months before the Google Search Appliance did. And the Lucene franchise features search term highlighting; completely tunable relevance and a transparent relevance algorithm; and the capability of fine tuning just about everything to work exactly as you want it. It may be a toolkit, but it is sure a pretty good one for many environments.

It's not like Gartner to miss the wave completely; maybe they are just not listening to the same people we've been talking to with in the corporate world.