« Search Patterns, the Movie | Main | HP looking at acquiring Autonomy »

August 09, 2011

So how many machines does *your* vendor suggest for 100,000,000+ document dataset?

We've been chatting with folks lately about really large data sets.  Clients who have a problem, and vendors who claim they can help.

But a basic question keeps coming up - not licensing - but "how many machines will we need?"  And not everybody can put their data on a public cloud, and private clouds can't always spit out a dozen virtual machines to play with, plus duplicates of that for dev and staging, so not quite as trivial as some folks thing.

The Tier-1 vendors can handle hundreds of millions of dcs, sure, but usually on quite a few machines, plus of course their premium licensing, and some non trivial setup at that point.

And as much as we love Lucene, Solr, Nutch and Hadoop, our tests show you need a fair number of machines if you're going to turn around a half billion docs in less than a week.

And beyond indexing time, once you start doing 3 or 4 facet filters, you also hit another performance knee.

We've got 4 Tier-2 vendors on our "short list" that might be able to reduce machine counts by a factor of 10 or more over the Tier-1 and open source guys.  But we'd love to hear your experiences.


TrackBack URL for this entry:

Listed below are links to weblogs that reference So how many machines does *your* vendor suggest for 100,000,000+ document dataset?:



We're a Fast OEM of from before the MS purchase. With the final 5.3 release our internal targets are generally 100mil/node. On the 4.3 release we went with 30mil/node but probably could have gotten that to 60 or 90 if we wanted to make our config a little more exotic. All with html previews no less ;-)

That sort of density isn't easy to come by & there are certainly bumps along the road. Not all customers understand the difference between indexing log files and email.

Not too familiar with the other vendor offerings or Solr but would be very interested to see to comparison... always thought we stacked up quite well density wise.
Thanks for contributing - wow that count on ESP was way over the FAST suggested limit.. but then, they licensed on document size and QPS, i believe.. still. much more in the way of efficient use of the iron..

Back in 2000 or so we built a 500m web page index with a farm of around 30 machines, based on what became Xapian. Our 'rule of thumb' for hardware is around 10-50m documents per server, but of course this depends very much on how big a document is - a single database row or a 1000 page PDF for example. Of course it's also dependent on the search load. Lucene/Solr more or less fits this model as well and generally requires a lot less hardware than FAST ESP for example (faceting being especially hardware intensive on that platform). However, it's important to remember that all search problems are different and all data grows like Topsy, so it's key to be able to scale economically in the future.
Excellent info Charlie.. and a good reminder of some of the parameters that can influence the call..

The comments to this entry are closed.