So how many machines does *your* vendor suggest for 100,000,000+ document dataset?
We've been chatting with folks lately about really large data sets. Clients who have a problem, and vendors who claim they can help.
But a basic question keeps coming up - not licensing - but "how many machines will we need?" And not everybody can put their data on a public cloud, and private clouds can't always spit out a dozen virtual machines to play with, plus duplicates of that for dev and staging, so not quite as trivial as some folks thing.
The Tier-1 vendors can handle hundreds of millions of dcs, sure, but usually on quite a few machines, plus of course their premium licensing, and some non trivial setup at that point.
And as much as we love Lucene, Solr, Nutch and Hadoop, our tests show you need a fair number of machines if you're going to turn around a half billion docs in less than a week.
And beyond indexing time, once you start doing 3 or 4 facet filters, you also hit another performance knee.
We've got 4 Tier-2 vendors on our "short list" that might be able to reduce machine counts by a factor of 10 or more over the Tier-1 and open source guys. But we'd love to hear your experiences.