« New Phrase for determining Sentiment Analysis / Customer Interest | Main | ISYS filters to be used for SAP Platforms »

January 04, 2012

My search platform ate my homework

In a recent article on inforword.com, Peter Wayner wrote a nifty piece discussing 11 programming trends to watch. It's interesting in general, but I found one trend really rang true for me with respect to enterprise search.

He calls his 9th trend Accuracy fades as scalability trumps all. He points out that most applications are fine with close approximations, based mainly on the assumption that at internet scale, if we miss an instance of something today, we'll probably see it again tomorrow. That brought to mind something I'm working on right now for a customer who needs 100% confidence in their search platform to meet some very stringent requirements. The InfoWorld article reminded me of a dirty little secret of nearly all enterprise search platforms, a secret you may not know (yet); but which could be important to you.

Search platform developers make assumptions about your data, and most search platforms do not index all of your content... by design! Don't get me wrong: these assumptions let them produce pretty good accuracy every time; and even 100% accuracy sometimes. And pretty good is fine most of the time. In fact, as a friend told me years ago, sometimes 'marginally acceptable' is just fine.

The theory seems to be that a search index might miss a particular term in a few documents, but any really important use of the term will clearly be indexed somewhere else and our users will get results from these other documents. In fact, some search platforms have picked an arbitrary size limit, and won't index any content past that limit even if it misses major sections of large documents. Google, in fact, is one of the few who actually document this - once the GSA has indexed 2 MB of text or 2.5MB of HTML in a file, it stops indexing that file and 'discards' the rest. This curious behavior works most of the time for most data (although there is an odd twist that will bite you if you feed GSA a large list of URLs or ODBC records). To be honest, most search platforms do this sort of trimming as well; they just don't mention it too often during the sales process.

Now, in legal markets like eDiscovery, it's pretty darned critical to get every document that contains a particular term. It's not OK to go to court and report that you missed one or more critical document because your search engine truncates or ignores some terms or some documents. That excuse might have worked in elementary school or even in high school, but it just doesn't cut it in demanding enterprise search environments.

It may not be a problem for you; just be sure that, if it is a requirement for you, you include it in your RFI/RFQ documents.




TrackBack URL for this entry:

Listed below are links to weblogs that reference My search platform ate my homework:


A friend talked of how his teacher kept all his excuses and put them in a shoe box during his Junior High years. They presented them when they kick him out of school on an unusually rough day. Oh the challenges of youth.
It would be interesting to know how many documents are actually skipped when an index is made. How full would the shoe box be for some companies without even knowing it? Thank goodness there's not a judge around all the time.
I think the real question is "How much corporate knowledge is lost when large documents are not available for recall?"
Thanks Del. Yep... there's a thread here similar to the 'known unknowns' and the 'unknown unknowns'.. which hurts more do you suppose? /Miles

The comments to this entry are closed.