« Exalead acquired by Dassault Systèmes | Main | Yahoo declares Hadoop ready for the Enterprise »

July 07, 2010

In Defense of "grep" / auto-substring Matching :-)

As some of you know, grep is the Unix utility that, in its simplest form, looks literal strings in a file and prints out any matching lines. The database equivalent is the LIKE operator with percent signs before and after the string.

For years all of us fulltext search engine snobs have been saying "grep is not a search engine" (and by extension, neither is the LIKE operator in SQL), and that this type of literal matching is insufficient for real searching. For example, this type of simple matching won't get word variations like "run" and "ran", nor synonyms like "cool" and "cold".

From an implementation standpoint, the problem with grep is performance related, it scans every line of every file to check each pattern. This is super slow if you have billions of documents. Instead search engines index all the documents ahead of time and create a highly optimized search index. It consults that index, not the original source documents, to search for specific words.

But I find myself doing substring searches in a few of the systems I frequently use. In our CRM, when I don't remember the specific spelling of a person or company or product, I type in just 3 or 4 letters. This doesn't always work, sometimes it brings back junk, other times it misses the mark. But it's an easy search to edit and resubmit, so I can fire off 2 or 3 variations in short order. I also use substrings quite a bit when searching through source code. OpenGrok is a very nice Lucene based search engine, and uses proper word breaks, but sometimes it actually doesn't find things I'm looking for because it's looking, by default, for complete words. Whereas when you're in the Eclipse editor, it uses substring searching by default, and you can lookup substrings without thinking about it. Email is yet another application that, at least on some systems, starts looking up matches after just 2 or 3 letters. There's a special case, some systems will only match those 2 or 3 characters if they're at the start of a word, similar to many autocomplete instances.

I can hear some of you yelling "what about wildcards!?" - most engines will let you put abc* and match everything starting with abc. Search engines differ on whether or not you can use wildcards in the middle or start of the word, and some engines can do it IF you enable it. This is close... it's an improvement in that it doesn't do a linear scan of all the documents, it still consults the fulltext search index. But most folks forget to put the asterisk... or is it a percent sign? And can you put it in the middle or beginning, in your particular engine and configuration? Who knows!

So what's to be done? The good news is you really can "have your cake and eat it too!". Highly configurable search engines can be told to index the same text in several different ways. One internal index can have tokens that are the exact words. Another index can normalize the words down to lower case and perform "stemming", to normalize all the plurals to singular form, etc. These engines should also be able to be coaxed into storing all of the smaller chunks of words in yet another index. Of course substrings aren't as good as a full match. But search engines have an answer for this too! You can set the relevancy for these different indices with different weights. A substring match is OK... if there's nothing else... but if the full word matches, it should get extra credit, or an exact match scores even higher. And keep in mind you're not paying the performance penalty, it's using the index and not doing a literal scan of every file.

All this techno-babel, let's walk through an example:

You're text has the term sentence "There were marks on the surface.", and let's focus on the third word "marks". Then another sentence has "Mark wrote this blog post."

The word "marks" gets indexed several ways:

Exact index: marks

Stemmed index: mark

Single index: m a r k s

Double index: ma ar rk ks

Triples: mar ark rks

Then the term "Mark" is indexed as:

Exact index: Mark

Stemmed index: mark

Tuple index (combines the 1, 2 and 3): m a r k ma ar rk mar ark

Kinda techie, but you can see that, as long as the same rules are applied to the search terms, we can easily matching something.  If somebody doesn't remember if my name ended in a "c" or a "k", they can find me with just "mar". Now, if there's a million documents, that search will bring back LOTS of other documents with the substring "mar", albeit very quickly!

But if somebody searches for mark or Mark, extra credit will be given for matching more precise indices. Actual implementations would probably leave off the single letter index, the m, a, r and k stuff, as almost every document would have those. And this implementation would take more disk space, more time to index, etc. And they'd tend to bring back a lot of junk. But the good news is that folks wouldn't have to remember to add wildcard characters. In techie terms we'd say this "helps recall, but hurts precision". Another idea would be to NOT apply the substring matching by default, but perhaps offer a clickable option in the results list to "expand your search", which re-issues the same search with the substring turned on, an let the user decide.

Index-based automatic substring matches have its place, along with all of the other tools in the search engine arsenal. It's a nice option to have when searching over names, source code, chemicals, domain names, and other technical data. Whether it's turned on by default, and how it's weighted against better matches, are choices to be carefully weighed.


TrackBack URL for this entry:

Listed below are links to weblogs that reference In Defense of "grep" / auto-substring Matching :-):


Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.


Post a comment

Comments are moderated, and will not appear until the author has approved them.