« Are you Tracking MRR? - "Mean Reciprocal Rank" Trend Monitoring | Main | Better Handling of Model Numbers and Software Versions »

September 13, 2012

Improved Tokenization for Punctuation Characters

Search appliations that are geared towards technical users often have problems with searches like .net, C#, *.*, etc.

In some cases these can be handled solely at the application level. For example, many Unix utility command line options begin with a hyphen, which means “NOT” to many search engines, so users searching for "-verbose" will find every document EXCEPT the ones that discuss the verbose option.

This can often be handled by just stripping off the minus sign before submitting the query to the engine. (depending on the engine and its configuration)

If there's always additional text in the search, a cheap workaround is to just consistently drop the same punctuation characters at both index time and search time. As long as "TCP/IP" is consistently reduced to [ tcp, ip ], users will have a good chance of finding it.

But what is punctuation is all you have? Somebody really needs to search for -(*) for example? What then? There's a strong tendency to balk at these uses cases, to claim that they are obscure edge cases, and rationalize why they should be ignored. But this edge case argument is old and stale - if your site truly needs to search punctuation rich content, then it may be worth the cost. Long search tails, which are common on technical search applications, can add up to substantial percentage of overall traffic!

Many punctuation problems need to be handled at index time, or in addition to special search time logic. For example, if asterisks are important, they can be stored as actual tokens in the fulltext index. At search time the asterisks would also need to be handled appropriately, since most search engines would either ignore them or assume they are part of a wildcard search.

The point is that, regardless of what you do with asterisks at search time, they cannot be found at all if they didn’t make it to the index, and were instead discarded at index time.

Token Overloading can be used to put multiple tokenized representations of the same text into the index. For example, a hyphenated phrase like "XY-1234" is found in a source document at index time, it can be injected as [ (xy-1234), (xy, 1234), (xy,-,1234), (xy1234) ]. Although this inflates the index size, it gives maximum flexibility at search time.

Don't confuse "don't need to do something" with "don't know how to do something", get your punctuation problems sorted out properly!

Hmm... ironically our own web properties don't follow this advice, and we certainly do attract a number of techies!  I could rationalize that punctuational search isn't a large percentage of our traffic, but the real reason is that we use hosted services and don't have full control over own their search engines.  But do as we say, not as we do, and remember that honest confessions are good for the soul.


TrackBack URL for this entry:

Listed below are links to weblogs that reference Improved Tokenization for Punctuation Characters:


The comments to this entry are closed.