« Domain Name Registrar Search Tweak: Indicate that you already own It in Search Results | Main | Google Instant: Predictive queries »

September 04, 2010

Faster sorting for Farsi / "Iranian", Danish, Turkish, other atypical languages in Lucene/Solr

By default search engines sort results by relevance or "score", to try and bring the best match to the top of the results list. That's normally what users want, but occasionally you might want to sort by a different field, such as date, title or author. Lucene and Solr support this in various ways, as do many other search engines.

When it comes to sorting by titles or author names, most languages sort words with similar rules, and this is the character ordering that's built into Unicode. But a few languages are different, they may have different policies on accented characters, for example. Java includes to concept of "locale" to represent some language differences, such as currency and date formats, and it can also encode these differences in preferred order. However, apparently the performance isn't great, so sorting in some languages can be slow, or there may not be a locale for a specific language/dialect.

Lucene does include an alternate "collator" class that claims to fix this. It allows for non-default Unicode sorting rules, without the slowdown normally associated with locales. The doc mentions Farsi, Danish and Turkish as examples. Although I haven't tried it, since it's buried a bit in the code tree, I wanted to surface it in a post.

The top URL (in case formatting gets lost) is:


Usage scenarios are given in package.html



TrackBack URL for this entry:

Listed below are links to weblogs that reference Faster sorting for Farsi / "Iranian", Danish, Turkish, other atypical languages in Lucene/Solr:


Thanks for writing about this!

More documentation and examples are available here for the Solr integration (not yet in any Solr release):


Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.


Post a comment

Comments are moderated, and will not appear until the author has approved them.