Search Relevancy and Japanese text, CJK, interesting thread on SearchDev.org
A really nice discussion over on SearchDev.org about relevancy when searching Japanese text and other CJK languages. Touches on a lot of technical issues including tokenization, thesaurus, character set normalization, etc.
Folks chiming in about how a number of different search engines handle this including Autonomy IDOL, K2, Ultraseek and MarkLogic.
The actual thread:
http://tech.groups.yahoo.com/group/search_dev/messages/718?threaded=1&m=e&var=1&tidx=1
A tad hard to read with all the quoted text, but well worth a full skim, keep scrolling!
Comments