August 30, 2010

Today's Search Term: Stemming


stemming, lemmatization, normalize
Search engines use stemming as a means to
determine the root of a given written word. Using a program or algorithm all of the affixes to a word (prefix and /or suffix in the English language) are removed, leaving the root word. By implementing the rules of the given language obstacles such as third- person singular present (as cries is of the verb cry) in the English language can be accurately indexed.

Stemmers become harder to design as the rules of the target language becomes more complex. For example, some languages have more verb and pronoun forms. Other languages do not always have clear word breaks between each word, and you can't do stemming until you've isolated the words!


August 27, 2010

There's an Ant on your Southwest Leg!

The WSJ has an interesting article on how language effects how we think.  I particularly liked the example of a indigenous language where anything you discuss involves absolute cardinal directions (north, south, east, west etc.). You literally can't say "There is an ant on one of your legs". Instead you say something like "There's an ant on your southwest leg." To say hello you'd ask "Where are you going?", and an appropriate response might be, "A long way to the south-southwest. How about you?" If you don't know which way is which, you literally can't get past hello. 

Dr. Kevin Lim reviewed Search Engine Society , a book which explores the effect search engines have on politics, culture and economics. He is not your typical reviewer since he also mentioned in the book, due to his recording a large part of his life using cameras (one he wears, another at his desk points at him) while a GPS device tracks his movements.

Google throws its weight behind Voice Search by Stephen Lawson discusses how voice search is based on statistical models of what sequences of words are most likely to occur, and how they train a new language model. Another example of that would be Midomi , a web site where you search for music by singing a fragment of the song. 

Multilingual Search Engine Breaks Language Barriers discusses how the the UNL Society uses the pivot language UNL to return a precise answer in the language in which the question was formulated. This seems to be still a research project, with some related projects such as LACE trying to extract data from parallel corpora as a cheaper way to populate a lexical database.

XBRL Across The Language Divide by Jennifer Zaino discusses how XBRL (eXtensible Business Reporting Language) may be one of the few areas that benefits from the Monnet project , which attempts to "provide a semantics-based solution for accessing information across language barriers". It tries to "build software that breaks the link between conceptual information and linguistic expressions (the labels that point back to concepts in ontologies) for each language." When that works, it makes it easier and quicker to perform analytics across multiple languages.

The Cross-Language Evaluation Forum (CLEF) is working on infrastructure for testing, tuning and evaluation of systems that retrieve information in European languages, and benchmarks to help test it. One of its papers for example, compares lexical and algorithmetic stemming in 9 languages using Hummingbird SearchServer

August 19, 2010

Microsoft has a ways to go in search...

So I discovered an article on a Microsoft forum today where someone was asking about the differences between the different versions of enterprise search. I posted a reply, with some link suggestions and a pointer to a previous posting here on our own blog.

Now, because of all the work we do with Microsoft's FAST product, some people think we see no wrong in Redmond. Well, read on.

A few minutes later, I wanted to go back and re-read the original posting; but, try as I might, I was unable to find the posting on the Microsoft forum search. The original question had a number of relatively unique terms, so I tried again. And again. No luck anywhere on the Microsoft MSDN site.

(By the wait, it sometimes took up to 30 seconds to get a result back- something on the system social.msdn.microsoft.com takes forever. But the search itself, when it came back, reported it only took 0.2 seconds' so i felt much better. NOT. I noticed that if I hit the 'Search' button again in frustration after a long wait, the result came back immediately. Someone at Microsoft needs to be looking at this!)

I went back to the Google public site and, by using a bunch of unique terms, found the original post. My search? fs4sp fs14 fsis fsia reference. Only one document comes back even in Google, which may be a record.

The same search on the Microsoft forums returns ZERO hits - ironic since the document is posted on the Microsoft discussion forum. Bing returns a Japanese language page; and, to no surprise, Yahoo returns the same page. Both, by the way, are an HTTP error 403 page.

So it looks like Microsoft has its work cut out for it in the public-facing web search arena: If it cannot locate a posting (from April!) on its own forums, how can it hope to compete with Google?

August 06, 2010

Coveo Expresso - free Enterprise Search Lite for up to 50 users

Coveo released a beta version of Coveo Expresso . It is a free entry level enterprise search application "designed to allow users to search through corporate emails, SharePoint, network file servers and desktop files from their mobile device or desktop." It's free for up to 50 users, 1 million emails and attachments, and 100,000 documents. Each user can use a Outlook sidebar for searching from within Outlook, a floating search bar for the desktop, a classic search page using their browser and/or a Blackberry MIDlet (mobile search).

It is built on Coveo Enterprise Search Platform 6.0 and provides a simplified admin portal to centrally provision users with just a few clicks. The employee can then install or update Coveo Expresso to their desktop, Outlook and BlackBerry with one click. The company claims you can download and configure it in less than 45 minutes.

They sell several upgrade packs . The license can be expanded to 250 users, 5 million desktop files and email messages, and 1 million SharePoint and file share documents just by typing a new access code. Expresso can use Coveo’s Advanced Search Modules, which are highly configurable and scalable to billions of documents.

The free version of Coveo Expresso requires a permanent Internet connection to receive license keys, which are renewed every 7 days. It will go offline and users will be unable to do any searches if they are not renewed.

Barb Masher has a overview of the new features in Enterprise Search Platform 6.1. The two products share many features. A comparison of their features is available here .

Stephen Arnold recently posted about Coveo's Enterprise Search product winning the SIAA Codie award in the “Best Enterprise Search Engine Category” for the second time. 

John Ragsdale has an interesting post (this was in January, before version 2.0 of CIAS was announced) about his spending some time with Coveo, an "emerging customer information access vendor, whom I will never again refer to as a search vendor". He argues that their customer search product provides so many possibilities to retrieve, manipulate and display data that it is "much more than a search engine, or a dash boarding tool, or a reporting platform, though it can do all of these things well."

August 04, 2010

First fully tested release of SMILA available

SMILA (SeMantic Information Logistics Architecture) is a Eclipse project that provides an extensible framework for building search applications to access unstructured information in the enterprise. It provides a integrated package based on Lucene that includes crawlers, connectors and the interfaces needed to manage it using existing infrastructure. The main goal of SMILA is to reduce the risk of investment and IT costs by providing a common development framework that can be used to build semantic applications and by standardizing a lot of the code.

SMILA attempts to provide economies of scale while providing the option to use highly specialized solutions or plug-ins as needed. It also provides the opportunity for a company to reuse interfaces from internal projects that use Lucene.

The first fully tested (to make certain there are no legal issues due to third party code) official release is available. Version 0.7 also adds Web Service API support and Solr integration (access to Apache Solr REST API). 

SMILA has been getting more German press (it was created by Empolis GmbH and Brox IT Solutions GmbH) in the last year but very little in this country. The last I spotted was as part of a 25 minute talk on Searching the Cloud - the EclipseRT Umbrella! at EclipseCon 2010 in March.

Version 0.9 is scheduled for November 30, 2010 and is supposed to include some more third party components (that have completed the IP process). It will be interesting to see if some of those components are from American companies and if they find a way to build bridges to other Eclipse projects that use semantic technologies. I found some newsgroup posts last year about creating a new Eclipse project to do that but nothing seems to have happened.

GitHub has a Chansonnier project based on SMILA, but its part of the authors bachelor's degree thesis project. It is a search application that indexes songs imported from the web, with parameters like language and emotion. Its useful as a sample SMILA application that isn't part of the official distribution. The SMILA project has a lot of potential but hasn't found a way to appeal to a wider audience yet

August 03, 2010

Open Source Search Engine in C#

c_sharp_asearch Codeplex now features a search engine written i C#. The author’s intent:

BlueCurve enterprise search is a search engine written in C#.
The main goal is to provide an extensible and reliable search engine with dotnet like nutch.
Join me : bluecurveteam@gmail.com


Today's Search Term: Folksonomy

Taxonomy, Behavior Based Taxonomy
A type of taxonomy or other organization of content  suggested by users.
For example, on popular photo sites, users can tag photos with descriptive words. These words can then be searched for. In the enterprise, some search systems allow employees to tag certain documents with key words. These terms are then found when other employees search for those terms.

August 02, 2010

Some of Yahoo's most valuable assets might switch to Google Search

Yahoo Japan is one of Yahoo's most valuable assets , but it is not fully owned by Yahoo and is not obligated by Yahoo's recent agreement with Microsoft to use Bing. There are a lot of posts about Google trying to reach an agreement with Yahoo Japan but the best one seems to be this one by Kara Swisher.  If they reach an agreement, Google would essentially control the Japanese search market.

The Alibaba Group owns Yahoo's name in China, and is partially owned by Yahoo. Its currently using Yahoos' search technology, but is also free to switch if it wants to. Yahoo Japan has partnered with Taobao (China's top ecommerce website and a subsidiary of the Alibaba Group) to list over eight million items in a Chinese-language TaoJapan section. That might cause a ripple effect if Yahoo Japan switches.

Yahoo Japan is very different from what somebody in the USA is used to. Its very localized , with what a non-Japanese would consider a very cluttered site. Even Google (in Japan) has customized its sparse splash page and added links to numerous services to try to cater to Japanese users. Yahoo Japan scans passerby's and puts personalized content on billboards . Supposedly the install CD from most Japanese ISPs sets the home page to Yahoo Japan, and few users bother to change it. Cheap 100Mbps residential broadband with a IP phone is also fairly standard. Why Yahoo! is more popular than Google in Japan has some more details.