4 posts categorized "Eclipse IDE"

May 10, 2012

Lucene Revolution: MS talks of being more open

Lucene Revolution: MS talks of being more open

At yesterday’s kickoff of Lucene Revolution 2012, Lucid CEO Paul Doscher introduced Gianugo Rabellino, Microsoft's Director of Open Source Communities. Gianugo said little about search per se, but he did confess to having been a fan of Lucene and Solr for a while now. In his talk, he told the audience that Microsoft has changed with respect to open source, and he went on to tell everyone how they have become more involved in open standard like HTML5, CSS3; and in hardware specifications like USB. He went so far as to say 'Microsoft's survival depends on open source software'.

News to me, and perhaps to others in the room, was the extent to which Microsoft is supporting a number of open source products and languages. Gianugo reported that Linux is now a 'first-class guest operating system' on Microsoft HyperV; and provides support for PHP, Ruby on Rails, node.js and other projects on Azure (and presumably for 'on premises' systems).

A number of folks from large commercial organizations seemed to appreciate the news about Microsoft's shift towards supporting open source; but a number of the open-source folks in the room felt this offered little new, and some even felt it was an unrelated 'sales pitch'. Even though we are Microsoft partners, I'm glad to see more support for open source products like PHP and Linux.

The finniest part of the talk came as Gianugo was describing how SharePoint data was easily accessible to other non-Microsoft' search platforms. An attendee asked if he felt there was a role for other platforms to be used as the primary engine for search in SharePoint; as he paused to craft a reply, Paul Doscher (loudly) pronounced his belief that there was, much to the pleasure of the crown.

There was not much else in the way of Microsoft news; but it was interesting to see how many people and how much effort Microsoft is putting into open source projects.



November 30, 2011

Odd Google Translate Encoding issue with Japanese

Was translating a comment in the Japanese SEN tokenization library.

It seems like if your text includes the Unicode right arrow character, Google somehow gets confused about the encoding.  Saw this on both Firefox and Safari.  Not a big deal, strangely comforting to see even the big guys trip up on character encodings.

OK: サセン
OK: チャセ
Not OK: サセンチャセ?


July 07, 2010

In Defense of "grep" / auto-substring Matching :-)

As some of you know, grep is the Unix utility that, in its simplest form, looks literal strings in a file and prints out any matching lines. The database equivalent is the LIKE operator with percent signs before and after the string.

For years all of us fulltext search engine snobs have been saying "grep is not a search engine" (and by extension, neither is the LIKE operator in SQL), and that this type of literal matching is insufficient for real searching. For example, this type of simple matching won't get word variations like "run" and "ran", nor synonyms like "cool" and "cold".

From an implementation standpoint, the problem with grep is performance related, it scans every line of every file to check each pattern. This is super slow if you have billions of documents. Instead search engines index all the documents ahead of time and create a highly optimized search index. It consults that index, not the original source documents, to search for specific words.

But I find myself doing substring searches in a few of the systems I frequently use. In our CRM, when I don't remember the specific spelling of a person or company or product, I type in just 3 or 4 letters. This doesn't always work, sometimes it brings back junk, other times it misses the mark. But it's an easy search to edit and resubmit, so I can fire off 2 or 3 variations in short order. I also use substrings quite a bit when searching through source code. OpenGrok is a very nice Lucene based search engine, and uses proper word breaks, but sometimes it actually doesn't find things I'm looking for because it's looking, by default, for complete words. Whereas when you're in the Eclipse editor, it uses substring searching by default, and you can lookup substrings without thinking about it. Email is yet another application that, at least on some systems, starts looking up matches after just 2 or 3 letters. There's a special case, some systems will only match those 2 or 3 characters if they're at the start of a word, similar to many autocomplete instances.

I can hear some of you yelling "what about wildcards!?" - most engines will let you put abc* and match everything starting with abc. Search engines differ on whether or not you can use wildcards in the middle or start of the word, and some engines can do it IF you enable it. This is close... it's an improvement in that it doesn't do a linear scan of all the documents, it still consults the fulltext search index. But most folks forget to put the asterisk... or is it a percent sign? And can you put it in the middle or beginning, in your particular engine and configuration? Who knows!

So what's to be done? The good news is you really can "have your cake and eat it too!". Highly configurable search engines can be told to index the same text in several different ways. One internal index can have tokens that are the exact words. Another index can normalize the words down to lower case and perform "stemming", to normalize all the plurals to singular form, etc. These engines should also be able to be coaxed into storing all of the smaller chunks of words in yet another index. Of course substrings aren't as good as a full match. But search engines have an answer for this too! You can set the relevancy for these different indices with different weights. A substring match is OK... if there's nothing else... but if the full word matches, it should get extra credit, or an exact match scores even higher. And keep in mind you're not paying the performance penalty, it's using the index and not doing a literal scan of every file.

All this techno-babel, let's walk through an example:

You're text has the term sentence "There were marks on the surface.", and let's focus on the third word "marks". Then another sentence has "Mark wrote this blog post."

The word "marks" gets indexed several ways:

Exact index: marks

Stemmed index: mark

Single index: m a r k s

Double index: ma ar rk ks

Triples: mar ark rks

Then the term "Mark" is indexed as:

Exact index: Mark

Stemmed index: mark

Tuple index (combines the 1, 2 and 3): m a r k ma ar rk mar ark

Kinda techie, but you can see that, as long as the same rules are applied to the search terms, we can easily matching something.  If somebody doesn't remember if my name ended in a "c" or a "k", they can find me with just "mar". Now, if there's a million documents, that search will bring back LOTS of other documents with the substring "mar", albeit very quickly!

But if somebody searches for mark or Mark, extra credit will be given for matching more precise indices. Actual implementations would probably leave off the single letter index, the m, a, r and k stuff, as almost every document would have those. And this implementation would take more disk space, more time to index, etc. And they'd tend to bring back a lot of junk. But the good news is that folks wouldn't have to remember to add wildcard characters. In techie terms we'd say this "helps recall, but hurts precision". Another idea would be to NOT apply the substring matching by default, but perhaps offer a clickable option in the results list to "expand your search", which re-issues the same search with the substring turned on, an let the user decide.

Index-based automatic substring matches have its place, along with all of the other tools in the search engine arsenal. It's a nice option to have when searching over names, source code, chemicals, domain names, and other technical data. Whether it's turned on by default, and how it's weighted against better matches, are choices to be carefully weighed.

September 24, 2009

Fix error "getTextContent is undefined for the type Node" for Solr project in Eclipse IDE

The error:

"The method getTextContent() is undefined for the type Node"
You get 3 of these, in the source files ReutersService.java and TestConfig.java

A Web fix that doesn't work:

You'll see suggestions that org.w3c.dom.Node.getTextContent() is only available as of Java 1.5.  But when you check you see you ARE running with Java 1.5 or later.

You can quickly check this by right clicking on the project, Properties -> Java Compiler, and confirm that 1.5 or above are in the drop down lists.

The fix, short story:

The order of the classpath needs to be tweaked in Eclipse project; shove the xml-apis-1.0.b2.jar all the way to the bottom, past the built in JVM libraries.

For more details, and how you would know this, read on!

Continue reading "Fix error "getTextContent is undefined for the type Node" for Solr project in Eclipse IDE" »