9 posts categorized "Java"

March 15, 2013

Open Source Search Myth 2: Potentially Expensive Customizations

This is part of a series addressing the misconception that open source search is too risky for companies to use. You can find the introduction to the series here; this is Part 2 of the series; for Part 3 click Skills Required In House.

Part 2: Potentially Expensive Customization

Which is more expensive: open source or proprietary search platforms?

Commercial enterprise search vendors often quote man-years of effort to create and deploy what, in many cases, should be relatively straightforward site search.  Sure, there are tough issues: unusual security; the need to mark-up content as part of indexing; multi-language issues; and vaguely defined user requirements.

Not to single them out, but Autonomy implementations were legend for taking years. Granted, this was usually eDiscovery search, so the sponsor - often a Chief Risk Officer - had no worries about budget. Anything that would keep the CRO and his/her fellow executives out of jail was reasonable. But even with easier tasks such as search-enabling an intranet site, took more time and effort than it needed because no one scoped out the work. This is one reason so many IDOL projects hire large numbers of IDOL contractors for such long projects.

FAST was also famous for lengthy engagements. 

FAST once quoted a company we later worked with a one year $500K project to assist in moving from ESP Version 4.x to ESP Version 5.x. These were two versions that were, for all purposes, the same user interface, the same API, the same command line tools. Really? One year?

True story: I joked with one of the sales guy that FAST even wanted 6 months to roll out a web search for a small intranet; I thought two weeks was more like it. He put me on the spot a year later and challenged me to help one of his customers, and sure enough, we took almost a month to bring up search! But we had a constraint: the new FAST search had to be callable from the existing custom CMS, which had hard-coded calls to Verity K2 - the customer did not have time to re-write the CMS.

Thus, part of our SOW was to write a front-end that would accept search requests using the Verity K2 DLL; intercept the call; and perform the search in FAST ESP. Then, intercepting the K2 results list processing calls, deliver the FAST results to the CMS that thought it was talking with Verity. And we did it in less that 20% of the time FAST wanted to index a generic HTML-bases web site.

On the other hand, at LucidWorks we frequently have 5-day engagements to set up the Solr and LucidWorks Search; index the user's content; and integrate results in the end user application. I think for most engagements, other Solr and open source implementations are comparable. 

Let me ask: which was the more "expensive" implementation?

November 30, 2011

Odd Google Translate Encoding issue with Japanese

Was translating a comment in the Japanese SEN tokenization library.

It seems like if your text includes the Unicode right arrow character, Google somehow gets confused about the encoding.  Saw this on both Firefox and Safari.  Not a big deal, strangely comforting to see even the big guys trip up on character encodings.

OK: サセン
OK: チャセ
Not OK: サセンチャセ?


November 29, 2011

10 Handy Things to Know about the Lucene / Solr Source Code

It's funny how certain facts are "obvious" to some folks, stuff they've known a long time, but come as a pleasant surprise to others.  Chances are you know at least half of these, but no harm in double checking!

  1. Although Lucene and Solr are available in binary form, most serious users are eventually going to need some custom code.  If you post questions on the mailing lists, I think the assumption is you're comfortable with compilers, source code control and patches.  So it's a good habit to get into early on.
  2. Lucene and Solr source code were combined a while back (circa March 2010), so it's now one convenient checkout.
  3. You'll want to be using Java 6 JDK to work with recent versions of Lucene / Solr.
  4. Lucene/Solr use the ant build tool by default.  BUT did you know that the build file can also generate Project files for Eclipse, IntelliJ and Maven.  So you can use your favorite tool.  (See the README.txt file for info and links)
  5. Lucene/Solr use the Subversion / SVN source code control system.  There are clients for Windows and plugins for Eclipse and IntelliJ. (Mac OS X has it built in)
  6. You're allowed to do read-only checkout without needing any sort of login - checkouts are completely open to the public.  This is news to folks who've used older or more secure systems.
  7. Although checking any changes back in would require a login, it's more common to post patches to the bug tracking system or mailing list, and then let the main committers review and checkin the patch.  Even the read-only checkouts create enough information on your machine to generate patches from your local changes.
  8. Doing a checkout, either public or with a login, does not "lock" anything.  This is also a surprise to folks used to older systems.  This non-locking checkout is why anonymous users can be allowed to checkout code - there's no need to coordinate checkouts.
  9. The read-only current source for the combined Lucene + Solr is at http://svn.apache.org/repos/asf/lucene/dev/trunk  Even though it's an http link, and can be browsed with a web browser, it's also a valid Subversion URL.
  10. The "contribute" wiki pages for Lucene and Solr have more info about the source code and patch process.

November 28, 2011

Solr Disk and Memory Size Estimator (Excel worksheet)

If you do a standard checkout of the Lucene/Solr codebbase you also get a dev-tools directory.  One interesting tidbit in there is an Excel spreadsheet for estimating the RAM and disk requirements for a given set of data.  Be sure to notice the tabs along the bottom, tab 2 is for memory/RAM estimates, and tab 3 is for disk space.

Full URL: http://svn.apache.org/repos/asf/lucene/dev/trunk/dev-tools/size-estimator-lucene-solr.xls

July 30, 2011

Java 7: Five days is just not enough time

You may have heard that the recent release of Java 7 has what sounds to me like some serious problems which are discussed on the Lucid Imagination blog. The most telling line i found there -

"These problems were detected only 5 days before the official Java 7 release,
so Oracle had no time to fix those bugs, affecting also many more

Granted, this is from Uwe Schindler as quoted on Lucid's site - not directly from Oracle. But I have to wonder about any product that is released with known serious flaws like this when they ONLY had five days' notice. I've seen software halted hours before its intended release to investigate a potentially serious bug; did Oracle have to meet revenue at end of quarter, and 'damn the torpedoes'?  

I know Oracle is not (the old) Hewlett Packard which had a legendary commitment to quality. When bugs were found in the earliest HP 3000, Dave Packard made his sales rep buy them back from customers - even from those customers who were happy with their purchase. The reason: The system did not perform as advertised. If every one of your users is going to be impacted, some in very subtle ways that may produce incorrect results - wouldn't you agree that five days is enough time to stop the presses?

Tell me what you think...



September 04, 2010

Faster sorting for Farsi / "Iranian", Danish, Turkish, other atypical languages in Lucene/Solr

By default search engines sort results by relevance or "score", to try and bring the best match to the top of the results list. That's normally what users want, but occasionally you might want to sort by a different field, such as date, title or author. Lucene and Solr support this in various ways, as do many other search engines.

When it comes to sorting by titles or author names, most languages sort words with similar rules, and this is the character ordering that's built into Unicode. But a few languages are different, they may have different policies on accented characters, for example. Java includes to concept of "locale" to represent some language differences, such as currency and date formats, and it can also encode these differences in preferred order. However, apparently the performance isn't great, so sorting in some languages can be slow, or there may not be a locale for a specific language/dialect.

Lucene does include an alternate "collator" class that claims to fix this. It allows for non-default Unicode sorting rules, without the slowdown normally associated with locales. The doc mentions Farsi, Danish and Turkish as examples. Although I haven't tried it, since it's buried a bit in the code tree, I wanted to surface it in a post.

The top URL (in case formatting gets lost) is:


Usage scenarios are given in package.html


August 04, 2010

First fully tested release of SMILA available

SMILA (SeMantic Information Logistics Architecture) is a Eclipse project that provides an extensible framework for building search applications to access unstructured information in the enterprise. It provides a integrated package based on Lucene that includes crawlers, connectors and the interfaces needed to manage it using existing infrastructure. The main goal of SMILA is to reduce the risk of investment and IT costs by providing a common development framework that can be used to build semantic applications and by standardizing a lot of the code.

SMILA attempts to provide economies of scale while providing the option to use highly specialized solutions or plug-ins as needed. It also provides the opportunity for a company to reuse interfaces from internal projects that use Lucene.

The first fully tested (to make certain there are no legal issues due to third party code) official release is available. Version 0.7 also adds Web Service API support and Solr integration (access to Apache Solr REST API). 

SMILA has been getting more German press (it was created by Empolis GmbH and Brox IT Solutions GmbH) in the last year but very little in this country. The last I spotted was as part of a 25 minute talk on Searching the Cloud - the EclipseRT Umbrella! at EclipseCon 2010 in March.

Version 0.9 is scheduled for November 30, 2010 and is supposed to include some more third party components (that have completed the IP process). It will be interesting to see if some of those components are from American companies and if they find a way to build bridges to other Eclipse projects that use semantic technologies. I found some newsgroup posts last year about creating a new Eclipse project to do that but nothing seems to have happened.

GitHub has a Chansonnier project based on SMILA, but its part of the authors bachelor's degree thesis project. It is a search application that indexes songs imported from the web, with parameters like language and emotion. Its useful as a sample SMILA application that isn't part of the official distribution. The SMILA project has a lot of potential but hasn't found a way to appeal to a wider audience yet

January 20, 2010

Google I/I Open for registration!

Google has announced its Google I/O 2010 to be held in San Francisco May 19-20 at the Moscone Center.

I think this is their third such annual event, and it's always been a full two days of information. The good news is the price is $400 per person (until April 15), a bargain really. The bad news? You'll need to bring four or five people from your company to hit all of the sessions in each track!

This conference is VERY technical, VERY good. You get the most from it if you are a developer, you know Java, Ajax, Python, or the other technologies Google uses in its various products. You won't find much in the way of marketing fluff here: in our experience, most presenters are Google developers.

The conference is being held the same week that Gilbane content management conference comes back to San Francisco. Bad timing for them, but good for you: you can probably walk to the nearby Westin at lunch and maybe catch the exhibits.

Last year, attendees received a free phone for development purposes on the Android OpSys; who knows what they might give away this year - besides the expected cool T-shirt!

Register at http://code.google.com/events/io/2010/.

September 24, 2009

Fix error "getTextContent is undefined for the type Node" for Solr project in Eclipse IDE

The error:

"The method getTextContent() is undefined for the type Node"
You get 3 of these, in the source files ReutersService.java and TestConfig.java

A Web fix that doesn't work:

You'll see suggestions that org.w3c.dom.Node.getTextContent() is only available as of Java 1.5.  But when you check you see you ARE running with Java 1.5 or later.

You can quickly check this by right clicking on the project, Properties -> Java Compiler, and confirm that 1.5 or above are in the drop down lists.

The fix, short story:

The order of the classpath needs to be tweaked in Eclipse project; shove the xml-apis-1.0.b2.jar all the way to the bottom, past the built in JVM libraries.

For more details, and how you would know this, read on!

Continue reading "Fix error "getTextContent is undefined for the type Node" for Solr project in Eclipse IDE" »