12 posts categorized "Open Source"

January 10, 2012

ISYS filters to be used for SAP Platforms

ISYS announced today that SAP has selected the popular ISYS Document Filters to replace software from both Autonomy and Oracle in their popular suite of analytical products.

ISYS, which has marketed an enterprise search product successfully for years, recognized the need for high-capability and low cost document filters, and packaged their internally developed technology. Because of its capabilities, support and price, ISYS Document Filters have become the best choice for companies that need to extract content from hundreds of different formats.

We particularly like that the ISYS filters are lightweight, easy to implement, and priced such that any company can afford to use them in-house or bundled with product. For large companies that use  Lucene/Solr for search but insist on having supported up-to-date filtering technology can solve the problem at a competitive price with ISYS.

 

 

November 28, 2011

Solr Disk and Memory Size Estimator (Excel worksheet)

If you do a standard checkout of the Lucene/Solr codebbase you also get a dev-tools directory.  One interesting tidbit in there is an Excel spreadsheet for estimating the RAM and disk requirements for a given set of data.  Be sure to notice the tabs along the bottom, tab 2 is for memory/RAM estimates, and tab 3 is for disk space.

Full URL: http://svn.apache.org/repos/asf/lucene/dev/trunk/dev-tools/size-estimator-lucene-solr.xls

November 08, 2011

Are you spending too much on enterprise search?

If your organization uses enterprise search, or if you are in the market for a new search platform, you may want to attend our webinar next week "Are you spending too much for search?". The one hour session will address:

  • What do users expect?
  • Why not just use Google?
  • How much search do you need?
  • Is an RFI a waste of time?   

Date: Wednesday, November 16 2011

Time: 11AM Pacific Standard Time / 1900 UTC

Register today!

August 09, 2011

So how many machines does *your* vendor suggest for 100,000,000+ document dataset?

We've been chatting with folks lately about really large data sets.  Clients who have a problem, and vendors who claim they can help.

But a basic question keeps coming up - not licensing - but "how many machines will we need?"  And not everybody can put their data on a public cloud, and private clouds can't always spit out a dozen virtual machines to play with, plus duplicates of that for dev and staging, so not quite as trivial as some folks thing.

The Tier-1 vendors can handle hundreds of millions of dcs, sure, but usually on quite a few machines, plus of course their premium licensing, and some non trivial setup at that point.

And as much as we love Lucene, Solr, Nutch and Hadoop, our tests show you need a fair number of machines if you're going to turn around a half billion docs in less than a week.

And beyond indexing time, once you start doing 3 or 4 facet filters, you also hit another performance knee.

We've got 4 Tier-2 vendors on our "short list" that might be able to reduce machine counts by a factor of 10 or more over the Tier-1 and open source guys.  But we'd love to hear your experiences.

December 05, 2010

Share your successes at ESS East next May

ESSSpringLogo Our friends over at InfoToday who run the successful Enterprise Search Summit conferences have asked us  to announce that the date for submitting papers to their Spring show in New York in May 2011 has been extended until Wednesday, December 8. You can find out what they are looking for and how to submit your proposal online at http://www.enterprisesearchsummit.com/Spring2011/CallForSpeakers.aspx.

Michelle Manafy, who runs the program again next May, really likes to have speakers who have found creative and successful ways to select, deploy, or manage ongoing enterprise search operations. We've co-presented with several of our customers in the past, and trust me, it's great fun and not bad for your career! And - no promises - the weather at ESS East has been great for just about every year - and we've been there for nearly 6 years now!

A friend told me something years ago that I've always fond helpful; I hope you'll take it to heart: 'Everything you know, someone else needs to know'. Don't worry if your search project isn't perfect; or worry that someone will find fault with what you've done. Trust me: there are many organizations newer to enterprise search than you are, and anything you found helpful will sure be valuable for them as well. And you get to attend al of the sessions, so you might learn more as well! A 'win-win' situation if I've ever seen one!

See you in New York!

/s/Miles

 

 

September 30, 2010

Solritas

Erik Hatcher wrote about Solritas: Solr 1.4′s Hidden Gem last year. Solritas is a fancy name for VelocityResponseWriter, derived from the the word Celeritas . It provides a simple Velocity template based translation layer that you can use to build a search user interface within a Solr environment.

Its enabled by default in LucidWorks for Solr 1.4. Eric Pugh discusses some of its improvements in Notes from using LucidWorks for Solr Distro. It doesn't support auto-completion out of the box. This thread gives some examples of how to use jQuery's auto-complete with it.

Solritas is also mentioned in Erik Hatcher's post on Solr Search User Interface Examples and in the slides for the Rapid Prototyping with Solr presentation.

September 04, 2010

Faster sorting for Farsi / "Iranian", Danish, Turkish, other atypical languages in Lucene/Solr

By default search engines sort results by relevance or "score", to try and bring the best match to the top of the results list. That's normally what users want, but occasionally you might want to sort by a different field, such as date, title or author. Lucene and Solr support this in various ways, as do many other search engines.

When it comes to sorting by titles or author names, most languages sort words with similar rules, and this is the character ordering that's built into Unicode. But a few languages are different, they may have different policies on accented characters, for example. Java includes to concept of "locale" to represent some language differences, such as currency and date formats, and it can also encode these differences in preferred order. However, apparently the performance isn't great, so sorting in some languages can be slow, or there may not be a locale for a specific language/dialect.

Lucene does include an alternate "collator" class that claims to fix this. It allows for non-default Unicode sorting rules, without the slowdown normally associated with locales. The doc mentions Farsi, Danish and Turkish as examples. Although I haven't tried it, since it's buried a bit in the code tree, I wanted to surface it in a post.

The top URL (in case formatting gets lost) is:

http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/collation

Usage scenarios are given in package.html

 

August 27, 2010

There's an Ant on your Southwest Leg!

The WSJ has an interesting article on how language effects how we think.  I particularly liked the example of a indigenous language where anything you discuss involves absolute cardinal directions (north, south, east, west etc.). You literally can't say "There is an ant on one of your legs". Instead you say something like "There's an ant on your southwest leg." To say hello you'd ask "Where are you going?", and an appropriate response might be, "A long way to the south-southwest. How about you?" If you don't know which way is which, you literally can't get past hello. 

Dr. Kevin Lim reviewed Search Engine Society , a book which explores the effect search engines have on politics, culture and economics. He is not your typical reviewer since he also mentioned in the book, due to his recording a large part of his life using cameras (one he wears, another at his desk points at him) while a GPS device tracks his movements.

Google throws its weight behind Voice Search by Stephen Lawson discusses how voice search is based on statistical models of what sequences of words are most likely to occur, and how they train a new language model. Another example of that would be Midomi , a web site where you search for music by singing a fragment of the song. 

Multilingual Search Engine Breaks Language Barriers discusses how the the UNL Society uses the pivot language UNL to return a precise answer in the language in which the question was formulated. This seems to be still a research project, with some related projects such as LACE trying to extract data from parallel corpora as a cheaper way to populate a lexical database.

XBRL Across The Language Divide by Jennifer Zaino discusses how XBRL (eXtensible Business Reporting Language) may be one of the few areas that benefits from the Monnet project , which attempts to "provide a semantics-based solution for accessing information across language barriers". It tries to "build software that breaks the link between conceptual information and linguistic expressions (the labels that point back to concepts in ontologies) for each language." When that works, it makes it easier and quicker to perform analytics across multiple languages.

The Cross-Language Evaluation Forum (CLEF) is working on infrastructure for testing, tuning and evaluation of systems that retrieve information in European languages, and benchmarks to help test it. One of its papers for example, compares lexical and algorithmetic stemming in 9 languages using Hummingbird SearchServer

August 04, 2010

First fully tested release of SMILA available

SMILA (SeMantic Information Logistics Architecture) is a Eclipse project that provides an extensible framework for building search applications to access unstructured information in the enterprise. It provides a integrated package based on Lucene that includes crawlers, connectors and the interfaces needed to manage it using existing infrastructure. The main goal of SMILA is to reduce the risk of investment and IT costs by providing a common development framework that can be used to build semantic applications and by standardizing a lot of the code.

SMILA attempts to provide economies of scale while providing the option to use highly specialized solutions or plug-ins as needed. It also provides the opportunity for a company to reuse interfaces from internal projects that use Lucene.

The first fully tested (to make certain there are no legal issues due to third party code) official release is available. Version 0.7 also adds Web Service API support and Solr integration (access to Apache Solr REST API). 

SMILA has been getting more German press (it was created by Empolis GmbH and Brox IT Solutions GmbH) in the last year but very little in this country. The last I spotted was as part of a 25 minute talk on Searching the Cloud - the EclipseRT Umbrella! at EclipseCon 2010 in March.

Version 0.9 is scheduled for November 30, 2010 and is supposed to include some more third party components (that have completed the IP process). It will be interesting to see if some of those components are from American companies and if they find a way to build bridges to other Eclipse projects that use semantic technologies. I found some newsgroup posts last year about creating a new Eclipse project to do that but nothing seems to have happened.

GitHub has a Chansonnier project based on SMILA, but its part of the authors bachelor's degree thesis project. It is a search application that indexes songs imported from the web, with parameters like language and emotion. Its useful as a sample SMILA application that isn't part of the official distribution. The SMILA project has a lot of potential but hasn't found a way to appeal to a wider audience yet

July 13, 2010

Next Generation of Curating Tools

Daniel Tunkelang has an interesting post about how Freebase Gridworks and Needlebase can be used to curate data. One of the screen casts shows how to use Gridworks to merge similar names using various methods, split multi-valued facets, create new facets, and morph linear scales to log scales as needed.

John Udell demonstrates how useful the combination of GridWorks and the PowerPivot business intelligence add-in for Excel can be in PowerPivot + Gridworks = Wow!