« June 2010 | Main | August 2010 »

5 posts from July 2010

July 22, 2010

Document filters webinar July 28 2010

ISYS Document filter independent ISYS is hosting a webinar on Wednesday, July 28 at 1PM Eastern to talk  about the role document filters play in successful search indexing and display. You can register now.

Of course, as a search technology company, ISYS has enjoyed great success, particularly among law enforcement where search has to work right at a reasonable price. We've always liked their technology and their approach.

But like every search platform, ISYS needed filters to convert so-called 'binary' formats like Microsoft Office, PDF, or even Photoshop files into a stream of text - after all, today's search platforms primarily operate on words.. in textual format. But ISYS looked at the market at the time, and found that two of their competitors, Autonomy and Oracle, own the best of the filter technologies.

Like any company, they made a 'make or buy' decision, and in their case, making their own filters was the right answer for them, and possibly for you. You see, ISYS decided to start selling their filter technology independent of their search platform, so now you can acquire some really great filtering and viewing technology for just about any search engine, 'off the shelf'. Their customers include other vendors with the need to extract text from various types of content, not just search vendors but also eDiscovery and eCompliance companies and many others who don’t want to pay excessive prices for technology - and who want really great filtering at a reasonable cost.

Then, a few years back, ISYS decided that open source platforms Lucene and Solr - which had no filters - needed them as well. So now you can buy a great filter pack 'off the shelf' with no huge volume commitment - no volume commitment at all! And you can get world class filtering for your open source search project.

Come hear ISYS, the guys from Lucid Imagination, and us here at New Idea Engineering talk about the critical role of filters in your search applications. See you then!


July 20, 2010

Search in SharePoint 2010: Microsoft options

SharePoint 2010 seems to be gaining traction both among its existing customer base and with companies looking for new web content management (WCM) systems.

Great search is critical in to a successful WCM deployment, and just about every serious search technology has a way to connect to SharePoint. Microsoft, not to be outdone by its competitors, has five unique search engines for SharePoint 2010:

  • SharePoint Foundation 2010 Search
  • Search Server Express
  • Search Server 2010
  • SharePoint Server 2010
  • FAST Search Server 2010 for SharePoint

The foundation search is basic search within SharePoint only, and is the new name for WSS. And of course, the FAST search is the new implementation of FAST ESP with tight SharePoint integration, but with all of the scaling people expect from a top-flight enterprise search engine.

What is interesting are the other three options: express, search server, and server. All three are seemingly based on the same codebase, with limitations imposed more by licensing than by technology. Express is free, and can scale to 300K to 10M documents, depending on which back-end SQL server database is in use.

Search Server and SharePoint Server both scale to roughly 100M documents, and an scale with multiple servers and instances. SharePoint Server is a superset, however, in that it supports integration with social content in SharePoint (think 'expert finder' and 'social proximity' among other capabilities).

Confused yet? There is a document - a wall chart, really - that explains the differences. You can get your own copy, in Visio file format, from the Microsoft Download Center.

Of course, there is also FAST Search for Internet Sites (FSIS) and FAST Search for Internal Applications (FSIA), which is really the existing FAST ESP 5.3. 

If you think you're confused, be glad you're not a Microsoft - or partner - sales rep, trying to explain this to companies.

Stay tuned.


July 13, 2010

Next Generation of Curating Tools

Daniel Tunkelang has an interesting post about how Freebase Gridworks and Needlebase can be used to curate data. One of the screen casts shows how to use Gridworks to merge similar names using various methods, split multi-valued facets, create new facets, and morph linear scales to log scales as needed.

John Udell demonstrates how useful the combination of GridWorks and the PowerPivot business intelligence add-in for Excel can be in PowerPivot + Gridworks = Wow! 


July 08, 2010

Yahoo declares Hadoop ready for the Enterprise

Alex Williams posted several months ago about Hadoop gaining more commercial acceptance. The open source distributed filesystem is used by Amazon, Facebook, Google, LinkedIn, Yahoo, Windows Azure and other major web/technology companies. IBM recently announced InfoSphere BigInsights, which uses Hadoop and BigSheets to analyze large volumes of data using a browser.

Cloudera is probably the most well known startup selling add-ons and support for Hadoop. An article in the NYTimes describes Datameer's Hadoop Helper, a tool that tries to make Hadoop easier to use by IT depts (without having to consult Hadoop wizards), combining their own interface with numerous mathematical and software tools. They also added support for the Katta indexing system to improve performance.

David Needle provides more details on Yahoo's recent announcement at the Hadoop Developer Conference that Hadoop is ready for broader enterprise use due to support being added for Kerberos and the Oozie workflow engine

July 07, 2010

In Defense of "grep" / auto-substring Matching :-)

As some of you know, grep is the Unix utility that, in its simplest form, looks literal strings in a file and prints out any matching lines. The database equivalent is the LIKE operator with percent signs before and after the string.

For years all of us fulltext search engine snobs have been saying "grep is not a search engine" (and by extension, neither is the LIKE operator in SQL), and that this type of literal matching is insufficient for real searching. For example, this type of simple matching won't get word variations like "run" and "ran", nor synonyms like "cool" and "cold".

From an implementation standpoint, the problem with grep is performance related, it scans every line of every file to check each pattern. This is super slow if you have billions of documents. Instead search engines index all the documents ahead of time and create a highly optimized search index. It consults that index, not the original source documents, to search for specific words.

But I find myself doing substring searches in a few of the systems I frequently use. In our CRM, when I don't remember the specific spelling of a person or company or product, I type in just 3 or 4 letters. This doesn't always work, sometimes it brings back junk, other times it misses the mark. But it's an easy search to edit and resubmit, so I can fire off 2 or 3 variations in short order. I also use substrings quite a bit when searching through source code. OpenGrok is a very nice Lucene based search engine, and uses proper word breaks, but sometimes it actually doesn't find things I'm looking for because it's looking, by default, for complete words. Whereas when you're in the Eclipse editor, it uses substring searching by default, and you can lookup substrings without thinking about it. Email is yet another application that, at least on some systems, starts looking up matches after just 2 or 3 letters. There's a special case, some systems will only match those 2 or 3 characters if they're at the start of a word, similar to many autocomplete instances.

I can hear some of you yelling "what about wildcards!?" - most engines will let you put abc* and match everything starting with abc. Search engines differ on whether or not you can use wildcards in the middle or start of the word, and some engines can do it IF you enable it. This is close... it's an improvement in that it doesn't do a linear scan of all the documents, it still consults the fulltext search index. But most folks forget to put the asterisk... or is it a percent sign? And can you put it in the middle or beginning, in your particular engine and configuration? Who knows!

So what's to be done? The good news is you really can "have your cake and eat it too!". Highly configurable search engines can be told to index the same text in several different ways. One internal index can have tokens that are the exact words. Another index can normalize the words down to lower case and perform "stemming", to normalize all the plurals to singular form, etc. These engines should also be able to be coaxed into storing all of the smaller chunks of words in yet another index. Of course substrings aren't as good as a full match. But search engines have an answer for this too! You can set the relevancy for these different indices with different weights. A substring match is OK... if there's nothing else... but if the full word matches, it should get extra credit, or an exact match scores even higher. And keep in mind you're not paying the performance penalty, it's using the index and not doing a literal scan of every file.

All this techno-babel, let's walk through an example:

You're text has the term sentence "There were marks on the surface.", and let's focus on the third word "marks". Then another sentence has "Mark wrote this blog post."

The word "marks" gets indexed several ways:

Exact index: marks

Stemmed index: mark

Single index: m a r k s

Double index: ma ar rk ks

Triples: mar ark rks

Then the term "Mark" is indexed as:

Exact index: Mark

Stemmed index: mark

Tuple index (combines the 1, 2 and 3): m a r k ma ar rk mar ark

Kinda techie, but you can see that, as long as the same rules are applied to the search terms, we can easily matching something.  If somebody doesn't remember if my name ended in a "c" or a "k", they can find me with just "mar". Now, if there's a million documents, that search will bring back LOTS of other documents with the substring "mar", albeit very quickly!

But if somebody searches for mark or Mark, extra credit will be given for matching more precise indices. Actual implementations would probably leave off the single letter index, the m, a, r and k stuff, as almost every document would have those. And this implementation would take more disk space, more time to index, etc. And they'd tend to bring back a lot of junk. But the good news is that folks wouldn't have to remember to add wildcard characters. In techie terms we'd say this "helps recall, but hurts precision". Another idea would be to NOT apply the substring matching by default, but perhaps offer a clickable option in the results list to "expand your search", which re-issues the same search with the substring turned on, an let the user decide.

Index-based automatic substring matches have its place, along with all of the other tools in the search engine arsenal. It's a nice option to have when searching over names, source code, chemicals, domain names, and other technical data. Whether it's turned on by default, and how it's weighted against better matches, are choices to be carefully weighed.