44 posts categorized "Technical"

March 09, 2010

Enterprise search engines: They're *not* all the same

We're in the process of doing a search engine evaluation for a large customer. That, by itself, isn't news: we do those quite a bit for companies large and small. No, what makes this project most interesting is that we are doing side-by-side comparisons of three leading search technologies using industry-standard data sets.

Our assumption going in was that, for out-of-box simple searches, all three engines would return pretty much of the same set of results: after all, if TF/IDF (term frequency/inverse document frequency) was at the core of these technologies, they should be getting roughly the same results sets. Much to our surprise, if we look at the top 10 search results from each engine for a simple search, we get only about 15% overlap.

Let me explain it this way: if we retrieve ten search results for a specific query from one search engine, only 3 of the twenty - 15% - results were found by either of the other engines. In a typical list of 10 results, only 3 show up in more than one engine. We were especially amazed because we are going out of our way to use default parameters as much as possible: no entity extraction, no search tuning, no special synonyms or thesaurus terms.

We're still too early in the process to understand what's behind this surprising situation: it's always possible the results are too tentative to make any judgments, or we could find an error in our methodology.  We're working on it, and we'll get back with any findings that we can share. If you have any explanations, leave a comment - we'd love to hear what you think.

/s/Miles

February 24, 2010

Enterprise Search Summit 2010 - DC

Even as we prepare for ESS East in New York (ESS NY from now on?), Information Today has issued its call for papers for the first ever ESS-DC to be held in Washington DC November 16-18 2010.

Follow this link to find background on what InfoToday is looking for; or jump right to the submissions page. Don't be shy: everyone who presents papers had, at one time, never done it before. What you know, someone else needs to know!

In our experience, the kind of content InfoToday likes is the information that can help an organization select or manage search and related technologies. Generally, real-world stories about how other companies and organizations have succeeded with search are the ones that attendees appreciate the most. 

We'll also be having a searchdev dinner at ESS DC this year. Details to come late in summer, but plan for it now!

Are you doing search now? Have you been successful getting it going on time and under budget? Tell your story. Submit your idea now!

January 20, 2010

Google I/I Open for registration!

Google has announced its Google I/O 2010 to be held in San Francisco May 19-20 at the Moscone Center.

I think this is their third such annual event, and it's always been a full two days of information. The good news is the price is $400 per person (until April 15), a bargain really. The bad news? You'll need to bring four or five people from your company to hit all of the sessions in each track!

This conference is VERY technical, VERY good. You get the most from it if you are a developer, you know Java, Ajax, Python, or the other technologies Google uses in its various products. You won't find much in the way of marketing fluff here: in our experience, most presenters are Google developers.

The conference is being held the same week that Gilbane content management conference comes back to San Francisco. Bad timing for them, but good for you: you can probably walk to the nearby Westin at lunch and maybe catch the exhibits.

Last year, attendees received a free phone for development purposes on the Android OpSys; who knows what they might give away this year - besides the expected cool T-shirt!

Register at http://code.google.com/events/io/2010/.

November 23, 2009

Webinar: Basics of Search and Relevancy with Solr

Lucid Imagination, the Lucene and Solr folks, are running a webinar featuring Mark Bennett, CTO of New Idea Engineering. The presentation is scheduled for Wednesday, December 2nd at 2:00PM Eastern/11AM Pacific time (1900 GMT is my calculations are correct). Read more about the event and register today!

The description of the sessions follows:

In this introductory technical presentation, renowned search expert Mark Bennett, CTO of search consultancy New Idea Engineering, will present practical tips and examples that web application developers can use to quickly get productive with Solr, including:

  • Working with the "web command line" to control your search
  • Understanding Solr's DISMAX parser
  • Using Solr's Explain output to tune your results relevance
  • Using Solr's Schema browser

Sign up today and get ready for some relevance.

/s/Miles

KMWorld/ESS West moves to Washington DC 2010

Andrew McAfee of MIT and Harvard fame presented a great keynote last week at what may turn out to be the last ESS West. InfoToday has announced that KMWorld, and probably the enterprise Search Summit, will take place in Washington, DC, next Fall rather than in San Jose. ESS East, traditionally held in New York in May. While held at a smaller venue - the midtown Hilton Hotel - ESS East has always had a stronger feel to it, and apparently InfoToday will be looking for growth in the government sector.

Ironically, InfoToday recently acquired the Boston Search Engine Meeting from Infonortics, so they'll be running two shows in the Spring (Boston in April and New York in May), leaving leaves the west coast high and dry in terms of search conferences. Maybe the west coast companies are more comfortable in the 'so it yourself' search using Lucene and Solr; or maybe west coast companies just don't want to spend the time schmoozing at shows, when the real work gets done one-on-one.

In any case, look to Washington for KM World November 16-19 2010  at the Renaissance Washington DC Hotel. Should we call it ESS DC?

/s/Miles

November 18, 2009

SharePoint 2010 public beta now available

Microsoft now has released the beta of many (if not all) of the elements of SharePoint 2010 for testing. Don't be surprised that it's labeled 'Beta 2': This is the first public beta, although Beat 1 was available to select corporations and Microsoft MVP partners.

The release includes SharePoint, the 2010 version of Search Server, and most interestingly to us, the new release of FAST ESP in SharePoint. This latter beta includes what we believe is a full re-write of FAST ESP 5.3 to integrate it into SharePoint 2010, with a huge number of usability, management, and feature capabilities. Stay tuned for more on these differences and enhancements over the coming months.

Jie Li's GeekWorld site links to the Microsoft public download pages; but if your company are Microsoft partners (MSPP) you'll find the downloads with other Microsoft code you can access.

You'll also find tips to getting everything working right on Jie Li's site as well as on Alex's SharePoint blog, where you'll find  a pretty darned complete list of what 'gotchas' you should know about.

November 09, 2009

SearchDev Dinner in San Jose at ESS West

We've just put the final touches on the annual SearchDev dinner in conjunction with the Enterprise Search Summit West next week in San Jose, California. Anyone who attends the conference, or anyone in the Bay Area, is welcome to attend.

Lucid Imagination is sponsoring the dinner this year along with New Idea Engineering, which will be held on Wednesday night, November 18, at 630 PM in the San Jose Hilton, adjacent to the convention center.

Seats are limited, so if you think you will want to attend, please RSVP today to info(at)ideaeng.com with your name and names of the folks who will join you. Of course, replace the (at) with @...

Miles

November 06, 2009

Relevance by, for, and of the people...

Have you ever found yourself browsing a search result list, clicked on a result with a promising teaser, and been frustrated that the document didn't live up to its summary? Me too... you mutter 'this search sucks' to yourself, click the browser's Back link, and browse the result list again, hoping for a better result.

It seems the obsession with 'social search' has lead a few of the best known search companies to tie click popularity back into the base relevance engine. Google recently announced  Self-Learning Scorer as a new part of its latest Google Search Appliance update; and Microsoft announced similar interactive behavior ranking capability in both SharePoint and FAST ESP search - Behavioral Adaptation, one engineered called it.

Color us skeptical. We like the concept of click popularity, but we prefer to see it linked with a 'thumbs-up/thumbs-down' feedback mechanism. If people like the document they see, they won't bother telling you what a great job you did; but trust us, if it's not what they wanted, they will spend the extra few seconds to enter a negative vote. We've not been able to find out the details of the Google feature; Microsoft tells us that the recommendations have a 'time to live' of 30 days, so at least there's hope that crummy documents with great summaries won't fill the top spots of your search result lists.

What do you think?

  

November 05, 2009

Call for Papers: Enterprise Search Summit East, May 2010

My friend Michelle Manafy over at Info Today has asked me to post their call for papers for the May 2010 Enterprise Search Summit East May 11 - 12. ESS East has been one of the premier shows, and Michelle has updated the format to provide attendees more face time with speakers to make the show more valuable.

If you're implementing search now, you're ahead of alot of folks - share what you've learned! Submit a paper today! You've only got until November 30!

See you in New York!

/s/Miles

November 04, 2009

PDF - The New Legacy Data

In the old days companies referred to paper documents as "legacy data", boxes and boxes of important printed documents that were difficult to access.  If you've been in the industry for a while, you'll recall all the high speed scanning / OCR companies that cropped up to solve this problem.

Today virtually all documents and manuals are created electronically, and thanks high quality formats like Adobe PDF and numerous electronic distribution channels, documents tend to stay in digital format.

To me, PDF has replaced paper as the new "legacy" format - there's a ton of technical data now being published in this format.  And to paraphrase an old commercial "data checks in, but it don't check out".

Of course there were ways to get all that tabular technical data back out of PDF, and into more usable forms, but getting this right is not trivial and it's certainly not Adobe's priority.  We're not chiding them for this, their business model is clearly served by getting data INTO PDF, and Acrobat can now export to XML. and other solutions can help you get the content out as well.

A similar case could be made for HTML, Word, Excel and PowerPoint.  Each of these formats have problems of their own.

PDF has some particularly details that can thwart enterprise search:

  • Not all PDF files have searchable text, and users are generally unaware of the difference.
  • PDF files come in many dialects.
  • Tabular data in PDF is sometimes difficult for software to infer; humans easily see the rows and columns, but unlike other document formats, there is no intrinsic hierarchical document structure, just pixels, lines and text snippets with various X,Y coordinates.
  • Older PDF formats were not as capable when dealing with other languages, such as Arabic.

All of these issues have solutions, but all of them require some thought and careful tool selection.

So the complexity of OCR has been replaced, on some level, with document filters, entity extraction, ETL and optimized fulltext search.