« October 2009 | Main | December 2009 »

7 posts from November 2009

November 23, 2009

Webinar: Basics of Search and Relevancy with Solr

Lucid Imagination, the Lucene and Solr folks, are running a webinar featuring Mark Bennett, CTO of New Idea Engineering. The presentation is scheduled for Wednesday, December 2nd at 2:00PM Eastern/11AM Pacific time (1900 GMT is my calculations are correct). Read more about the event and register today!

The description of the sessions follows:

In this introductory technical presentation, renowned search expert Mark Bennett, CTO of search consultancy New Idea Engineering, will present practical tips and examples that web application developers can use to quickly get productive with Solr, including:

  • Working with the "web command line" to control your search
  • Understanding Solr's DISMAX parser
  • Using Solr's Explain output to tune your results relevance
  • Using Solr's Schema browser

Sign up today and get ready for some relevance.


KMWorld/ESS West moves to Washington DC 2010

Andrew McAfee of MIT and Harvard fame presented a great keynote last week at what may turn out to be the last ESS West. InfoToday has announced that KMWorld, and probably the enterprise Search Summit, will take place in Washington, DC, next Fall rather than in San Jose. ESS East, traditionally held in New York in May. While held at a smaller venue - the midtown Hilton Hotel - ESS East has always had a stronger feel to it, and apparently InfoToday will be looking for growth in the government sector.

Ironically, InfoToday recently acquired the Boston Search Engine Meeting from Infonortics, so they'll be running two shows in the Spring (Boston in April and New York in May), leaving leaves the west coast high and dry in terms of search conferences. Maybe the west coast companies are more comfortable in the 'so it yourself' search using Lucene and Solr; or maybe west coast companies just don't want to spend the time schmoozing at shows, when the real work gets done one-on-one.

In any case, look to Washington for KM World November 16-19 2010  at the Renaissance Washington DC Hotel. Should we call it ESS DC?


November 18, 2009

SharePoint 2010 public beta now available

Microsoft now has released the beta of many (if not all) of the elements of SharePoint 2010 for testing. Don't be surprised that it's labeled 'Beta 2': This is the first public beta, although Beat 1 was available to select corporations and Microsoft MVP partners.

The release includes SharePoint, the 2010 version of Search Server, and most interestingly to us, the new release of FAST ESP in SharePoint. This latter beta includes what we believe is a full re-write of FAST ESP 5.3 to integrate it into SharePoint 2010, with a huge number of usability, management, and feature capabilities. Stay tuned for more on these differences and enhancements over the coming months.

Jie Li's GeekWorld site links to the Microsoft public download pages; but if your company are Microsoft partners (MSPP) you'll find the downloads with other Microsoft code you can access.

You'll also find tips to getting everything working right on Jie Li's site as well as on Alex's SharePoint blog, where you'll find  a pretty darned complete list of what 'gotchas' you should know about.

November 09, 2009

SearchDev Dinner in San Jose at ESS West

We've just put the final touches on the annual SearchDev dinner in conjunction with the Enterprise Search Summit West next week in San Jose, California. Anyone who attends the conference, or anyone in the Bay Area, is welcome to attend.

Lucid Imagination is sponsoring the dinner this year along with New Idea Engineering, which will be held on Wednesday night, November 18, at 630 PM in the San Jose Hilton, adjacent to the convention center.

Seats are limited, so if you think you will want to attend, please RSVP today to info(at)ideaeng.com with your name and names of the folks who will join you. Of course, replace the (at) with @...


November 06, 2009

Relevance by, for, and of the people...

Have you ever found yourself browsing a search result list, clicked on a result with a promising teaser, and been frustrated that the document didn't live up to its summary? Me too... you mutter 'this search sucks' to yourself, click the browser's Back link, and browse the result list again, hoping for a better result.

It seems the obsession with 'social search' has lead a few of the best known search companies to tie click popularity back into the base relevance engine. Google recently announced  Self-Learning Scorer as a new part of its latest Google Search Appliance update; and Microsoft announced similar interactive behavior ranking capability in both SharePoint and FAST ESP search - Behavioral Adaptation, one engineered called it.

Color us skeptical. We like the concept of click popularity, but we prefer to see it linked with a 'thumbs-up/thumbs-down' feedback mechanism. If people like the document they see, they won't bother telling you what a great job you did; but trust us, if it's not what they wanted, they will spend the extra few seconds to enter a negative vote. We've not been able to find out the details of the Google feature; Microsoft tells us that the recommendations have a 'time to live' of 30 days, so at least there's hope that crummy documents with great summaries won't fill the top spots of your search result lists.

What do you think?


November 05, 2009

Call for Papers: Enterprise Search Summit East, May 2010

My friend Michelle Manafy over at Info Today has asked me to post their call for papers for the May 2010 Enterprise Search Summit East May 11 - 12. ESS East has been one of the premier shows, and Michelle has updated the format to provide attendees more face time with speakers to make the show more valuable.

If you're implementing search now, you're ahead of alot of folks - share what you've learned! Submit a paper today! You've only got until November 30!

See you in New York!


November 04, 2009

PDF - The New Legacy Data

In the old days companies referred to paper documents as "legacy data", boxes and boxes of important printed documents that were difficult to access.  If you've been in the industry for a while, you'll recall all the high speed scanning / OCR companies that cropped up to solve this problem.

Today virtually all documents and manuals are created electronically, and thanks high quality formats like Adobe PDF and numerous electronic distribution channels, documents tend to stay in digital format.

To me, PDF has replaced paper as the new "legacy" format - there's a ton of technical data now being published in this format.  And to paraphrase an old commercial "data checks in, but it don't check out".

Of course there were ways to get all that tabular technical data back out of PDF, and into more usable forms, but getting this right is not trivial and it's certainly not Adobe's priority.  We're not chiding them for this, their business model is clearly served by getting data INTO PDF, and Acrobat can now export to XML. and other solutions can help you get the content out as well.

A similar case could be made for HTML, Word, Excel and PowerPoint.  Each of these formats have problems of their own.

PDF has some particularly details that can thwart enterprise search:

  • Not all PDF files have searchable text, and users are generally unaware of the difference.
  • PDF files come in many dialects.
  • Tabular data in PDF is sometimes difficult for software to infer; humans easily see the rows and columns, but unlike other document formats, there is no intrinsic hierarchical document structure, just pixels, lines and text snippets with various X,Y coordinates.
  • Older PDF formats were not as capable when dealing with other languages, such as Arabic.

All of these issues have solutions, but all of them require some thought and careful tool selection.

So the complexity of OCR has been replaced, on some level, with document filters, entity extraction, ETL and optimized fulltext search.