17 posts categorized "Solr"

February 24, 2010

Enterprise Search Summit 2010 - DC

Even as we prepare for ESS East in New York (ESS NY from now on?), Information Today has issued its call for papers for the first ever ESS-DC to be held in Washington DC November 16-18 2010.

Follow this link to find background on what InfoToday is looking for; or jump right to the submissions page. Don't be shy: everyone who presents papers had, at one time, never done it before. What you know, someone else needs to know!

In our experience, the kind of content InfoToday likes is the information that can help an organization select or manage search and related technologies. Generally, real-world stories about how other companies and organizations have succeeded with search are the ones that attendees appreciate the most. 

We'll also be having a searchdev dinner at ESS DC this year. Details to come late in summer, but plan for it now!

Are you doing search now? Have you been successful getting it going on time and under budget? Tell your story. Submit your idea now!

November 23, 2009

Webinar: Basics of Search and Relevancy with Solr

Lucid Imagination, the Lucene and Solr folks, are running a webinar featuring Mark Bennett, CTO of New Idea Engineering. The presentation is scheduled for Wednesday, December 2nd at 2:00PM Eastern/11AM Pacific time (1900 GMT is my calculations are correct). Read more about the event and register today!

The description of the sessions follows:

In this introductory technical presentation, renowned search expert Mark Bennett, CTO of search consultancy New Idea Engineering, will present practical tips and examples that web application developers can use to quickly get productive with Solr, including:

  • Working with the "web command line" to control your search
  • Understanding Solr's DISMAX parser
  • Using Solr's Explain output to tune your results relevance
  • Using Solr's Schema browser

Sign up today and get ready for some relevance.

/s/Miles

November 09, 2009

SearchDev Dinner in San Jose at ESS West

We've just put the final touches on the annual SearchDev dinner in conjunction with the Enterprise Search Summit West next week in San Jose, California. Anyone who attends the conference, or anyone in the Bay Area, is welcome to attend.

Lucid Imagination is sponsoring the dinner this year along with New Idea Engineering, which will be held on Wednesday night, November 18, at 630 PM in the San Jose Hilton, adjacent to the convention center.

Seats are limited, so if you think you will want to attend, please RSVP today to info(at)ideaeng.com with your name and names of the folks who will join you. Of course, replace the (at) with @...

Miles

September 24, 2009

Fix error "getTextContent is undefined for the type Node" for Solr project in Eclipse IDE

The error:

"The method getTextContent() is undefined for the type Node"
You get 3 of these, in the source files ReutersService.java and TestConfig.java

A Web fix that doesn't work:

You'll see suggestions that org.w3c.dom.Node.getTextContent() is only available as of Java 1.5.  But when you check you see you ARE running with Java 1.5 or later.

You can quickly check this by right clicking on the project, Properties -> Java Compiler, and confirm that 1.5 or above are in the drop down lists.

The fix, short story:

The order of the classpath needs to be tweaked in Eclipse project; shove the xml-apis-1.0.b2.jar all the way to the bottom, past the built in JVM libraries.

For more details, and how you would know this, read on!

Continue reading "Fix error "getTextContent is undefined for the type Node" for Solr project in Eclipse IDE" »

August 24, 2009

Bay Area Apache Lucene / Solr Meetup September 3

Another meet-up for San Francisco Bay Lucene and Solr users and developers is coming up Thursday night, September 3rd in Mountain View California.

CHM Logo This time, the Computer History Museum will be the venue at 1401 North Shoreline, the odl Silicon Graphics digs on the corner of 101 and Shoreline. Mark Bennett from New Idea Engineering will be one of many search luminaries on the agenda including Walter Underwood and Brian Pinkerton and others. Topics include relevancy in Solr; search at Netflix and digg, and search performance analysis.

The last meet-up in San Francisco was pretty full.. sign up now or you may find the meet-up has pushed past the quota of 48 people.

/s/Miles

July 21, 2009

Lucene: It's coming from inside the firewall!

We've done a number of projects helping large clients with search roadmap planning, including an audit of  their various data sources. This is often an early step in implementing an enterprise search solution that will integrate diverse content across multiple sources.

On a number of our recent projects, an interesting thing has happened. As we've spoken with content owners, we've found an increasing number of Lucene implementations that no one knew about. This has often been a surprise to the people who brought us in, usually corporate IT.  Much like PCs infiltrated into corporations in the early days, it looks like Lucene is making its way into companies under the radar, often hacked in by a creative employee who just wants to get a simple search capability working, and who doesn't have time for a formal selection process or budget to purchase a commercial solution.

As we're written before, Lucene/Solr is getting to be a pretty decent search solution, although it's still a bit rough round the edges. This can't be a good sign for companies that market premium-priced search products.

Consider:

  • IBM offers the free IBM/Yahoo! search for up to 500,000 documents
  • Microsoft offers free Search Server Express as well as a higher-capacity Search Server
  • Google Site Search and Google Custom Search are free and low-cost hosted solutions that provide search to your site - or a group of sites - and not spend much money

Finally, as Microsoft subsidiary FAST moves into the mid-range price sector with high end capabilities, the price of enterprise search is dropping for many companies that might have had to license six-figure deals for licensing alone. Add to this the free and low costs supporting technologies - consider clustering engine Carrot^2, for example - and you've got a movement.

To paraphrase our long-time friend Deep Search, you can spend a bunch of money on a commercial search, then spend much more implementing it well; or you can find a free or low-cost solution and spend a bunch of money implementing it. Your call.

July 16, 2009

Lucene/Solr Meet-Up in New York City 7/22

MTV is hosting a New York City Lucene/Solr meet up featuring the guys from Lucid Imagination.
This seems to be similar to the meet-up in San Francisco last month which we've written about in the blog. Unique to the New York event will be a talk by MTV's Michael Rosencrantz will talk about how they are using Solr and the benefits they are seeing.

Visit the meet-up page to learn more and to sign up for the event. If you want to go, register now because the number of available seats is dropping fast.

We've used Solr and Lucene on a number of large projects, and it's certainly become a darned good search engine. In fact, the internal structures of the 1.4 release are almost identical to the internal structure of Verity back in its early days. The catch is that the project technology is really enterprise scale; the packaging still leaves something to be desired to really succeed in the corporate environment. It's getting to be better and better all the time, though, so stay tuned.

We found the event really informative, although the crowd was not a typical corporate IT gathering. If you want to learn about what's new in Solr 1.4, and how a large media company uses open source search to their advantage, head for MTV. You'll also be able to brag about when you "were down at MTV..."

June 09, 2009

Enterprise search doesn't mean mortgaging the farm

Lynda Moulton, the Search Practice analyst at CMS firm Gilbane Group, really hit the mark on a recent blog post nominally about how advertising money can but editorial space. While it's true that many  publications (and analyst firms) are happy getting paid by both sides. (Note: I was called out on this by this by Theresa Regli of CMS Watch, so I no longer say 'all analysts' for anything!)

In my opinion, the real news in Lynda's post is this: "there are dozens of enterprise search solutions that will serve you extremely well, with much lower cost of ownership" than with the big industry players. In fact, open source is beginning to penetrate the corporate veil, and while Lucene and Solr are not right for everyone, it looks like they've just about implemented what Mark and I consider Verity's  "Topic 1.0" capabilities circa 1990. We went to a meet-up the other night that Mark has written about; and we were pretty impressed.

So before you decide you need to budget a half million dollars or more for search, consider what Walter Underwood, chief architect of Ultraseek and now search evangelist at Netflix once told me. Paraphrasing: "You can download Solr then spend a ton of money customizing it; or you can spend a ton of money licensing enterprise search software, then spend a ton of money installing and customizing it. Your call."

But get help!

s/Miles

June 08, 2009

Enterprise Search Engine Optimization: eSEO

Last week at the Gilbane Conference in San Francisco, I participated in a panel "Search Survival Guide: Delivering Great Results" moderated by Hadley Reynolds of IDC. In the presentation, I offered a new view on improving enterprise search engine relevancy that I call eSEO.

The term SEO is well understood by - and widely practiced in - the corporate world.  The concept of SEO, as summarized by one of the Gilbane talks, states that "Key to the value of any Web content is the ability for people to find it”. In the SEO world this is done by combining organic results and keyword placement - advertising - to improve placement, maintain ranking, and monitor search engine position - results- over time.

While we've been helping our customers improve their enterprise search results, it's hard to convince them that search results are not a problem they can solve once. I've decided to apply a new term to this process - Enterprise Search Engine Optimization, or eSEO. To paraphrase the role of SEO, eSEO is the process of combining organic results and best bets to deliver correct, relevant, timely content to enterprise search users - employees, customers, partners, investors, and others.

For both organic and best bets, the first step is to identify what we call the "top 100" queries. Start by creating a histogram that shows the top terms from your search engine. I hope you'll agree that if the top queries - whether 100, 50, or even 20 - deliver great results, you're on your way to having happy users. Talk to your content owners as you review the histogram, and ask them to identify the best result for each.

Once you have a list of queries and results, start the two step process: tune the search engine using its native query tuning capabilities. This will impact the shape of the histogram, and over time should start delivering better results. The bad news is tuning like this doesn't position all of your top terms, and it would be silly to try to micro-manage the results for each. That's why search engines have best bets.

When you feel pretty good about the curve through query tuning, it' time to start setting up best bets - the "ad words" of eSEO. Limit the number of bests bets to one or two at most - but remember that you can use other real-estate like the rightmost column of the screen to suggest additional content. Some guidelines for best bets:

  • Use one or at most two best bets
  • Don't repeat a document already at the top of the organic results
  • Make sure your best bets respect security

Once you have tuned your search engine, and set up best bets for the most timely and actionable result, you're ready to roll it out. But then the ongoing part comes in: you need to review your search activity and best bets periodically. Usually, we'd suggest once a month for a while, then perhaps quarterly thereafter. You may find seasonal variations, and if you're not watching you'll miss a golden opportunity.

In Summary

1. eSEO is just as critical as SEO

  • Lost time and revenue
  • Legal exposure

2. Watch for trends over time: Search is not "fire and forget"

3. Make sure SEO doesn't impact your eSEO

  • Use fielded data that web search engines ignore for your tuning (i.e., 'Abstract' rather than 'Description'.

This will get you started; but because your queries and your content changes over time, it's a never-ending story. Some companies - ours included - have tools that can help. But no matter what, hang in there!

s/Miles


June 06, 2009

Impressions of first Lucene/Solr SF Meetup

Kudos to Carl, our NIE Marketeer and defacto social director, for getting us to attend, well worth it, and conveniently coinciding with Gilbane.

The Good:

  • VERY entertaining, very informative.  Lots of good info about upcoming versions of Lucene and Solr, including additional performance tweaks.
  • A friendly, supportive bunch of like-minded nerds, and I mean this is the best possible way.
  • Also discussions of other related Apache projects.  We're all gonna need a cheat sheet pretty soon to keep track of it all.
  • Lucene/Solr will soon have implemented much of the core features of Autonomy IDOL, Endeca, FAST, etc.  They really ought to be spying.  :-)

Personally I think Otis & co. might wanna fly out for the next one.  I also think Dieselpoint ought to attend and talk about Open Pipeline.  If we get up enough energy maybe we could even volunteer to do that next time, we're on the board after all, but this is really Chris's baby.

The Not-so-Good:

  • About 50 terms that clients would not understand.  Don't get me wrong, we love the Map/Reduce, Bayesian, K-Means, SVD stuff, but most corporate clients would be lost.
  • Not much for Enterprise Packaging.  Ironically it's the mundane aspects of search, from a non-developer standpoint, that are still not on the horizon.  Not a criticism of the developers, they have what they need.
  • Not much about Nutch.  Nutch 1.0 is out, along with rumors of a revised admin GUI, but not much coverage here.

Impressions of Lucid Imagination:

This event was sponsored by Lucid, a company that recently got funding for bringing commercial packaging and services to the open source search world, and their senior staff includes quite a few of the core committers.

  • A very sincere bunch of guys.
  • They haven't sold their souls to corporate America, I think their "geek cred" is still well in tact.
  • Probably will not be filling in enterprise packaging pot holes any time soon.
  • Do they understand the Enterprise Market?

Also a shout out to LinkedIn and IBM for giving back to open source community.

There was also an "open mic" segment, and I'd like to give a shout to Avi Rappaport - I agree 1,000%, "stop words bad!" (or at least the blind use of index time stop words)


Surprises:

  • Not much of a threat to Google Appliance, due to packaging.  Yes, Google scales with their Map/Reduce and relevancy algorithms, and the open source guys have responded, but that's not the stuff that makes Google tick these days.
  • And despite the impressive and rapidly evolving core technologies, also not a real threat to the other Tier One vendors like FAST and Autonomy.  More on this seeming contradiction in a bit.
  • The Tier 2 vendors of the world, Attivio, Exalead, Dieselpoint, etc. DO need to pay attention.  There is a place for Tier 2 vendors, but they need to mind what the open source products do and do not provide more carefully.
  • It's really cool to see IBM willing to contribute so aggressively to the open source search engines, even though they sell several of their own.  A naive person might think they are competing with themselves, sabotaging their own sales guys, but they're a lot smarter than that.  They are selling their commercial search products as pure search, those technologies are always part of a larger (and more expensive) grand business solution.  They know what they're doing!

For similar reasons, still not a huge threat to Autonomy, MS/FAST, Endeca, etc. on corporate services.  I said earlier that the Apache projects are implementing a lot of the "secret sauce" that launched Autonomy and Endeca, etc, so you'd think this represents "a clear and present danger", but Mike Lynch's secret algorithms are not why people buy IDOL anymore.  Things like giant reference accounts, professional services, and commercial grade spiders have a lot more to with why big companies still pay six figures for search technology.

And speaking of surprises and Lucid Imagination, I wanna circle back to their PR a few months back when they got their funding and launched their company.  They talked about relevancy in their press releases!?  Wow... Yes, Lucene and Solr have some good traction there, but that specific competitive advantage has been used by almost every commercial search vendor in the past 15 years, including Verity, Autonomy and Google!

I would've expected them to say something like "we're gonna do for Lucene what RedHat did for Linux" - this would have been a very clear business-oriented proposition, though to be fair lots of companies have used that business model as well.  It wouldn't be original, but would be more business centric.  Then again, I'm not in Marketing, and their VC's obviously liked their pitch, so what do I know!

s/Mark