June 09, 2009

Enterprise search doesn't mean mortgaging the farm

Lynda Moulton, the Search Practice analyst at CMS firm Gilbane Group, really hit the mark on a recent blog post nominally about how advertising money can but editorial space. While it's true that many  publications (and analyst firms) are happy getting paid by both sides. (Note: I was called out on this by this by Theresa Regli of CMS Watch, so I no longer say 'all analysts' for anything!)

In my opinion, the real news in Lynda's post is this: "there are dozens of enterprise search solutions that will serve you extremely well, with much lower cost of ownership" than with the big industry players. In fact, open source is beginning to penetrate the corporate veil, and while Lucene and Solr are not right for everyone, it looks like they've just about implemented what Mark and I consider Verity's  "Topic 1.0" capabilities circa 1990. We went to a meet-up the other night that Mark has written about; and we were pretty impressed.

So before you decide you need to budget a half million dollars or more for search, consider what Walter Underwood, chief architect of Ultraseek and now search evangelist at Netflix once told me. Paraphrasing: "You can download Solr then spend a ton of money customizing it; or you can spend a ton of money licensing enterprise search software, then spend a ton of money installing and customizing it. Your call."

But get help!

s/Miles

June 08, 2009

Enterprise Search Engine Optimization: eSEO

Last week at the Gilbane Conference in San Francisco, I participated in a panel "Search Survival Guide: Delivering Great Results" moderated by Hadley Reynolds of IDC. In the presentation, I offered a new view on improving enterprise search engine relevancy that I call eSEO.

The term SEO is well understood by - and widely practiced in - the corporate world.  The concept of SEO, as summarized by one of the Gilbane talks, states that "Key to the value of any Web content is the ability for people to find it”. In the SEO world this is done by combining organic results and keyword placement - advertising - to improve placement, maintain ranking, and monitor search engine position - results- over time.

While we've been helping our customers improve their enterprise search results, it's hard to convince them that search results are not a problem they can solve once. I've decided to apply a new term to this process - Enterprise Search Engine Optimization, or eSEO. To paraphrase the role of SEO, eSEO is the process of combining organic results and best bets to deliver correct, relevant, timely content to enterprise search users - employees, customers, partners, investors, and others.

For both organic and best bets, the first step is to identify what we call the "top 100" queries. Start by creating a histogram that shows the top terms from your search engine. I hope you'll agree that if the top queries - whether 100, 50, or even 20 - deliver great results, you're on your way to having happy users. Talk to your content owners as you review the histogram, and ask them to identify the best result for each.

Once you have a list of queries and results, start the two step process: tune the search engine using its native query tuning capabilities. This will impact the shape of the histogram, and over time should start delivering better results. The bad news is tuning like this doesn't position all of your top terms, and it would be silly to try to micro-manage the results for each. That's why search engines have best bets.

When you feel pretty good about the curve through query tuning, it' time to start setting up best bets - the "ad words" of eSEO. Limit the number of bests bets to one or two at most - but remember that you can use other real-estate like the rightmost column of the screen to suggest additional content. Some guidelines for best bets:

  • Use one or at most two best bets
  • Don't repeat a document already at the top of the organic results
  • Make sure your best bets respect security

Once you have tuned your search engine, and set up best bets for the most timely and actionable result, you're ready to roll it out. But then the ongoing part comes in: you need to review your search activity and best bets periodically. Usually, we'd suggest once a month for a while, then perhaps quarterly thereafter. You may find seasonal variations, and if you're not watching you'll miss a golden opportunity.

In Summary

1. eSEO is just as critical as SEO

  • Lost time and revenue
  • Legal exposure

2. Watch for trends over time: Search is not "fire and forget"

3. Make sure SEO doesn't impact your eSEO

  • Use fielded data that web search engines ignore for your tuning (i.e., 'Abstract' rather than 'Description'.

This will get you started; but because your queries and your content changes over time, it's a never-ending story. Some companies - ours included - have tools that can help. But no matter what, hang in there!

s/Miles


June 06, 2009

Impressions of first Lucene/Solr SF Meetup

Kudos to Carl, our NIE Marketeer and defacto social director, for getting us to attend, well worth it, and conveniently coinciding with Gilbane.

The Good:

  • VERY entertaining, very informative.  Lots of good info about upcoming versions of Lucene and Solr, including additional performance tweaks.
  • A friendly, supportive bunch of like-minded nerds, and I mean this is the best possible way.
  • Also discussions of other related Apache projects.  We're all gonna need a cheat sheet pretty soon to keep track of it all.
  • Lucene/Solr will soon have implemented much of the core features of Autonomy IDOL, Endeca, FAST, etc.  They really ought to be spying.  :-)

Personally I think Otis & co. might wanna fly out for the next one.  I also think Dieselpoint ought to attend and talk about Open Pipeline.  If we get up enough energy maybe we could even volunteer to do that next time, we're on the board after all, but this is really Chris's baby.

The Not-so-Good:

  • About 50 terms that clients would not understand.  Don't get me wrong, we love the Map/Reduce, Bayesian, K-Means, SVD stuff, but most corporate clients would be lost.
  • Not much for Enterprise Packaging.  Ironically it's the mundane aspects of search, from a non-developer standpoint, that are still not on the horizon.  Not a criticism of the developers, they have what they need.
  • Not much about Nutch.  Nutch 1.0 is out, along with rumors of a revised admin GUI, but not much coverage here.

Impressions of Lucid Imagination:

This event was sponsored by Lucid, a company that recently got funding for bringing commercial packaging and services to the open source search world, and their senior staff includes quite a few of the core committers.

  • A very sincere bunch of guys.
  • They haven't sold their souls to corporate America, I think their "geek cred" is still well in tact.
  • Probably will not be filling in enterprise packaging pot holes any time soon.
  • Do they understand the Enterprise Market?

Also a shout out to LinkedIn and IBM for giving back to open source community.

There was also an "open mic" segment, and I'd like to give a shout to Avi Rappaport - I agree 1,000%, "stop words bad!" (or at least the blind use of index time stop words)


Surprises:

  • Not much of a threat to Google Appliance, due to packaging.  Yes, Google scales with their Map/Reduce and relevancy algorithms, and the open source guys have responded, but that's not the stuff that makes Google tick these days.
  • And despite the impressive and rapidly evolving core technologies, also not a real threat to the other Tier One vendors like FAST and Autonomy.  More on this seeming contradiction in a bit.
  • The Tier 2 vendors of the world, Attivio, Exalead, Dieselpoint, etc. DO need to pay attention.  There is a place for Tier 2 vendors, but they need to mind what the open source products do and do not provide more carefully.
  • It's really cool to see IBM willing to contribute so aggressively to the open source search engines, even though they sell several of their own.  A naive person might think they are competing with themselves, sabotaging their own sales guys, but they're a lot smarter than that.  They are selling their commercial search products as pure search, those technologies are always part of a larger (and more expensive) grand business solution.  They know what they're doing!

For similar reasons, still not a huge threat to Autonomy, MS/FAST, Endeca, etc. on corporate services.  I said earlier that the Apache projects are implementing a lot of the "secret sauce" that launched Autonomy and Endeca, etc, so you'd think this represents "a clear and present danger", but Mike Lynch's secret algorithms are not why people buy IDOL anymore.  Things like giant reference accounts, professional services, and commercial grade spiders have a lot more to with why big companies still pay six figures for search technology.

And speaking of surprises and Lucid Imagination, I wanna circle back to their PR a few months back when they got their funding and launched their company.  They talked about relevancy in their press releases!?  Wow... Yes, Lucene and Solr have some good traction there, but that specific competitive advantage has been used by almost every commercial search vendor in the past 15 years, including Verity, Autonomy and Google!

I would've expected them to say something like "we're gonna do for Lucene what RedHat did for Linux" - this would have been a very clear business-oriented proposition, though to be fair lots of companies have used that business model as well.  It wouldn't be original, but would be more business centric.  Then again, I'm not in Marketing, and their VC's obviously liked their pitch, so what do I know!

s/Mark

February 03, 2009

Lucene Start-Up Lucid Imagination Funded

There's been a good deal of buzz lately about venture money investing in Lucid Imagination, a company that wants to be to search what Red Hat is to Linux.

We're excited to see a commercial venture committed to standardizing and supporting Lucene, the open source Apache project. And their "employs about 10 of the Lucene/solr projects’s top 10 committers" - isn't that about 100% ?

We've done a number of Lucene/Solr projects over the last four years, and we tell our customers that it's a good technology.. but it's a toolkit. Lucid has apparently wrapped system monitoring and "tools for improved search relevancy" into their offering; and will support customers for "$12,000 to $18,000 per year" on subscription.

A good friend of ours who we sometimes call Deep Search uses Solr at his company and he loves it. But then, they have a relatively small number of structured XML documents extracted from their  database, and their content doesn't update very often. Still, he says you can purchase a commercial engine and spend a lot of money and time implementing it; or you can download an open source engine for free, and then spend a lot of money and implementing it, your choice!

There are other free search technologies, both open source and proprietary, from companies as well known as IBM and  Microsoft. And there are a number of other open source search technologies out there. Still, Lucid is to be congratulated for offering support for an exciting and  growing search technology, and their guys obviously understand what it takes to write a search engine.

Bottom line: although we are fans of technology, search methodology is much more important than the underlying technology. Now comes the hard part: making it work.

January 13, 2009

Updated List of Free, Open Source and Low Cost Search Engines

Over at our partner site:
www.searchcomponentsonline.com/free-open-source-and-low-cost.html

January 12, 2009

Virtualization and Search: Performance Tests Summary

We've been researching Virtualization and Search, and have recently presented on the topic at a couple shows.  We wanted to compare the performance penalty running a spider on virtual machine instead of a physical machine.  This is a summary of our findings. 

What we expected to find in terms of a performance penalty:

  • Estimated 3 to 20% penalty
  • Leaning towards 20% given the heavy disk IO

Actual results:

  • We found an approximate 10% average penalty.
  • There was a pretty wide margin, various tests measured between 0% to 17%, but always under the 20% we had estimated.  (actually less than zero percent in some cases, but we labeled those as outliers)
  • Overall the performance was better than expected, and certainly reasonable for many applications.

The test and environment:

  • HP mid tier workstation, dual core, AMD Athlon 64 X2 4400+
  • 8 Gigs memory, local SATA disk (non RAID)
  • Microsoft Windows Server 2008 64-bit for both host and client, using MS Hyper-V.
  • Sun's 64-bit JVM v6 set to 1 Gig max (which it did not fully respect)
  • Nutch 0.9 stock distribution
  • Dataset was the Enron public emails, approx half million emails in 1.5 Gigs of source data, served by IIS on separate local machine
  • Email files were mapped to text filter
  • Data was fetched and indexed into Lucene by Nutch
  • Clock time ranged from 31 to 35 minutes for all tests, with the physical tests (non-virtual) giving the widest measured deviation

Drop us an email if you'd like the full PDF when published.

Enterprise Search group on LinkedIn

There is a relatively active Enterprise Search Engine professional group over on LinkedIn - you might consider joining the group if you're there. The discussions have been about Open Source technology, visual search, Microsoft and FAST, federated search and more.

It's interesting that so many 'enterprise search' groups have grown so much in the last year or two including searchdev.org - I guess it reflects both the fact that it's finally being recognized as a mission-critical capability and that there is so few places to go for information. Hopefully we'll see even more discussion and user participation in the coming year!


June 18, 2008

Search Quality: You Can't Improve What You Don't Measure

In our latest survey of new newsletter subscribers we found that 29% had no formal metrics for measuring quality of search results.  Search metrics allow you to keep search on the right track and can be a powerful tool for managing your systems.  They are a wonderful source for insights and trends.  We thought we would share a couple that we think work well. Many of these are covered in greater depth in Interpreting Your Search Activity Reports in the Enterprise Search newsletter.

  • Count the number of people who use search  
  • Count the total number of searches  
  • Count the number of zero search results  
  • User feedback on top 100 searches  
  • Track email complaints about search  
  • Measure number of clicks on navigators (navigation menu items)  
  • Business Goals  
  •    
    • Reduce call volume (normallized for growth in customer base) by enabling self-service from search: results are good enough to reduce calls.
    • Reduce e-mail volume (again adjusted for growth in customer base) by enabling self-service from search: results are good enough to reduce e-mails. 
    • Revenue       
    • Add-on revenue       

May 08, 2008

A proposed standard for enterprise search

Dieselpoint has announced support for a technology it calls OpenPipeline, which can enhance the task virtually every enterprise search technology uses to get documents into the search index. They will be showing the pipeline at the upcoming Enterprise Search Summit on May 20-21 integrated with their new Dieselpoint Search 4.0, also on display.

The Dieselpoint press release claims:

OpenPipeline provides a common architecture for connectors to data sources, file filters, text analyzers and modules to distribute documents across a network. It is fully functional out of the box and includes an installer, a job scheduler, file scanner and crawlers, doc filters, and point and click interface with drag and drop module installation.

OpenPipeline is compatible with IBM's UIMA (Unstructured Information Management Architecture), and is designed to connect UIMA annotators to other systems.

Document processing can be centralized or parallelized as needed. The transport mechanism is simple, web-services XML over HTTP. RSS/Atom feeds are also possible.

The development philosophy behind OpenPipeline stresses simple, elegant design, and massive scalability. Minimal external dependencies and straightforward plug-in implementation ensure that the learning curve is low.

OpenPipeline can be downloaded without charge from http://www.OpenPipeline.org. It's available under the Apache License.


Making this technology open source makes sense. The core technology for an enterprise search company, their 'secret sauce', is optimizing the index and making search great, not creating new code to parse the latest version of Microsoft Office or of Documentum. By embracing OpenPipeline, presumably we will start to see pipeline stages created by a number of smaller companies and individuals, easing the burden on enterprise search companies. And companies that provide possible sources of data like Content Management Systems, can create a single pipeline stage for their product that could work for every search technology, and be done with it.

To create a searchable index, all search technologies need to create a stream of text. If the source document is a binary file - Microsoft Word, for example - search vendors need to provide some way to read the format and convert it to text. The same is true of content stored in a relational database: each row represents a virtual document which needs to be extracted from the database and turned into a stream of text. This conversion is typically done as one stage of a pipeline. Other stages may include adding metadata, performing entity or sentiment extraction, or even enhanced language processing.

The concept of a 'pipeline' applies directly to many existing search technologies, each with a proprietary method of accessing content. On top of that, no search technology companies have cooperated with competitors to create standards. In the relational database world, standards have made life much better: consider ODBC and JDBC. Because of these standards, developers can write code that can connect to just about any relational database. Not so in search. Maybe this effort will help break the ice. Stay tuned...

As enterprise search users, are you glad to see an open source solution for part of the search puzzle?

May 05, 2008

The problem with alerts - Google or otherwise

I use Google alerts to keep an eye on current events. Over the weekend I got an alert: "AMEC uses Verity's K2" - Now, since Verity is part of former competitor Autonomy, and because K2 is generally not being actively marketed, I decided to read the article. Sure enough, the content is dated January 2004, but Google Alerts thinks it is brand new. So I have to conclude that either the publisher just changed something on the page, or Google is just finding that document - either way, Google thinks this is news and in reality, it isn't.

Not long after we started SearchButton.com, we met the Google founders Sergey and Larry. Mark Bennett, my co-founder at SearchButton and here at New Idea Engineering, asked about the then-young Google's handling of dates and recency, and the Google guys took the position that date wasn't that important. This has led to a couple of energetic email exchanges over the last few years, but my recent alert illustrates the problem Google - and most other search technologies have - in generating really useful alerts. In fact, this subject was of such relevance to enterprise search owners, we had an article about the importance of dates in the first issue of our enterprise search newsletter in April of 2003.

Continue reading "The problem with alerts - Google or otherwise" »

Search Blog Archive

Dr Search

  • Dr. Search is the technical genius of enterprise search. Feel free to Ask the Doctor any questions you may have about enterprise search.

Enterprise Search Newsletter

Other Resources