2 posts categorized "Elastic"

June 28, 2017

Poor data quality gives search a bad rap

If you’re involved in managing the enterprise search instance at your company, there’s a good chance that you’ve experienced at least some users complain about the poor results they see. 

The common lament search teams hear is “Why didn’t we use Google?” when in fact, sites that implemented the GSA but don’t utilize the Google logo and look, we’ve seen the same complaints.

We're often asked to come in and recommend a solution. Sometimes the problem is simply using the wrong search platform: not every platform handles every user case and requirement equally well. Occasionally, the problem is a poorly or misconfigured search, or simply an instance that hasn’t been managed properly. Even the renowned Google public search engine doesn’t happen by itself, but even that is a poor example: in recent years, the Google search has become less of a search platform and more of a big data analytics engine.

Over the years, we’ve been helping clients select, implement, and manage Intranet search. In my opinion, the problem with search is elsewhere: Poor data quality. 

Enterprise data isn’t created with search in mind. There is little incentive for content authors to attach quality metadata in the properties fields of Adobe PDF Maker, Microsoft Office, and other document publishing tools. To make matters worse, there may be several versions of a given document as it goes through creation, editing, reviews, and updates. And often the early drafts, as well as the final version, are in the same directory or file share. Very rarely does a public facing web site content have such issues.

Sometimes content management systems make it easy to implement what is really ‘search engine optimization’ or SEO; but it seems all too often that the optimization is left to the enterprise search platform to work out.

We have an updated two-part series on data quality and search, starting here. We hope you find it helpful; let us know if you have any questions!

November 01, 2016

One search to rule them all

(Originally published on LinkedIn)

Lucene was ‘born’ in 1999, created by Doug Cutting; and in 2005, it became a top-level Apache project. That year, Gartner Group announced that the search ‘Leaders’ platforms on their Enterprise Search Magic Quadrant included Autonomy, FAST, Endeca, IBM Omnifind, and Verity. The Google Search Appliance was right on the cusp between ‘Challengers’ and ‘Leaders’. Not many people knew about Lucene; and few who did saw it as much more than a quirky little project.

Just a year later, Yonik Seeley and his employer, CNET Networks, published and donated the Solr search server to the Apache Software Foundation, where it became an incubator project in 2006; the two projects soon merged into a single top-level Apache project. That same year, Gartner narrowed the ‘Leaders’ in their 2006 Magic Quadrant for Search to Autonomy (which acquired Verity the previous year), FAST, and Endeca.

Jump forward to the present. FAST is gone, acquired by Microsoft in 2008 and morphed into SharePoint Search. Hewlett-Packard acquired Autonomy in October of 2011, followed a few weeks later by Oracle’s acquisition of Endeca. Endeca is no longer available as a search platform; and Autonomy is mostly seen as a strategy to keep a large number of HP consultants fully employed, often on compliance applications.

Only a spattering of commercial enterprise search platforms that once flooded the market just a few years back exist any more. While Gartner continues to list 14 or 15 products in their Magic Quadrant Enterprise Search grid, about the only pure commercial products we see any more are the Google Search Appliance and Recommind. And Google recently announced that the appliance is scheduled to go ‘end of life’ over the next few years. All of those bright yellow boxes become really nice Dell servers by the end of 2018.

A new crop of search platforms has grown to fill the void.

As an open source product, Solr has grown in its capabilities, and is now widely used for enterprise search and data applications in corporations and government projects. Solr Cloud extends the platform to a scalable high-availability platform for demanding enterprise and data search applications. Solr is an open source solution.

Cloudera also bundles some interesting extra tools including Solr in their HUE bundle; free to download and free to use as long as you like. Cloudera runs a slightly older but stable release, 4.10; but with a committers Yonik Seeley and Mark Miller, I suspect they’re in a good position.

Hortonworks, a Cloudera competitor, also offers Solr/Solr Cloud in their releases, in partnership with Lucidworks - a company with a large number of committers on staff.

There are also three companies that have proprietary offerings based on open source technology.

Attivio, founded in 2007, is a “Leader” in the most recent Gartner Magic Quadrant for Enterprise Search. Their product, while not open source, nonetheless thrives by combining search, BI, data automation, analytics and more.

Elasticsearch has evolved into a strong platform for search and data analytics, and a number of organizations are finding it useful in some tradition enterprise search applications as well. Elastic has also integrated Kibana, a powerful graphical presentation tool that adds value for content analytics, not just search activity reporting.

Lucidworks Fusion is a relative newcomer to enterprise search. It includes many of the rich architectural features that enterprises expect, including a powerful crawler, connectors, and reporting. With its ‘Anda’ crawler and connectors, admin UI, and reporting, some people see it as a contender to replace the Google Search Appliance.

The one thing that all of these ‘proprietary’ products have in common? They are based on Apache Lucene to deliver critical functionality. And when you consider all of the web sites that use some form of Lucene for their site search, I think you'd agree that it really is a powerful little package. It’s available for virtually any operating systems, and can be integrated using just about any programming language from C/C++ to Java to Perl to Python to .NET.

Even more amazing is that these companies with commercial products based on Lucene – and who compete in the marketplace - actually cooperate when it comes time to fix bugs or add new capabilities to Lucene. Given all of the commercial players that have closed their doors - leaving their customers to find replacement platforms – we’ve reached the point where open-source-based software really is the safe choice now. And universally, Lucene is the common element.

The quirky little search API Doug Cutting put together in 1999 has evolved to be the platform that drives the leading search platforms used in big data, NoSQL, enterprise search, and search analytics. And it doesn’t seem like it’s going to be phasing out any time soon.