December 04, 2017

Search Indices are Not Content Repositories

Recently on Quora, someone asked for help with a corrupt Elasticsearch index. A number of folks responded, all recommending that he simply rebuild the search index and move on.

The bad news turns out that this person didn't have any source documents: he was so impressed with what Elasticsearch did that he had been using it as his primary storage for content. When it crashed, his content was gone. This is not an indictment of Elasticsearch: it can happen to any complex software product whether Elastic, Solr or SharePoint.

In my reply, I told him how sorry I was for his loss, and suggested he get to work restoring or recreating his content. I even offered to call and tell him how sorry I was for his loss. 

Then I launched into what I really felt I needed to say - there for his behalf, and here for yours.  I suggested - no, actually I insisted - that you NEVER use ANY search index as your primary store for content. Let me be more specific: NEVER. EVER.

Some commercial platforms such as Solr and commercial software based on Solr (ie, Lucidworks) have a reasonably robust ability to replicate the index over multiple servers or nodes which provides some safety (I’m thinking SOLR Cloud here); others do not. But the replication is a copy of the INDEX, which is NOT your documents.

The search index is optimized for retrieval. Databases, CMS, file systems and other tech are for storage.

For one, I’m not sure any search engine stores the entire document of any type. Conceptually, most search indices have two ‘logical’ (if not physical) files 

One of these files you can think of as a database table with one row per document, with field values (Title, Author, etc). This file generally stores the URL, file name, database row as well, basically ‘where do I go to find this full document?’ - and maybe a few other field values.

The second file is a list of all the (non-stopwords) in all of your documents. The word itself is stored once, along with a list of byte offsets in the document where the word appears (multiple byte offsets, one for each instance of the word). It also has a pointer to all docs which have that word. Again: Stop words are generally NOT indexed, so they are usually not in the index.

(There is more detail in an older article on my website Relational Databases vs. Full-Text Search Engines - New Idea Engineering)

COULD you rebuild the full document? Well, depends on the search platform. In most platforms I've seen, it would be difficult because stop words are not even stored. Recreating a document that omits ‘the’, ’a’, ‘an’, ‘and’ etc. MIGHT be human readable but it is NOT the original document.

Secondly, not all search engine indices are replicated for redundancy. The assumption is that if you lose the file system where the content lives, you can still search; you just can't retrieve any documents until you restore the original content.

And some platforms do not give you a way to access the index, short of searching. And a search index is an index, not a repository.

Finally, some platforms are better at redundant failover of indices than others. If the platform you use is one of those that do not have redundancy BY DEFAULT.. like some very popular platforms - and you use that index as the primary data store for your doc and the index dies.. you’re what we used to call SOL - ‘sure outta luck’.

The moral of the story? DO NOT USE A SEARCH INDEX AS THE PRIMARY DATASTORE. Specific enough?

October 11, 2017

A Search Center of Excellence is still relevant managing enterprise search

I was having a conversation with an old friend who manages enterprise search her organization, a biotech company back east. We've worked together on search projects going back to my days at Verity - for you young'uns, that is what we call BG: 'before Google'.

Based on an engagement we did sometime after Google but before Solr, "Centers of Excellence" or "COE" had become very popular, and we decided we could define the rules and responsibilities of a  Search Center of Excellence or SCOE: the team that manages the full breadth of operation and management for enterprise search. We began preaching the gospel of the SCOE at trade show events and on our blog where you can find that original article.

My friend and I had a great conversation about how successful they had been managing three generations of search platforms now with the SCOE; and how they still maintain the responsibilities the SCOE assumed years back with only a few meetings a year to review how search is doing, address any concerns, and map out enhancements as they become available. 

It worked then, and it works now. The SCOE is a great idea! Let me know if you'd like to talk about it.

September 28, 2017

Enterprise Search Newsletter: September 2017

 

 

Welcome to the Volume 7 Issue 2 of the Enterprise Search Newsletter

from New Idea Engineering, Inc.

This month we start with What's New which includes an update of the Google Search Appliance saga; the current renaissance in enterprise search; and an extended product line for what I'd argue is an industry leader. We also cover:

Winning Methodologies for Enterprise Search

Like many organizations, you probably have an existing enterprise search solution ­ serving intranet or customer­facing content, perhaps even e­commerce. You had great expectations for the solution, but it hasn't worked out the way you had hoped. Your users are unhappy and complain about not being able to find the information they need. Maybe customers continue to call your support group for answers since they cannot find help on your website. Or your sales remained flat or even dropped after the roll­out of the new search. What can you do? more

Why Is Enterprise Search Difficult?

Companies, government agencies, and other organizations maintain huge amounts of information in electronic form including spreadsheets, policy manuals, and web pages just to mention a few. The content may be stored in file shares, websites, content management systems or databases, but without the ability to find this corporate knowledge, managing even a small company would be difficult. more

The Search Whisperers

Several years ago, Toyota ran an ad in the San Francisco Bay Area featuring the then recently retired Steve Young of San Francisco 49ers fame. In the advert, he is chatting with a woman at a party and the woman asks, "What do you do for a living, Steve?" Rather than answer directly, Young replies with a question: "Do you follow sports? Football?" When the woman answers that she doesn't, Young's (truthful) reply? "I'm a lawyer." more

 

Finally, I'd be remiss if I didn't mention my September 2017 column at CMS Wire or the upcoming Enterprise Search and Discovery conference in DC in November. I hope to see you there!

Feel free to contact me with your questions or suggestions!

 

www.ideaeng.com Copyright 2017: New Idea Engineering

June 28, 2017

Poor data quality gives search a bad rap

If you’re involved in managing the enterprise search instance at your company, there’s a good chance that you’ve experienced at least some users complain about the poor results they see. 

The common lament search teams hear is “Why didn’t we use Google?” when in fact, sites that implemented the GSA but don’t utilize the Google logo and look, we’ve seen the same complaints.

We're often asked to come in and recommend a solution. Sometimes the problem is simply using the wrong search platform: not every platform handles every user case and requirement equally well. Occasionally, the problem is a poorly or misconfigured search, or simply an instance that hasn’t been managed properly. Even the renowned Google public search engine doesn’t happen by itself, but even that is a poor example: in recent years, the Google search has become less of a search platform and more of a big data analytics engine.

Over the years, we’ve been helping clients select, implement, and manage Intranet search. In my opinion, the problem with search is elsewhere: Poor data quality. 

Enterprise data isn’t created with search in mind. There is little incentive for content authors to attach quality metadata in the properties fields of Adobe PDF Maker, Microsoft Office, and other document publishing tools. To make matters worse, there may be several versions of a given document as it goes through creation, editing, reviews, and updates. And often the early drafts, as well as the final version, are in the same directory or file share. Very rarely does a public facing web site content have such issues.

Sometimes content management systems make it easy to implement what is really ‘search engine optimization’ or SEO; but it seems all too often that the optimization is left to the enterprise search platform to work out.

We have an updated two-part series on data quality and search, starting here. We hope you find it helpful; let us know if you have any questions!

June 22, 2017

First Impressions on the new Forrester Wave

The new Forrester Wave™: Cognitive Search And Knowledge Discovery Solutions is out, and once again I think Forrester, along with Gartner and others, miss the mark on the real enterprise search market. 

In the belief that sharing my quick first impression will at least start a conversation going until I can write up a more complete analysis, I am going to share these first thoughts.

First, I am not wild about the new buzzterms 'cognitive search' and "insight engines". Yes, enterprise search can be intelligent, but it's not cognitive. which Webster defines as "of, relating to, or involving conscious mental activities (such as thinking, understanding, learning, and remembering)". HAL 9000 was cognitive software; "Did you mean" and "You might also like" are not cognition.  And enterprise search has always provided insights into content, so why the new 'insight engines'? 

Moving on, I agree with Forrester that Attivio, Coveo and Sinequa are among the leaders. Honestly, I wish Coveo was fully multi-platform, but they do have an outstanding cloud offering that in my mind addresses much of the issue.

However, unlike Forrester, I believe Lucidworks Fusion belongs right up there with the leaders. Fusion starts with a strong open source Solr-based core; an integrated administrative UI; a great search UI builder (with the recent acquisition of Twigkit); and multiple-platform support. (Yep, I worked there a few years ago, but well before the current product was created).

I count IDOL in with the 'Old Guard' along with Endeca, Vivisimo (‘Watson’) and perhaps others - former leaders still available, but offered by non-search companies, or removed from traditional enterprise search (Watson). And it will be interesting to see if Idol and its new parent, Microfocus, survive the recent shotgun wedding. 

Tier 2, great search but not quite “full” enterprise search, includes Elastic (which I believe is in the enviable position as *the* platform for IoT), Mark Logic, and perhaps one or two more.

And there are several newer or perhaps less-well known search offerings like Algolia, Funnelback, Swiftype, Yippy and more. Don’t hold their size and/or youth against them; they’re quite good products.

No, I’d say the Forrester report is limited, and honestly a bit out of touch with the real enterprise search market. I know, I know; How do I really feel? Stay tuned, I've got more to say coming soon. What do you think? Leave a comment below!