7 posts categorized "Findability"

January 06, 2020

It's a new year: Time for better metadata!

The new year is a time when most of us resolve to make changes in our personal lives: losing weight, exercising more, spending more time with a spouse and/or the kids. We start the year with great energy to meet our goals, but sadly many of us fall short through the year.

This often happens in the enterprise as well. Improving internal search is a common resolution at the time of the year. For eCommerce sites, January generally means fewer site visitors once the holiday rush is done; so making changes won’t have a great impact on sales. For corporations, it’s a time of new budgets and great expectations: and more than a few of the clients I’ve we’ve worked with over the years tell me how poorly their internal search performs compared to the public search sites like Google, Bing, and DuckDuckGo. Why do these search platforms work so well? And why can’t your site search match their success? It’s a numbers game. By definition, public search platforms index millions of sites; and many of these contain similar if not identical content. This makes is easy to find what you’re looking for because thousands of sites have relevant results for just about any query you may try.

Intranet sites are different, Usually, there is only one page with the information you are looking for. But often, content authors, who have read about how to promote consent on Google, will add keywords using Microsoft Word’s “Properties” field in an effort to promote their documents. This attempt to ‘game’ the internal search platform generally interferes with the platform’s relevance functions and results in poor result relevance. Even the Document Properties the Microsoft Word provides can interfere with search effectiveness.

Years ago, we were working with a client who was interested in knowing which employees were contributing to the intranet content. When the data was processed, it turned out that an Administrative Assistant in Marketing had authored more documents than anyone else in the corporation. After a quick review, we discovered why this one person was apparently more prolific than any other employee. That person had created all of the template forms used throughout the company, so the Word Document Properties listed that employee’s name as the author of virtually every standard template throughout the company.

So in the spirit of the new year, I’d suggest that you spend a day or two performing a data audit to discover where your content – or lack thereof – is negatively impacting your enterprise search results. And if you find any doozies – I’d love to hear about it!



December 10, 2019

A Working Vacation

The month of January is associated with the Roman god Janus who, with two heads, could look forward and back. That said, I find December a quiet time that provides the opportunity to review the current year and to plan the coming new year. As I tweeted yesterday at @miles_kehoe, this is the most stressful time of the year for most sites focused on eCommerce. Changes are generally 'off-limits' - even an hour offline can put a dent in sales.

But for those responsible for corporate internal and public-facing sites, this is the time to review content, identify potential changes, and even new content. And if planned well, the holidays are often a great time to update intranet sites: from late November through the new year, activity tends to slow for more corporate sites. Both IT and content staff should be using this quiet time to make changes, from updates to current content - the new vacation schedule is just one the comes to mind - to minor restructuring. (Note: while the holidays are a great time to roll out major changes, these should have been in planning months ago: it's a holiday, not a sabbatical!)

For the search team, this is time to review search activity: top queries, zero hits, misspellings, and synonyms come to mind as a minimum effort. It's also a good time to identify popular content, as well as content that was either never part of any search result or was included in result lists but never viewed.

So - December is nearly half over: take advantage of what is normally a quiet time for intranets and make that site better!

Happy Holidays!


May 31, 2016

The Findwise Enterprise Search and Findability Survey 2016 is open for business

Would you find it helpful to benchmark your Enterprise Search operations against hundreds of corporations, organizations and government agencies worldwide? Before you answer, would you find that information useful enough that you’re spend a few minutes answering a survey about your enterprise search practices? It seems like a pretty good deal to me to have real-world data from people just like yourself worldwide.

This survey, the results of which are useful, insightful, and actionable for search managers everywhere, provides the insight into many of the critical areas of search.

Findwise, the Swedish company with offices there and in Denmark, Norway Poland, Norway and London, is gathering data now for the 2016 version of their annual Enterprise Search and Findability Survey at http://bit.ly/1sY9qiE.

What sorts of things will you learn?

Past surveys give insight into the difference between companies will happy search users versus those whose employees prefer to avoid using internal search. One particularly interesting finding last year was that there are three levels of ‘search maturity’, identifiable by how search is implemented across content.

The least mature search organizations, roughly 25% of respondents, have search for specific repositories (siloes), but they generally treat search as ‘fire and forget’, and once installed, there is no ongoing oversight.

More mature search organizations that represent about 60% of respondents, have one search for all silos; but maintaining and improving search technology has very little staff attention.

The remaining 15% of organizations answering the survey invest in search technology and staff, and continuously attempt to improve search and findability. These organizations often have multiple search instances tailored for specific users and repositories.

One of my favorite findings a few years back was that a majority of enterprises have “one or less” full time staff responsible for search; and yet a similar majority of employees reported that search just didn’t work. The good news? Subsequent surveys have shown that staffing search with as few as 2 FTEs improves overall search satisfactions; and 3 FTEs seem to strongly improve overall satisfaction. And even more good news: Over the years, the trend in enterprise search shows that more and more organizations are taking search and findability seriously.

You can participate in the 2016 Findwise Enterprise Search and Findability Survey in just 10 or 15 minutes and you’ll be among the first to know what this year brings. Again, you’ll find the 2016 survey at http://bit.ly/1sY9qiE.

October 30, 2012

Link to cool story of Lucene/Solr 4.0's new fast Fuzzy Search

Interesting article with lots of links to other good resources.  Tells the story of a lot of open source cross pollination and collaberation, automatons, Levenstein, and even a dash of Python - thanks Mike!


September 06, 2012

Got OGP? A Social Media Lesson for the Enterprise

    Anytime you decide to re-post that article hot off the virtual press from a site like nyt.com or Endgadget to your social network of choice, odds are strong that its content crosses the news-media-to-social-media gap via a metadata standard called the Open Graph Protocol, or OGP.  OGP facilitates grabbing the article's title, its content-type, an image that will appear in the article's post on your profile, and the article's canonical URL.  It's a simple standard based on the usual HTML metadata tags that actually predate Facebook and Google+ by over a decade (OGP's metadata tags can be distinguished  by the "og:" prefix on each property name, e.g. "og:title", "og:description", etc.)  And despite its Facebook origins, OGP's success should strongly inform enterprise metadata policies and practices in one basic, crucial area.

    The key to OGP's success on the public internet lies largely in its simplicity.  Implementing OGP requires the content creator to fill in just the four aforementioned metadata fields:

  • the content's URL (og:url)
  • its title (og:title)
  • its type (og:type)
  • a representative image (og:image)

     A great number of other OGP metadata fields certainly do exist, and should absolutely be taken advantage of, but only these four need to be defined in order for a page to be considered OGP-compliant.

     What can we immediately learn here from OGP that applies to metadata in the enterprise?  The enterprise content-creation and/or content-import process should involve a clearly-defined and standardized minimum set of metadata fields that should be present in every document *before* that document is added into CMS and/or indexed for search.  NYT.com certainly doesn't push out articles without proper OGP, and enterprise knowledge workers need to be equally diligent in producing documents with the proper metadata in place to find them again later!  Even if practical complications make that last proposition difficult, many Content Management Systems can be setup to suggest a basic set of default values automagically for the author to review at submission time.  Just having a simple, minimum spec in place for the metadata fields that are considered absolutely mandatory will generally improve base-line metadata quality considerably.

    What should this minimum set of metadata fields include for your specific enterprise content? It's hard to make exact recommendations, but let's consider the problem that OGP's designers were trying to solve in the case of web-content metadata: people want a simple preview of the content they're sharing from some content source, with sufficient information to identify that content's basic subject-matter and providence, and (perhaps most importantly!) a flashy image that stands out on their profile.  OGP's four basic requirements fit exactly these specs.  What information do your knowledge workers always need from their documents?  Perhaps the date-of-creation is a particularly relevant data-point for the work they're doing, or perhaps they often need to reference a document's author.  Whatever these fields might actually be, spending some time with the people who end up using your enterprise documents' metadata is the best way to find out.  And even if their baseline needs are dead simple, like the problem OGP manages to solve so succinctly, your default policy should be to just say NO to no metadata.  Your search engine will thank you.

    A natural question might arise from this case-study: should you actually just start using OGP in the enterprise?  It's not necessarily the best option, since intranet search-engine spiders and indexers might not know about OGP fields yet.  In any case, you'll definitely still want to have a regular title, description, etc. in your documents as well.  As of the time-of-writing, OGP is still best suited to the exact niche it was desinged to operate in: the public internet.  Replicating the benefits it provides within the enterprise environment is an important goal.

March 23, 2012

Webinar: Is bad metadata costing you money?

We've planned a webinar that will help identify whether you have a metadata problem, what you can do to fix it, and how to justify the cost.

Despite what some vendors claim, enterprise search platforms rely on good metadata in order to deliver quality results. Yet few organizations have the resources to attack their metadata problems, so findability suffers and users lament "Why don't we use Google?"  Search managers know that even the Google Search Appliance, without quality metadata, can't deliver the internet search experience end users know, love, and trust. Yet it's hard to justify the time and effort to improve metadata in hopes of a better search experience.

In this webinar we will consider the issue of bad metadata, ways to address the problem, and some ideas on what the ROI can be. We will discuss:

  • Do you have a metadata problem?
  • How much is it costing you?
  • What is the risk of bad metadata
  • What tools are available?
  • What will it cost to fix?  
  • What's the ROI of improved metadata?


We're hosting the webinar twice; Wednesday, April 11th at 11AM Pacific time (GMT-7); and again on Thursday, April 12th at 8:30AM Pacific time. Click the link on the appropriate session you'd like to attend.

See you then!

February 21, 2012

10 changes Wikipedia needs to become more Human and Search Engine Friendly

There's a really nice set of examples comparing JSON to other similar formats like YAML, Python, PHP, PLists, etc.  It was in a Wikipedia article, but you won't see it now unless you know to go looking through the version history (link in previous sentence).

Contents-deletedThis the content had existed for quite a while in that article, and had been contributed to by many people.  One day in March 2011 one editor decided it was irrelevant and gutted that entire section.  The information was useful, I was actually looking for it today!  I happened to think of reviewing the version history since I was almost sure that's where it had been.

The editors at Wikipedia need to be able to delete content, for any number of reasons, and I'm sure it's a thankless job.  And there are procedures for handling disputed edits - I've pinged that editor who deleted it about maybe finding a new home for the content.  Also, ironically, I found an out of date copy of the page here that still has it.

I'm not in favor of a ton of rules, but I beleive wholesale deletes of long-existing and thriving content should get some special attention.  To be clear, I'm not talking about content that's clearly wrong or profane or whatever, or gibberish that was just added.  How about as a first pass "a paragraph or more that has existed for more than 6 months (long-lived) and has been contibuted to (not just corrected) byt at least 3 people (thriving".

Human-Centric Policy & Tech Changes:

  • If the content's only "crime" is that it's somewhat off-topic then the person doing the deleting ought to find another home for it.  The editor could either move it to another page, or possibly even split up the current page; maybe they could "fork" the page into 2 pages, then cross link them, and then remove duplicate content, so then page 1 retains the original article and links to page 2, and then page 2 has the possibly-off-topic content. Yes, this would take more effort for the "deleteing" editor, BUT what about the large amount of effort the mutliple contributors put into it, and going the extra step to try and conform to Wikipedia's policies so that it would NOT get deleted.  I also suspect that senior editors, those more likely to consider wholesale deletes, are probably much more efficient as splitting up a page or moving content somwhere else - novice contributors might be unaware, or only vaguely aware, that such thigns are even possible.
  • Wikipedia should make it easier for contributors to find content of theirs that's been deleted.  this is a somewhat manual process now.  Obviously they don't want to promote "edit wars".
  • Wikipedia should generally track large deletes (maybe they do?)
  • Wikipedia should "speed up" it's diff viewer.  It should run faster so you can zip through a bunch of changes, and maybe even include "diff thumbnails".  The UI makes sense to developers used to source code control systems, but is probably confusing to most others.  I realize this is all easier said than done!
  • Wikipedia should include some visual indication of "page flux".  It would helpful, for example, if a young person could see at a glance that abortion and gun control are highly debated subjects between adults.
  • Wikipedia should be a bit more visually proactive in educating visitors that there are other versions of a page.  I'm sure Wikipedia veterans would say "more visible!? - there's a bold link in a tab at the top of every page!"  While that's true, it just doesn't convey it to casual visitors.  On the other hand, it shouldn't go too far and be annoying about it - like car GPS systems that make you agree to to the legal disclaimer page every time you get in the car!

Search Engine Related Changes:

  • Wikipedia search (I use the Google option) should have an option to expand the scope of search to include deleted content.  This shouldn't be the default, and there are presentation issues to be considered.  Some deletes are in the middle of sentences, and there are multiple deletes and edits, etc, so I realize it's not quite as easy as it may sound.
  • There needs to be a better way to convey this additional data to public search engines, as well as representing it in their own built in engine.
  • Wikipedia should consider rendering full content and all changes inline, using some type of CSS / HTML5 dynamic mode that marks suspect or deprecated content with tags, instead of removing it.  Perhaps the search engines could also request this special version of the page and assign relevance accordingly.
  • Perhaps Wikipedia could offer some alternative domain name for this somewhat messier version of the data, something like "garaga.en.wikipedia.org".

It's Not Just A or B:

  • Whenever I hear people lament the declining content contributions on Wikipedia I have to chuckle.  It's incredibly demoralizing to delete content that people take the time to contribute.  If a new contributor on Wikipedia discovers 1 or 2 of their first edits promptly deleted, trust me they're very unlikely to try again.  I know a number of people that have just given up.
  • Others would say that if you put more pressure on editors to not delete, then the overall quality of Wikipedia will go down, and raving nut-jobs and spammers will engulf the site.
  • The compromise is to flag content (which is very similar to tracking diffs) and give users and search engines some choice in whether they want to see "suspect" content or not.

This is about survival and participation.  When newer contributors have their content "flagged" vs. "deleted", with more explanations and recourse, they will still learn to uphold Wikipedia's quality standards without being too discouraged.  They'll hang around longer and practice more.

An analogy: Wikipedia's current policy is like potty training your kid with a stun gun - make one mistake and ZAP! - or don't bother and just keep going in your diaper like you've always done.

I understand and appreciate all the work that Wikipedia's volunteers do, but I think there are some constructive things that could be done better.