« August 2012 | Main | October 2012 »

6 posts from September 2012

September 21, 2012

Kipling on the future of Autonomy

There's a line in Rudyard Kipling's poem 'If" I always found inspiring:

"Or watch the things you gave your life to, broken,
And stoop and build'em up with worn-out tools;"

Born some 150 years ago in South Africa, Englishman Rudyard Kipling seems to have little to offer for Autonomy founder, former HP employee, and newly minted venture capitalist Mike Lynch. On the other hand, I think I may see an investment opportunity he may like.

Do you suppose he would invest in a company that grew to be a leader in its field, but which languished after acquisition by a giant firm seemingly past its prime? Would he seek to acquire such a company, perhaps at a significant discount to the price paid to acquire it just a few years ago - and make that company back into the respected - and feared - leader it once was? Let's say selling a company for $11B in 2011 and buying it back for $3B in 2013? 

I can't say; I'm just a small business owner with little experience in VC. On the other hand, think what a coup it would be for Autonomy to rise from the ashes, and confirm the genius of its founder.

Stay tuned.


September 18, 2012

Better Handling of Model Numbers and Software Versions

In a recent post I talked about different ways to tokenizer your data.  Today I'll extend that by talking about tokenizing text that has a small amout of structure in it, and the relationship between tokenziation and Entity Extraction.

Although this post talks about eCommerce related items Model Numbers and Version Numbers, this same logic could also be applied to dates, amounts of money, social security numbers, phone numbers, ISBN numbers, patent references, legal citations, etc. 

Better handling of Model Numbers

Note: Parts suppliers often have product names that look more like model numbers, so they might benefit from this as well.

It would be possible use field specific tokenization rules, in conjunction with search time logic, to allow for more superior partial matches. In a manner somewhat analogous to the previous section (Improved Tokenization for Punctuation), structured product names could be broken down into components, and also maintained in their original form, and overload the tokens in the index.

Search time patterns could also possibly enhance this search logic.

Potential advantages:

  • Ability to rank more exact matches (if the user types the longer form)
  • More predictable partial matches
  • Could enable normalized sorting and field collapsing
  • Could link from more specific to less specific and vice versa
  • Possibly improve autosuggest searches
  • Avoid use of wildcards (although this isn't a problem in some search engines)

Normalizing Version Numbers

Technical websites have a great deal of software and drivers, with many version numbers. Similar to the methods suggested for model numbers, these special numbers could be recognized and normalized as they're added to the index. Potential advantages:
  • Allow for proper version number sorting within one software component or driver (there is no absolute scale that’s comparable across disparate software)
  • Allow for proper partial matches
  • Allow for proper range searches
  • Possibly add an additional sentinel tokens for “latest” Entity Extraction / Normalization

Depending on the search engine, there might not be much implementation difference between normalizing model and version numbers as mentioned previously, and doing full entity extraction. However, regardless of implementation similarity, designing for full Entity Extraction elicits a more complete functional spec and UI treatment.

Benefits of full Entity Extraction over simple normalized tokenization:

  • Usually includes using the extracted entities in Faceted Navigation. If some silos already have good metadata for Facets but other silos lack it, this might allow those other silos to have almost comparable values for the same data type (via extraction vs. well defined metadata) and have more consistence coverage for faceted search.
  • Encourages further thought as to the preferred canonical internal representation and display format for each type of entity.
There is one potential issue with the first point, using entity extraction for faceted search: the text of a document may reference many valid entities, while the document itself is only primarily related to one or two of them, so there may be a tendency towards “Facet Inflation”. This can sometimes be mitigated by having several classes of the same facet, but where the scope of one type is more heavily restricted by having it only pull values from key parts of the document, such as title or model number.

September 13, 2012

Improved Tokenization for Punctuation Characters

Search appliations that are geared towards technical users often have problems with searches like .net, C#, *.*, etc.

In some cases these can be handled solely at the application level. For example, many Unix utility command line options begin with a hyphen, which means “NOT” to many search engines, so users searching for "-verbose" will find every document EXCEPT the ones that discuss the verbose option.

This can often be handled by just stripping off the minus sign before submitting the query to the engine. (depending on the engine and its configuration)

If there's always additional text in the search, a cheap workaround is to just consistently drop the same punctuation characters at both index time and search time. As long as "TCP/IP" is consistently reduced to [ tcp, ip ], users will have a good chance of finding it.

But what is punctuation is all you have? Somebody really needs to search for -(*) for example? What then? There's a strong tendency to balk at these uses cases, to claim that they are obscure edge cases, and rationalize why they should be ignored. But this edge case argument is old and stale - if your site truly needs to search punctuation rich content, then it may be worth the cost. Long search tails, which are common on technical search applications, can add up to substantial percentage of overall traffic!

Many punctuation problems need to be handled at index time, or in addition to special search time logic. For example, if asterisks are important, they can be stored as actual tokens in the fulltext index. At search time the asterisks would also need to be handled appropriately, since most search engines would either ignore them or assume they are part of a wildcard search.

The point is that, regardless of what you do with asterisks at search time, they cannot be found at all if they didn’t make it to the index, and were instead discarded at index time.

Token Overloading can be used to put multiple tokenized representations of the same text into the index. For example, a hyphenated phrase like "XY-1234" is found in a source document at index time, it can be injected as [ (xy-1234), (xy, 1234), (xy,-,1234), (xy1234) ]. Although this inflates the index size, it gives maximum flexibility at search time.

Don't confuse "don't need to do something" with "don't know how to do something", get your punctuation problems sorted out properly!

Hmm... ironically our own web properties don't follow this advice, and we certainly do attract a number of techies!  I could rationalize that punctuational search isn't a large percentage of our traffic, but the real reason is that we use hosted services and don't have full control over own their search engines.  But do as we say, not as we do, and remember that honest confessions are good for the soul.

September 11, 2012

Are you Tracking MRR? - "Mean Reciprocal Rank" Trend Monitoring

MRR is a simple numerical technique to monitor the overall relevancy performance of search engines over time. It is based on click-throughs in the search results, where a click on the top document is scored as 100%, a click on the second document is 50%, 3rd document is 33%, etc. These numbers are collected and averaged over units of time.

The absolute value of MRR is not necessarily the important statistic because each site has different content, different classes of users, and different search technology. However, the trend of MRR over time can allow a site to spot changes quickly. It can also be used to score "A/B" testing.

There are certainly more elaborate algorithms that can be used, and MRR doesn’t account for whether a user liked the document once they opened it. But having even possibly imperfect performance data that can be trended over time is better than having nothing.

Reference: http://en.wikipedia.org/wiki/Mean_reciprocal_rank

Walter Underwood (of Ultraseek, Netflix, MarkLogic fame) gave a presentation (in PPT/PowerPoint) of this topic a couple years ago about NetFlix's use of MRR.

September 06, 2012

Got OGP? A Social Media Lesson for the Enterprise

    Anytime you decide to re-post that article hot off the virtual press from a site like nyt.com or Endgadget to your social network of choice, odds are strong that its content crosses the news-media-to-social-media gap via a metadata standard called the Open Graph Protocol, or OGP.  OGP facilitates grabbing the article's title, its content-type, an image that will appear in the article's post on your profile, and the article's canonical URL.  It's a simple standard based on the usual HTML metadata tags that actually predate Facebook and Google+ by over a decade (OGP's metadata tags can be distinguished  by the "og:" prefix on each property name, e.g. "og:title", "og:description", etc.)  And despite its Facebook origins, OGP's success should strongly inform enterprise metadata policies and practices in one basic, crucial area.

    The key to OGP's success on the public internet lies largely in its simplicity.  Implementing OGP requires the content creator to fill in just the four aforementioned metadata fields:

  • the content's URL (og:url)
  • its title (og:title)
  • its type (og:type)
  • a representative image (og:image)

     A great number of other OGP metadata fields certainly do exist, and should absolutely be taken advantage of, but only these four need to be defined in order for a page to be considered OGP-compliant.

     What can we immediately learn here from OGP that applies to metadata in the enterprise?  The enterprise content-creation and/or content-import process should involve a clearly-defined and standardized minimum set of metadata fields that should be present in every document *before* that document is added into CMS and/or indexed for search.  NYT.com certainly doesn't push out articles without proper OGP, and enterprise knowledge workers need to be equally diligent in producing documents with the proper metadata in place to find them again later!  Even if practical complications make that last proposition difficult, many Content Management Systems can be setup to suggest a basic set of default values automagically for the author to review at submission time.  Just having a simple, minimum spec in place for the metadata fields that are considered absolutely mandatory will generally improve base-line metadata quality considerably.

    What should this minimum set of metadata fields include for your specific enterprise content? It's hard to make exact recommendations, but let's consider the problem that OGP's designers were trying to solve in the case of web-content metadata: people want a simple preview of the content they're sharing from some content source, with sufficient information to identify that content's basic subject-matter and providence, and (perhaps most importantly!) a flashy image that stands out on their profile.  OGP's four basic requirements fit exactly these specs.  What information do your knowledge workers always need from their documents?  Perhaps the date-of-creation is a particularly relevant data-point for the work they're doing, or perhaps they often need to reference a document's author.  Whatever these fields might actually be, spending some time with the people who end up using your enterprise documents' metadata is the best way to find out.  And even if their baseline needs are dead simple, like the problem OGP manages to solve so succinctly, your default policy should be to just say NO to no metadata.  Your search engine will thank you.

    A natural question might arise from this case-study: should you actually just start using OGP in the enterprise?  It's not necessarily the best option, since intranet search-engine spiders and indexers might not know about OGP fields yet.  In any case, you'll definitely still want to have a regular title, description, etc. in your documents as well.  As of the time-of-writing, OGP is still best suited to the exact niche it was desinged to operate in: the public internet.  Replicating the benefits it provides within the enterprise environment is an important goal.

September 05, 2012

The "Gotcha's" of Disk and Memory Issues with Search

Here are some performance problems that can be caused by shared disk resources or RAM memory.

SAN / NAS Disk Latency

Note: Both SAN and NAS are types of shared disk drives that can be used by multiple machines.

Many NIE customers are using SAN storage for search indices and sometimes for content. Although this is becoming a more common practice, there is one issue to be particularly vigilant for. Search uses storage somewhat differently than other applications such as multimedia storage; search makes many round trips to disk, so the latency of all these serial transactions can stack up, and latency can be more of an issue than raw bandwidth.

Symptoms of Serial Latency Issues:

  • Other applications on similar machines, or using the same storage, not reporting performance problems, but search indexing or retrieval is slow.
  • Large performance differences between the same search application running in different environments, such as dev vs. staging.
  • Sudden changes in search performance when only “minor” changes were made to systems or networking. Other problems with network storage are thankfully seen much less often in modern systems:
  • Filehandle limitations, or different limits between local and network filehandles.
  • Exact order of transactions not maintained
  • File locking and cascading system failures. Note that some search engines may still have file locking limitations with simultaneous transactions, regardless of where the storage is located.

Memory and Virtual Machines

Search engines can be heavy users of RAM. If servers are hosting multiple applications, or many virtual servers which are each running RAM intensive apps, performance can suffer. Most operational teams are aware of this problem and try to avoid it. But there are ways this can sneak up on even the best of teams:
  • Performance degrades slowly over time
  • Performance degrades sporadically is therefore harder to analyze
  • Search might simply the first application that is noticeably impacted by resource constraints
  • Performance is greatly impacted by other applications, but at irregular intervals, the “bump in the night”. Organizations are often unaware of their full application loading picture over the course of an entire week or month.
  • Memory and performance can be affected by the activities of other virtual servers running on the same host. If too many virtual servers are consuming lots of memory, the physical host may need to start swapping to disk or otherwise constrain or delay memory access.
  • Performance varies between different environments on seemingly similar systems. The similar systems may actually have very different sets of applications or run schedules. However, such differences can also be related to SAN or NAS storage issues.
  • Parts of the Linux OS may allocate unused memory for other purposes, so operators become accustomed to seeing low available memory. At later times it’s then harder to spot true memory shortages vs. the “normal” low memory cause.
  • Some part of the OS or application stack is accidently using a 32 bit subsystem instead of 64 bit, perhaps as the result of a recent software update.