69 posts categorized "Technical"

March 18, 2013

Solr 4 Training 3/27 in Northern Virginia/DC area

Interrupting my series on whether open source search is a good idea in the enterprise to tell you about an opportunity to attend LucidWorks' Solr Bootcamp in Reston, Virginia on Wednesday March 27. Lucid staff and Lucene/Solr committers Erick Erickson and Erik Hatcher will be there, along with Solr pro Joel Bernstein. Heck, I'll even be there!

The link is here; for readers of our blog, use discount code SOLR4VA-5OFF for a discount.

Course Outline:

  • What's new in Solr 4
  • Solr 4 Functional Overview
  • Solr Cloud Deep Dive
  • Solr 4 Expert Panel Case Studies
  • Workshop and Open lab

And ask the guys how you can get involved in Solr as a contributor or committer!


December 18, 2012

Last call for submiting papers to ESS NY

This Friday, December 21, is the last day for submitting papers and workshops to ESS in NY in May 21-22. See the information site at the Enterprise Search Summit Call for Speakers page.

If you work with enterprise search technologies (or supporting technologies), chances are the things you've learned would be valuable to other folks. If you have an in-depth topic, write it up as a 3 hour workshop; if you have a success story, or lessons learned you can share, submit a talk for a 30-45 minute session.

I have to say, this conference has enjoyed a multi-year run in terms of quality of talks and excellent Spring weather.. see you in May?



October 30, 2012

Link to cool story of Lucene/Solr 4.0's new fast Fuzzy Search

Interesting article with lots of links to other good resources.  Tells the story of a lot of open source cross pollination and collaberation, automatons, Levenstein, and even a dash of Python - thanks Mike!


October 26, 2012

Deep Solr in London and New York

Last week I had the pleasure of conducting a workshop at the recent Enterprise Search Summit on open source tools including Solr, Lucene, and some of the commercial products based on these tools. To a lesser extent we also covered ElasticSearch, SearchBloxAlcove9, and a few other platforms, as well as a number of open source and commercial tools that support enterprise search.

One thing many of the attendees had in common was that they had been experimenting with Lucene/Solr for a while, but many were skeptical that they were ready do dive into a deep project on their own.

While that sentiment is no problem for me - after all, we provide services around implementing both open source and commercial search to our customers - I know many companies want to have expertise in-house

For those of you who are looking for those skills, you might be interested in a post I just saw from LucidWorks, They are offering a developer course titled  'Everything you always wanted to know about Solr' in both New York City and in London England during November. If you've been experimenting with Solr in-house (or on your own) for a while now, and you're ready to move to the next level, you might give some thought to registering for one of these classes.

It will cover the usual Solr topics, but also replication, sharding, and all the things you need to know to really use Solr in production search. Take a look and see if it's right for you.





September 18, 2012

Better Handling of Model Numbers and Software Versions

In a recent post I talked about different ways to tokenizer your data.  Today I'll extend that by talking about tokenizing text that has a small amout of structure in it, and the relationship between tokenziation and Entity Extraction.

Although this post talks about eCommerce related items Model Numbers and Version Numbers, this same logic could also be applied to dates, amounts of money, social security numbers, phone numbers, ISBN numbers, patent references, legal citations, etc. 

Better handling of Model Numbers

Note: Parts suppliers often have product names that look more like model numbers, so they might benefit from this as well.

It would be possible use field specific tokenization rules, in conjunction with search time logic, to allow for more superior partial matches. In a manner somewhat analogous to the previous section (Improved Tokenization for Punctuation), structured product names could be broken down into components, and also maintained in their original form, and overload the tokens in the index.

Search time patterns could also possibly enhance this search logic.

Potential advantages:

  • Ability to rank more exact matches (if the user types the longer form)
  • More predictable partial matches
  • Could enable normalized sorting and field collapsing
  • Could link from more specific to less specific and vice versa
  • Possibly improve autosuggest searches
  • Avoid use of wildcards (although this isn't a problem in some search engines)

Normalizing Version Numbers

Technical websites have a great deal of software and drivers, with many version numbers. Similar to the methods suggested for model numbers, these special numbers could be recognized and normalized as they're added to the index. Potential advantages:
  • Allow for proper version number sorting within one software component or driver (there is no absolute scale that’s comparable across disparate software)
  • Allow for proper partial matches
  • Allow for proper range searches
  • Possibly add an additional sentinel tokens for “latest” Entity Extraction / Normalization

Depending on the search engine, there might not be much implementation difference between normalizing model and version numbers as mentioned previously, and doing full entity extraction. However, regardless of implementation similarity, designing for full Entity Extraction elicits a more complete functional spec and UI treatment.

Benefits of full Entity Extraction over simple normalized tokenization:

  • Usually includes using the extracted entities in Faceted Navigation. If some silos already have good metadata for Facets but other silos lack it, this might allow those other silos to have almost comparable values for the same data type (via extraction vs. well defined metadata) and have more consistence coverage for faceted search.
  • Encourages further thought as to the preferred canonical internal representation and display format for each type of entity.
There is one potential issue with the first point, using entity extraction for faceted search: the text of a document may reference many valid entities, while the document itself is only primarily related to one or two of them, so there may be a tendency towards “Facet Inflation”. This can sometimes be mitigated by having several classes of the same facet, but where the scope of one type is more heavily restricted by having it only pull values from key parts of the document, such as title or model number.

September 13, 2012

Improved Tokenization for Punctuation Characters

Search appliations that are geared towards technical users often have problems with searches like .net, C#, *.*, etc.

In some cases these can be handled solely at the application level. For example, many Unix utility command line options begin with a hyphen, which means “NOT” to many search engines, so users searching for "-verbose" will find every document EXCEPT the ones that discuss the verbose option.

This can often be handled by just stripping off the minus sign before submitting the query to the engine. (depending on the engine and its configuration)

If there's always additional text in the search, a cheap workaround is to just consistently drop the same punctuation characters at both index time and search time. As long as "TCP/IP" is consistently reduced to [ tcp, ip ], users will have a good chance of finding it.

But what is punctuation is all you have? Somebody really needs to search for -(*) for example? What then? There's a strong tendency to balk at these uses cases, to claim that they are obscure edge cases, and rationalize why they should be ignored. But this edge case argument is old and stale - if your site truly needs to search punctuation rich content, then it may be worth the cost. Long search tails, which are common on technical search applications, can add up to substantial percentage of overall traffic!

Many punctuation problems need to be handled at index time, or in addition to special search time logic. For example, if asterisks are important, they can be stored as actual tokens in the fulltext index. At search time the asterisks would also need to be handled appropriately, since most search engines would either ignore them or assume they are part of a wildcard search.

The point is that, regardless of what you do with asterisks at search time, they cannot be found at all if they didn’t make it to the index, and were instead discarded at index time.

Token Overloading can be used to put multiple tokenized representations of the same text into the index. For example, a hyphenated phrase like "XY-1234" is found in a source document at index time, it can be injected as [ (xy-1234), (xy, 1234), (xy,-,1234), (xy1234) ]. Although this inflates the index size, it gives maximum flexibility at search time.

Don't confuse "don't need to do something" with "don't know how to do something", get your punctuation problems sorted out properly!

Hmm... ironically our own web properties don't follow this advice, and we certainly do attract a number of techies!  I could rationalize that punctuational search isn't a large percentage of our traffic, but the real reason is that we use hosted services and don't have full control over own their search engines.  But do as we say, not as we do, and remember that honest confessions are good for the soul.

September 11, 2012

Are you Tracking MRR? - "Mean Reciprocal Rank" Trend Monitoring

MRR is a simple numerical technique to monitor the overall relevancy performance of search engines over time. It is based on click-throughs in the search results, where a click on the top document is scored as 100%, a click on the second document is 50%, 3rd document is 33%, etc. These numbers are collected and averaged over units of time.

The absolute value of MRR is not necessarily the important statistic because each site has different content, different classes of users, and different search technology. However, the trend of MRR over time can allow a site to spot changes quickly. It can also be used to score "A/B" testing.

There are certainly more elaborate algorithms that can be used, and MRR doesn’t account for whether a user liked the document once they opened it. But having even possibly imperfect performance data that can be trended over time is better than having nothing.

Reference: http://en.wikipedia.org/wiki/Mean_reciprocal_rank

Walter Underwood (of Ultraseek, Netflix, MarkLogic fame) gave a presentation (in PPT/PowerPoint) of this topic a couple years ago about NetFlix's use of MRR.

September 06, 2012

Got OGP? A Social Media Lesson for the Enterprise

    Anytime you decide to re-post that article hot off the virtual press from a site like nyt.com or Endgadget to your social network of choice, odds are strong that its content crosses the news-media-to-social-media gap via a metadata standard called the Open Graph Protocol, or OGP.  OGP facilitates grabbing the article's title, its content-type, an image that will appear in the article's post on your profile, and the article's canonical URL.  It's a simple standard based on the usual HTML metadata tags that actually predate Facebook and Google+ by over a decade (OGP's metadata tags can be distinguished  by the "og:" prefix on each property name, e.g. "og:title", "og:description", etc.)  And despite its Facebook origins, OGP's success should strongly inform enterprise metadata policies and practices in one basic, crucial area.

    The key to OGP's success on the public internet lies largely in its simplicity.  Implementing OGP requires the content creator to fill in just the four aforementioned metadata fields:

  • the content's URL (og:url)
  • its title (og:title)
  • its type (og:type)
  • a representative image (og:image)

     A great number of other OGP metadata fields certainly do exist, and should absolutely be taken advantage of, but only these four need to be defined in order for a page to be considered OGP-compliant.

     What can we immediately learn here from OGP that applies to metadata in the enterprise?  The enterprise content-creation and/or content-import process should involve a clearly-defined and standardized minimum set of metadata fields that should be present in every document *before* that document is added into CMS and/or indexed for search.  NYT.com certainly doesn't push out articles without proper OGP, and enterprise knowledge workers need to be equally diligent in producing documents with the proper metadata in place to find them again later!  Even if practical complications make that last proposition difficult, many Content Management Systems can be setup to suggest a basic set of default values automagically for the author to review at submission time.  Just having a simple, minimum spec in place for the metadata fields that are considered absolutely mandatory will generally improve base-line metadata quality considerably.

    What should this minimum set of metadata fields include for your specific enterprise content? It's hard to make exact recommendations, but let's consider the problem that OGP's designers were trying to solve in the case of web-content metadata: people want a simple preview of the content they're sharing from some content source, with sufficient information to identify that content's basic subject-matter and providence, and (perhaps most importantly!) a flashy image that stands out on their profile.  OGP's four basic requirements fit exactly these specs.  What information do your knowledge workers always need from their documents?  Perhaps the date-of-creation is a particularly relevant data-point for the work they're doing, or perhaps they often need to reference a document's author.  Whatever these fields might actually be, spending some time with the people who end up using your enterprise documents' metadata is the best way to find out.  And even if their baseline needs are dead simple, like the problem OGP manages to solve so succinctly, your default policy should be to just say NO to no metadata.  Your search engine will thank you.

    A natural question might arise from this case-study: should you actually just start using OGP in the enterprise?  It's not necessarily the best option, since intranet search-engine spiders and indexers might not know about OGP fields yet.  In any case, you'll definitely still want to have a regular title, description, etc. in your documents as well.  As of the time-of-writing, OGP is still best suited to the exact niche it was desinged to operate in: the public internet.  Replicating the benefits it provides within the enterprise environment is an important goal.

September 05, 2012

The "Gotcha's" of Disk and Memory Issues with Search

Here are some performance problems that can be caused by shared disk resources or RAM memory.

SAN / NAS Disk Latency

Note: Both SAN and NAS are types of shared disk drives that can be used by multiple machines.

Many NIE customers are using SAN storage for search indices and sometimes for content. Although this is becoming a more common practice, there is one issue to be particularly vigilant for. Search uses storage somewhat differently than other applications such as multimedia storage; search makes many round trips to disk, so the latency of all these serial transactions can stack up, and latency can be more of an issue than raw bandwidth.

Symptoms of Serial Latency Issues:

  • Other applications on similar machines, or using the same storage, not reporting performance problems, but search indexing or retrieval is slow.
  • Large performance differences between the same search application running in different environments, such as dev vs. staging.
  • Sudden changes in search performance when only “minor” changes were made to systems or networking. Other problems with network storage are thankfully seen much less often in modern systems:
  • Filehandle limitations, or different limits between local and network filehandles.
  • Exact order of transactions not maintained
  • File locking and cascading system failures. Note that some search engines may still have file locking limitations with simultaneous transactions, regardless of where the storage is located.

Memory and Virtual Machines

Search engines can be heavy users of RAM. If servers are hosting multiple applications, or many virtual servers which are each running RAM intensive apps, performance can suffer. Most operational teams are aware of this problem and try to avoid it. But there are ways this can sneak up on even the best of teams:
  • Performance degrades slowly over time
  • Performance degrades sporadically is therefore harder to analyze
  • Search might simply the first application that is noticeably impacted by resource constraints
  • Performance is greatly impacted by other applications, but at irregular intervals, the “bump in the night”. Organizations are often unaware of their full application loading picture over the course of an entire week or month.
  • Memory and performance can be affected by the activities of other virtual servers running on the same host. If too many virtual servers are consuming lots of memory, the physical host may need to start swapping to disk or otherwise constrain or delay memory access.
  • Performance varies between different environments on seemingly similar systems. The similar systems may actually have very different sets of applications or run schedules. However, such differences can also be related to SAN or NAS storage issues.
  • Parts of the Linux OS may allocate unused memory for other purposes, so operators become accustomed to seeing low available memory. At later times it’s then harder to spot true memory shortages vs. the “normal” low memory cause.
  • Some part of the OS or application stack is accidently using a 32 bit subsystem instead of 64 bit, perhaps as the result of a recent software update.

December 12, 2011

New Phrase for determining Sentiment Analysis / Customer Interest

If you lookup:

fedex "Package not due for delivery"

which is one of the status messages you can get when tracking a package, you'll see a lot of postings asking about it.

FYI: It means your new toy has arrived in the city you live in, but will NOT be delivered today, because they didn't promise to get it to you until tomorrow.  Whether this is to force customers into paying for express service, or simply a logistics issue, or a mix of the two, depends on your view of companies and I won't get into that here.

However, you'll notice a lot of the postings asking about it are from folks waiting for delivery of things they're very excited to get, often some big-ticket peice of shiny electronics.  They're dying for Fedex to deliver it - they're so anxious and upset about the delay that they motivated enough to go online and search, and make ranting posts - all because their "toy" is delayed.

So we have particular emotional response, often about an upscale product, with a reasonably distinct search phrase - cool!

Yes, yes, of course you could say that the customers are mad about the percieved injustice of it, the Occupy Wall Street spin, or that sometimes the package could be really important for other reasons, which are certainly valid points.  I'm not taking sides or passing judgement - and I found discovered this today looking for a friend's overdue toy - that's not the point.  I'm just saying that I bet there's a good statistical correlation, and of course it wouldn't apply 100% of the time - which would actually be quite rare in such things.