« Improved Tokenization for Punctuation Characters | Main | Kipling on the future of Autonomy »

September 18, 2012

Better Handling of Model Numbers and Software Versions

In a recent post I talked about different ways to tokenizer your data.  Today I'll extend that by talking about tokenizing text that has a small amout of structure in it, and the relationship between tokenziation and Entity Extraction.

Although this post talks about eCommerce related items Model Numbers and Version Numbers, this same logic could also be applied to dates, amounts of money, social security numbers, phone numbers, ISBN numbers, patent references, legal citations, etc. 

Better handling of Model Numbers

Note: Parts suppliers often have product names that look more like model numbers, so they might benefit from this as well.

It would be possible use field specific tokenization rules, in conjunction with search time logic, to allow for more superior partial matches. In a manner somewhat analogous to the previous section (Improved Tokenization for Punctuation), structured product names could be broken down into components, and also maintained in their original form, and overload the tokens in the index.

Search time patterns could also possibly enhance this search logic.

Potential advantages:

  • Ability to rank more exact matches (if the user types the longer form)
  • More predictable partial matches
  • Could enable normalized sorting and field collapsing
  • Could link from more specific to less specific and vice versa
  • Possibly improve autosuggest searches
  • Avoid use of wildcards (although this isn't a problem in some search engines)

Normalizing Version Numbers

Technical websites have a great deal of software and drivers, with many version numbers. Similar to the methods suggested for model numbers, these special numbers could be recognized and normalized as they're added to the index. Potential advantages:
  • Allow for proper version number sorting within one software component or driver (there is no absolute scale that’s comparable across disparate software)
  • Allow for proper partial matches
  • Allow for proper range searches
  • Possibly add an additional sentinel tokens for “latest” Entity Extraction / Normalization

Depending on the search engine, there might not be much implementation difference between normalizing model and version numbers as mentioned previously, and doing full entity extraction. However, regardless of implementation similarity, designing for full Entity Extraction elicits a more complete functional spec and UI treatment.

Benefits of full Entity Extraction over simple normalized tokenization:

  • Usually includes using the extracted entities in Faceted Navigation. If some silos already have good metadata for Facets but other silos lack it, this might allow those other silos to have almost comparable values for the same data type (via extraction vs. well defined metadata) and have more consistence coverage for faceted search.
  • Encourages further thought as to the preferred canonical internal representation and display format for each type of entity.
There is one potential issue with the first point, using entity extraction for faceted search: the text of a document may reference many valid entities, while the document itself is only primarily related to one or two of them, so there may be a tendency towards “Facet Inflation”. This can sometimes be mitigated by having several classes of the same facet, but where the scope of one type is more heavily restricted by having it only pull values from key parts of the document, such as title or model number.


TrackBack URL for this entry:

Listed below are links to weblogs that reference Better Handling of Model Numbers and Software Versions:


The comments to this entry are closed.