8 posts categorized "Search Term Definitions and Glossary"

October 30, 2012

Link to cool story of Lucene/Solr 4.0's new fast Fuzzy Search

Interesting article with lots of links to other good resources.  Tells the story of a lot of open source cross pollination and collaberation, automatons, Levenstein, and even a dash of Python - thanks Mike!

http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html

September 18, 2012

Better Handling of Model Numbers and Software Versions

In a recent post I talked about different ways to tokenizer your data.  Today I'll extend that by talking about tokenizing text that has a small amout of structure in it, and the relationship between tokenziation and Entity Extraction.

Although this post talks about eCommerce related items Model Numbers and Version Numbers, this same logic could also be applied to dates, amounts of money, social security numbers, phone numbers, ISBN numbers, patent references, legal citations, etc. 

Better handling of Model Numbers

Note: Parts suppliers often have product names that look more like model numbers, so they might benefit from this as well.

It would be possible use field specific tokenization rules, in conjunction with search time logic, to allow for more superior partial matches. In a manner somewhat analogous to the previous section (Improved Tokenization for Punctuation), structured product names could be broken down into components, and also maintained in their original form, and overload the tokens in the index.

Search time patterns could also possibly enhance this search logic.

Potential advantages:

  • Ability to rank more exact matches (if the user types the longer form)
  • More predictable partial matches
  • Could enable normalized sorting and field collapsing
  • Could link from more specific to less specific and vice versa
  • Possibly improve autosuggest searches
  • Avoid use of wildcards (although this isn't a problem in some search engines)

Normalizing Version Numbers

Technical websites have a great deal of software and drivers, with many version numbers. Similar to the methods suggested for model numbers, these special numbers could be recognized and normalized as they're added to the index. Potential advantages:
  • Allow for proper version number sorting within one software component or driver (there is no absolute scale that’s comparable across disparate software)
  • Allow for proper partial matches
  • Allow for proper range searches
  • Possibly add an additional sentinel tokens for “latest” Entity Extraction / Normalization

Depending on the search engine, there might not be much implementation difference between normalizing model and version numbers as mentioned previously, and doing full entity extraction. However, regardless of implementation similarity, designing for full Entity Extraction elicits a more complete functional spec and UI treatment.

Benefits of full Entity Extraction over simple normalized tokenization:

  • Usually includes using the extracted entities in Faceted Navigation. If some silos already have good metadata for Facets but other silos lack it, this might allow those other silos to have almost comparable values for the same data type (via extraction vs. well defined metadata) and have more consistence coverage for faceted search.
  • Encourages further thought as to the preferred canonical internal representation and display format for each type of entity.
There is one potential issue with the first point, using entity extraction for faceted search: the text of a document may reference many valid entities, while the document itself is only primarily related to one or two of them, so there may be a tendency towards “Facet Inflation”. This can sometimes be mitigated by having several classes of the same facet, but where the scope of one type is more heavily restricted by having it only pull values from key parts of the document, such as title or model number.

September 13, 2012

Improved Tokenization for Punctuation Characters

Search appliations that are geared towards technical users often have problems with searches like .net, C#, *.*, etc.

In some cases these can be handled solely at the application level. For example, many Unix utility command line options begin with a hyphen, which means “NOT” to many search engines, so users searching for "-verbose" will find every document EXCEPT the ones that discuss the verbose option.

This can often be handled by just stripping off the minus sign before submitting the query to the engine. (depending on the engine and its configuration)

If there's always additional text in the search, a cheap workaround is to just consistently drop the same punctuation characters at both index time and search time. As long as "TCP/IP" is consistently reduced to [ tcp, ip ], users will have a good chance of finding it.

But what is punctuation is all you have? Somebody really needs to search for -(*) for example? What then? There's a strong tendency to balk at these uses cases, to claim that they are obscure edge cases, and rationalize why they should be ignored. But this edge case argument is old and stale - if your site truly needs to search punctuation rich content, then it may be worth the cost. Long search tails, which are common on technical search applications, can add up to substantial percentage of overall traffic!

Many punctuation problems need to be handled at index time, or in addition to special search time logic. For example, if asterisks are important, they can be stored as actual tokens in the fulltext index. At search time the asterisks would also need to be handled appropriately, since most search engines would either ignore them or assume they are part of a wildcard search.

The point is that, regardless of what you do with asterisks at search time, they cannot be found at all if they didn’t make it to the index, and were instead discarded at index time.

Token Overloading can be used to put multiple tokenized representations of the same text into the index. For example, a hyphenated phrase like "XY-1234" is found in a source document at index time, it can be injected as [ (xy-1234), (xy, 1234), (xy,-,1234), (xy1234) ]. Although this inflates the index size, it gives maximum flexibility at search time.

Don't confuse "don't need to do something" with "don't know how to do something", get your punctuation problems sorted out properly!

Hmm... ironically our own web properties don't follow this advice, and we certainly do attract a number of techies!  I could rationalize that punctuational search isn't a large percentage of our traffic, but the real reason is that we use hosted services and don't have full control over own their search engines.  But do as we say, not as we do, and remember that honest confessions are good for the soul.

September 11, 2012

Are you Tracking MRR? - "Mean Reciprocal Rank" Trend Monitoring

MRR is a simple numerical technique to monitor the overall relevancy performance of search engines over time. It is based on click-throughs in the search results, where a click on the top document is scored as 100%, a click on the second document is 50%, 3rd document is 33%, etc. These numbers are collected and averaged over units of time.

The absolute value of MRR is not necessarily the important statistic because each site has different content, different classes of users, and different search technology. However, the trend of MRR over time can allow a site to spot changes quickly. It can also be used to score "A/B" testing.

There are certainly more elaborate algorithms that can be used, and MRR doesn’t account for whether a user liked the document once they opened it. But having even possibly imperfect performance data that can be trended over time is better than having nothing.

Reference: http://en.wikipedia.org/wiki/Mean_reciprocal_rank

Walter Underwood (of Ultraseek, Netflix, MarkLogic fame) gave a presentation (in PPT/PowerPoint) of this topic a couple years ago about NetFlix's use of MRR.

June 03, 2011

Today's Search Term: Term Density

'Term density' is a calculated percentage of how frequently a term appears in a document, relative to the overall size of the document. This fixes the problem with simple term frequency calculations. For example, if a word appears 5 times in a 2 page document and 10 times in document a 100 page document, the first document is probably still more relevant, even though it has 5 less occurrences of the term.

From the New Idea Engineering Glossary of Search-Related Terms

 

 

 

  


 

August 30, 2010

Today's Search Term: Stemming

stemming

Related Terms:  lemmatization, normalize
Search engines use stemming as a means to
determine the root of a given written word. Using a program or algorithm all of the affixes to a word (prefix and /or suffix in the English language) are removed, leaving the root word. By implementing the rules of the given language obstacles such as third- person singular present (as cries is of the verb cry) in the English language can be accurately indexed.


Stemmers become harder to design as the rules of the target language becomes more complex. For example, some languages have more verb and pronoun forms. Other languages do not always have clear word breaks between each word, and you can't do stemming until you've isolated the words!

 


Search Terms

NIE maintains a Glossary Enterprise Search Terms related to the Business and Technology of Search on our site, which you can browse at your convenience. This is an active list, and we welcome your suggestions and additions!

Now we're going to select and post one of these each day or so in the blog. Some may be familiar but we hope some will be new to you. Enjoy!

August 03, 2010

Today's Search Term: Folksonomy

Folksonomy
Related Terms:  Taxonomy, Behavior Based Taxonomy
A type of taxonomy or other organization of content  suggested by users.
For example, on popular photo sites, users can tag photos with descriptive words. These words can then be searched for. In the enterprise, some search systems allow employees to tag certain documents with key words. These terms are then found when other employees search for those terms.