6 posts categorized "Database and SQL with Search"

November 21, 2011

Google: Sometimes I really do want EXACT MATCHES

Disclaimer: Google only attracts my annoyances more because I use it so much.  And I'm confident they can do even better, and so I'm helping by writing this stuff down!

My Complaint:

Back in my day, when you typed something in quotes into a search engine, you'd get an exact match!  Well... OK, sometimes that meant "phrase search" or "turn off stemming"... but still, if it was only a ONE WORD query, and I took the time to still put it in quotes, then the engine knew I was being VERY specific.

But now that everyone's flying with jet-packs and hover boards, search engines have decided that they know more than I do, and so when I use quotes, they seem to ignore them!

I can't give the exact query I was using, but let's say it'd been "IS_OF".  Google tries to talk me out of it, doing a "Show results for (something else)", but then I click on the "Actually do what I said" hyperlink.  And even then it still doesn't.  In this false example, it'd still match I.S.O.F. and even span sentence gaps, as in "Do you know that that *is*?  *Of* course I do!"

The Technical Challenge:

To be fair, there's technical problems with trying to match arbitrary exact patterns of characters in a scalable way.  Punctuation presents a challenge, with many options.  And most engines use tokenization, which implies word breaks, which normally wouldn't handle arbitrary substring matching.

At least with some engines, if you want to support both case insensitive and case sensitive matching, you have two different indexes, with the latter sometimes being called a "casedex".  Other engines allow you to generate multiple overlapping tokens within the index, so "A-B" can be stored as both separate A's and B's, and also as "AB", and also as the literal "A-B", so any form will match.

Some would say I'm really looking for the Unix "grep" command, or the SQL "LIKE" operator.  And by the way, those tools a VERY inefficient because they use linear searching, instead of pre-indexing.  And if you tried to have a set of indexes to handle all permutations of case matching, punctuation, pattern matching, etc, you'd wind up with a giant index, maybe way larger than the source text.

But I do think Google has moved beyond at least some of these old limitations, they DO seem to find matches that go beyond simple token indices.

Could you store an efficient, scalable set of indices that store enough information to accommodate both normal English words and complex near-regex level literal matching, and still have reasonable performance and reasonable index sizes?  In other words "could you have your cake and eat it too"?  Well... you'd think a multi-billion-dollar company full of Standard smarties certainly could! ;-)  But then the cost would need to be justtified... and outlier use-cases never survive that scrutiny.  As long as the underlying index supports finding celebrity names and lasagna recipes, and pairing them with appropriate ads, the 80% use cases are satisfied.

August 01, 2011

Google Refine, Google's open source ETL tool for data cleansing, with videos!

For any of you working with Entity Extraction this might be of interest.  Google has open sourced some software from their FreeBase acquisition, formerly called Gridworks.  It lets you interactively cleanup and transform data.  More importantly, it says these steps into a reusable sequence of steps in JSON format, so they could be reapplied to other data.

Here's the main page and wiki (and 3 intro videos):

It IS Open Source, here's the source code and license:

That type of UI makes me want to dust off our XPump code and retrofit into it...

July 07, 2010

In Defense of "grep" / auto-substring Matching :-)

As some of you know, grep is the Unix utility that, in its simplest form, looks literal strings in a file and prints out any matching lines. The database equivalent is the LIKE operator with percent signs before and after the string.

For years all of us fulltext search engine snobs have been saying "grep is not a search engine" (and by extension, neither is the LIKE operator in SQL), and that this type of literal matching is insufficient for real searching. For example, this type of simple matching won't get word variations like "run" and "ran", nor synonyms like "cool" and "cold".

From an implementation standpoint, the problem with grep is performance related, it scans every line of every file to check each pattern. This is super slow if you have billions of documents. Instead search engines index all the documents ahead of time and create a highly optimized search index. It consults that index, not the original source documents, to search for specific words.

But I find myself doing substring searches in a few of the systems I frequently use. In our CRM, when I don't remember the specific spelling of a person or company or product, I type in just 3 or 4 letters. This doesn't always work, sometimes it brings back junk, other times it misses the mark. But it's an easy search to edit and resubmit, so I can fire off 2 or 3 variations in short order. I also use substrings quite a bit when searching through source code. OpenGrok is a very nice Lucene based search engine, and uses proper word breaks, but sometimes it actually doesn't find things I'm looking for because it's looking, by default, for complete words. Whereas when you're in the Eclipse editor, it uses substring searching by default, and you can lookup substrings without thinking about it. Email is yet another application that, at least on some systems, starts looking up matches after just 2 or 3 letters. There's a special case, some systems will only match those 2 or 3 characters if they're at the start of a word, similar to many autocomplete instances.

I can hear some of you yelling "what about wildcards!?" - most engines will let you put abc* and match everything starting with abc. Search engines differ on whether or not you can use wildcards in the middle or start of the word, and some engines can do it IF you enable it. This is close... it's an improvement in that it doesn't do a linear scan of all the documents, it still consults the fulltext search index. But most folks forget to put the asterisk... or is it a percent sign? And can you put it in the middle or beginning, in your particular engine and configuration? Who knows!

So what's to be done? The good news is you really can "have your cake and eat it too!". Highly configurable search engines can be told to index the same text in several different ways. One internal index can have tokens that are the exact words. Another index can normalize the words down to lower case and perform "stemming", to normalize all the plurals to singular form, etc. These engines should also be able to be coaxed into storing all of the smaller chunks of words in yet another index. Of course substrings aren't as good as a full match. But search engines have an answer for this too! You can set the relevancy for these different indices with different weights. A substring match is OK... if there's nothing else... but if the full word matches, it should get extra credit, or an exact match scores even higher. And keep in mind you're not paying the performance penalty, it's using the index and not doing a literal scan of every file.

All this techno-babel, let's walk through an example:

You're text has the term sentence "There were marks on the surface.", and let's focus on the third word "marks". Then another sentence has "Mark wrote this blog post."

The word "marks" gets indexed several ways:

Exact index: marks

Stemmed index: mark

Single index: m a r k s

Double index: ma ar rk ks

Triples: mar ark rks

Then the term "Mark" is indexed as:

Exact index: Mark

Stemmed index: mark

Tuple index (combines the 1, 2 and 3): m a r k ma ar rk mar ark

Kinda techie, but you can see that, as long as the same rules are applied to the search terms, we can easily matching something.  If somebody doesn't remember if my name ended in a "c" or a "k", they can find me with just "mar". Now, if there's a million documents, that search will bring back LOTS of other documents with the substring "mar", albeit very quickly!

But if somebody searches for mark or Mark, extra credit will be given for matching more precise indices. Actual implementations would probably leave off the single letter index, the m, a, r and k stuff, as almost every document would have those. And this implementation would take more disk space, more time to index, etc. And they'd tend to bring back a lot of junk. But the good news is that folks wouldn't have to remember to add wildcard characters. In techie terms we'd say this "helps recall, but hurts precision". Another idea would be to NOT apply the substring matching by default, but perhaps offer a clickable option in the results list to "expand your search", which re-issues the same search with the substring turned on, an let the user decide.

Index-based automatic substring matches have its place, along with all of the other tools in the search engine arsenal. It's a nice option to have when searching over names, source code, chemicals, domain names, and other technical data. Whether it's turned on by default, and how it's weighted against better matches, are choices to be carefully weighed.

May 24, 2010

Does Maxxcat's New Search Appliance Challenge the Google Box?

Both Jessica Bratcher and Tim Grey have interesting posts about Maxxcat releasing several enterprise search appliances that are supposedly much faster, cheaper, and extensible then the corresponding Google search appliance, with unlimited lifetime use.

They were created from the ground up and run on a special Linux platform. "On a 1 million document collection, the kernel can dispatch and resolve a multi-term query spanning the entire collection in as little as 100 usec." (of course anything under 500 msec would be fine for an end user)

Maxxcat has also released a new version of its JDBC connector (Bobcat) that supports standard SQL and allows any JDBC compliant database to interface directly to a MaxxCAT appliance. The company claims "EX-5000 Enterprise Search appliances equipped with BobCAT are able to retrieve and index information from host systems at speeds in excess of 1GB/minute."

Their chief integration engineer stated ""We are working with a number of customers who have data in SQLServer, mySQL or Oracle Databases that we are able to easily consolidate and query against, even though the source databases and data models vary dramatically. This is simply not possible with conventional database software, which relies upon proprietary interfaces and does not handle unstructured data very well, if at all.""

April 14, 2010

The NoSQL Movement

Our focus here is normally on fulltext search, but of course databases have been at the core of Information Retrieval for decades. But are those foundations getting old? Some factors to consider when considering alternatives to relational databases and ACID guarantees such as using a document store. SQL at 40: Ready for Retirement?

July 18, 2008

Microsoft Terms for SQL Server Search Components

I found a nice article about Microsoft Language Packs and MS SQL Server, including some info on Japanese and CJK handling, but another tidbit of info they had was how Microsoft refers to certain parts of their search engine:

  • What most vendors refer to as "indexing" MS refers to as "population" (into an index)
  • What most vendors call a "collection" Microsoft calls a "catalog" - we've seen other vendors use that term in the past.
  • And what most vendors call "tokenzation" or "tokenizers", Microsoft calls "word breakers", which is actually a bit more descriptive to a non programmer.

I actually wrote an article a few years ago comparing traditional relational databases to full-text search engines, which included a table of equivalent terms and concepts (near the end of the article).  If you're already familiar with databases, this will get you up to speed much faster!