« October 2011 | Main | December 2011 »

9 posts from November 2011

November 30, 2011

Odd Google Translate Encoding issue with Japanese

Was translating a comment in the Japanese SEN tokenization library.

It seems like if your text includes the Unicode right arrow character, Google somehow gets confused about the encoding.  Saw this on both Firefox and Safari.  Not a big deal, strangely comforting to see even the big guys trip up on character encodings.

OK: サセン
OK: チャセ
Not OK: サセンチャセ?


November 29, 2011

10 Handy Things to Know about the Lucene / Solr Source Code

It's funny how certain facts are "obvious" to some folks, stuff they've known a long time, but come as a pleasant surprise to others.  Chances are you know at least half of these, but no harm in double checking!

  1. Although Lucene and Solr are available in binary form, most serious users are eventually going to need some custom code.  If you post questions on the mailing lists, I think the assumption is you're comfortable with compilers, source code control and patches.  So it's a good habit to get into early on.
  2. Lucene and Solr source code were combined a while back (circa March 2010), so it's now one convenient checkout.
  3. You'll want to be using Java 6 JDK to work with recent versions of Lucene / Solr.
  4. Lucene/Solr use the ant build tool by default.  BUT did you know that the build file can also generate Project files for Eclipse, IntelliJ and Maven.  So you can use your favorite tool.  (See the README.txt file for info and links)
  5. Lucene/Solr use the Subversion / SVN source code control system.  There are clients for Windows and plugins for Eclipse and IntelliJ. (Mac OS X has it built in)
  6. You're allowed to do read-only checkout without needing any sort of login - checkouts are completely open to the public.  This is news to folks who've used older or more secure systems.
  7. Although checking any changes back in would require a login, it's more common to post patches to the bug tracking system or mailing list, and then let the main committers review and checkin the patch.  Even the read-only checkouts create enough information on your machine to generate patches from your local changes.
  8. Doing a checkout, either public or with a login, does not "lock" anything.  This is also a surprise to folks used to older systems.  This non-locking checkout is why anonymous users can be allowed to checkout code - there's no need to coordinate checkouts.
  9. The read-only current source for the combined Lucene + Solr is at http://svn.apache.org/repos/asf/lucene/dev/trunk  Even though it's an http link, and can be browsed with a web browser, it's also a valid Subversion URL.
  10. The "contribute" wiki pages for Lucene and Solr have more info about the source code and patch process.

November 28, 2011

Solr Disk and Memory Size Estimator (Excel worksheet)

If you do a standard checkout of the Lucene/Solr codebbase you also get a dev-tools directory.  One interesting tidbit in there is an Excel spreadsheet for estimating the RAM and disk requirements for a given set of data.  Be sure to notice the tabs along the bottom, tab 2 is for memory/RAM estimates, and tab 3 is for disk space.

Full URL: http://svn.apache.org/repos/asf/lucene/dev/trunk/dev-tools/size-estimator-lucene-solr.xls

November 22, 2011

Webinar: Improving SharePoint search with the FAST indexing pipeline

For those of you still at your desks this short Thanksgiving week, you might be interested in a webinar we'll be doing with our partner SurfRay early next month.

"Everyone knows that great metadata is key to a great user search experience, but what can you do if your existing content falls short? The FAST Search for SharePoint pipeline provides a way to enhance document metadata during the indexing process so your content has better metadata and users will experience better search results.

During the webinar we’ll talk about what the pipeline is, give examples of how it can improve your metadata, and describe some real-world scenarios where having access to the pipeline resulted in better search quality and happier users."

How can the indexing pipeline improve search quality? You'll have to come to the webinar to hear our take, but a hint: you can add and improve metadata to the document during the indexing process - which means better search.

The webinar is planned for Friday, December 9 at 2PM Eastern/11AM Pacific.  You can register for the event now.

7 things GMail Search needs to change

My General Complaint:

If you've had a gmail account for many years, either for work or personal, it's getting large enough that GMail's search is starting to break.

Anything word you can think of to type in will match tons of useless results.  Eventually, as you try to think of more words to add, your results count goes to zero.

If you were lucky enough to have starred the email when you saw it, or can remember who might have sent it, or maybe the approximate timeframe, or maybe you think you might have sent the email in question from this account, you *might* have a chance.

A Tough Problem:

I realize this seems like classic precision and recall troubles, but Google is pretty smart, and they a fair amount of metadata, and a lot of context about me, so there's some potential fixes to hang a hat on.

And some of my ideas involve making labels/tags (Gmail's equivalent of folders), but that assumes that people are using labels, which I suspect many folks don't, or at least not beyond the default ones you get.  Well... sure, but they DO have them, and there's an automated rules engine in Gmail to set them, so presumably a few people use tags / labels?  (or maybe nobody does and, in hindsight, maybe a legacy feature!?) So, if you're going to have labels, and you've got even a few users who both with them, then make them as useful as possible.  AND maybe make Labels more visible, maybe easier to set, more powerful, etc.

On To The Ideas:

1: Make it easier to refine search results.

Let's face it, as you accumulate more and more email, the odds of finding the email you want on the first screen of search results goes WAY down.

Google wisely uses most-recent-first sorting in search results, vs. their normal relevancy, in the GMail search UI.  I'm not sure why, this seems like an odd choice for them given all the bravado about Google's relevancy, but I'm guessing it was too weird to have email normally sorted by date in most parts of the UI, but have it switch back and forth between relevancy and date as you alternate between search and normal browsing.  Also, maybe they found it's more likely you're looking f or a very recent email.  You could fold "freshness" into relevancy calculations, but just respecting date keeps it more consistent.

Yes, GMail does have some search options... I'll get to those, but suffice to say they are very "non iterative".

Other traditional filters should be facets as well.  "Sent" emails, date ranges, "has attachments" (maybe even how many, sizes, or types)

2: Promote form-based "Search options" to FULL Facets

You can limit your search to a subset of your email if you've Labeled it - this is the GMail equivalent of Folders.  But doing this is a hassle (see item 3), and you can't do this after the fact, once you're looking at results.

So, if you do normal text search, and then remember you labeled it, you can't just click on the tags on the left of the results.  Those are for browsing, and will actually clear out you search terms.  These should be clickable drilldown facets, perhaps even with match counts in the parenthesis, and maybe some stylizing to make it clear that they will affect the current search results.

Yes, there's a syntax you can use:

lebal:your-label regular search terms

It's a nice option for advanced users who are accurate touch typists and remember the tag name they want, but this should also be easy from the UI.  Yes, there is an advanced search / search options forms, but this brings me to item 3...

(read the rest of the ideas after the break)

Continue reading "7 things GMail Search needs to change" »

November 21, 2011

Google: Sometimes I really do want EXACT MATCHES

Disclaimer: Google only attracts my annoyances more because I use it so much.  And I'm confident they can do even better, and so I'm helping by writing this stuff down!

My Complaint:

Back in my day, when you typed something in quotes into a search engine, you'd get an exact match!  Well... OK, sometimes that meant "phrase search" or "turn off stemming"... but still, if it was only a ONE WORD query, and I took the time to still put it in quotes, then the engine knew I was being VERY specific.

But now that everyone's flying with jet-packs and hover boards, search engines have decided that they know more than I do, and so when I use quotes, they seem to ignore them!

I can't give the exact query I was using, but let's say it'd been "IS_OF".  Google tries to talk me out of it, doing a "Show results for (something else)", but then I click on the "Actually do what I said" hyperlink.  And even then it still doesn't.  In this false example, it'd still match I.S.O.F. and even span sentence gaps, as in "Do you know that that *is*?  *Of* course I do!"

The Technical Challenge:

To be fair, there's technical problems with trying to match arbitrary exact patterns of characters in a scalable way.  Punctuation presents a challenge, with many options.  And most engines use tokenization, which implies word breaks, which normally wouldn't handle arbitrary substring matching.

At least with some engines, if you want to support both case insensitive and case sensitive matching, you have two different indexes, with the latter sometimes being called a "casedex".  Other engines allow you to generate multiple overlapping tokens within the index, so "A-B" can be stored as both separate A's and B's, and also as "AB", and also as the literal "A-B", so any form will match.

Some would say I'm really looking for the Unix "grep" command, or the SQL "LIKE" operator.  And by the way, those tools a VERY inefficient because they use linear searching, instead of pre-indexing.  And if you tried to have a set of indexes to handle all permutations of case matching, punctuation, pattern matching, etc, you'd wind up with a giant index, maybe way larger than the source text.

But I do think Google has moved beyond at least some of these old limitations, they DO seem to find matches that go beyond simple token indices.

Could you store an efficient, scalable set of indices that store enough information to accommodate both normal English words and complex near-regex level literal matching, and still have reasonable performance and reasonable index sizes?  In other words "could you have your cake and eat it too"?  Well... you'd think a multi-billion-dollar company full of Standard smarties certainly could! ;-)  But then the cost would need to be justtified... and outlier use-cases never survive that scrutiny.  As long as the underlying index supports finding celebrity names and lasagna recipes, and pairing them with appropriate ads, the 80% use cases are satisfied.

November 20, 2011

Dell.com Site Search acting weird today

FYI: The Dell Zino is (was?) a small form factor machine that could be used as a portable server, probably a bit larger than a Mac Mini, but still portable.

If you go to Dell.com and use their search and do a one word search for "zino" you get 8 results, but none of them are for that machine.  7 are for memory sims, and the bottom result is for a different dell machine, a small tower.  At first I was worried that perhaps they had discontinued the cute little guy.

I sent to Google and one of the suggested searches was "dell zino discontinued", aw... I was afraid of that.. But wait! - The first page didn't actually say it was discontinued, it just had those words in a long discussion thread.  And the second result goes to the Zino page on Dell.com, and it is still listed, though I wasn't able to actually buy it.  When you click the "choose" link you're asked to choose a market segment but the list was empty.  Maybe it's not for sale and their site search knows to know display it???

November 08, 2011

Are you spending too much on enterprise search?

If your organization uses enterprise search, or if you are in the market for a new search platform, you may want to attend our webinar next week "Are you spending too much for search?". The one hour session will address:

  • What do users expect?
  • Why not just use Google?
  • How much search do you need?
  • Is an RFI a waste of time?   

Date: Wednesday, November 16 2011

Time: 11AM Pacific Standard Time / 1900 UTC

Register today!

Pingar and New Idea Engineering Partnership

I'm happy to announce that our company, New Idea Engineering, has announced a partnership with Pingar, a New Zealand-based company that provides tools to extend and enhance the capabilities of enterprise search. New Idea Engineering is Pingar's first North American reseller.

Pingar markets libraries that provide tools for entity extraction, document summarization, redaction for key documents, autocomplete and a number of other capabilities that organizations can use to improve the user search experience.

In the developer area, Pingar provides access to view the various capabilities in action. For example, you can paste in the text of a document and see the summarization or view the redaction or any of the other Pingar capabilities. Developers can download an API key to test the code yourself. Pingar supports both C# and Java.

We'll be writing more about Pingar in action over the coming months.