69 posts categorized "Technical"

November 29, 2011

10 Handy Things to Know about the Lucene / Solr Source Code

It's funny how certain facts are "obvious" to some folks, stuff they've known a long time, but come as a pleasant surprise to others.  Chances are you know at least half of these, but no harm in double checking!

  1. Although Lucene and Solr are available in binary form, most serious users are eventually going to need some custom code.  If you post questions on the mailing lists, I think the assumption is you're comfortable with compilers, source code control and patches.  So it's a good habit to get into early on.
  2. Lucene and Solr source code were combined a while back (circa March 2010), so it's now one convenient checkout.
  3. You'll want to be using Java 6 JDK to work with recent versions of Lucene / Solr.
  4. Lucene/Solr use the ant build tool by default.  BUT did you know that the build file can also generate Project files for Eclipse, IntelliJ and Maven.  So you can use your favorite tool.  (See the README.txt file for info and links)
  5. Lucene/Solr use the Subversion / SVN source code control system.  There are clients for Windows and plugins for Eclipse and IntelliJ. (Mac OS X has it built in)
  6. You're allowed to do read-only checkout without needing any sort of login - checkouts are completely open to the public.  This is news to folks who've used older or more secure systems.
  7. Although checking any changes back in would require a login, it's more common to post patches to the bug tracking system or mailing list, and then let the main committers review and checkin the patch.  Even the read-only checkouts create enough information on your machine to generate patches from your local changes.
  8. Doing a checkout, either public or with a login, does not "lock" anything.  This is also a surprise to folks used to older systems.  This non-locking checkout is why anonymous users can be allowed to checkout code - there's no need to coordinate checkouts.
  9. The read-only current source for the combined Lucene + Solr is at http://svn.apache.org/repos/asf/lucene/dev/trunk  Even though it's an http link, and can be browsed with a web browser, it's also a valid Subversion URL.
  10. The "contribute" wiki pages for Lucene and Solr have more info about the source code and patch process.

November 28, 2011

Solr Disk and Memory Size Estimator (Excel worksheet)

If you do a standard checkout of the Lucene/Solr codebbase you also get a dev-tools directory.  One interesting tidbit in there is an Excel spreadsheet for estimating the RAM and disk requirements for a given set of data.  Be sure to notice the tabs along the bottom, tab 2 is for memory/RAM estimates, and tab 3 is for disk space.

Full URL: http://svn.apache.org/repos/asf/lucene/dev/trunk/dev-tools/size-estimator-lucene-solr.xls

November 22, 2011

7 things GMail Search needs to change

My General Complaint:

If you've had a gmail account for many years, either for work or personal, it's getting large enough that GMail's search is starting to break.

Anything word you can think of to type in will match tons of useless results.  Eventually, as you try to think of more words to add, your results count goes to zero.

If you were lucky enough to have starred the email when you saw it, or can remember who might have sent it, or maybe the approximate timeframe, or maybe you think you might have sent the email in question from this account, you *might* have a chance.

A Tough Problem:

I realize this seems like classic precision and recall troubles, but Google is pretty smart, and they a fair amount of metadata, and a lot of context about me, so there's some potential fixes to hang a hat on.

And some of my ideas involve making labels/tags (Gmail's equivalent of folders), but that assumes that people are using labels, which I suspect many folks don't, or at least not beyond the default ones you get.  Well... sure, but they DO have them, and there's an automated rules engine in Gmail to set them, so presumably a few people use tags / labels?  (or maybe nobody does and, in hindsight, maybe a legacy feature!?) So, if you're going to have labels, and you've got even a few users who both with them, then make them as useful as possible.  AND maybe make Labels more visible, maybe easier to set, more powerful, etc.

On To The Ideas:

1: Make it easier to refine search results.

Let's face it, as you accumulate more and more email, the odds of finding the email you want on the first screen of search results goes WAY down.

Google wisely uses most-recent-first sorting in search results, vs. their normal relevancy, in the GMail search UI.  I'm not sure why, this seems like an odd choice for them given all the bravado about Google's relevancy, but I'm guessing it was too weird to have email normally sorted by date in most parts of the UI, but have it switch back and forth between relevancy and date as you alternate between search and normal browsing.  Also, maybe they found it's more likely you're looking f or a very recent email.  You could fold "freshness" into relevancy calculations, but just respecting date keeps it more consistent.

Yes, GMail does have some search options... I'll get to those, but suffice to say they are very "non iterative".

Other traditional filters should be facets as well.  "Sent" emails, date ranges, "has attachments" (maybe even how many, sizes, or types)

2: Promote form-based "Search options" to FULL Facets

You can limit your search to a subset of your email if you've Labeled it - this is the GMail equivalent of Folders.  But doing this is a hassle (see item 3), and you can't do this after the fact, once you're looking at results.

So, if you do normal text search, and then remember you labeled it, you can't just click on the tags on the left of the results.  Those are for browsing, and will actually clear out you search terms.  These should be clickable drilldown facets, perhaps even with match counts in the parenthesis, and maybe some stylizing to make it clear that they will affect the current search results.

Yes, there's a syntax you can use:

lebal:your-label regular search terms

It's a nice option for advanced users who are accurate touch typists and remember the tag name they want, but this should also be easy from the UI.  Yes, there is an advanced search / search options forms, but this brings me to item 3...

(read the rest of the ideas after the break)

Continue reading "7 things GMail Search needs to change" »

November 21, 2011

Google: Sometimes I really do want EXACT MATCHES

Disclaimer: Google only attracts my annoyances more because I use it so much.  And I'm confident they can do even better, and so I'm helping by writing this stuff down!

My Complaint:

Back in my day, when you typed something in quotes into a search engine, you'd get an exact match!  Well... OK, sometimes that meant "phrase search" or "turn off stemming"... but still, if it was only a ONE WORD query, and I took the time to still put it in quotes, then the engine knew I was being VERY specific.

But now that everyone's flying with jet-packs and hover boards, search engines have decided that they know more than I do, and so when I use quotes, they seem to ignore them!

I can't give the exact query I was using, but let's say it'd been "IS_OF".  Google tries to talk me out of it, doing a "Show results for (something else)", but then I click on the "Actually do what I said" hyperlink.  And even then it still doesn't.  In this false example, it'd still match I.S.O.F. and even span sentence gaps, as in "Do you know that that *is*?  *Of* course I do!"

The Technical Challenge:

To be fair, there's technical problems with trying to match arbitrary exact patterns of characters in a scalable way.  Punctuation presents a challenge, with many options.  And most engines use tokenization, which implies word breaks, which normally wouldn't handle arbitrary substring matching.

At least with some engines, if you want to support both case insensitive and case sensitive matching, you have two different indexes, with the latter sometimes being called a "casedex".  Other engines allow you to generate multiple overlapping tokens within the index, so "A-B" can be stored as both separate A's and B's, and also as "AB", and also as the literal "A-B", so any form will match.

Some would say I'm really looking for the Unix "grep" command, or the SQL "LIKE" operator.  And by the way, those tools a VERY inefficient because they use linear searching, instead of pre-indexing.  And if you tried to have a set of indexes to handle all permutations of case matching, punctuation, pattern matching, etc, you'd wind up with a giant index, maybe way larger than the source text.

But I do think Google has moved beyond at least some of these old limitations, they DO seem to find matches that go beyond simple token indices.

Could you store an efficient, scalable set of indices that store enough information to accommodate both normal English words and complex near-regex level literal matching, and still have reasonable performance and reasonable index sizes?  In other words "could you have your cake and eat it too"?  Well... you'd think a multi-billion-dollar company full of Standard smarties certainly could! ;-)  But then the cost would need to be justtified... and outlier use-cases never survive that scrutiny.  As long as the underlying index supports finding celebrity names and lasagna recipes, and pairing them with appropriate ads, the 80% use cases are satisfied.

November 20, 2011

Dell.com Site Search acting weird today

FYI: The Dell Zino is (was?) a small form factor machine that could be used as a portable server, probably a bit larger than a Mac Mini, but still portable.

If you go to Dell.com and use their search and do a one word search for "zino" you get 8 results, but none of them are for that machine.  7 are for memory sims, and the bottom result is for a different dell machine, a small tower.  At first I was worried that perhaps they had discontinued the cute little guy.

I sent to Google and one of the suggested searches was "dell zino discontinued", aw... I was afraid of that.. But wait! - The first page didn't actually say it was discontinued, it just had those words in a long discussion thread.  And the second result goes to the Zino page on Dell.com, and it is still listed, though I wasn't able to actually buy it.  When you click the "choose" link you're asked to choose a market segment but the list was empty.  Maybe it's not for sale and their site search knows to know display it???

August 01, 2011

Google Refine, Google's open source ETL tool for data cleansing, with videos!

For any of you working with Entity Extraction this might be of interest.  Google has open sourced some software from their FreeBase acquisition, formerly called Gridworks.  It lets you interactively cleanup and transform data.  More importantly, it says these steps into a reusable sequence of steps in JSON format, so they could be reapplied to other data.

Here's the main page and wiki (and 3 intro videos):

It IS Open Source, here's the source code and license:

That type of UI makes me want to dust off our XPump code and retrofit into it...

July 07, 2011

Webinar: Customizing the SharePoint Advanced Search Page

Sorry for the late notice - I just discovered it myself today.

Josh Noble, SurfRay consultant and author of Pro SharePoint 2010 Search, will give a webinar on Customizing the Advance Search Page on July 8 at 11AM Pacific, 2PM Eastern. If you're in Europe, it's worth staying up for!

Josh gave a related talk at SharePoint Saturday Sacramento a couple of weeks ago, and he really had some great tips and techniques. Register for the webinar now.




February 13, 2011

Humans versus Watson on Jeopardy Feb 14-16 2010

This week is a big one in search technology. Well, sort of - if you liked seeing IBM's 'Deep Blue' beat Garry Kasparov back in 1996.

For several years, a team at IBM has been working on a computer system - dubbed 'Watson' - that will be one of the featured players this week on the game show Jeopardy.

The IBM team has been working on the project for years. According to NOVA, Watson has passed the screening interview required of all players; and this week - Monday the 14th through Wednesday the 16th - Watson will take on the two best human players in Jeopardy history, Ken Jennings and Brat Rutter, in a historic match. The Nova special, 'The Smartest Machine on Earth', tells the story in a captivating way without too much waving of the hands. It takes us through the low points and the ultimate high point, when, in a test round a few months ago, Watson soundly defeated two human players.

Main_event Watson is not connected to the Internet, so it's on its own at air time. The system is not voice-driven, so for input it receives the question in the form of a text stream when the director clicks the magic button to flip the question. Watson can buzz in like the human players, and it speaks the 'question' in a synthesized human voice. Because it cannot listen to the other players' wrong answers, the IBM support engineers 'notify' Watson when there was a wrong answer so it can use that information in its determination.

Watching the practice round linked above is interesting: they've overlaid Watson's answers even when it did not buzz in first; and it is uncanny how often Watson was right - just too late to buzz in.

This doesn't apply to search engines just yet; Watson is programmed for the nuances of the game show and isn't billed as an AI device. Still, it's interesting to see the work the iBM team put into getting Watson ready; and we'll se how it does this week.

Man versus machine: sounds like something right out of the Firesign Theater's 'I think we're all Bozos on this bus' when 'Ah Clem' takes on the President and wins. Except that this time it might be a chance for revenge if Watson can pull it off: check it out this week, Monday through Wednesday!



November 08, 2010

Enterprise Search Summit DC November 15-18

The new home for the Fall ESS show is the Renaissance Hotel in downtown Washington, DC... so much for ESS-West! The new locale should bring a large number of new attendees and visitors, and a new co-located conference: SharePoint Symposium. InfoToday knows a trend when they see one!

In addition to the usual sessions provided to show sponsors, there are some interesting sessions by Tom Reamy of KAPS Group; Martin White of Intranet Focus; and eDiscovery expert Oz Benamram, CKO of White and Case LLP. Tony Byrne of Real Story Group will also be there, moderating the session I'll be participating in: Stump the  Search Consultant on Wednesday afternoon November 17th.

I really expect the show to have a large number of government folks in attendance, given how hard it's been for these good folks to travel to previous ESS conferences in New York and San Jose. InfoToday reports higher pre-registration this year than in the past; and I'll be happy to find out I'm wrong about most of the attendees being government or government-related folks.

Come by the session Wesnesday afternoon at 3PM; or leave a comment here if you want to get together.



September 04, 2010

Faster sorting for Farsi / "Iranian", Danish, Turkish, other atypical languages in Lucene/Solr

By default search engines sort results by relevance or "score", to try and bring the best match to the top of the results list. That's normally what users want, but occasionally you might want to sort by a different field, such as date, title or author. Lucene and Solr support this in various ways, as do many other search engines.

When it comes to sorting by titles or author names, most languages sort words with similar rules, and this is the character ordering that's built into Unicode. But a few languages are different, they may have different policies on accented characters, for example. Java includes to concept of "locale" to represent some language differences, such as currency and date formats, and it can also encode these differences in preferred order. However, apparently the performance isn't great, so sorting in some languages can be slow, or there may not be a locale for a specific language/dialect.

Lucene does include an alternate "collator" class that claims to fix this. It allows for non-default Unicode sorting rules, without the slowdown normally associated with locales. The doc mentions Farsi, Danish and Turkish as examples. Although I haven't tried it, since it's buried a bit in the code tree, I wanted to surface it in a post.

The top URL (in case formatting gets lost) is:


Usage scenarios are given in package.html