16 posts categorized "Google public search"

March 20, 2013

Open Source Search Myth 3: Skills Required In-House

This is part of a series addressing the misconception that open source search is too risky for companies to use. You can find the introduction to the series here; this is Part 3 of the series; for Part 2 click Potentially Expensive Customizations.

Part 3: Skills Required In-House

One of the hallmarks of enterprise software in general is that it is complex. People in large organizations who manage instances of enterprise search as no less likely than their non-technical peers to believe that "if Google can make search so good on the internet, enterprise search must be trivial". Sadly, that is the killer myth of search.

Google on the internet - or Bing or Baidu or whichever site you use and love - is good because of the supporting technology, NOT simply because of search. I'd wager that most of what people like about Google et al has very little to do with search and a great deal to do constant monitoring and tweaking of the platform.

Consider: at the Google 'command line' (the search box), you can type in an arithmetic equation such as "2+3" get 5. You can enter a FedEx tracking number and get a suggestion to link to FedEx for information. It's cool that Google provides those capabilities and others; but those features are there because Google has programs looking at search behavior for all of its users every day in order to understand user intent. When something unusual comes up, humans get involved and make judgments. When it makes sense, Google implements another capability - in front of the search engine, not within it.

Enterprise search is the same - except that very few companies invest money in managing and running their search; so no matter how well you tune it at the beginning, quality deteriorates over time. Enterprise search is not 'fire and forget'.

 Any company that rolls out a mission critical application and does NOT have their own skilled team in house is going to pay a consulting form thousands of dollars a day forever. 'Nuff said.


July 23, 2012

Want search 'just like Google'?

This weekend I read a post by John Hefferman on Seth Earley's blog, which was related to a discussion going at the LinkedIn Enterprise Search Engine Professionals group. The similarity may not jump right out at you, but let me try to piece together the logic in my brain that makes them one in the same.

If asked, most enterprise search users will tell you they want search "just like Google'. In fact web search is different than enterprise search - but it is possible to deliver a somewhat psychic 'Google-like' experience in the enterprise. It takes time, effort, and understanding of what people mean by "just like Google". And yes, these are often hard to come by. (If you have a few minutes, see our recent webinar from earlier this year  "What enterprise users want from search".)

The LinkedIn discussion was started by Phil Lloyd of Standard Life in the UK where he asked about the effort required to keep an enterprise search system running reasonably well (in fact, he asked what the minimum effort to implement FAST Search for SharePoint; but the discussion quickly became one of ongoing effort). 

The answer, as valid for FAST as it is for any enterprise product is, of course "it depends". 

You've just spend a large sum to license the platform. If you want it to work well, you'll staff it to run well and provide a return on the investment. If you don't care how well it works, it can be arbitrarily inexpensive. Heck, if it doesn't need to work well, you may never need to touch it again. After a few years, a new VP will be willing to toss it out and spend a large sum on a new and improved platform. Of course, without attention, it, too, will be doomed to failure. 

How does this relate to the Google article? One reason Google on the public web is so good is that it has a huge data set to work with. But - and this is what most enterprise search owners don't get - Google also has armies of engineers and bots looking at search activity every day... in fact, perhaps even every second or two, 24 hours a day, 365 days a year, worldwide.

When you search Google for your FedEx tracking number, do you think that is in the search index? When Google recognizes a 12 digit number that matches the algorithm for FedEx tracking codes, it bypasses the search index and offers to link the user to FedEx. Did you know that if you change a single digit in that code there's an excellent chance you'll get a rare 'no hits' from Google?

How did they know this? Well, by noting a bunch of odd 12 digit numbers as searches... all with no hits and almost every one unique - some bot noticed that the subsequent search was for FedEx, and alerted a human engineer to answer the riddle and take action. In this case, the action was likely to contact FedEx to understand it's algorithm in order to recognize a valid tracking code - and to insert logic ahead of the actual search to suggest a specific page on the FedEx site - the one that tracks that package. 

If you want search that works just like Google, you need to invest some time. Maybe it's an average of one or two FTEs each year; maybe it's less. One thing is certain; when you first roll out search, it will take more effort than when it's been running for a year. But of you let search roll out and never look at it again - well, then you've managed to out in the minimum effort for search. And you've probably doomed your company to buying a new platform out of frustration in a few years. The choice is yours.

April 10, 2012

Autonomy 'King of the Cloud'

Years ago, my friend Jerry Gross in PR at HP related a funny story about how companies work with the press. He met with an editor of some electronics magazine to announce the new, improved memory chips that HP had created that actually provided 4K on a single chip! (This was a while ago!) 

Way back then this *was* news; but to Jerry and the reporter, it was yet another memory product to announce and the meeting was just specs and details. Jerry, a prankster at heart, decided to throw in a twist: he said that HP decided to use ROUND chips in this new product, rather than conventional rectangular ones. 

This piqued the reporter's interest: round chips? Yes, when you think about it, in rectangular chips, some of the bits are in a corner, so it takes longer for those bits to be accessed. By making the chip round, all chips were equidistant from the center so all could be accessed at the same speed!

The reporter was eating this up - something new and exciting! He wrote a quick paragraph for his publication before Jerry broke out laughing.  Luckily, their relationship was a good one and both had a great laugh about it. The round memory chip never made it to the world media.

Today I read an article in the London Business Weekly, reporting that Autonomy now has "world’s largest private cloud", more than "50 petabytes of data including web content, video, email and multimedia data". Granted Autonomy has a great service business in hosting and search-enabling all sorts of multimedia content. But ... I wonder if the reporter ever wondered out loud about some other rather large 'private clouds' - perhaps Google? Or Microsoft? Amazon? 

Maybe none of these robust competitors are as big as Autonomy; maybe HP really became the cloud giant by acquiring Autonomy last year. Or maybe a round memory chip made it past a reporter today. What do you think?



February 21, 2012

10 changes Wikipedia needs to become more Human and Search Engine Friendly

There's a really nice set of examples comparing JSON to other similar formats like YAML, Python, PHP, PLists, etc.  It was in a Wikipedia article, but you won't see it now unless you know to go looking through the version history (link in previous sentence).

Contents-deletedThis the content had existed for quite a while in that article, and had been contributed to by many people.  One day in March 2011 one editor decided it was irrelevant and gutted that entire section.  The information was useful, I was actually looking for it today!  I happened to think of reviewing the version history since I was almost sure that's where it had been.

The editors at Wikipedia need to be able to delete content, for any number of reasons, and I'm sure it's a thankless job.  And there are procedures for handling disputed edits - I've pinged that editor who deleted it about maybe finding a new home for the content.  Also, ironically, I found an out of date copy of the page here that still has it.

I'm not in favor of a ton of rules, but I beleive wholesale deletes of long-existing and thriving content should get some special attention.  To be clear, I'm not talking about content that's clearly wrong or profane or whatever, or gibberish that was just added.  How about as a first pass "a paragraph or more that has existed for more than 6 months (long-lived) and has been contibuted to (not just corrected) byt at least 3 people (thriving".

Human-Centric Policy & Tech Changes:

  • If the content's only "crime" is that it's somewhat off-topic then the person doing the deleting ought to find another home for it.  The editor could either move it to another page, or possibly even split up the current page; maybe they could "fork" the page into 2 pages, then cross link them, and then remove duplicate content, so then page 1 retains the original article and links to page 2, and then page 2 has the possibly-off-topic content. Yes, this would take more effort for the "deleteing" editor, BUT what about the large amount of effort the mutliple contributors put into it, and going the extra step to try and conform to Wikipedia's policies so that it would NOT get deleted.  I also suspect that senior editors, those more likely to consider wholesale deletes, are probably much more efficient as splitting up a page or moving content somwhere else - novice contributors might be unaware, or only vaguely aware, that such thigns are even possible.
  • Wikipedia should make it easier for contributors to find content of theirs that's been deleted.  this is a somewhat manual process now.  Obviously they don't want to promote "edit wars".
  • Wikipedia should generally track large deletes (maybe they do?)
  • Wikipedia should "speed up" it's diff viewer.  It should run faster so you can zip through a bunch of changes, and maybe even include "diff thumbnails".  The UI makes sense to developers used to source code control systems, but is probably confusing to most others.  I realize this is all easier said than done!
  • Wikipedia should include some visual indication of "page flux".  It would helpful, for example, if a young person could see at a glance that abortion and gun control are highly debated subjects between adults.
  • Wikipedia should be a bit more visually proactive in educating visitors that there are other versions of a page.  I'm sure Wikipedia veterans would say "more visible!? - there's a bold link in a tab at the top of every page!"  While that's true, it just doesn't convey it to casual visitors.  On the other hand, it shouldn't go too far and be annoying about it - like car GPS systems that make you agree to to the legal disclaimer page every time you get in the car!

Search Engine Related Changes:

  • Wikipedia search (I use the Google option) should have an option to expand the scope of search to include deleted content.  This shouldn't be the default, and there are presentation issues to be considered.  Some deletes are in the middle of sentences, and there are multiple deletes and edits, etc, so I realize it's not quite as easy as it may sound.
  • There needs to be a better way to convey this additional data to public search engines, as well as representing it in their own built in engine.
  • Wikipedia should consider rendering full content and all changes inline, using some type of CSS / HTML5 dynamic mode that marks suspect or deprecated content with tags, instead of removing it.  Perhaps the search engines could also request this special version of the page and assign relevance accordingly.
  • Perhaps Wikipedia could offer some alternative domain name for this somewhat messier version of the data, something like "garaga.en.wikipedia.org".

It's Not Just A or B:

  • Whenever I hear people lament the declining content contributions on Wikipedia I have to chuckle.  It's incredibly demoralizing to delete content that people take the time to contribute.  If a new contributor on Wikipedia discovers 1 or 2 of their first edits promptly deleted, trust me they're very unlikely to try again.  I know a number of people that have just given up.
  • Others would say that if you put more pressure on editors to not delete, then the overall quality of Wikipedia will go down, and raving nut-jobs and spammers will engulf the site.
  • The compromise is to flag content (which is very similar to tracking diffs) and give users and search engines some choice in whether they want to see "suspect" content or not.

This is about survival and participation.  When newer contributors have their content "flagged" vs. "deleted", with more explanations and recourse, they will still learn to uphold Wikipedia's quality standards without being too discouraged.  They'll hang around longer and practice more.

An analogy: Wikipedia's current policy is like potty training your kid with a stun gun - make one mistake and ZAP! - or don't bother and just keep going in your diaper like you've always done.

I understand and appreciate all the work that Wikipedia's volunteers do, but I think there are some constructive things that could be done better.

December 12, 2011

New Phrase for determining Sentiment Analysis / Customer Interest

If you lookup:

fedex "Package not due for delivery"

which is one of the status messages you can get when tracking a package, you'll see a lot of postings asking about it.

FYI: It means your new toy has arrived in the city you live in, but will NOT be delivered today, because they didn't promise to get it to you until tomorrow.  Whether this is to force customers into paying for express service, or simply a logistics issue, or a mix of the two, depends on your view of companies and I won't get into that here.

However, you'll notice a lot of the postings asking about it are from folks waiting for delivery of things they're very excited to get, often some big-ticket peice of shiny electronics.  They're dying for Fedex to deliver it - they're so anxious and upset about the delay that they motivated enough to go online and search, and make ranting posts - all because their "toy" is delayed.

So we have particular emotional response, often about an upscale product, with a reasonably distinct search phrase - cool!

Yes, yes, of course you could say that the customers are mad about the percieved injustice of it, the Occupy Wall Street spin, or that sometimes the package could be really important for other reasons, which are certainly valid points.  I'm not taking sides or passing judgement - and I found discovered this today looking for a friend's overdue toy - that's not the point.  I'm just saying that I bet there's a good statistical correlation, and of course it wouldn't apply 100% of the time - which would actually be quite rare in such things.

November 30, 2011

Odd Google Translate Encoding issue with Japanese

Was translating a comment in the Japanese SEN tokenization library.

It seems like if your text includes the Unicode right arrow character, Google somehow gets confused about the encoding.  Saw this on both Firefox and Safari.  Not a big deal, strangely comforting to see even the big guys trip up on character encodings.

OK: サセン
OK: チャセ
Not OK: サセンチャセ?


November 22, 2011

7 things GMail Search needs to change

My General Complaint:

If you've had a gmail account for many years, either for work or personal, it's getting large enough that GMail's search is starting to break.

Anything word you can think of to type in will match tons of useless results.  Eventually, as you try to think of more words to add, your results count goes to zero.

If you were lucky enough to have starred the email when you saw it, or can remember who might have sent it, or maybe the approximate timeframe, or maybe you think you might have sent the email in question from this account, you *might* have a chance.

A Tough Problem:

I realize this seems like classic precision and recall troubles, but Google is pretty smart, and they a fair amount of metadata, and a lot of context about me, so there's some potential fixes to hang a hat on.

And some of my ideas involve making labels/tags (Gmail's equivalent of folders), but that assumes that people are using labels, which I suspect many folks don't, or at least not beyond the default ones you get.  Well... sure, but they DO have them, and there's an automated rules engine in Gmail to set them, so presumably a few people use tags / labels?  (or maybe nobody does and, in hindsight, maybe a legacy feature!?) So, if you're going to have labels, and you've got even a few users who both with them, then make them as useful as possible.  AND maybe make Labels more visible, maybe easier to set, more powerful, etc.

On To The Ideas:

1: Make it easier to refine search results.

Let's face it, as you accumulate more and more email, the odds of finding the email you want on the first screen of search results goes WAY down.

Google wisely uses most-recent-first sorting in search results, vs. their normal relevancy, in the GMail search UI.  I'm not sure why, this seems like an odd choice for them given all the bravado about Google's relevancy, but I'm guessing it was too weird to have email normally sorted by date in most parts of the UI, but have it switch back and forth between relevancy and date as you alternate between search and normal browsing.  Also, maybe they found it's more likely you're looking f or a very recent email.  You could fold "freshness" into relevancy calculations, but just respecting date keeps it more consistent.

Yes, GMail does have some search options... I'll get to those, but suffice to say they are very "non iterative".

Other traditional filters should be facets as well.  "Sent" emails, date ranges, "has attachments" (maybe even how many, sizes, or types)

2: Promote form-based "Search options" to FULL Facets

You can limit your search to a subset of your email if you've Labeled it - this is the GMail equivalent of Folders.  But doing this is a hassle (see item 3), and you can't do this after the fact, once you're looking at results.

So, if you do normal text search, and then remember you labeled it, you can't just click on the tags on the left of the results.  Those are for browsing, and will actually clear out you search terms.  These should be clickable drilldown facets, perhaps even with match counts in the parenthesis, and maybe some stylizing to make it clear that they will affect the current search results.

Yes, there's a syntax you can use:

lebal:your-label regular search terms

It's a nice option for advanced users who are accurate touch typists and remember the tag name they want, but this should also be easy from the UI.  Yes, there is an advanced search / search options forms, but this brings me to item 3...

(read the rest of the ideas after the break)

Continue reading "7 things GMail Search needs to change" »

November 21, 2011

Google: Sometimes I really do want EXACT MATCHES

Disclaimer: Google only attracts my annoyances more because I use it so much.  And I'm confident they can do even better, and so I'm helping by writing this stuff down!

My Complaint:

Back in my day, when you typed something in quotes into a search engine, you'd get an exact match!  Well... OK, sometimes that meant "phrase search" or "turn off stemming"... but still, if it was only a ONE WORD query, and I took the time to still put it in quotes, then the engine knew I was being VERY specific.

But now that everyone's flying with jet-packs and hover boards, search engines have decided that they know more than I do, and so when I use quotes, they seem to ignore them!

I can't give the exact query I was using, but let's say it'd been "IS_OF".  Google tries to talk me out of it, doing a "Show results for (something else)", but then I click on the "Actually do what I said" hyperlink.  And even then it still doesn't.  In this false example, it'd still match I.S.O.F. and even span sentence gaps, as in "Do you know that that *is*?  *Of* course I do!"

The Technical Challenge:

To be fair, there's technical problems with trying to match arbitrary exact patterns of characters in a scalable way.  Punctuation presents a challenge, with many options.  And most engines use tokenization, which implies word breaks, which normally wouldn't handle arbitrary substring matching.

At least with some engines, if you want to support both case insensitive and case sensitive matching, you have two different indexes, with the latter sometimes being called a "casedex".  Other engines allow you to generate multiple overlapping tokens within the index, so "A-B" can be stored as both separate A's and B's, and also as "AB", and also as the literal "A-B", so any form will match.

Some would say I'm really looking for the Unix "grep" command, or the SQL "LIKE" operator.  And by the way, those tools a VERY inefficient because they use linear searching, instead of pre-indexing.  And if you tried to have a set of indexes to handle all permutations of case matching, punctuation, pattern matching, etc, you'd wind up with a giant index, maybe way larger than the source text.

But I do think Google has moved beyond at least some of these old limitations, they DO seem to find matches that go beyond simple token indices.

Could you store an efficient, scalable set of indices that store enough information to accommodate both normal English words and complex near-regex level literal matching, and still have reasonable performance and reasonable index sizes?  In other words "could you have your cake and eat it too"?  Well... you'd think a multi-billion-dollar company full of Standard smarties certainly could! ;-)  But then the cost would need to be justtified... and outlier use-cases never survive that scrutiny.  As long as the underlying index supports finding celebrity names and lasagna recipes, and pairing them with appropriate ads, the 80% use cases are satisfied.

May 21, 2011

Google and the official search blog

A couple of days ago, Google started Inside Search, the 'official Google search blog'. It's not really enterprise search news, but because so many knowledge workers compare the behavior of their internal search platform with the Google public search experience, it may be worth monitoring for those whose job it is to keep enterprise search going.


February 02, 2011

Make your search engine seem psychic

People tell us that Google just seems to know what they want - it's almost psychic sometimes. If only every search engine could be like Google. Well, maybe it can.

Over the years, the functions performed by the actual 'search engine' have grown. At first, it was simply a search for an exact match - probably using punch card input. Then, over time, new and expanded capabilities were added, including stemming... synonyms... expanded query languages... weighting based on fields and metadata.. and more. But no matter what the search technology provided, really demanding search consumers pushed the technology, often by wrapping extra processing both at index time and at query time. This let the most innovative search driven organizations stay ahead of the competition. Two great examples today: LexisNexis and Factiva.

In fact, the magic that makes public Google search so good - and so much better than even the Google Search Appliance - is the armies of specialists analyzing query activity and adding specialized actions 'above' the search engine. 

One example of this many of us know well: enter a 12 digit number. if the format of the number matches the algorithm used by FedEx in creating tracking numbers, Google will offer to let you track that package directly from FedEx. For example, search for 796579057470 and you see a delivery record; change that last 1 to a zero, and you get no hits. How do they know?

The folks at Google must have noticed lots of 12 digit numbers as queries; and being smart, they realized that many were FedEx tracking numbers. I imagine, working in conjunction with FedEx, Google implemented the algorithm - what makes a valid FedEx tracking number - and boosted that as a 'best bet'.

Why is this important to you? Well, first it shows that Google.com is great in part because of the army of humans who review search activity, likely on a daily basis. Oh, sure, they have automated tools to help them out - with maybe 100 million queries every day, you'd need to automate too. They look for interesting trends and search behavior that lets them provide better answers.

Secondly, you can do the same sort of thing at your organization. Autonomy, Exalead, Microsoft, Lucene, and even the Google Search Appliance, can all be improved with some custom code after the user query but before the results show up. Did the user type what looks like a name? Check the employee directory and suggest a phone number or an email address. Is the query a product name? Suggest the product page. You can make your search psychic.

Finally, does the query return no hits? You can tell what form the user was on when the search was submitted - rather than a generic 'No Hits' page. Was the query more than a single term? Look for any of the words, rather than all; make a guess at what the user wanted, based on the search form, pervious searches, or whatever context you can find.

So how do you make your search engine seem psychic? Learn about query tuning and result list pre-processing; we've written a number of articles about query tuning in our newsletter alone.

But most importantly: mimic Google: work hard at it every day.