40 posts categorized "Web/Tech"

April 10, 2012

Autonomy 'King of the Cloud'

Years ago, my friend Jerry Gross in PR at HP related a funny story about how companies work with the press. He met with an editor of some electronics magazine to announce the new, improved memory chips that HP had created that actually provided 4K on a single chip! (This was a while ago!) 

Way back then this *was* news; but to Jerry and the reporter, it was yet another memory product to announce and the meeting was just specs and details. Jerry, a prankster at heart, decided to throw in a twist: he said that HP decided to use ROUND chips in this new product, rather than conventional rectangular ones. 

This piqued the reporter's interest: round chips? Yes, when you think about it, in rectangular chips, some of the bits are in a corner, so it takes longer for those bits to be accessed. By making the chip round, all chips were equidistant from the center so all could be accessed at the same speed!

The reporter was eating this up - something new and exciting! He wrote a quick paragraph for his publication before Jerry broke out laughing.  Luckily, their relationship was a good one and both had a great laugh about it. The round memory chip never made it to the world media.

Today I read an article in the London Business Weekly, reporting that Autonomy now has "world’s largest private cloud", more than "50 petabytes of data including web content, video, email and multimedia data". Granted Autonomy has a great service business in hosting and search-enabling all sorts of multimedia content. But ... I wonder if the reporter ever wondered out loud about some other rather large 'private clouds' - perhaps Google? Or Microsoft? Amazon? 

Maybe none of these robust competitors are as big as Autonomy; maybe HP really became the cloud giant by acquiring Autonomy last year. Or maybe a round memory chip made it past a reporter today. What do you think?

 

 

March 28, 2012

The importance of context in enterprise search

For years we have talked about the important of context when it comes to enterprise search. we blogged about it as long ago as 2007 and we stressed that the context of the user, the content, and the query all need to be considered between the time the user click 'Search' and the search platform gets the extended query. As an example, we've used things like Google's special treatment of 12-digit numbers that match the algorithm for FedEx tracking numbers. 

Now it appears that Google has started plans to expand their use of context as published in the Wall Street Journal and called out in blog postings from Avalon's Joe Hilger and Mashable's Lance Ulanoff. Google's Amit Singhal spoke of the shift from keywords to meaning, a change not only at Google but, over time, in the enterprise search platforms most companies use internally every day.

Extended_search_processing_flowAs we talk about in a recent webinar 'Secrets your Search Vendor Won't Tell You', search platform vendors have always trailed user requirements; sometimes you just need to write your own custom code to create a search experience users are happy with. You often need to add your own pre-search processing code to analyze the user query and create an expanded query using the vendor-specific search operators; make the most of standard platform capabilities; and post-process the search result list in order to give yours a great, meaningful, helpful set of results and actions.

At ESS New York in May, we're doing a pre-conference workshop that will take a deep dive into this process. We'll talk about how you can do this extended processing in several popular search platforms, and will include some representative examples of how you can implement this type of contextual enhancement for several popular search platforms. If you're going to be in New York anyway, come to the workshop!

s/Miles

 

 

November 30, 2011

Odd Google Translate Encoding issue with Japanese

Was translating a comment in the Japanese SEN tokenization library.

It seems like if your text includes the Unicode right arrow character, Google somehow gets confused about the encoding.  Saw this on both Firefox and Safari.  Not a big deal, strangely comforting to see even the big guys trip up on character encodings.

OK: サセン
OK: チャセ
Not OK: サセンチャセ?

Google-translate-encoding

November 22, 2011

7 things GMail Search needs to change

My General Complaint:

If you've had a gmail account for many years, either for work or personal, it's getting large enough that GMail's search is starting to break.

Anything word you can think of to type in will match tons of useless results.  Eventually, as you try to think of more words to add, your results count goes to zero.

If you were lucky enough to have starred the email when you saw it, or can remember who might have sent it, or maybe the approximate timeframe, or maybe you think you might have sent the email in question from this account, you *might* have a chance.

A Tough Problem:

I realize this seems like classic precision and recall troubles, but Google is pretty smart, and they a fair amount of metadata, and a lot of context about me, so there's some potential fixes to hang a hat on.

And some of my ideas involve making labels/tags (Gmail's equivalent of folders), but that assumes that people are using labels, which I suspect many folks don't, or at least not beyond the default ones you get.  Well... sure, but they DO have them, and there's an automated rules engine in Gmail to set them, so presumably a few people use tags / labels?  (or maybe nobody does and, in hindsight, maybe a legacy feature!?) So, if you're going to have labels, and you've got even a few users who both with them, then make them as useful as possible.  AND maybe make Labels more visible, maybe easier to set, more powerful, etc.

On To The Ideas:

1: Make it easier to refine search results.

Let's face it, as you accumulate more and more email, the odds of finding the email you want on the first screen of search results goes WAY down.

Google wisely uses most-recent-first sorting in search results, vs. their normal relevancy, in the GMail search UI.  I'm not sure why, this seems like an odd choice for them given all the bravado about Google's relevancy, but I'm guessing it was too weird to have email normally sorted by date in most parts of the UI, but have it switch back and forth between relevancy and date as you alternate between search and normal browsing.  Also, maybe they found it's more likely you're looking f or a very recent email.  You could fold "freshness" into relevancy calculations, but just respecting date keeps it more consistent.

Yes, GMail does have some search options... I'll get to those, but suffice to say they are very "non iterative".

Other traditional filters should be facets as well.  "Sent" emails, date ranges, "has attachments" (maybe even how many, sizes, or types)

2: Promote form-based "Search options" to FULL Facets

You can limit your search to a subset of your email if you've Labeled it - this is the GMail equivalent of Folders.  But doing this is a hassle (see item 3), and you can't do this after the fact, once you're looking at results.

So, if you do normal text search, and then remember you labeled it, you can't just click on the tags on the left of the results.  Those are for browsing, and will actually clear out you search terms.  These should be clickable drilldown facets, perhaps even with match counts in the parenthesis, and maybe some stylizing to make it clear that they will affect the current search results.

Yes, there's a syntax you can use:

lebal:your-label regular search terms

It's a nice option for advanced users who are accurate touch typists and remember the tag name they want, but this should also be easy from the UI.  Yes, there is an advanced search / search options forms, but this brings me to item 3...

(read the rest of the ideas after the break)

Continue reading "7 things GMail Search needs to change" »

November 21, 2011

Google: Sometimes I really do want EXACT MATCHES

Disclaimer: Google only attracts my annoyances more because I use it so much.  And I'm confident they can do even better, and so I'm helping by writing this stuff down!

My Complaint:

Back in my day, when you typed something in quotes into a search engine, you'd get an exact match!  Well... OK, sometimes that meant "phrase search" or "turn off stemming"... but still, if it was only a ONE WORD query, and I took the time to still put it in quotes, then the engine knew I was being VERY specific.

But now that everyone's flying with jet-packs and hover boards, search engines have decided that they know more than I do, and so when I use quotes, they seem to ignore them!

I can't give the exact query I was using, but let's say it'd been "IS_OF".  Google tries to talk me out of it, doing a "Show results for (something else)", but then I click on the "Actually do what I said" hyperlink.  And even then it still doesn't.  In this false example, it'd still match I.S.O.F. and even span sentence gaps, as in "Do you know that that *is*?  *Of* course I do!"

The Technical Challenge:

To be fair, there's technical problems with trying to match arbitrary exact patterns of characters in a scalable way.  Punctuation presents a challenge, with many options.  And most engines use tokenization, which implies word breaks, which normally wouldn't handle arbitrary substring matching.

At least with some engines, if you want to support both case insensitive and case sensitive matching, you have two different indexes, with the latter sometimes being called a "casedex".  Other engines allow you to generate multiple overlapping tokens within the index, so "A-B" can be stored as both separate A's and B's, and also as "AB", and also as the literal "A-B", so any form will match.

Some would say I'm really looking for the Unix "grep" command, or the SQL "LIKE" operator.  And by the way, those tools a VERY inefficient because they use linear searching, instead of pre-indexing.  And if you tried to have a set of indexes to handle all permutations of case matching, punctuation, pattern matching, etc, you'd wind up with a giant index, maybe way larger than the source text.

But I do think Google has moved beyond at least some of these old limitations, they DO seem to find matches that go beyond simple token indices.

Could you store an efficient, scalable set of indices that store enough information to accommodate both normal English words and complex near-regex level literal matching, and still have reasonable performance and reasonable index sizes?  In other words "could you have your cake and eat it too"?  Well... you'd think a multi-billion-dollar company full of Standard smarties certainly could! ;-)  But then the cost would need to be justtified... and outlier use-cases never survive that scrutiny.  As long as the underlying index supports finding celebrity names and lasagna recipes, and pairing them with appropriate ads, the 80% use cases are satisfied.

November 20, 2011

Dell.com Site Search acting weird today

FYI: The Dell Zino is (was?) a small form factor machine that could be used as a portable server, probably a bit larger than a Mac Mini, but still portable.

If you go to Dell.com and use their search and do a one word search for "zino" you get 8 results, but none of them are for that machine.  7 are for memory sims, and the bottom result is for a different dell machine, a small tower.  At first I was worried that perhaps they had discontinued the cute little guy.

I sent to Google and one of the suggested searches was "dell zino discontinued", aw... I was afraid of that.. But wait! - The first page didn't actually say it was discontinued, it just had those words in a long discussion thread.  And the second result goes to the Zino page on Dell.com, and it is still listed, though I wasn't able to actually buy it.  When you click the "choose" link you're asked to choose a market segment but the list was empty.  Maybe it's not for sale and their site search knows to know display it???

August 01, 2011

Google Refine, Google's open source ETL tool for data cleansing, with videos!

For any of you working with Entity Extraction this might be of interest.  Google has open sourced some software from their FreeBase acquisition, formerly called Gridworks.  It lets you interactively cleanup and transform data.  More importantly, it says these steps into a reusable sequence of steps in JSON format, so they could be reapplied to other data.

Here's the main page and wiki (and 3 intro videos):

It IS Open Source, here's the source code and license:

That type of UI makes me want to dust off our XPump code and retrofit into it...

January 31, 2011

Great new tool for Pharmaceutical researchers

Topic_Explorer Our partners over at Raritan Technologies Inc. have recently released a great tool they developed using the  Lexalytics, Inc. Salence toolkit. The product, Topic Explorer, provides a way for users to dig through content and explore concepts from Raritan's extensive knowledgebase of medical terminology, augmented by the text analytics capabilities provided by Lexalytics. Many of you will remember Lexalytics as the company that provided great sentiment analysis in the original FAST ESP product prior to the acquisition by Microsoft.

Raritan co-founder Ted Sullivan gives a great video demo of the product you should see.

What's really great about Topic Explorer is that it isn't limited to just pharma. With the right taxonomy, it can be a great research tool for just about any vertical - risk management, eDiscovery, patent research, and more.

Topic Explorer is a search technology neutral product, so it will work with your current solution whether you're using Lucene/Solr or a popular commercial technolgy. Contact Raritan at 908-668-8181 Extentsion 110. Tell them you read it here! 

September 03, 2010

Domain Name Registrar Search Tweak: Indicate that you already own It in Search Results

Many companies own lots of domain names, and manage them on one or two registrars.

When you do a search for a new domain it'd be nice if they listed domains that you already own differently from the domains owned by others. It's a little tricky for them sometimes, with different account associations or something.  It doesn't look like they do, at least the ones I've played with.

It's not really the main domains you'd need help with, most people know their key domains by heart, but it's all those other domain suggestions they mix into the results. Their results include different suffixes or word variations. Some of these are only suggested if they're available, but they also show the top level domains with a clickable check box or red X .

So a search on a registrar can show 30 domains on a screen, some with red X's, even if they're taken by you.

If some already do give us a comment.

July 22, 2010

Document filters webinar July 28 2010

ISYS Document filter independent ISYS is hosting a webinar on Wednesday, July 28 at 1PM Eastern to talk  about the role document filters play in successful search indexing and display. You can register now.

Of course, as a search technology company, ISYS has enjoyed great success, particularly among law enforcement where search has to work right at a reasonable price. We've always liked their technology and their approach.

But like every search platform, ISYS needed filters to convert so-called 'binary' formats like Microsoft Office, PDF, or even Photoshop files into a stream of text - after all, today's search platforms primarily operate on words.. in textual format. But ISYS looked at the market at the time, and found that two of their competitors, Autonomy and Oracle, own the best of the filter technologies.

Like any company, they made a 'make or buy' decision, and in their case, making their own filters was the right answer for them, and possibly for you. You see, ISYS decided to start selling their filter technology independent of their search platform, so now you can acquire some really great filtering and viewing technology for just about any search engine, 'off the shelf'. Their customers include other vendors with the need to extract text from various types of content, not just search vendors but also eDiscovery and eCompliance companies and many others who don’t want to pay excessive prices for technology - and who want really great filtering at a reasonable cost.

Then, a few years back, ISYS decided that open source platforms Lucene and Solr - which had no filters - needed them as well. So now you can buy a great filter pack 'off the shelf' with no huge volume commitment - no volume commitment at all! And you can get world class filtering for your open source search project.

Come hear ISYS, the guys from Lucid Imagination, and us here at New Idea Engineering talk about the critical role of filters in your search applications. See you then!

/s/Miles