18 posts categorized "Verity K2"

November 21, 2011

Google: Sometimes I really do want EXACT MATCHES

Disclaimer: Google only attracts my annoyances more because I use it so much.  And I'm confident they can do even better, and so I'm helping by writing this stuff down!

My Complaint:

Back in my day, when you typed something in quotes into a search engine, you'd get an exact match!  Well... OK, sometimes that meant "phrase search" or "turn off stemming"... but still, if it was only a ONE WORD query, and I took the time to still put it in quotes, then the engine knew I was being VERY specific.

But now that everyone's flying with jet-packs and hover boards, search engines have decided that they know more than I do, and so when I use quotes, they seem to ignore them!

I can't give the exact query I was using, but let's say it'd been "IS_OF".  Google tries to talk me out of it, doing a "Show results for (something else)", but then I click on the "Actually do what I said" hyperlink.  And even then it still doesn't.  In this false example, it'd still match I.S.O.F. and even span sentence gaps, as in "Do you know that that *is*?  *Of* course I do!"

The Technical Challenge:

To be fair, there's technical problems with trying to match arbitrary exact patterns of characters in a scalable way.  Punctuation presents a challenge, with many options.  And most engines use tokenization, which implies word breaks, which normally wouldn't handle arbitrary substring matching.

At least with some engines, if you want to support both case insensitive and case sensitive matching, you have two different indexes, with the latter sometimes being called a "casedex".  Other engines allow you to generate multiple overlapping tokens within the index, so "A-B" can be stored as both separate A's and B's, and also as "AB", and also as the literal "A-B", so any form will match.

Some would say I'm really looking for the Unix "grep" command, or the SQL "LIKE" operator.  And by the way, those tools a VERY inefficient because they use linear searching, instead of pre-indexing.  And if you tried to have a set of indexes to handle all permutations of case matching, punctuation, pattern matching, etc, you'd wind up with a giant index, maybe way larger than the source text.

But I do think Google has moved beyond at least some of these old limitations, they DO seem to find matches that go beyond simple token indices.

Could you store an efficient, scalable set of indices that store enough information to accommodate both normal English words and complex near-regex level literal matching, and still have reasonable performance and reasonable index sizes?  In other words "could you have your cake and eat it too"?  Well... you'd think a multi-billion-dollar company full of Standard smarties certainly could! ;-)  But then the cost would need to be justtified... and outlier use-cases never survive that scrutiny.  As long as the underlying index supports finding celebrity names and lasagna recipes, and pairing them with appropriate ads, the 80% use cases are satisfied.

November 08, 2011

Are you spending too much on enterprise search?

If your organization uses enterprise search, or if you are in the market for a new search platform, you may want to attend our webinar next week "Are you spending too much for search?". The one hour session will address:

  • What do users expect?
  • Why not just use Google?
  • How much search do you need?
  • Is an RFI a waste of time?   

Date: Wednesday, November 16 2011

Time: 11AM Pacific Standard Time / 1900 UTC

Register today!

May 19, 2011

Content owners don't care about metadata

Or do they?

Our recent post about Booz & Company's 'men named Sarah' highlights just how important good metadata can be in order to provide a great search experience for employees and customers.

One of our customers who spoke at the recent ESS 2011 in New York provided some great insights into the problems organizations have getting employee content creators to include good metadata with their documents.

During the ESS talk, they report that content owners don't really seem motivated when asked to help improve the overall intranet site by improving document metadata. However - and this is a big one - when a sub-site owner sees poor results on their own site, they are willing to invest the time to provide really good metadata.

[A bit of background: This customer provides a way to individual site owners within the organization to add search to their 'sub site' pretty much automatically - sort of a 'search as a service' within the enterprise.]

So if you've been thinking of adding the ability to search-enable sub-sites within your organization, but solving the relevance problem is your first task, you might reconsider your priorities!

/s/Miles

February 10, 2010

Acquitision Wednesday

As we hinted here last week, Autonomy has announced that it has acquired MicroLink, its 2007 Partner of the Year.

MicroLink, a major player for Autonomy and for Microsoft in the federal; space, has been a reseller and implementation partner for both for years. As recently as last year, MicroLink started development of a very cool social search product that helped blur the lines between enterprise search and social search on the SharePoint platform, and had architected its application to sit on FAST as well as IDOL. There were even hints that they were eying a Lucene platform. 

I would have loved to be able to hear the negotiations between Microsoft and Autonomy concerning access to internals of the FAST search engine currently being integrated tightly into SharePoint. The story we've heard is that the Microsoft negotiations contributed to the delay in the announcement, since apparently folks in both companies have had the news since at least Christmas.

It will be curious to see what happens now. We've always thought of MicroLink as a consulting firm, delivering implementation support. Mike Lynch, Autonomy's boss, has never had good things to say about consultants, and has certainly overseen the dwindling of the Verity consulting group he acquired a few years back. Either he's decided that independent consultants are bad, his consultants are good; or he's hoping he can reduce his 'days outstanding' receivables by bring the implementers in house. Let's hope it wasn't $55M spent just to gain access to the federal sales force.

/s/Miles

June 08, 2009

Enterprise Search Engine Optimization: eSEO

Last week at the Gilbane Conference in San Francisco, I participated in a panel "Search Survival Guide: Delivering Great Results" moderated by Hadley Reynolds of IDC. In the presentation, I offered a new view on improving enterprise search engine relevancy that I call eSEO.

The term SEO is well understood by - and widely practiced in - the corporate world.  The concept of SEO, as summarized by one of the Gilbane talks, states that "Key to the value of any Web content is the ability for people to find it”. In the SEO world this is done by combining organic results and keyword placement - advertising - to improve placement, maintain ranking, and monitor search engine position - results- over time.

While we've been helping our customers improve their enterprise search results, it's hard to convince them that search results are not a problem they can solve once. I've decided to apply a new term to this process - Enterprise Search Engine Optimization, or eSEO. To paraphrase the role of SEO, eSEO is the process of combining organic results and best bets to deliver correct, relevant, timely content to enterprise search users - employees, customers, partners, investors, and others.

For both organic and best bets, the first step is to identify what we call the "top 100" queries. Start by creating a histogram that shows the top terms from your search engine. I hope you'll agree that if the top queries - whether 100, 50, or even 20 - deliver great results, you're on your way to having happy users. Talk to your content owners as you review the histogram, and ask them to identify the best result for each.

Once you have a list of queries and results, start the two step process: tune the search engine using its native query tuning capabilities. This will impact the shape of the histogram, and over time should start delivering better results. The bad news is tuning like this doesn't position all of your top terms, and it would be silly to try to micro-manage the results for each. That's why search engines have best bets.

When you feel pretty good about the curve through query tuning, it' time to start setting up best bets - the "ad words" of eSEO. Limit the number of bests bets to one or two at most - but remember that you can use other real-estate like the rightmost column of the screen to suggest additional content. Some guidelines for best bets:

  • Use one or at most two best bets
  • Don't repeat a document already at the top of the organic results
  • Make sure your best bets respect security

Once you have tuned your search engine, and set up best bets for the most timely and actionable result, you're ready to roll it out. But then the ongoing part comes in: you need to review your search activity and best bets periodically. Usually, we'd suggest once a month for a while, then perhaps quarterly thereafter. You may find seasonal variations, and if you're not watching you'll miss a golden opportunity.

In Summary

1. eSEO is just as critical as SEO

  • Lost time and revenue
  • Legal exposure

2. Watch for trends over time: Search is not "fire and forget"

3. Make sure SEO doesn't impact your eSEO

  • Use fielded data that web search engines ignore for your tuning (i.e., 'Abstract' rather than 'Description'.

This will get you started; but because your queries and your content changes over time, it's a never-ending story. Some companies - ours included - have tools that can help. But no matter what, hang in there!

s/Miles


June 06, 2009

Impressions of first Lucene/Solr SF Meetup

Kudos to Carl, our NIE Marketeer and defacto social director, for getting us to attend, well worth it, and conveniently coinciding with Gilbane.

The Good:

  • VERY entertaining, very informative.  Lots of good info about upcoming versions of Lucene and Solr, including additional performance tweaks.
  • A friendly, supportive bunch of like-minded nerds, and I mean this is the best possible way.
  • Also discussions of other related Apache projects.  We're all gonna need a cheat sheet pretty soon to keep track of it all.
  • Lucene/Solr will soon have implemented much of the core features of Autonomy IDOL, Endeca, FAST, etc.  They really ought to be spying.  :-)

Personally I think Otis & co. might wanna fly out for the next one.  I also think Dieselpoint ought to attend and talk about Open Pipeline.  If we get up enough energy maybe we could even volunteer to do that next time, we're on the board after all, but this is really Chris's baby.

The Not-so-Good:

  • About 50 terms that clients would not understand.  Don't get me wrong, we love the Map/Reduce, Bayesian, K-Means, SVD stuff, but most corporate clients would be lost.
  • Not much for Enterprise Packaging.  Ironically it's the mundane aspects of search, from a non-developer standpoint, that are still not on the horizon.  Not a criticism of the developers, they have what they need.
  • Not much about Nutch.  Nutch 1.0 is out, along with rumors of a revised admin GUI, but not much coverage here.

Impressions of Lucid Imagination:

This event was sponsored by Lucid, a company that recently got funding for bringing commercial packaging and services to the open source search world, and their senior staff includes quite a few of the core committers.

  • A very sincere bunch of guys.
  • They haven't sold their souls to corporate America, I think their "geek cred" is still well in tact.
  • Probably will not be filling in enterprise packaging pot holes any time soon.
  • Do they understand the Enterprise Market?

Also a shout out to LinkedIn and IBM for giving back to open source community.

There was also an "open mic" segment, and I'd like to give a shout to Avi Rappaport - I agree 1,000%, "stop words bad!" (or at least the blind use of index time stop words)


Surprises:

  • Not much of a threat to Google Appliance, due to packaging.  Yes, Google scales with their Map/Reduce and relevancy algorithms, and the open source guys have responded, but that's not the stuff that makes Google tick these days.
  • And despite the impressive and rapidly evolving core technologies, also not a real threat to the other Tier One vendors like FAST and Autonomy.  More on this seeming contradiction in a bit.
  • The Tier 2 vendors of the world, Attivio, Exalead, Dieselpoint, etc. DO need to pay attention.  There is a place for Tier 2 vendors, but they need to mind what the open source products do and do not provide more carefully.
  • It's really cool to see IBM willing to contribute so aggressively to the open source search engines, even though they sell several of their own.  A naive person might think they are competing with themselves, sabotaging their own sales guys, but they're a lot smarter than that.  They are selling their commercial search products as pure search, those technologies are always part of a larger (and more expensive) grand business solution.  They know what they're doing!

For similar reasons, still not a huge threat to Autonomy, MS/FAST, Endeca, etc. on corporate services.  I said earlier that the Apache projects are implementing a lot of the "secret sauce" that launched Autonomy and Endeca, etc, so you'd think this represents "a clear and present danger", but Mike Lynch's secret algorithms are not why people buy IDOL anymore.  Things like giant reference accounts, professional services, and commercial grade spiders have a lot more to with why big companies still pay six figures for search technology.

And speaking of surprises and Lucid Imagination, I wanna circle back to their PR a few months back when they got their funding and launched their company.  They talked about relevancy in their press releases!?  Wow... Yes, Lucene and Solr have some good traction there, but that specific competitive advantage has been used by almost every commercial search vendor in the past 15 years, including Verity, Autonomy and Google!

I would've expected them to say something like "we're gonna do for Lucene what RedHat did for Linux" - this would have been a very clear business-oriented proposition, though to be fair lots of companies have used that business model as well.  It wouldn't be original, but would be more business centric.  Then again, I'm not in Marketing, and their VC's obviously liked their pitch, so what do I know!

s/Mark

March 12, 2009

Search Relevancy and Japanese text, CJK, interesting thread on SearchDev.org

A really nice discussion over on SearchDev.org about relevancy when searching Japanese text and other CJK languages.  Touches on a lot of technical issues including tokenization, thesaurus, character set normalization, etc.

Folks chiming in about how a number of different search engines handle this including Autonomy IDOL, K2, Ultraseek and MarkLogic.

The actual thread:
http://tech.groups.yahoo.com/group/search_dev/messages/718?threaded=1&m=e&var=1&tidx=1

A tad hard to read with all the quoted text, but well worth a full skim, keep scrolling!

March 02, 2009

Enterprise Search Resources

Search Resources

There's a great deal of activity going on in the enterprise search market - groups and resources popping up everywhere. We thought we'd provide a list of the ones we know and respect best; feel free to add your own suggestions as comments and we'll post them in a follow up.

User Forums

SearchDev.org: The independent search developer's forum. A forum on the business and technology of search.

SearchDev also has two technical forums for detailed vendor-specific questions dealing with everything from coding and scripting to problem resolution, with more in the works:

autonomy.searchdev.org

fast.searchdev.org

LinkedIn Groups

Enterprise Search Engine Professionals Group: A fast-growing LinkedIn group for people working in or involved with enterprise search in corporate environments worldwide. Search for it under the Groups menu.

Enterprise Search Summit Group: A new group run by Michelle Manafy at Information Today which will provide industry news and information as well as details and podcasts about upcoming EDD events.

Newsletters

Enterprise Search Newsletter: Produced by New Idea Engineering, this newsletter covers both business and technical issues of search, generally at a more detailed technical level. It covers all vendors, provides advice for improving your search, and includes Ask Dr Search who answers technical questions from subscribers.

Blogs

Enterprise Search Blog: A blog produced by New Idea Engineering that covers all topics around the business and technology of enterprise search including opinion, news, events and more.

The Noisy Channel: This insightful blog, run by Daniel Tunkelang, CTO of Endeca, has a perspective on technology of enterprise search from someone who knows search from the ground up.

Beyond Search: Run by search guru Steve Arnold, Beyond Search contains news, interviews, and opinion on the search market delivered

SearchTools:  Avi Rappoport runs this blog which summarizes new content from her website http://searchtools.com/ which covers almost every search technology known to mankind!

SLI Systems Blog: Hosted search service SLI Systems provides a newsletter that talks about the kinds of problems they see in working with their customers. http://www.sli-systems.com/newsletter.php

FAST Forward Blog: A blog run by FAST Search staffed by FAST, Microsoft, and independent bloggers who write about search and IT issues at http://www.fastforwardblog.com/.

Attivio:The search vendor has a useful blog at  that had good general informaiton as well as Attivio-specific material.

Mark Logic Blog: Written by CEO Dave Kellogg, who shares interesting informaitn about technolgy. A fun read, and always informative.

Vivisimo Blog: Vivisimo runs the 'Search Done Right ' blog that provides grat background information on enterprise search. Like Attivio's blog, this has great background information that anyone can benefit from reading.

Flax Blog: From Lemur Consulting in the UK, the creators of the Flax open source search technology. You'll find more than just Flax here, though, with good coverage of issues relevant to enterprise search in general. 

Gilbane Search Practice Blog: Written by Lynda Moulton, this is a good background blog for enterprise search as well. Gilbane holds two interesting content management conferences a year that include a search track that can be worthwhile.

Two other blogs i find most interesting are not directly related to enterprise search, but I find good value when I follow them:

Andrew McAfee, a Professor at Harvard Business School. writes about IT issues, and he always has interesting material.

John Battelle, author of 'The Search...', has an interesting blog as well, and it's always fun to follow what he's doing.

Trade Shows

Enterprise Search Summit New York: Every May, Information Today sponsors the premier show for enterprise search in New York City. If you only go to one show a year, this is the one to go to. That's also the advice we give to new vendors entering the marketplace. We'll be back again this year, speaking about how you can save money by making your existing search engine work rather than replace it. By the way, you can listen to a preview of our talk, as well as talks by other speakers including Matt Brown of Forrester and Sid Probstein of Attivio.

Search Engine Meeting: Search Engine Meeting in an interesting show run by Infonortics from the UK. In its 14th year, this year's show returns to Boston in April 27-28; see you there!

January 22, 2009

Autonomy proposes to acquire Interwoven

This morning Autonomy announced that they will acquire California-based interwoven for $16.20 a share, a nice premium over last night's close at $11.84. The deal, expected to close Q2-2009, is subject to the vote of shareholders of both companies. This comes after Interwoven announced that they expected to their most recent quarter pretty much met their previous guidance.

It sounds like a great move. Autonomy continues its evolution from a search leader into compliance leader. Interwoven has been on a buying spree itself over the last several years, and offers solutions in many different areas including Digital Asset Management, eDiscovery/Legal, records management and even enterprise search.

Since FAST's acquisition by Microsoft a year ago, we've wondered if Autonomy would be an interesting target for acquisition by someone like Oracle or even by Google; but it looks like Mike Lynch would rather grow the old-fashioned way - by top line growth and by acquiring companies whose products support and extend their business.


June 18, 2008

Search Quality: You Can't Improve What You Don't Measure

In our latest survey of new newsletter subscribers we found that 29% had no formal metrics for measuring quality of search results.  Search metrics allow you to keep search on the right track and can be a powerful tool for managing your systems.  They are a wonderful source for insights and trends.  We thought we would share a couple that we think work well. Many of these are covered in greater depth in Interpreting Your Search Activity Reports in the Enterprise Search newsletter.

  • Count the number of people who use search  
  • Count the total number of searches  
  • Count the number of zero search results  
  • User feedback on top 100 searches  
  • Track email complaints about search  
  • Measure number of clicks on navigators (navigation menu items)  
  • Business Goals  
  •    
    • Reduce call volume (normallized for growth in customer base) by enabling self-service from search: results are good enough to reduce calls.
    • Reduce e-mail volume (again adjusted for growth in customer base) by enabling self-service from search: results are good enough to reduce e-mails. 
    • Revenue       
    • Add-on revenue