« April 2010 | Main | June 2010 »

11 posts from May 2010

May 27, 2010

Lucene and Solr Development Merged

The development of the Apache Lucene and Solr projects has merged. This has no impact on the packaging - there will still be separate Lucene and Solr jar files. However, it should result in tighter coordination between the two projects, less duplication of efforts, and Solr users getting the latest Lucene improvements faster. Mark Miller and Shalin Mangar discuss this in more detail in their blogs.

We're glad to see this. Mark Bennett, our own in-house Solr guy, thinks that having a unified Java developer list will get questions answered more quickly and consistently. Many developers work on both projects.

May 26, 2010

Aardvark's interesting blend of Search and Social Networking

Vark.com (Aardvark) doesn't use search to try and directly answer your questions, instead they just use search to route your question to humans who might be able to answer it.  So this is two levels removed from the classic search engine usage model:

1: It's not trying to answer your question directly with search, it's just trying to find a person who might be able to answer it.

2: It assumes a high initial raw failure rate (for example a high percentage of people are probably busy doing something else), so it builds-in retry routing logic. It even allows humans to help with routing.

There's a lot of hidden details in that retry logic, and a lot if it leverages social networking software, looking at user profiles, friends of friends, previous successes and failures, etc.  And then many of those steps also mix in keyword search and related algorithms. On the surface might seem like a "simple" hybrid, once you see it spelled out, but they've done a lot of nice work on the details, which I suppose are proprietary.

A bit of etiquette - there's an assumption that you've already tried to find the answer on your own, perhaps with a Google, Yahoo or Bing search. You shouldn't be wasting other humans' time with questions that machines could have easily answered. Vark could qualify as a "research" engine, not just a search engine.

What Vark's network of gray-matter resources is best reserved for is the "why" or "how" or "what's the difference between..." type questions. These are the types high level or "wisdom" type questions the keyword and NLP search engines still struggle with and Vark's come up with a nice compromise.

This type of "expert locator" system has certainly been tried before, especially in the Enterprise Search market. Those older systems had the end goal of "fixing you up" with an expert via email. Vark's managing of the actual questions and answers nice and I imagine this will be the norm in enterprise offers at some point, barring any intellectual property issues. Heck, I think Vark could offer their own enterprise version.

I'd be curious to see Vark actually give me the option of searching over the previously asked set of public questions and answers. Maybe if an immediate answer isn't forthcoming from the folks Vark has asked, it could come back and at least offer to run the search as a "plan B". If it's clearly labeled and optional, and  asking another human remains the primary objective, I think folks might like it.

They've also building a valuable database of questions and answers to analyze and learn from. Q&A search engines have a particular problem with Vocabulary Mismatch. The specific words people use to ask questions are different than people answering them. Some of this is linguistic in nature, and other times experts just use fancier or more precise terms than the novices asking the questions. I imagine they could mine their corpus and derive some useful relations. Even better, when vark lets a user re-ask a question, they have a chance to have multiple answers to the EXACT same question. And a multilingual vark could do this for multiple languages. Presumably this is all in their business plan, plus stuff I can't fathom. Cool!

I hope vark can continue to attract smart people!

May 25, 2010

Google and TV: "prevening" on the Bing Bang Theory

When a popular TV show mentions an odd word, there's a tendency for people to look it up online and/or blog about it.

Our staff likes "The Big Bang Theory". One of the characters mentions the term "prevening" referring to the time between mid afternoon and the early evening.

When I first heard the term:

  • Mon 5/24/2010, 10:49pm PDT
  • Google shows 7,000 hits
  • including an entry Urban Dictionary from 2008

When watching a rerun of a different episode this evening, I remembered this post and went back to check:

  • Mon 7/19/10 11:04pm PDT
  • Google shows 96,000 hits
  • more than a 10x increase, pretty cool

Some years later, reviewing old blog posts, checked again:

  • Thu 5/23/2013 17:13 PDT
  • Google shows 95,800 hits
  • With 0.2 % of reading 3 years ago, so I'll call it quiescent

I had another more colorful entry from the Jay Leno show, but apparently too off color for our blog.  :-(

May 24, 2010

Microsoft's Site Search Survey

I went to Microsoft.com today to find some info on virtualization. A window popped up asking if I wanted to survey about their search. I'm guessing this is old news to some, but I thought it was nice. The survey was polite and short.

The one thing the survey didn't have was a general comments block. As it so happened, I was having a bit of a problem with their search on this particular day, nothing horrible, but I was gonna give a few details in the comments box. But the survey ended without one; I think they should add an optional "Comments/Search Details" box on the final submit-survey page

For the record I think Microsoft's site search is decent, and to be fair they do have a LOT of products and a lot of versions to sort through.

Does Maxxcat's New Search Appliance Challenge the Google Box?

Both Jessica Bratcher and Tim Grey have interesting posts about Maxxcat releasing several enterprise search appliances that are supposedly much faster, cheaper, and extensible then the corresponding Google search appliance, with unlimited lifetime use.

They were created from the ground up and run on a special Linux platform. "On a 1 million document collection, the kernel can dispatch and resolve a multi-term query spanning the entire collection in as little as 100 usec." (of course anything under 500 msec would be fine for an end user)

Maxxcat has also released a new version of its JDBC connector (Bobcat) that supports standard SQL and allows any JDBC compliant database to interface directly to a MaxxCAT appliance. The company claims "EX-5000 Enterprise Search appliances equipped with BobCAT are able to retrieve and index information from host systems at speeds in excess of 1GB/minute."

Their chief integration engineer stated ""We are working with a number of customers who have data in SQLServer, mySQL or Oracle Databases that we are able to easily consolidate and query against, even though the source databases and data models vary dramatically. This is simply not possible with conventional database software, which relies upon proprietary interfaces and does not handle unstructured data very well, if at all.""

May 19, 2010

Lucid should be the "Redhat of *search*", not just of Solr/Lucene

Hear me out!

Obviously their core technical team is pivotal to that specicfic engine, and they'd be fools to waste even a drop of that momentum. But they're dispropotionately attractive to folks who'd prefer to write their own code.

But those nerds' BOSSES have distinctly different problems. Virtually all companies above a certain size have multiple engines, and lots of other related data and IR problems.

Lucid's mix of open and commercial code and team's skillset could rapidly create unique tools to solve those problems, but not with their current uni-engine focus.

We like the Lucid team a lot, but they can and should accept a broader mission, one that attracts the pointy haired bosses too!

May 17, 2010

Some flights have Fresh Baked Cookies, but we offer Fresh Baked Code!

We're all loving the in-flight Wifi these days. One of our folks is wrapping up a search project on his return flight and hit a bug. I've sent him a patch think should fix it. Ironically he was VPN'd back into our office working on it.

I'm sure we're not the first folks to do this, but still cool. We've had a smart-ass saying around here "Code - baked fresh daily!"  But I think we'll have to update that now. American Airlines serves freshly baked cookies in First Class near the end of the flight, but now we can best them, and from coach!

Searching an Encrypted Cloud

Luke O'Connor has an interesting post about encrypted search . It discusses the “fully homomorphic” encryption system devised by IBM researcher Craig Gentry that was in the news last year, and the outlook for encrypted search. It was strictly a theoretical breakthrough, Gentry estimated that performing a Google search with encrypted keywords would increase the amount of computing time by about a trillion.

In A Step Toward Better Cloud Security: Searchable Encryption Abel Abram discusses a paper from the Microsoft Research Cryptography Group proposing a virtual private storage service as a solution.

Ray Lucchesi's post about securing the cloud also discusses several approaches towards searching encrypted data. So far there don't appear to be any viable approaches, all take tens of seconds for a single word search.

May 11, 2010

Plink Acquired - Should Improve Google Goggles

Google Goggles is a visual search tool for smart phones. Its one of several recent search enhancements for Google Mobile mentioned in this post by Greg Sterling. A recent post of his discusses Google's acquisition of Plink (a UK based startup that developed a visual search engine for Android) four months after the company’s public launch, and Google's plan to use the team to improve Google Goggles.

PlinkArt won the ‘peoples choice’ award in the IQPrize last year and then $100,000 in Google’s second Android Developer Challenge.

May 10, 2010

Google's Opt-Out Option for Behavioral Targeting

Last month Google announced that they would provide a browser plug-in to allow users to opt-out of Google Analytics tracking.  Joseph Stanhope's post explained why it was highly doubtful that this would do substantial harm to Google Analytics and its customers. Several posts such as one by Felipe Miyata suggested that this was an “insurance” move to silence opposition from privacy supporters, perhaps in preparation for doing more web analytics within the U.S. Federal Government.

Anil Batra's post has a quite different explanation - he suggests that it is really an attempt by Google to make more money by taking another step towards behavioral targeting.