« January 2012 | Main | March 2012 »

2 posts from February 2012

February 26, 2012

How many gigabytes of memory on your printer?

I read an article originally tweeted by @nickpatience newly of search firm Recommind. In the FT article, HP's Mike Lynch talks about plans to introduce printers with embedded Autonomy IDOL.

At first, I had to chuckle. We've seen big systems brought to their knees indexing content with IDOL, and I imaged steam coming out of my HP laser printer as I print a long contract. (Maybe it was smoke... you know, printers need smoke to make them work. No, really. Ever seen a printer work after smoke came out of it?)

Then I realized that hundreds of companies bundle copies of IDOL with their products, and most implementations are quite successful with a relatively small footprint. And honestly, in another recent engagement, IDOL did provide the best 'out of the box' relevance. This is probably because of the way IDOL breaks documents into smaller units for indexing, and then reassembles them in the result list for human consumption.

But hang on for a minute. A printer with a search engine? I know IDOL is well known in eDiscovery applications; and I've also heard of cases where one team of lawyers will subpoena the disk drives from opposing client's printers. Correct me if I'm wrong, but if I'm printing a document, isn't there a good chance it exists on file servers that are already indexed with IDOL (or one of its competitors)? I'd think there is an audit trail back to the original document... no?

And what is the interface, do you suppose? Federated results in from an index within the printer? Traffic from the printer back to IDOL central servers to index the document as it passes through the network? I can imagine a way to reconstruct the document from the IDOL index; but that seems a bit extreme.

Anyway - it may just be that I'm too old-fashioned to understand this sort of thing. It feels to me like a technology - pardon me - in search for a market. I'd just as soon keep IDOL on my servers where I can understand what it's up to - and where it does a pretty darned good job!

What do you think?

 

February 21, 2012

10 changes Wikipedia needs to become more Human and Search Engine Friendly

There's a really nice set of examples comparing JSON to other similar formats like YAML, Python, PHP, PLists, etc.  It was in a Wikipedia article, but you won't see it now unless you know to go looking through the version history (link in previous sentence).

Contents-deletedThis the content had existed for quite a while in that article, and had been contributed to by many people.  One day in March 2011 one editor decided it was irrelevant and gutted that entire section.  The information was useful, I was actually looking for it today!  I happened to think of reviewing the version history since I was almost sure that's where it had been.

The editors at Wikipedia need to be able to delete content, for any number of reasons, and I'm sure it's a thankless job.  And there are procedures for handling disputed edits - I've pinged that editor who deleted it about maybe finding a new home for the content.  Also, ironically, I found an out of date copy of the page here that still has it.

I'm not in favor of a ton of rules, but I beleive wholesale deletes of long-existing and thriving content should get some special attention.  To be clear, I'm not talking about content that's clearly wrong or profane or whatever, or gibberish that was just added.  How about as a first pass "a paragraph or more that has existed for more than 6 months (long-lived) and has been contibuted to (not just corrected) byt at least 3 people (thriving".

Human-Centric Policy & Tech Changes:

  • If the content's only "crime" is that it's somewhat off-topic then the person doing the deleting ought to find another home for it.  The editor could either move it to another page, or possibly even split up the current page; maybe they could "fork" the page into 2 pages, then cross link them, and then remove duplicate content, so then page 1 retains the original article and links to page 2, and then page 2 has the possibly-off-topic content. Yes, this would take more effort for the "deleteing" editor, BUT what about the large amount of effort the mutliple contributors put into it, and going the extra step to try and conform to Wikipedia's policies so that it would NOT get deleted.  I also suspect that senior editors, those more likely to consider wholesale deletes, are probably much more efficient as splitting up a page or moving content somwhere else - novice contributors might be unaware, or only vaguely aware, that such thigns are even possible.
  • Wikipedia should make it easier for contributors to find content of theirs that's been deleted.  this is a somewhat manual process now.  Obviously they don't want to promote "edit wars".
  • Wikipedia should generally track large deletes (maybe they do?)
  • Wikipedia should "speed up" it's diff viewer.  It should run faster so you can zip through a bunch of changes, and maybe even include "diff thumbnails".  The UI makes sense to developers used to source code control systems, but is probably confusing to most others.  I realize this is all easier said than done!
  • Wikipedia should include some visual indication of "page flux".  It would helpful, for example, if a young person could see at a glance that abortion and gun control are highly debated subjects between adults.
  • Wikipedia should be a bit more visually proactive in educating visitors that there are other versions of a page.  I'm sure Wikipedia veterans would say "more visible!? - there's a bold link in a tab at the top of every page!"  While that's true, it just doesn't convey it to casual visitors.  On the other hand, it shouldn't go too far and be annoying about it - like car GPS systems that make you agree to to the legal disclaimer page every time you get in the car!

Search Engine Related Changes:

  • Wikipedia search (I use the Google option) should have an option to expand the scope of search to include deleted content.  This shouldn't be the default, and there are presentation issues to be considered.  Some deletes are in the middle of sentences, and there are multiple deletes and edits, etc, so I realize it's not quite as easy as it may sound.
  • There needs to be a better way to convey this additional data to public search engines, as well as representing it in their own built in engine.
  • Wikipedia should consider rendering full content and all changes inline, using some type of CSS / HTML5 dynamic mode that marks suspect or deprecated content with tags, instead of removing it.  Perhaps the search engines could also request this special version of the page and assign relevance accordingly.
  • Perhaps Wikipedia could offer some alternative domain name for this somewhat messier version of the data, something like "garaga.en.wikipedia.org".

It's Not Just A or B:

  • Whenever I hear people lament the declining content contributions on Wikipedia I have to chuckle.  It's incredibly demoralizing to delete content that people take the time to contribute.  If a new contributor on Wikipedia discovers 1 or 2 of their first edits promptly deleted, trust me they're very unlikely to try again.  I know a number of people that have just given up.
  • Others would say that if you put more pressure on editors to not delete, then the overall quality of Wikipedia will go down, and raving nut-jobs and spammers will engulf the site.
  • The compromise is to flag content (which is very similar to tracking diffs) and give users and search engines some choice in whether they want to see "suspect" content or not.

This is about survival and participation.  When newer contributors have their content "flagged" vs. "deleted", with more explanations and recourse, they will still learn to uphold Wikipedia's quality standards without being too discouraged.  They'll hang around longer and practice more.

An analogy: Wikipedia's current policy is like potty training your kid with a stun gun - make one mistake and ZAP! - or don't bother and just keep going in your diaper like you've always done.

I understand and appreciate all the work that Wikipedia's volunteers do, but I think there are some constructive things that could be done better.