« November 2007 | Main | February 2008 »

10 posts from January 2008

January 31, 2008

Adding a Search Box / Search Form to a TypePad Blog

(mostly raw notes, not fully tested, proofed)

It's nice to have a search box on your typepad based blog.  These notes were assembled from the TypePad help posts and some of our own notes.

Step 1: TypePad FAQ: Adding Custom Content and HTML in Your Sidebar (AKA: a sidebar Notes TypeList)

Step 1.a: Creating a new TypeList

Step 1.b: Adding an item to a TypeList and possibly give it a blank name (to avoid having too many labels around the search form)

Step 2: TypePad FAQ: Adding Google Search to your Weblog (via a "Notes TypeList")

Step 2.a: add in the HTML

Step 2.b: CHANGE DOMAIN name in THREE places

Step 2.c: maybe change radio buttons to single hidden field just pointing to your domain

Step 2.d: maybe comment out div tag with white background and big google logo

Step 2.e: maybe change text of button

Step 2.f: maybe add target="_new" to <form> tag

(other tweaks to HTML  form)

Step 2.f: Go ahead and accept the offer "The first item has been added to your new list. Would you like to publish this list on a weblog or your About Page?" (Publish)

Step 3: Display the custom TypeList in your blog

Step 3.a: do it... see above.  FAQ is out of date, it's in the right column under "TypeLists", and may already be checked / enabled by default.

Step 3.b: May need to Adjust the Ordering (placement) of the Content

Step 3.c: May need to adjust the width (near the bottom of Step 2's link)

Step 3.d: may need to adjust names of forms, typelists, etc.

(held as a draft pending actual testing)

Notes: Sample data for indexing, searching, analysis

Updated Feb 26, 08, added ref. from Chris C.
http://www.datawrangling.com/some-datasets-available-on-the-web.html

Wikipedia database download
Many formats and versions:
http://en.wikipedia.org/wiki/Wikipedia:Database_download
Example: From January 10, 2008
http://download.wikimedia.org/enwiki/20080103/
file pages-articles.xml.bz2 = 3.2 GB (claimed to be the version most folks want)

Enron public emails:
Released in 2003
http://www.cs.cmu.edu/~enron/
UC Berkeley
http://bailando.sims.berkeley.edu/enron_email.html
Emails searchable with Lucene
http://orange.sims.berkeley.edu/~atf/enron/enron.cgi
"Enronic"
http://jheer.org/enron/v2/

Reuters Corpus Volume 1
http://trec.nist.gov/data/reuters/reuters.html

TREC and TREC Data
http://trec.nist.gov/
http://trec.nist.gov/data.html
Terabyte Data, $800 US or is it Uklb 600, 426 GB, 25 million docs GOV2 collection http://es.csiro.au/TRECWeb/gov2-summary.htm
http://ir.dcs.gla.ac.uk/test_collections/
http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html http://www.mccurley.org/trec/
http://trec.nist.gov/data/terabyte/04.guidelines.html
http://es.csiro.au/TRECWeb/

CIA World Fact Book FactBook
Updated Feb-4-2007 Revised link here:
https://www.cia.gov/library/publications/the-world-factbook/index.html
Updated Feb-4-2007 In various Database and XML formats: (thanks Clint)
http://www.dbis.informatik.uni-goettingen.de/Mondial/
http://www.cia.gov/cia/publications/factbook/
http://www.cia.gov/search?NS-collection=World%20Factbook
Public domain, questions: (703) 482-0623
Easy download at: http://www.odci.gov/cia/download.html

Database: (though may have copyright restrictions, and not in generic format)
MS SQL: Northwind, pubs, AdventureWorks

Oracle: books database?

TODO: MySQL, PostgreSQL

Ideas for Electronic parts manufacturers: (not much luck so far)
TI, National Semi, AMD, Motorola, Cypress. Connectors: AMP and Molex

Sample data sets:
http://lib.stat.cmu.edu/datasets/ Boston housing prices (old)
http://lib.stat.cmu.edu/datasets/boston_corrected.txt
cars
http://lib.stat.cmu.edu/datasets/cars.data (a bit cryptic)
http://lib.stat.cmu.edu/datasets/cars.desc

colleges
http://lib.stat.cmu.edu/datasets/colleges/

Maybe ACM data
http://www.acm.org/

TODO: DMOZ category data, need to find link.

TODO: Bruce suggested US Patent Data, need to find link.

SOCNET data: Approx 10 M blog entries (via Sean)
Agreement http://www.blogpulse.com/www2006-workshop/datashare-agreement.pdf
Date: Fri, 16 Dec 2005 10:32:44 -0800
Sender: Social Networks Discussion Forum <[email protected]
From: Eytan Adar
Subject: [SOCNET] Announcing: data availability for the 3rd Annual Workshop on the Weblogging Ecosystem
We are happy to announce the public availability of a substantial collection of blog data for research purposes. The data is being made available by Intelliseek/BlogPulse in conjunction with the 3rd Annual Workshop on the Weblogging Ecosystem. A DVD containing full text from nearly 1 million blogs can be requested by filling out the form at the workshop homepage: http://www.blogpulse.com/www2006-workshop/

January 24, 2008

Google Search Appliance (GSA) User Interface "Glitch"

Google's web based administration application is nice and clean, what we've all come to expect from Google.  It reminds us of the easy-to-use Ultraseek web UI.

But one detail that might confuse new admins: many successful actions redisplay the exact same screen.

For example:

  • Edit the properties for a collection, for example by adding another URL.
  • Click the Save Collection Definition at the bottom of the screen.
  • ... and poof! ... you're still looking at the exact same screen.

This might make you wonder if you actually submitted the form.  "Did I actually click the button?" - "Let me try again..."

If you had sharp eyes you'd notice that the browser DID to a quick screen update, and the little activity animation in the upper right corner did flash for a second.

So what happened?  Well... it worked... it did exactly what you asked, it saved the changes; and in case you might want to make additional changes, it redisplayed the same config screen.  Since it worked fine (this is a Google product after all!), there was no error, so no reason to give any sort of alert (in their opinion).  On some other screens, such as the Create Collection form, you'll notice a slight change in the screen, when you're newly created collection is listed in the table of collections.

I've seen this style of UI before, where success equals redisplay without error.  We even debated this back at SearchButton.

Since this is unlike what Windows applications do, which is where a majority of today's computer users cut their teeth, I would argue that it is "non standard" behavior, and tends to be confusing.  I freely admit that logically it makes perfect sense, I get the design philosophy, it's just that it's not what many folks expect.

A simple compromise I'd like to see Google take, which other similar UI's have adopted, is to at least put a confirmation message on the redisplayed screen.  Something like a little green banner saying "Your changes have been saved."  Of course if you then re-edit and re-save, the next screen would have the same green banner, and therefore still look the same.  At least this would give you some hint that the system is listening, and that you needn't worry.  Or I guess a timestamp could be included in the confirmation message.

Heck, even Google's wildly popular GMail application uses these types of banners.

Not a big deal.  Admins will get used to it and learn to "Trust the Google", but it's a small change that might help the newbies.


I'll be giving some screen shots to Dr. Search for his next article

January 23, 2008

The Sad “Turn of the Century Craftsmanship” in Enterprise Search Software

Some of the biggest players in search still don’t have their act together when it comes to configuring and administering their expensive software offerings. Here in 2008 we’re still seeing complex and absurdly fragile software reminiscent of the mid to late 1990s (the "turn" of the 21st century)

It’s bad enough to require command line scripting for open source tools, but we’re seeing some commercial vendors embrace and expand this to “URL command line” tools. In one case, instead of editing a lengthy command in your favorite terminal program, the vendor has you manually edit encoded parameters in the URL of your browser to submit new documents! We also continue to see 10 or 20 page INI files from multiple vendors, where incrementing numerical strings are concatenated to form arrays.

Spider debugging is a particular sore point.  Would you like to know what happened to a particular URL that was supposed to have been indexed?  For many vendors the answer is still to run a “find” command to “zcat” your log files through “grep”. (references to Unix tools). It is shocking to see how outdated and finicky some of these very expensive products are.

And yes, every accused vendor still claims to be "working on it", with admin UI updates in the development pipeline.

January 22, 2008

Is hosted / managed search behind the Microsoft FAST acquisition?

What does Microsoft the acquisition of FAST mean for the industry?

We've been sifting through the information available around the web and from our contacts in the enterprise search space, and we are beginning to see some signals come through the noise.

First, this is still an early story in the consolidation that will continue to take place over the next few years. Microsoft will operate FAST as a 'wholly owned subsidiary', which may well be a first for Microsoft - I'm not sure.  In addition, FAST seems to be heads-down on their release schedule, with some cool stuff rumored to be on the way.

In the Microsoft-FAST conference call (which will be available at that link through June 9th, 2008 - scroll to the Teleconference link), Microsoft's Jeff Raikes had little to say about how the integration would go forward, but I thought he dropped an interesting hint while answering an question from a JP Morgan analyst. He was talking about how the two technologies might fit when he said:

"Obviously, we feel one of our great strengths is that we'll bring to customers the power of on-premise software with software services; that combination can bring  customers greater capability plus we can give customers the power of choice  in terms of deployment models. So without going in greater detail which I wouldn't be able to do today ... I can just simply say that part of what we will look at ... will be to marry the strengths that we have with our software plus services with what FAST is doing in Enterprise Search."

Now, taken by itself it all sounds pretty generic. But it was his emphasis on the word "with" above - and on the parts about software as a service. Could it be Microsoft  wants FAST for the hosted/managed enterprise search solution that FAST can offer its customers? No enterprise data center; no need for in-house expertise; no pesky updates to install; no load-balancing to manage. And FAST can offer this data center service either fully hosted or remotely by connecting into the enterprise and providing only search management services.

Could Jeff be admitting that Microsoft wants to look towards more enterprise services using a hosted model - say something like his friends in Mountain View offer? Only time will tell!

It sure makes the FASTForward'08 user conference a must-see event.

January 15, 2008

Google Public Search: still not the freshest...

You do it, you know you do!  Moving aside the gallons of milk at the front to find one at the back of the shelf with a longer expiration date.

I find myself doing that with Google's public search quite a bit.  Sure we all use Google's public search... but they STILL haven't sorted out the "dates thing".  We first debated this with one of the Google founders back in 2000.  Yes, it's hard to do dates "perfectly", I get that, but there's certainly room for improvement, at least tracking when a page was *first* *seen*, so you can tell it's at least N years old.

Context: I did these searches on Tues Jan 15, 2008.  After the Iowa cauces and NH primaries, with California to vote in  a few weeks.  Mac World is today, and the start of awards season in Hollywood.

Check out this lame-ness: (one example courtesy of Miles)

Look for: steve jobs keynote time
Top result: Live from WWDC 2006: Steve Jobs keynote - Engadget
He'll be speaking later today, 1/15/08, but this is from 2 years ago.

Look for: california propositions
Top result: decent, 2007, 2008
Second result is from 2005:
http://www.smartvoter.org/2005/11/08/ca/state/prop/

Look for: election results
First result: Virginia State Board of Elections : View Election Results
But I'm in California... no disrespect to the fine folks in Va, but their local elections are probably not what the average surfer is looking for.
Second result: CNN.com Election 2004
3 to 4 years old, amazing.

Look for: java for palm os
First site is pretty good, pointing to the IBM WebSphere site.
But the second result is from 2002!  (http://www.javaworld.com/javaworld/jw-05-2002/jw-0531-palm.html)

Search for: new england patriots score
First result OK, but second is from December 2007, and of course they have been playing in the post season since then.

To be fair, Google does get some other items spot on:

Look for: CES
Good, all from 2008

Look for: new season of lost
Good, most are recent

Look for: Iraq
Good, Wikipedia, CIA World Factbook, etc.

Look for: golden globe
Good, mostly points to main web sites.

I will say it again:

Yes, it is difficult when parsing random web text and HTTP headers to know with 100% certainty when the content was authored, for various technical reasons.  References to dates might be in regards to discussions of past events, etc.

BUT you can certainly figure out when the first time your spider saw that content.  You might not know whether it was authored in February or June of 2007, but when it's 2009 you'll know it's at least 2 years out of date.  This isn't quite as easy as it sounds, as the text on web pages change slightly, so raw "checksums" won't cut it.  But I'm sure some smart guys from Stanford could figure *something* out.

And when 4 digit years are part of the URL, with other numbers that look  like dates, that would often be another good hint.

Web content now goes back more than ten years.  All engines need to keep this mind.

January 12, 2008

Microsoft-FAST: A kinder, gentler takeover?

The big news this week in enterprise search was the Microsoft acquisition of FAST Search - so at the risk of providing a MicrosoFast overdose, so I'll toss out one more observation.

When Autonomy acquired competitor and larger search industry leader Verity in early November 2005, they announced that the conversion "was complete" a week later. While a number of clients might maintain it is still not complete, it was clear right from the start that K2 was being phased out, and that customers would be expected to convert to IDOL - either through the interim step of using the K2 emulation API or by converting directly to IDOL. 

Unlike Autonomy's acquisition, Microsoft has not announced plans to 'convert' customers over to its technology. In fact, it seems that FAST will operate as a wholly owned subsidiary of Microsoft for at least a year, and possibly beyond. This should give current and potential FAST customers some comfort that a major conversion - or replacement - is not planned in near future.

When faced with a major conversion, recent history has shown that companies will take the step of evaluating all options - and not all customers will stay with their current vendor. Google and FAST both won some major deals after the Verity acquisition, so it remains to be seen what impact this deal will have. Stay tuned.

MicrosoFast and the open source conundrum

Another question coming from the merger of Microsoft and FAST Search: what's to become of all the open source technology embedded inside of FAST ESP?

I believe FAST uses MySQL, Tomcat, and certainly Python heavily, and it will be no small matter to 'replace' these items even over months or years. And there may be even deeper integration under the covers. Now, they have acquired companies based on Unix/Linux technologies - Hotmail comes to mind - and they seem to have pretty much left that 'as is'; maybe they will do the same with FAST.


January 10, 2008

Updated 2008 Enterprise Search Vendor Roundup

Jan. 10, 2008 - San Jose, CA, USA 

Microsoft announced they were acquiring FAST Search on January 8, forcing New Idea Engineering to amend our January 4th article "2008 Enterprise Search Vendors:  The new 'Fab4 ... and 1/2" (http://www.ideaeng.com/pub/entsrch/2008/number_01/article01.html). The announcement validates our original assessment and reinforces that search is mission critical for corporations, driving Microsoft to invest in a better search technology.

Some Highlights from NIE's 2008 Enterprise Search Vendor Roundup
 
Autonomy IDOL and FAST Search continue to hold the high end. K2 and Ultraseek are finally retiring.
Google's new version 5 appliance has arrived in the enterprise search mainstream.
Endeca is moving from the ecommerce side and had one of the most impressive search demos at ESS West 2007.
Lucene/ Nutch/ Solr (LNS) open source search engines continue to gain customer mindshare.
Microsoft with its acquistion moves in as Tier 1.
IBM and Oracle still not there.
 
Autonomy IDOL and FAST Search continue to hold the high end, evolving into "search platforms" that go beyond traditional drop in applications. The two leaders from earlier this decade, K2 and Ultraseek, are fading.

Google's new version 5 appliance has arrived in the enterprise search mainstream. While the new version won't satisfy every requirement, it addresses many of the earlier integration issues that had held it back. Expect to see the Google logo on a lot more enterprise portals.

Endeca has created some slick administration tools, doing very well in a head-to-head comparison with Autonomy and FAST despite their continued progress in this area.  As the importance of administration continues to increase, we are more enthusiastic about them in the Enterprise space.

Open source tools based on Lucene, including Nutch and Solr (LNS) are increasingly considered by companies, especially in niches that need to micromanage document relevancy and rating. Lucene and its derivatives are increasingly embedded in other software packages and services, to the point that many users won't even realize they're using it.

We had expected IBM to be the next entrant into the "Tier 1" lineup, based on their iPhrase acquisition. To our surprise, when we saw IBM at ESS East 2007, they were featuring one of their older engines, the OmniFind Enterprise Edition. IBM OmniFind is still not one of our new Fab 4 and an 1/2.

Dieselpoint, Intellisearch, Reccomind, ISYS, ZyLAN, Vivisimo, Siderean and Exalead have strong presences in niche markets.
 
To read the full article ... 2008 Enterprise Search Vendors: The New Fab 4 ... and 1/2. http://www.ideaeng.com/pub/entsrch/2008/number_01/article01.html

January 08, 2008

Microsoft is acquiring FAST Search for $1.2 billion

Obviously if this goes through, and we expect it will, it moves Microsoft into the Tier 1 position and creates an interesting situation for Autonomy, and to a lesser extent the Google search appliance.

Links to ZDNet, Yahoo!, Reuters -- many others online.

This continues the trend of search engine acquisitions.  Could Endeca or Autonomy be next?  If so, would Oracle be the logical choice?  Google, IBM and now Microsoft already have well established enterprise packages.  Although Oracle has its own private brand of search, so did Microsoft, but that didn't stop them.