20 posts categorized "Attivio"

June 28, 2017

Poor data quality gives search a bad rap

If you’re involved in managing the enterprise search instance at your company, there’s a good chance that you’ve experienced at least some users complain about the poor results they see. 

The common lament search teams hear is “Why didn’t we use Google?” when in fact, sites that implemented the GSA but don’t utilize the Google logo and look, we’ve seen the same complaints.

We're often asked to come in and recommend a solution. Sometimes the problem is simply using the wrong search platform: not every platform handles every user case and requirement equally well. Occasionally, the problem is a poorly or misconfigured search, or simply an instance that hasn’t been managed properly. Even the renowned Google public search engine doesn’t happen by itself, but even that is a poor example: in recent years, the Google search has become less of a search platform and more of a big data analytics engine.

Over the years, we’ve been helping clients select, implement, and manage Intranet search. In my opinion, the problem with search is elsewhere: Poor data quality. 

Enterprise data isn’t created with search in mind. There is little incentive for content authors to attach quality metadata in the properties fields of Adobe PDF Maker, Microsoft Office, and other document publishing tools. To make matters worse, there may be several versions of a given document as it goes through creation, editing, reviews, and updates. And often the early drafts, as well as the final version, are in the same directory or file share. Very rarely does a public facing web site content have such issues.

Sometimes content management systems make it easy to implement what is really ‘search engine optimization’ or SEO; but it seems all too often that the optimization is left to the enterprise search platform to work out.

We have an updated two-part series on data quality and search, starting here. We hope you find it helpful; let us know if you have any questions!

June 22, 2017

First Impressions on the new Forrester Wave

The new Forrester Wave™: Cognitive Search And Knowledge Discovery Solutions is out, and once again I think Forrester, along with Gartner and others, miss the mark on the real enterprise search market. 

In the belief that sharing my quick first impression will at least start a conversation going until I can write up a more complete analysis, I am going to share these first thoughts.

First, I am not wild about the new buzzterms 'cognitive search' and "insight engines". Yes, enterprise search can be intelligent, but it's not cognitive. which Webster defines as "of, relating to, or involving conscious mental activities (such as thinking, understanding, learning, and remembering)". HAL 9000 was cognitive software; "Did you mean" and "You might also like" are not cognition.  And enterprise search has always provided insights into content, so why the new 'insight engines'? 

Moving on, I agree with Forrester that Attivio, Coveo and Sinequa are among the leaders. Honestly, I wish Coveo was fully multi-platform, but they do have an outstanding cloud offering that in my mind addresses much of the issue.

However, unlike Forrester, I believe Lucidworks Fusion belongs right up there with the leaders. Fusion starts with a strong open source Solr-based core; an integrated administrative UI; a great search UI builder (with the recent acquisition of Twigkit); and multiple-platform support. (Yep, I worked there a few years ago, but well before the current product was created).

I count IDOL in with the 'Old Guard' along with Endeca, Vivisimo (‘Watson’) and perhaps others - former leaders still available, but offered by non-search companies, or removed from traditional enterprise search (Watson). And it will be interesting to see if Idol and its new parent, Microfocus, survive the recent shotgun wedding. 

Tier 2, great search but not quite “full” enterprise search, includes Elastic (which I believe is in the enviable position as *the* platform for IoT), Mark Logic, and perhaps one or two more.

And there are several newer or perhaps less-well known search offerings like Algolia, Funnelback, Swiftype, Yippy and more. Don’t hold their size and/or youth against them; they’re quite good products.

No, I’d say the Forrester report is limited, and honestly a bit out of touch with the real enterprise search market. I know, I know; How do I really feel? Stay tuned, I've got more to say coming soon. What do you think? Leave a comment below!

November 16, 2016

What features do your search users really want?

What features and capabilities do corporate end-users need from their search platform? Here's a radical concept: ask stakeholders what they want- and what they need - and making a list. No surprise: you'll have too much to do.

Try this: meet with stakeholders from each functional area of the organization. During each interview, ask people to tell you what internet search sites they use for personal browsing, and what capabilities of those sites they like best. As they name the desired features, write them on a white board.

Repeat this with representatives from every department, whether marketing, IT, support, documentation, sales, finance, shipping or others - really every group that will use the platform for a substantial part of their days. 

Once you have the list, ask for a little more help. Tell your users they each have $100 "Dev Dollars" to invest in new features, and ask them to spend whatever portion they want to pay for each feature - but all they have is $100 DD.

Now the dynamics get interesting. The really important features get the big bucks; the outliers get a pittance -  if anything. Typically, the top two or three features requested get between 40DD and 50DD; and that quickly trails off. 

I know - it sounds odd. These Dev Dollars have no true value - but people give a great deal of thought to assigning relative value to a list of capabilities - and it gives you a feature list with real priorities.

How do you discover what users really want? 

 

 

November 01, 2016

One search to rule them all

(Originally published on LinkedIn)

Lucene was ‘born’ in 1999, created by Doug Cutting; and in 2005, it became a top-level Apache project. That year, Gartner Group announced that the search ‘Leaders’ platforms on their Enterprise Search Magic Quadrant included Autonomy, FAST, Endeca, IBM Omnifind, and Verity. The Google Search Appliance was right on the cusp between ‘Challengers’ and ‘Leaders’. Not many people knew about Lucene; and few who did saw it as much more than a quirky little project.

Just a year later, Yonik Seeley and his employer, CNET Networks, published and donated the Solr search server to the Apache Software Foundation, where it became an incubator project in 2006; the two projects soon merged into a single top-level Apache project. That same year, Gartner narrowed the ‘Leaders’ in their 2006 Magic Quadrant for Search to Autonomy (which acquired Verity the previous year), FAST, and Endeca.

Jump forward to the present. FAST is gone, acquired by Microsoft in 2008 and morphed into SharePoint Search. Hewlett-Packard acquired Autonomy in October of 2011, followed a few weeks later by Oracle’s acquisition of Endeca. Endeca is no longer available as a search platform; and Autonomy is mostly seen as a strategy to keep a large number of HP consultants fully employed, often on compliance applications.

Only a spattering of commercial enterprise search platforms that once flooded the market just a few years back exist any more. While Gartner continues to list 14 or 15 products in their Magic Quadrant Enterprise Search grid, about the only pure commercial products we see any more are the Google Search Appliance and Recommind. And Google recently announced that the appliance is scheduled to go ‘end of life’ over the next few years. All of those bright yellow boxes become really nice Dell servers by the end of 2018.

A new crop of search platforms has grown to fill the void.

As an open source product, Solr has grown in its capabilities, and is now widely used for enterprise search and data applications in corporations and government projects. Solr Cloud extends the platform to a scalable high-availability platform for demanding enterprise and data search applications. Solr is an open source solution.

Cloudera also bundles some interesting extra tools including Solr in their HUE bundle; free to download and free to use as long as you like. Cloudera runs a slightly older but stable release, 4.10; but with a committers Yonik Seeley and Mark Miller, I suspect they’re in a good position.

Hortonworks, a Cloudera competitor, also offers Solr/Solr Cloud in their releases, in partnership with Lucidworks - a company with a large number of committers on staff.

There are also three companies that have proprietary offerings based on open source technology.

Attivio, founded in 2007, is a “Leader” in the most recent Gartner Magic Quadrant for Enterprise Search. Their product, while not open source, nonetheless thrives by combining search, BI, data automation, analytics and more.

Elasticsearch has evolved into a strong platform for search and data analytics, and a number of organizations are finding it useful in some tradition enterprise search applications as well. Elastic has also integrated Kibana, a powerful graphical presentation tool that adds value for content analytics, not just search activity reporting.

Lucidworks Fusion is a relative newcomer to enterprise search. It includes many of the rich architectural features that enterprises expect, including a powerful crawler, connectors, and reporting. With its ‘Anda’ crawler and connectors, admin UI, and reporting, some people see it as a contender to replace the Google Search Appliance.

The one thing that all of these ‘proprietary’ products have in common? They are based on Apache Lucene to deliver critical functionality. And when you consider all of the web sites that use some form of Lucene for their site search, I think you'd agree that it really is a powerful little package. It’s available for virtually any operating systems, and can be integrated using just about any programming language from C/C++ to Java to Perl to Python to .NET.

Even more amazing is that these companies with commercial products based on Lucene – and who compete in the marketplace - actually cooperate when it comes time to fix bugs or add new capabilities to Lucene. Given all of the commercial players that have closed their doors - leaving their customers to find replacement platforms – we’ve reached the point where open-source-based software really is the safe choice now. And universally, Lucene is the common element.

The quirky little search API Doug Cutting put together in 1999 has evolved to be the platform that drives the leading search platforms used in big data, NoSQL, enterprise search, and search analytics. And it doesn’t seem like it’s going to be phasing out any time soon.

May 31, 2016

The Findwise Enterprise Search and Findability Survey 2016 is open for business

Would you find it helpful to benchmark your Enterprise Search operations against hundreds of corporations, organizations and government agencies worldwide? Before you answer, would you find that information useful enough that you’re spend a few minutes answering a survey about your enterprise search practices? It seems like a pretty good deal to me to have real-world data from people just like yourself worldwide.

This survey, the results of which are useful, insightful, and actionable for search managers everywhere, provides the insight into many of the critical areas of search.

Findwise, the Swedish company with offices there and in Denmark, Norway Poland, Norway and London, is gathering data now for the 2016 version of their annual Enterprise Search and Findability Survey at http://bit.ly/1sY9qiE.

What sorts of things will you learn?

Past surveys give insight into the difference between companies will happy search users versus those whose employees prefer to avoid using internal search. One particularly interesting finding last year was that there are three levels of ‘search maturity’, identifiable by how search is implemented across content.

The least mature search organizations, roughly 25% of respondents, have search for specific repositories (siloes), but they generally treat search as ‘fire and forget’, and once installed, there is no ongoing oversight.

More mature search organizations that represent about 60% of respondents, have one search for all silos; but maintaining and improving search technology has very little staff attention.

The remaining 15% of organizations answering the survey invest in search technology and staff, and continuously attempt to improve search and findability. These organizations often have multiple search instances tailored for specific users and repositories.

One of my favorite findings a few years back was that a majority of enterprises have “one or less” full time staff responsible for search; and yet a similar majority of employees reported that search just didn’t work. The good news? Subsequent surveys have shown that staffing search with as few as 2 FTEs improves overall search satisfactions; and 3 FTEs seem to strongly improve overall satisfaction. And even more good news: Over the years, the trend in enterprise search shows that more and more organizations are taking search and findability seriously.

You can participate in the 2016 Findwise Enterprise Search and Findability Survey in just 10 or 15 minutes and you’ll be among the first to know what this year brings. Again, you’ll find the 2016 survey at http://bit.ly/1sY9qiE.

January 20, 2015

Your enterprise search is like your teenager

During a seminar a while back, I made this spontaneous claim. Recently, I made the comment again, and decided to back up my claim - which I’ll do here.

No, really – it’s true. Consider:

You can give your search platform detailed instructions, but it may or may not do things the way you meant:

Modern search platforms provide a console where you, as the one responsible for search, can enter all of the information needed to index content and serve up results. You tell it what repositories to index; what security applies to the various repositories; and how you want the results to look.  But did it? Does it give you a full report of what it did, what it was unable to do, and why?

You really have no idea what it’s doing – especially on weekends:

 Search platforms are notorious for the lack of operational information they provide.

Does your platform give you a useful report of what content was indexed successfully, and which were not – and why? And some platforms stop indexing files when they reach a certain size: do you know what content was not completely indexed?

When it does tell you, sometimes the information is incomplete: 

Your crawler tells you there were a bunch of ‘404’ errors because of a bad or missing URL; but will it tell you which page(s) had the bad link? Chances are it does not. 

They can be moody, and malfunction without any notice:

You schedule a full update of you index every weekend, and it has always worked flawlessly – as far as you know. Then, usually on a 3-day weekend, it fails. Why? See above.

When you talk to others who have search, theirs always sounds much better than yours:

As a conscientious search manager, you read about search, you attend webinars and conferences, and you always want to learn more. But you wonder why other search mangers seem to describe their platform in glowing terms, and never seem to have any of the behavioral issues you live with every day. It kind of makes you wonder what you’re doing wrong with yours.

It costs more to maintain than you thought and it always needs updates:

When you first got the platform you knew there we ongoing expenses you’d have to budget – support, training, updates, consulting. But just like your kid who needs books, a computer, soccer coaching, and tuition, it’s always more than you budgeted. Sometimes way more!

You can buy insurance, but it never seems to cover what you really need:

Bear with me here: you get insurance for your kids in case they get sick or cause an accident, and you buy support and maintenance for your search platform.  But in the same way that you end up surprised that orthodontics are not fully covered, you may find out that help tuning the search platform, or making it work better, isn’t covered by the plan you purchased – in fact, it wasn’t even offered. QED.

It speaks a different vocabulary:

You want to talk with your kid and understand what’s going on; you certainly don’t want to look uncool. But like your kid, your search platform has a vocabulary that only barely makes sense to you. You know rows and columns, and thought you understood ‘fields’; but the search platform uses words you know but that don’t seem to be the same definition you’ve known from databases or CMS systems.

It's hard for one person to manage, especially when it's new:

Many surveys show that most companies have one (or less) full-time staff responsible for running the search engine – while the same companies claim search is ‘critical’ to their mission.  Search is hard to run, especially in the first few years when everything needs attention. You can always get outside help – not unlike day care and babysitters – but it just seems so much better if you could have a team to help manage and maintain search to make it behave better.

How it behaves reflects on you:

You’re the search manager and you’ve got the job to make search work “just like Google”.  You spent more than $250K to get this search engine, and the fact that it just doesn’t work well reflects badly on you and your career. You may be worried about a divorce.

It doesn’t behave like the last one:

People tend to be nostalgic, as are many search managers I know. They learned how to take care of the previous one, but this new one – well, it’s NOTHING like the earlier one. You need to learn its habits and behaviors, and often adjust your behavior to insure peace at work.

You know if it messes up badly late at night, even on a weekend or a holiday, you’ll hear about it:

If customers or employees around the world use your search platform, there is no ‘down time’: when it’s having an issue, you’ll hear about it, and will be expected to solve the issue – NOW. You may even have IT staff monitoring the platform; but when it breaks in some odd and unanticipated way, you get the call. (And when does search ever fail in an expected way?)

 You may be legally responsible if it messes up:

Depending on what your search application is used for, you may find yourself legally responsible for a problem. Fortunately, the chances of you personally being at fault are slim, but if your company takes a hit for a problem that you hadn’t anticipated, you may have some ‘career risk’ of your own. Was secure content about the upcoming merger accidentally made public? Was content to be served only to your Swiss employees when they search from Switzerland exposed outside of the country? And you can’t even buy liability insurance for that kind of error.

When it’s good, you rarely hear about it; when it's bad, you’ll hear about it:

Seriously, how many of you have gotten a call from your CIO to tell you what a great experience he or she had on the new search platform? Do people want to take you to lunch because search works so well? If you answered ‘yes’ to either of these, I’d like to hear from you!

In my experience, people only go out of their way to give feedback on search when it’s not working well. It’s not “like Google”. Even though Google has hundreds or people and ‘bots’ examining every search query to try to make the result better, and you have only yourself and an IT guy.

You’ll hear. 

The work of managing it is never done:

The wonderful southern writer Ferrol Sams wrote :

“He's a good boy… I just can't think of enough things to tell him not to do.” Sound like your search platform? It will misbehave (or fail outright) in ways you never considered, and your search vendor will tell you “We’ve never seen a problem like that before”. Who has to get it fixed? You have to ask?

Once it moves away, you sometimes feel nostalgic:

Either you toss it out, or a major upgrade from your vendor comes alone and the old search platform gets replaced. Soon, you’re wishing for the “Good old days” when you knew how cute and quirky the old one was, and you find yourself feeling nostalgic for it and wishing that it didn’t have to move out.

Do you agree with my premise? What  have I missed?

August 21, 2014

More on the Gartner MQ: Fact or fiction?

There is a lively discussion going on over in the LinkedIn ‘Enterprise Search Engine Professionals’ group about the recent Gartner Magic Quadrant report on Enterprise Search. Whit Andrews, a Gartner Research VP, has replied that the Gartner MQ is not a 'pay to play'. I confess guilt to have been the one who brought the topic up in these threads, at least, and I certainly thank Whit for clarifying the misunderstanding directly.

That said, two of my colleagues who are true search experts have raised some questions I thought should be addressed.

Charlie Hull of UK-based Flax says he's “unconvinced of the value of the MQ to anyone wanting a comprehensive … view of the options available in the search market'. And Otis Gospodnetić of New York-based Sematext asks "why (would) anyone bother with Gartner's reports. We all know they don't necessarily match the reality". I want to try to address those two very good points.

First, I'm not sure Gartner claims to be a comprehensive overview of the search market. Perhaps there are more thorough lists- my friends and colleagues Avi Rappoport and Steve Arnold both have more complete coverage. Avi, now at Search Technologies, still maintains   

www.searchtools.com with a list that is as much a history of search as a list of vendors. And Steve Arnold has a great deal of free content on his site as well as high quality technology overviews by subscription. Find links to both at arnoldit.com.

Nonetheless, Gartner does have published criterion, and being a paid subscriber is not one of them. His fellow Gartner analyst French Caldwell calls that out on his blog. By the way, I have first-hand experience that Gartner is willing to cut some slack to companies that don't quite meet all of their guidelines for inclusion, and I think that adds credence to the claim that everything.

A more interesting question is one that Otis raises: “why would anyone bother with Gartner's reports”?

To answer that, let me paraphrase a well-known quote from the early days of computers: "No one ever got fired for following Gartner's advice". They are well known for having good if not perfect advice - and I'd suspect that in the fine print, Gartner even acknowledges the fallibility of their recommendations. And all of us know that in real life, you can't select software as complex as an enterprise search platform without a proof of concept in your environment and on your content.

The industry is full of examples where the *best* technology loses pretty consistently to 'pretty good' stuff backed by a major firm/analyst/expert. Otis, I know you're an expert, and I'd take what you say as gospel. A VP at a big corporation who is not familiar with search (or his company's detailed search requirements) may not do so. And any one on that VP's staff who picks a platform based solely on what someone like you or I say probably faces some amount of career risk. That said, I think I speak for Otis and Charlie and others when I say I am glad that a number of folks have listened to our advice and are still fully employed!]

So - in summary, I think we're all right. Whit Andrews and Gartner provide advice that large organizations trust because of the overall methodology of their evaluation. Everyone does know it's not infallible, so a smart company will use the 'trust but verify' approach. And they continue to trust you and I, but more so when Gartner or Forrester or one of the large national consulting companies conforms our recommendation. And of not, we have to provide a compelling reason why something else is better for them. And the longer we're successful with out clients, the more credible we become.

 

 

August 05, 2014

The unspoken "search user contract"

Search usability is a major difference between search that works and search that sucks. 

I recently had lunch with my longtime friend and associate Avi Rappoport from Search Tools Consulting. We had a great time exchanging stories about some of the search problems our clients have. She mentioned one customer who she was sharing best practices when laying out a result list. That brought to mind what I've called the 'search user contract', which users tacitly expect when they use your search on any site, internal or external.

If you are responsible for an instance of search running inside a firewall, even if it's outward facing, you have a problem your predecessors of 15 to 20 years ago* didn't have. Back then, most users didn't have experience with search except the one you provided - so they didn't have expectations of what it could be like.

Fast forward to the present. In addition to your intranet search, virtually everyone in your organization knows, uses, and often loves Amazon, Facebook, Google, Apple, eBay, and others. They know what really great search looks like. They expect you to suggest searches (or even products) on the fly! Search today knows misspelled words and what other products you might like. And as we start to see more machine learning in the enterprise space, it will get even harder.

But most importantly, almost all of the above sites follow the same unspoken user contract:

  • On the result list, the search box goes at the top, either across a wide swath of the browser window or in a smaller box on the left-hand side, near the top.
  • There is no more than one search box on the results page.
  • Search results, numbered or not, show a page title or product name and description and a meaningful description of the product or summary of the document. Sometimes the summary is just a snippet.
  • Words that cause the document to be returned are sometimes bolded in the summary.
  • Suggestions for the words and phrases you type show up just below the search box (or up in the URL field)
  • Facets, when available, go along the left-hand side and/or across the top, just under the search box. Occasionally they can be on the right of the result list.
  • Whether facets are displayed on the left or right of the screen, the numbers next to each facet indicate how many results will display when that facet is clicked.
  • Best bets and boosted or promoted results show up at the top of the result list and are generally recognizable as recommended or featured results.
  • Advertisements or special announcements appear on the right side of the result list.
  • Links to the 'next’ or ‘previous' results page appear at the bottom or less often at the top of the result list.
  • Generally, when there is very long result list, there may be a limited number of results per page with a 'Next" and "Previous" links. 

Now it's time to look your web sites - public facing as well as behind your firewall. Things we often see on internal or corporate sites include:

  • Spelling suggestions in small, dark font very close to the site background color, at the left edge of the content, just above facets. Users don't expect to look there for suggestions, and even if they do look, make the color stand out so users see it**. Don't make the user think!
  • An extra search form on the page; one at the top as 'part of our standard header block'; and one right above the result list to enable drill down. The results you see will differ depending on which field to type in. [The visitor is confused: which search button should be pressed to do a 'drill down' search. Again, don't make the user think]
  • Tabs for drilling into different content areas seem to be facets, but some of the tabs ('News") have no results. [Facets should only display if, by clicking on a facet, the user can see more content]
  • As I said at the top, we’ve found poor search user experience is a major reason employees and site visitors report that ‘search sucks’. One of the standard engagements we do is a Search Audit, which includes search usability in addition to a review of user requirement and expectations.  

 

/s/Miles

 

*Yes, Virginia, there was enterprise search 20 or more years ago. Virtually none of those names still exist, but their technology is still touching you every day. Fulcrum, Verity, Excalibur and others were solving problems for corporations and government agencies; and of course Yahoo was founded in 1994.

**True story, with names omitted to protect the innocent. On a site where I was asked to deliver a search quality audit, ‘spelling suggestions’ was a top requested feature. They actually had spell suggestions, in grey letters in a dark black field with a dark green background, far to the left of the browser window. No one noticed them. You know you are; you’re welcome!

 

July 21, 2014

Gartner MQ 2014 for Search: Surprise!

Funny, just last week I tweeted about how late the Gartner Magic Quadrant for Enterprise Search is this year. Usually it's out in March, and here it is, July.

Well, it's out - and boy does it have some surprises! My first take:

Coveo, a great search platform that runs on Windows only, is in the Leaders quadrant, and best overall in the "Completeness of Vision". Don't get me wrong, it's a great search platform; but I guess completeness of vision does not include completeness of platform. Linux your flavor? Sorry.

HP/Autonomy IDOL is in the upper right quadrant as well, back strong as the top in 'Ability to Execute' and in the top three on 'Completeness of Vision'. IDOL has always reminded me of the reliable old Douglas DC-3, described by aviation enthusiasts as 'a collection of parts flying in loose formation', but it really does offer everything enterprise search needs. And, because it loves big hardware, everything that HP loves to sell.

BA Insight surprised me with their Knowledge Integration Platform at the top of the Visionaries quadrant. It enhances Microsoft SharePoint Search, or runs with a stand-alone version of Lucene. It's very cool, yes. But I sure don't think of it as a search engine. Do you? More on this later.

Attivio comes in solid in the lower right 'Visionaries' quadrant. I'd really expected to see them further along on both measures, so I'm surprised.

I'm really quite disappointed that Gartner places my former employer Lucidworks solidly in the lower left 'Niche players' quadrant. I think Lucidworks has a very good vision of where they want to go, and I think most enterprises will find it compelling once they take a look. I don’t think I'm biased when I say that this may be Gartner's big miss this year. And OK, I understand that, like BA Insight's Knowledge product, Lucidworks needs a search engine to run, but it feels more like a true search platform.

Big surprise: IHS, which I have always thought as a publisher, has made it to the Gartner Niche quadrant as a search platform. Odd.

Other surprises: IBM in the Niche market quadrant, based on 'Ability to Execute'. Back at Verity, then CEO Philippe Courtot got the Gartner folks to admit that the big component of Ability to Execute was really about how long you could fund the project and I have to confess I figured IBM (and Google) as the MQ companies with the best cash position.

If you're not a Gartner client, I'm sorry you won't get the report or the insights Whit Andrews (@WhitAndrews _), a long time search analyst who knows his stuff. You can still find the report from several vendors happy to let you download the Gartner MQ Search from them. Search Google and find the link you most prefer, or call your vendor for a full copy.

/s/Miles

What does it take to qualify as 'Big Data'?

If you've been on a deserted island for a couple of decades, you may not have heard the hot new buzz phrase: Big Data. And you many not have heard of "Hadoop", the application that accidentally solved the problem of Big Data.

Hadoop was originally designed as a way for the open source Nutch crawler to store its content prior to indexing. Nutch was fine for crawling sites; but of you wanted to crawl really massive data sets – say the Internet – you needed a better way to store the content (thank goodness Doug Cutting didn’t work at a database giant or we’d all be speaking SQL now!) GigaOm has a great series on the history of Hadoop http://bit.ly/1jOMHiQ I recommend for anyone interested in how it all began and evolved,

After a number of false starts, brick walls, and subsequent successes, Hadoop is a technology that really enables what we now call ‘big data’- usually written as "Big Data". But what does this mean?  After all, there are companies with a lot of data – and there are companies with limited content size that changes rapidly every day. But which of these really have data that meets the 'Big" definition.  

Consider a company like AT&T or Xerox PARC, which licenses its technology to companies worldwide. As part of a license agreement, PARC agrees to defend its licensees if an intellectual property lawsuit ever crosses the transom. Both companies own over tens of thousands patents going back to its founding in the early 20th century. Just the digital content to support these patents and inventions must number on the tens of millions of documents, much of which is in formats no longer supported by any modern search platform. Heck, to Xerox, WordStar and Peachtext probably seem pretty recent! But about the only time they have to access their content search is when a licensee needs help defending a licensee against an IP claim. I don’t know how often that is, but I’d bet less than a dozen times a year.

Now consider a retail giant like Amazon or Best Buy. In raw size, I’d bet Amazon has hundreds of millions of items to index: books, products, videos, tunes. Maybe more. But that’s not what makes Amazon successful. I think it’s the ability to execute billions of queries every day – again, maybe more – and return damn good results in well under a second, along with recommendations for related products. Best buy actually has retail stores, so they have to keep purchase data, but also buying patterns so they know what products to stock in any given retail location.

A healthcare company like UnitedHealth must have its share of corporate intranet content. But unlike many corporations, these companies must process millions of medical transactions every week: doctor visits, prescriptions, test results, and more. They need to process these transactions, but they also must keep these transactions around for legally defined durations.

Finally, consider a global telecom company like Ericsson or Verizon. They’ve got the usual corporate intranet, I’m sure. They have financial transactions like Amazon and UHG. But they also have telecomm transaction records that must count in the billions a month: phone calls and more. And given the politics of the world, many of these transactions have to be maintained and searchable for months, if not years.

These four companies have a number of common traits with respect to search; but each has its own specific demands. Which ones count as ‘big data’ as it’s usually defined? And which just have ‘a bunch of content?

As it turns out that’s a touch question. At one point, there was a consensus that ‘big data’ required three things, known as the “Three V’s of Big Data’. This escalated to the ‘5 V’s of Big Data’, then the “7 V’s”– and I’ve even seen some define the “10 V’s of Big Data”. Wow.. and growing!

Let’s take a look at the various “V’s” that are commonly used to define ‘Big Data’.

Depending on who you ask, there are four, five, seven or more ‘requirements’ that define ‘big data. These are usually referred to as the “Vs of Big Data”, and these usually include:

Volume: The scale of your data – basically, how many ‘entries’ or ‘items’, you have. For Xerox, how many patents; for a telecom company, how many phone ‘transactions’ have there been.   

Variety: Basically this means how many different types of data you have. Amazon has mouse clicks, product views, unique titles, subscribers, financial transactions and more. For UHG and Ericsson, I’d guess the majority of their content is transactional: phone call metadata (originating and receiving phone number, duration of the call, time of day, etc.). In the enterprise, variety can also mean data format and structure. Some claim that 90% of enterprise data is unstructured, which adds yet another challenge.

Veracity: The boils down whether the data is trustworthy and meaningful. I remember a survey HP did years ago to find out what predictors were useful to know whether a person waking into a random electronics store would walk out with an HP PC. Using HP products at work or at home we the big predictors; but the fact that the most likely day was Tuesday was perhaps spurious and not very valuable.

Velocity: How fast is the data coming in and/or changing. Amazon has a pretty good idea on any given day how many transactions they can expect, and Verizon knows how much call data they can expect. But things change: A new product becomes available, or a major world event triggers many more phone calls than usual.

Viability: If you want to track trends, you need to know what data points are the most useful in predicting the future. A good friend of mine bought a router on Amazon; and Amazon reported that people who bought that router also bought.. men’s extra large jeans. Now, he tells me he did think they were nice jeans, but that signal may not have had long viability.

Value: How useful or important is the data in making a prediction, or in improving business decisions. That was easy!

Variability: This often refers to how internally consistent the data is. To a data point as an accurate predictor, that data point is ideally consistent across the wide range of content. Blood pressure, for example, is generally in a small range; and for a given patient, should be relatively consistent over time. When there is a change, UHG may want to understand the cause.

Visualization: Rows and columns of data can look pretty intimidating and it’s not easy to extract meaning from them. But as they say, ‘a picture is worth a thousand words’, so being able to see charts or graphs can help meaning and trends jump out at you.  I’d use Lucidworks’ SiLK product as an example of a great visualization tool for big data, but there are many others.

Validity: This seems like another way to say the data has veracity, but it may be a subtle point. If you’re recording click-thru data, or prescriptions, or intellectual property, you have to know that the data is accurate and internally consistent. In my HP anecdote above, is the fact that more people bought HP PCs on Tuesday a valid finding? Or is it simply noise? You’ll probably need a human researcher to make these kinds of calls.

Venue: With respect to Big Data, this means where the data came from and where it will be used. Content collected from automobiles and from airplanes may look similar in a lot of ways to the novice. In the same way, data from the public Internet versus data collected from a private cloud may look almost identical. But making decisions for your intranet based on data collected from Bing or Google may prove to be a risk.

Vocabulary: What describes or defines the various items of the data. Ericsson has to know which bit of data represent a phone number and which represent the time of day. Without some idea of the schema or taxonomy, we’ll be hard pressed to reach reasonable decisions from Big Data.

Volatility: This may seem like velocity above, but volatility in Big Data really means how long is the data value, how long do you need to keep it around.  Healthcare companies may need to keep the data a lot longer than

Vagueness: This final one is credited to Venkat Krishnamurthy of YarcData just last month at the Big Data Innovation Summit here in Silicon Valley.  In a way, it addresses the confidence we can have in the results suggested by the data. Are we seeing real trends, or are we witnessing a black swan?

In the application of Big Data not all of these various V’s are as valid or valuable to the casual (or serious) observer. But as in so many things, interpreting the data is to the person making the call. Big Data is only a tool: use it wisely!

Some resources I used in collection data for this article include the follow web sites and blogs:

IBM’s Big Data & Analytics Hub 

MapR's Blog: Top 10 Big Data Challenges – A Serious Look at 10 Big Data V’s 

See also Dr. Kirk Borne’s Top 10 List on Data Science Central   

Bernard Marr’s LinkedIn post on The 5 Vs Everyone Must Know