« August 2013 | Main | August 2014 »

6 posts from July 2014

July 29, 2014

Big data: Salvation for enterprise search?

Or just another data source?

With all the acquisitions we've seen in enterprise search in the last several years, it's no wonder that the field looks boring to the casual observer. Most companies have gone through two or more complex, costly search implementations to a new search platform, users still complain, and in some quarters, there seems to be 'quality of search fatigue'. I acknowledge I'm biased, but I think enterprise search implemented and managed properly provides incredible value to corporations, their employees, and their customers/consumers. That said, a lot of companies seem to treat search as 'fire and forget', and after it's installed, it never gets the resources to get off the ground in a quality way.

It's no surprise then that the recent hype bubble in 'Big Data' has the attention of enterprise search companies as they see a way to convince an entirely new group of technologists that search is the way. 

It's certainly true that Hadoop's beginning was related to search - as a repository for web crawler Nutch in preparation for highly scalable indexing in Lucene/Solr no doubt. Hadoop and its zoo* of related tools certainly are designed for nerds. At best, it's a framework that sits on top of your physical disks; worst case it's a file system that does support authentication but not really security (in the LDAP/AFD sense). And it's a great tool to write 'jobs' to manipulate content in interesting ways to a data scientist. How is your Java? Python? Clojure? Better brush up.

The enterprise search vendors of the world certainly see the tremendous interest in Hadoop and 'big data' as a great opportunity to grow their business. And for the right use cases, most enterprise search platforms can address the problem. But remember that, to enterprise search, the content you store in Hadoop is simply content in a different repository: a new data source on the menu.

But remember, big data apps come with all the same challenges enterprise search has faced for years plus a few more. Users - even if data scientists and researchers - think web-based Google is search; and even though - as a group - this demographic may be more intelligent than your average search users, they still expect your search to 'just know". If you think babysitting your existing enterprise search solution is touch, wait until you see what billions of documents does for you.

And speaking of billions of records - how long does your current search take to index your current content? How long does it take to do a full index of your current content? Now extrapolate: how long will it take to index a few billion records? (Note: some vendors can provide a much faster 'pipe' for indexing massive content from Hadoop. Lucidworks and Cloudera are two of the companies I am familiar with; there may be others)

A failure in search? Well, it depends what you want. If you are going to treat Hadoop as a 'Ha-Dump' with all of your log files, all of your customer transactional data, hundreds of Twitter feeds for ever and ever, and add your existing enterprise data, you're going to have some time on your hands while the data gets indexed.

On the other hand, if you're smart about where your data goes, break it into 'data lakes' of related content, and use the right tool for each type of data, you won't be using your enterprise search platform for use cases better served with analytics tools that are part of the Apache Zoo; and you’ll still be doing pretty well. And in that universe, Hadoop is just another data source for search - and not the slow pipe through which all of your data has to flow.

Do you agree?


*If you get the joke, chances are you know a bit about the Apache project and open source software. If not, you may want to hold off and research before you download your first Hadoop VM.   


July 22, 2014

The role of professional services in software companies

I just saw a great piece on LinkedIn written by venture investor Mark Suster (@msuster) of Upfront Ventures. His piece is about the importance of professional services to building a VC backed company/product. I can add no more to his take (bolded text is on me):

Have an in-house professional services team that implements your software. It will bring down your overall margins but will produce profitable revenue. Most importantly in ensures long-term success. I wrote about The Importance of Professional Services here. I know, I know. Your favorite investor told you this was a bad idea. Trust me – you’ll thank me a few years from now if you control your own destiny and improve quality through services. If your investor worked inside of a SaaS company for years and disagrees with me then listen to them. If they’re a spreadsheet jockey then on this particular issue I promise you they are FOS. Spreadsheet quant does not equal success, properly implemented software does.

'Nuff said. I've never met Mark, but I like where he's coming from!

/s/ Miles

July 21, 2014

Gartner MQ 2014 for Search: Surprise!

Funny, just last week I tweeted about how late the Gartner Magic Quadrant for Enterprise Search is this year. Usually it's out in March, and here it is, July.

Well, it's out - and boy does it have some surprises! My first take:

Coveo, a great search platform that runs on Windows only, is in the Leaders quadrant, and best overall in the "Completeness of Vision". Don't get me wrong, it's a great search platform; but I guess completeness of vision does not include completeness of platform. Linux your flavor? Sorry.

HP/Autonomy IDOL is in the upper right quadrant as well, back strong as the top in 'Ability to Execute' and in the top three on 'Completeness of Vision'. IDOL has always reminded me of the reliable old Douglas DC-3, described by aviation enthusiasts as 'a collection of parts flying in loose formation', but it really does offer everything enterprise search needs. And, because it loves big hardware, everything that HP loves to sell.

BA Insight surprised me with their Knowledge Integration Platform at the top of the Visionaries quadrant. It enhances Microsoft SharePoint Search, or runs with a stand-alone version of Lucene. It's very cool, yes. But I sure don't think of it as a search engine. Do you? More on this later.

Attivio comes in solid in the lower right 'Visionaries' quadrant. I'd really expected to see them further along on both measures, so I'm surprised.

I'm really quite disappointed that Gartner places my former employer Lucidworks solidly in the lower left 'Niche players' quadrant. I think Lucidworks has a very good vision of where they want to go, and I think most enterprises will find it compelling once they take a look. I don’t think I'm biased when I say that this may be Gartner's big miss this year. And OK, I understand that, like BA Insight's Knowledge product, Lucidworks needs a search engine to run, but it feels more like a true search platform.

Big surprise: IHS, which I have always thought as a publisher, has made it to the Gartner Niche quadrant as a search platform. Odd.

Other surprises: IBM in the Niche market quadrant, based on 'Ability to Execute'. Back at Verity, then CEO Philippe Courtot got the Gartner folks to admit that the big component of Ability to Execute was really about how long you could fund the project and I have to confess I figured IBM (and Google) as the MQ companies with the best cash position.

If you're not a Gartner client, I'm sorry you won't get the report or the insights Whit Andrews (@WhitAndrews _), a long time search analyst who knows his stuff. You can still find the report from several vendors happy to let you download the Gartner MQ Search from them. Search Google and find the link you most prefer, or call your vendor for a full copy.


A new V for Big Data: Visitor

About an hour after I wrote my most recent post, What does it take to qualify as 'Big Data'? about the multiple Vs of Big Data. As you can imagine, I had words that start with the letter 'V' bouncing around in my head. Then it hit me: another V word for Big Data is Visitor

For years, I've been writing about the importance of context in search - basically data about the search user. Without context, we can show some basic full-text search results. With context, we try to get into the user's head and understand what the search terms mean to him or her. 

On the Internet, this usually means the physical location of the user IP address, previous searches he/she may have done, or products the searcher browsed. (Aside: Just last week, I was on Amazon looking for an adapter that would let me convert my desk so i could work standing up rather than sitting down. A few days later, I searched Google for a news story I had seen. One of the ads that showed up on the results? An ad from Amazon for a stand up desk adaptor. Now that's what I mean by 'context'!)

Inside the organization, user context might include things like the user's department, physical location, job title, native language, or product specialty. And if you want to do search right, you need to start finding a way to use that context. 

And as I was thinking about V words.. Visitor came to mind. Whether you call it context, signals, search and browse history or just environment, all of this is critical to successful implementation of enterprise search. This is true whether you are searching big data or your SharePoint repository. And as far as most search platforms go, you still have to do it yourself. 




What does it take to qualify as 'Big Data'?

If you've been on a deserted island for a couple of decades, you may not have heard the hot new buzz phrase: Big Data. And you many not have heard of "Hadoop", the application that accidentally solved the problem of Big Data.

Hadoop was originally designed as a way for the open source Nutch crawler to store its content prior to indexing. Nutch was fine for crawling sites; but of you wanted to crawl really massive data sets – say the Internet – you needed a better way to store the content (thank goodness Doug Cutting didn’t work at a database giant or we’d all be speaking SQL now!) GigaOm has a great series on the history of Hadoop http://bit.ly/1jOMHiQ I recommend for anyone interested in how it all began and evolved,

After a number of false starts, brick walls, and subsequent successes, Hadoop is a technology that really enables what we now call ‘big data’- usually written as "Big Data". But what does this mean?  After all, there are companies with a lot of data – and there are companies with limited content size that changes rapidly every day. But which of these really have data that meets the 'Big" definition.  

Consider a company like AT&T or Xerox PARC, which licenses its technology to companies worldwide. As part of a license agreement, PARC agrees to defend its licensees if an intellectual property lawsuit ever crosses the transom. Both companies own over tens of thousands patents going back to its founding in the early 20th century. Just the digital content to support these patents and inventions must number on the tens of millions of documents, much of which is in formats no longer supported by any modern search platform. Heck, to Xerox, WordStar and Peachtext probably seem pretty recent! But about the only time they have to access their content search is when a licensee needs help defending a licensee against an IP claim. I don’t know how often that is, but I’d bet less than a dozen times a year.

Now consider a retail giant like Amazon or Best Buy. In raw size, I’d bet Amazon has hundreds of millions of items to index: books, products, videos, tunes. Maybe more. But that’s not what makes Amazon successful. I think it’s the ability to execute billions of queries every day – again, maybe more – and return damn good results in well under a second, along with recommendations for related products. Best buy actually has retail stores, so they have to keep purchase data, but also buying patterns so they know what products to stock in any given retail location.

A healthcare company like UnitedHealth must have its share of corporate intranet content. But unlike many corporations, these companies must process millions of medical transactions every week: doctor visits, prescriptions, test results, and more. They need to process these transactions, but they also must keep these transactions around for legally defined durations.

Finally, consider a global telecom company like Ericsson or Verizon. They’ve got the usual corporate intranet, I’m sure. They have financial transactions like Amazon and UHG. But they also have telecomm transaction records that must count in the billions a month: phone calls and more. And given the politics of the world, many of these transactions have to be maintained and searchable for months, if not years.

These four companies have a number of common traits with respect to search; but each has its own specific demands. Which ones count as ‘big data’ as it’s usually defined? And which just have ‘a bunch of content?

As it turns out that’s a touch question. At one point, there was a consensus that ‘big data’ required three things, known as the “Three V’s of Big Data’. This escalated to the ‘5 V’s of Big Data’, then the “7 V’s”– and I’ve even seen some define the “10 V’s of Big Data”. Wow.. and growing!

Let’s take a look at the various “V’s” that are commonly used to define ‘Big Data’.

Depending on who you ask, there are four, five, seven or more ‘requirements’ that define ‘big data. These are usually referred to as the “Vs of Big Data”, and these usually include:

Volume: The scale of your data – basically, how many ‘entries’ or ‘items’, you have. For Xerox, how many patents; for a telecom company, how many phone ‘transactions’ have there been.   

Variety: Basically this means how many different types of data you have. Amazon has mouse clicks, product views, unique titles, subscribers, financial transactions and more. For UHG and Ericsson, I’d guess the majority of their content is transactional: phone call metadata (originating and receiving phone number, duration of the call, time of day, etc.). In the enterprise, variety can also mean data format and structure. Some claim that 90% of enterprise data is unstructured, which adds yet another challenge.

Veracity: The boils down whether the data is trustworthy and meaningful. I remember a survey HP did years ago to find out what predictors were useful to know whether a person waking into a random electronics store would walk out with an HP PC. Using HP products at work or at home we the big predictors; but the fact that the most likely day was Tuesday was perhaps spurious and not very valuable.

Velocity: How fast is the data coming in and/or changing. Amazon has a pretty good idea on any given day how many transactions they can expect, and Verizon knows how much call data they can expect. But things change: A new product becomes available, or a major world event triggers many more phone calls than usual.

Viability: If you want to track trends, you need to know what data points are the most useful in predicting the future. A good friend of mine bought a router on Amazon; and Amazon reported that people who bought that router also bought.. men’s extra large jeans. Now, he tells me he did think they were nice jeans, but that signal may not have had long viability.

Value: How useful or important is the data in making a prediction, or in improving business decisions. That was easy!

Variability: This often refers to how internally consistent the data is. To a data point as an accurate predictor, that data point is ideally consistent across the wide range of content. Blood pressure, for example, is generally in a small range; and for a given patient, should be relatively consistent over time. When there is a change, UHG may want to understand the cause.

Visualization: Rows and columns of data can look pretty intimidating and it’s not easy to extract meaning from them. But as they say, ‘a picture is worth a thousand words’, so being able to see charts or graphs can help meaning and trends jump out at you.  I’d use Lucidworks’ SiLK product as an example of a great visualization tool for big data, but there are many others.

Validity: This seems like another way to say the data has veracity, but it may be a subtle point. If you’re recording click-thru data, or prescriptions, or intellectual property, you have to know that the data is accurate and internally consistent. In my HP anecdote above, is the fact that more people bought HP PCs on Tuesday a valid finding? Or is it simply noise? You’ll probably need a human researcher to make these kinds of calls.

Venue: With respect to Big Data, this means where the data came from and where it will be used. Content collected from automobiles and from airplanes may look similar in a lot of ways to the novice. In the same way, data from the public Internet versus data collected from a private cloud may look almost identical. But making decisions for your intranet based on data collected from Bing or Google may prove to be a risk.

Vocabulary: What describes or defines the various items of the data. Ericsson has to know which bit of data represent a phone number and which represent the time of day. Without some idea of the schema or taxonomy, we’ll be hard pressed to reach reasonable decisions from Big Data.

Volatility: This may seem like velocity above, but volatility in Big Data really means how long is the data value, how long do you need to keep it around.  Healthcare companies may need to keep the data a lot longer than

Vagueness: This final one is credited to Venkat Krishnamurthy of YarcData just last month at the Big Data Innovation Summit here in Silicon Valley.  In a way, it addresses the confidence we can have in the results suggested by the data. Are we seeing real trends, or are we witnessing a black swan?

In the application of Big Data not all of these various V’s are as valid or valuable to the casual (or serious) observer. But as in so many things, interpreting the data is to the person making the call. Big Data is only a tool: use it wisely!

Some resources I used in collection data for this article include the follow web sites and blogs:

IBM’s Big Data & Analytics Hub 

MapR's Blog: Top 10 Big Data Challenges – A Serious Look at 10 Big Data V’s 

See also Dr. Kirk Borne’s Top 10 List on Data Science Central   

Bernard Marr’s LinkedIn post on The 5 Vs Everyone Must Know 


July 07, 2014

Data quality as critical to 'big data' as it is to search

For years, we've been preaching in the wilderness about the important of 'data quality' - the new name for 'garbage in, garbage out'. Maybe it gets so little respect because it was first cited on April Fools Day (back in 1963, according to Wikipedia). Bad content has caused enterprise search owners headaches for years - heck, one of our most popular posts, Sixty Guys named Sarah is really about a data quality problem.  

Last Monday, Fortune Magazine posted an article called Big data's dirty problem, telling us all that:

"Inaccuracies, misspellings, and obsolete information makes achieving the big data utopia a slog for businesses and researchers"

For those of us who have worked in search for a while, it comes as no surprise. (It's also a reason why you need great search along with a big data distro to succeed).

So many companies approach enterprise search with what I call a 'fire and forget' mentality. Google on the public web makes it look so easy - how hard can it be?

At so many companies, we've seen this vicious cycle: Pick the search platform that looks best, install it, ignore it, and repeat in two to four years. No - really, ask yourself: how long have you used your current enterprise search platform?

Now ask yourself "How often do we review top queries, top misspellings, zero hit query reports?"  In my experience, there is an inverse correlation between longevity of search platform and money invested in monitoring and maintaining the search platform. In search, big data, and life, you get what you pay for.

Looking to end the vicious circle of 'roll out, replace, repeat' with your enterprise OR big data apps? You might start with a data audit - we can help.

Miles Kehoe/July 6, 2014