13 posts categorized "Nutch"

July 29, 2014

Big data: Salvation for enterprise search?

Or just another data source?

With all the acquisitions we've seen in enterprise search in the last several years, it's no wonder that the field looks boring to the casual observer. Most companies have gone through two or more complex, costly search implementations to a new search platform, users still complain, and in some quarters, there seems to be 'quality of search fatigue'. I acknowledge I'm biased, but I think enterprise search implemented and managed properly provides incredible value to corporations, their employees, and their customers/consumers. That said, a lot of companies seem to treat search as 'fire and forget', and after it's installed, it never gets the resources to get off the ground in a quality way.

It's no surprise then that the recent hype bubble in 'Big Data' has the attention of enterprise search companies as they see a way to convince an entirely new group of technologists that search is the way. 

It's certainly true that Hadoop's beginning was related to search - as a repository for web crawler Nutch in preparation for highly scalable indexing in Lucene/Solr no doubt. Hadoop and its zoo* of related tools certainly are designed for nerds. At best, it's a framework that sits on top of your physical disks; worst case it's a file system that does support authentication but not really security (in the LDAP/AFD sense). And it's a great tool to write 'jobs' to manipulate content in interesting ways to a data scientist. How is your Java? Python? Clojure? Better brush up.

The enterprise search vendors of the world certainly see the tremendous interest in Hadoop and 'big data' as a great opportunity to grow their business. And for the right use cases, most enterprise search platforms can address the problem. But remember that, to enterprise search, the content you store in Hadoop is simply content in a different repository: a new data source on the menu.

But remember, big data apps come with all the same challenges enterprise search has faced for years plus a few more. Users - even if data scientists and researchers - think web-based Google is search; and even though - as a group - this demographic may be more intelligent than your average search users, they still expect your search to 'just know". If you think babysitting your existing enterprise search solution is touch, wait until you see what billions of documents does for you.

And speaking of billions of records - how long does your current search take to index your current content? How long does it take to do a full index of your current content? Now extrapolate: how long will it take to index a few billion records? (Note: some vendors can provide a much faster 'pipe' for indexing massive content from Hadoop. Lucidworks and Cloudera are two of the companies I am familiar with; there may be others)

A failure in search? Well, it depends what you want. If you are going to treat Hadoop as a 'Ha-Dump' with all of your log files, all of your customer transactional data, hundreds of Twitter feeds for ever and ever, and add your existing enterprise data, you're going to have some time on your hands while the data gets indexed.

On the other hand, if you're smart about where your data goes, break it into 'data lakes' of related content, and use the right tool for each type of data, you won't be using your enterprise search platform for use cases better served with analytics tools that are part of the Apache Zoo; and you’ll still be doing pretty well. And in that universe, Hadoop is just another data source for search - and not the slow pipe through which all of your data has to flow.

Do you agree?

 

*If you get the joke, chances are you know a bit about the Apache project and open source software. If not, you may want to hold off and research before you download your first Hadoop VM.   

 

July 21, 2014

What does it take to qualify as 'Big Data'?

If you've been on a deserted island for a couple of decades, you may not have heard the hot new buzz phrase: Big Data. And you many not have heard of "Hadoop", the application that accidentally solved the problem of Big Data.

Hadoop was originally designed as a way for the open source Nutch crawler to store its content prior to indexing. Nutch was fine for crawling sites; but of you wanted to crawl really massive data sets – say the Internet – you needed a better way to store the content (thank goodness Doug Cutting didn’t work at a database giant or we’d all be speaking SQL now!) GigaOm has a great series on the history of Hadoop http://bit.ly/1jOMHiQ I recommend for anyone interested in how it all began and evolved,

After a number of false starts, brick walls, and subsequent successes, Hadoop is a technology that really enables what we now call ‘big data’- usually written as "Big Data". But what does this mean?  After all, there are companies with a lot of data – and there are companies with limited content size that changes rapidly every day. But which of these really have data that meets the 'Big" definition.  

Consider a company like AT&T or Xerox PARC, which licenses its technology to companies worldwide. As part of a license agreement, PARC agrees to defend its licensees if an intellectual property lawsuit ever crosses the transom. Both companies own over tens of thousands patents going back to its founding in the early 20th century. Just the digital content to support these patents and inventions must number on the tens of millions of documents, much of which is in formats no longer supported by any modern search platform. Heck, to Xerox, WordStar and Peachtext probably seem pretty recent! But about the only time they have to access their content search is when a licensee needs help defending a licensee against an IP claim. I don’t know how often that is, but I’d bet less than a dozen times a year.

Now consider a retail giant like Amazon or Best Buy. In raw size, I’d bet Amazon has hundreds of millions of items to index: books, products, videos, tunes. Maybe more. But that’s not what makes Amazon successful. I think it’s the ability to execute billions of queries every day – again, maybe more – and return damn good results in well under a second, along with recommendations for related products. Best buy actually has retail stores, so they have to keep purchase data, but also buying patterns so they know what products to stock in any given retail location.

A healthcare company like UnitedHealth must have its share of corporate intranet content. But unlike many corporations, these companies must process millions of medical transactions every week: doctor visits, prescriptions, test results, and more. They need to process these transactions, but they also must keep these transactions around for legally defined durations.

Finally, consider a global telecom company like Ericsson or Verizon. They’ve got the usual corporate intranet, I’m sure. They have financial transactions like Amazon and UHG. But they also have telecomm transaction records that must count in the billions a month: phone calls and more. And given the politics of the world, many of these transactions have to be maintained and searchable for months, if not years.

These four companies have a number of common traits with respect to search; but each has its own specific demands. Which ones count as ‘big data’ as it’s usually defined? And which just have ‘a bunch of content?

As it turns out that’s a touch question. At one point, there was a consensus that ‘big data’ required three things, known as the “Three V’s of Big Data’. This escalated to the ‘5 V’s of Big Data’, then the “7 V’s”– and I’ve even seen some define the “10 V’s of Big Data”. Wow.. and growing!

Let’s take a look at the various “V’s” that are commonly used to define ‘Big Data’.

Depending on who you ask, there are four, five, seven or more ‘requirements’ that define ‘big data. These are usually referred to as the “Vs of Big Data”, and these usually include:

Volume: The scale of your data – basically, how many ‘entries’ or ‘items’, you have. For Xerox, how many patents; for a telecom company, how many phone ‘transactions’ have there been.   

Variety: Basically this means how many different types of data you have. Amazon has mouse clicks, product views, unique titles, subscribers, financial transactions and more. For UHG and Ericsson, I’d guess the majority of their content is transactional: phone call metadata (originating and receiving phone number, duration of the call, time of day, etc.). In the enterprise, variety can also mean data format and structure. Some claim that 90% of enterprise data is unstructured, which adds yet another challenge.

Veracity: The boils down whether the data is trustworthy and meaningful. I remember a survey HP did years ago to find out what predictors were useful to know whether a person waking into a random electronics store would walk out with an HP PC. Using HP products at work or at home we the big predictors; but the fact that the most likely day was Tuesday was perhaps spurious and not very valuable.

Velocity: How fast is the data coming in and/or changing. Amazon has a pretty good idea on any given day how many transactions they can expect, and Verizon knows how much call data they can expect. But things change: A new product becomes available, or a major world event triggers many more phone calls than usual.

Viability: If you want to track trends, you need to know what data points are the most useful in predicting the future. A good friend of mine bought a router on Amazon; and Amazon reported that people who bought that router also bought.. men’s extra large jeans. Now, he tells me he did think they were nice jeans, but that signal may not have had long viability.

Value: How useful or important is the data in making a prediction, or in improving business decisions. That was easy!

Variability: This often refers to how internally consistent the data is. To a data point as an accurate predictor, that data point is ideally consistent across the wide range of content. Blood pressure, for example, is generally in a small range; and for a given patient, should be relatively consistent over time. When there is a change, UHG may want to understand the cause.

Visualization: Rows and columns of data can look pretty intimidating and it’s not easy to extract meaning from them. But as they say, ‘a picture is worth a thousand words’, so being able to see charts or graphs can help meaning and trends jump out at you.  I’d use Lucidworks’ SiLK product as an example of a great visualization tool for big data, but there are many others.

Validity: This seems like another way to say the data has veracity, but it may be a subtle point. If you’re recording click-thru data, or prescriptions, or intellectual property, you have to know that the data is accurate and internally consistent. In my HP anecdote above, is the fact that more people bought HP PCs on Tuesday a valid finding? Or is it simply noise? You’ll probably need a human researcher to make these kinds of calls.

Venue: With respect to Big Data, this means where the data came from and where it will be used. Content collected from automobiles and from airplanes may look similar in a lot of ways to the novice. In the same way, data from the public Internet versus data collected from a private cloud may look almost identical. But making decisions for your intranet based on data collected from Bing or Google may prove to be a risk.

Vocabulary: What describes or defines the various items of the data. Ericsson has to know which bit of data represent a phone number and which represent the time of day. Without some idea of the schema or taxonomy, we’ll be hard pressed to reach reasonable decisions from Big Data.

Volatility: This may seem like velocity above, but volatility in Big Data really means how long is the data value, how long do you need to keep it around.  Healthcare companies may need to keep the data a lot longer than

Vagueness: This final one is credited to Venkat Krishnamurthy of YarcData just last month at the Big Data Innovation Summit here in Silicon Valley.  In a way, it addresses the confidence we can have in the results suggested by the data. Are we seeing real trends, or are we witnessing a black swan?

In the application of Big Data not all of these various V’s are as valid or valuable to the casual (or serious) observer. But as in so many things, interpreting the data is to the person making the call. Big Data is only a tool: use it wisely!

Some resources I used in collection data for this article include the follow web sites and blogs:

IBM’s Big Data & Analytics Hub 

MapR's Blog: Top 10 Big Data Challenges – A Serious Look at 10 Big Data V’s 

See also Dr. Kirk Borne’s Top 10 List on Data Science Central   

Bernard Marr’s LinkedIn post on The 5 Vs Everyone Must Know 

 

December 18, 2012

Last call for submiting papers to ESS NY

This Friday, December 21, is the last day for submitting papers and workshops to ESS in NY in May 21-22. See the information site at the Enterprise Search Summit Call for Speakers page.

If you work with enterprise search technologies (or supporting technologies), chances are the things you've learned would be valuable to other folks. If you have an in-depth topic, write it up as a 3 hour workshop; if you have a success story, or lessons learned you can share, submit a talk for a 30-45 minute session.

I have to say, this conference has enjoyed a multi-year run in terms of quality of talks and excellent Spring weather.. see you in May?

 

 

December 03, 2012

Why LucidWorks? And Why Now?

Big news for us here at New Idea Engineering. After 16 years as an independent search technology consulting company, we've become part of LucidWorks effective December 1, 2012.

For years we've focused on both the business and technology of search.  We've provided vendor neutral consulting services to large and small organizations. We've worked with search platform companies to help tune their product capabilities and their message; we've helped companies implement enterprise search from 'the usual (vendor) suspects'. We've provided business best practices, data audits, and implementation overview for dozens of companies for most of our time as an independent company.

As you know, the market for enterprise search has changed over the last several years. Verity then Autonomy, FAST, Endeca, Exalead, ISYS, and more have been acquired by large companies with varying levels of success. With these acquisitions the products have morphed to fit into the new owners' world view, we've politely referred to shift in focus as being "distracted". Google, one of the few non-acquired engines, got into the market with a low-cost entry which has enjoyed great acceptance; but as the market changed, Google has started raising its prices for the nifty yellow box.   And while they pursue laudable offerings like phones, tables, Google Glass and self driving cars, cloud computer and simultaneously retool their ad model for the mobile world, it's fair to say that even their enterprise offerings are potentially distracted at times.

Sure, if your company has a typical use case for search, there's an engine or appliance for you.  But so many complex projects we've seen are atypical, almost by definition.  These high-end projects are no longer efficiently served by commercial sector.  Many projects have turned to Open Source offerings, but not out of cost savings as you might think, but out of a desire to have extreme control and flexibility, and not be tied down by vendor meddling and license nit-picking.

Over the same period, more and more people have realized that the need to understand and manage 'big data' is taking off. In fact, search is the interface of choice to find content in big data repositories. 

It's been about 10 years since we did our first project based on Lucene, the basis for nearly all modern open source search engines today. Since then, the capabilities of open source search have increased to the point where we honestly think Solr may be the best search platform available on the market today. 

We didn't call what we did with Lucene back then 'big data', but that's really what it was. Scalable, controllable, flexible, powerful... and open! And free for the taking - and modifying. Just add programmers.

A few years back, Lucid Imagination was started to provide that support, along with training and an easy to use interface that lets business owners - not just developers - use Solr search.  We've called them "the RedHat of Open Source Search".  Now, Lucid Imagination has become LucidWorks, and it is set to be the best way to search web, file, and database content, with extreme control, and of course with big data.

A few months ago we spoke with Lucid CEO Paul Doscher about upping our contract with them, and about where they were going, and it just made sense to us at that time to join a bigger team.

While we're committed to success at LucidWorks, we'll continue to use our blog to discuss all aspects of enterprise search – vendors, tools, technologies, events, and trends.  Unlike our days at past search companies, this one is based on an open platform so we'll be able to share a lot more as we move forward.

We hope you'll find our posts interesting, helpful, and engaging. Let us know how we're doing.

 

April 30, 2012

Is Microsoft joining the Lucene/Solr dance?

Lucene Revolution is only 10 days away, and if you're not already planning on being in Boston, today's a great time to register.

Why be at the 3rd annual Lucene Revolution, Lucid Imagination's open source conference? Several reasons:

  • Open source search is hot, and Lucene/Solr is better than ever;
  • Lucid Imagination is just introducing their LucidWorks Enterprise 2.1 release;
  • Paul Doscher, recently of Exalead, is the new CEO and keynote speaker; and
  • Microsoft's Gianugo Rabellino is speaking about Lucene, Azure, and OSS.

Yes, you saw it here. A Microsoft Azure guy is speaking right after Paul Dorscher Wednesday moring at Lucene Revolution. Has Microsoft caught the drift of the market towards Lucene/Solr in search, big data, and the cloud? Even search pundit Steven Arnold posted a few days back about Microsoft and Linux. Strange bedfellows perhaps, but there it is. 

So yes, I think if you can find any way to get to Boston in a week, I'd say do it. See you there!

 

November 08, 2011

Are you spending too much on enterprise search?

If your organization uses enterprise search, or if you are in the market for a new search platform, you may want to attend our webinar next week "Are you spending too much for search?". The one hour session will address:

  • What do users expect?
  • Why not just use Google?
  • How much search do you need?
  • Is an RFI a waste of time?   

Date: Wednesday, November 16 2011

Time: 11AM Pacific Standard Time / 1900 UTC

Register today!

August 09, 2011

So how many machines does *your* vendor suggest for 100,000,000+ document dataset?

We've been chatting with folks lately about really large data sets.  Clients who have a problem, and vendors who claim they can help.

But a basic question keeps coming up - not licensing - but "how many machines will we need?"  And not everybody can put their data on a public cloud, and private clouds can't always spit out a dozen virtual machines to play with, plus duplicates of that for dev and staging, so not quite as trivial as some folks thing.

The Tier-1 vendors can handle hundreds of millions of dcs, sure, but usually on quite a few machines, plus of course their premium licensing, and some non trivial setup at that point.

And as much as we love Lucene, Solr, Nutch and Hadoop, our tests show you need a fair number of machines if you're going to turn around a half billion docs in less than a week.

And beyond indexing time, once you start doing 3 or 4 facet filters, you also hit another performance knee.

We've got 4 Tier-2 vendors on our "short list" that might be able to reduce machine counts by a factor of 10 or more over the Tier-1 and open source guys.  But we'd love to hear your experiences.

December 05, 2010

Share your successes at ESS East next May

ESSSpringLogo Our friends over at InfoToday who run the successful Enterprise Search Summit conferences have asked us  to announce that the date for submitting papers to their Spring show in New York in May 2011 has been extended until Wednesday, December 8. You can find out what they are looking for and how to submit your proposal online at http://www.enterprisesearchsummit.com/Spring2011/CallForSpeakers.aspx.

Michelle Manafy, who runs the program again next May, really likes to have speakers who have found creative and successful ways to select, deploy, or manage ongoing enterprise search operations. We've co-presented with several of our customers in the past, and trust me, it's great fun and not bad for your career! And - no promises - the weather at ESS East has been great for just about every year - and we've been there for nearly 6 years now!

A friend told me something years ago that I've always fond helpful; I hope you'll take it to heart: 'Everything you know, someone else needs to know'. Don't worry if your search project isn't perfect; or worry that someone will find fault with what you've done. Trust me: there are many organizations newer to enterprise search than you are, and anything you found helpful will sure be valuable for them as well. And you get to attend al of the sessions, so you might learn more as well! A 'win-win' situation if I've ever seen one!

See you in New York!

/s/Miles

 

 

July 22, 2010

Document filters webinar July 28 2010

ISYS Document filter independent ISYS is hosting a webinar on Wednesday, July 28 at 1PM Eastern to talk  about the role document filters play in successful search indexing and display. You can register now.

Of course, as a search technology company, ISYS has enjoyed great success, particularly among law enforcement where search has to work right at a reasonable price. We've always liked their technology and their approach.

But like every search platform, ISYS needed filters to convert so-called 'binary' formats like Microsoft Office, PDF, or even Photoshop files into a stream of text - after all, today's search platforms primarily operate on words.. in textual format. But ISYS looked at the market at the time, and found that two of their competitors, Autonomy and Oracle, own the best of the filter technologies.

Like any company, they made a 'make or buy' decision, and in their case, making their own filters was the right answer for them, and possibly for you. You see, ISYS decided to start selling their filter technology independent of their search platform, so now you can acquire some really great filtering and viewing technology for just about any search engine, 'off the shelf'. Their customers include other vendors with the need to extract text from various types of content, not just search vendors but also eDiscovery and eCompliance companies and many others who don’t want to pay excessive prices for technology - and who want really great filtering at a reasonable cost.

Then, a few years back, ISYS decided that open source platforms Lucene and Solr - which had no filters - needed them as well. So now you can buy a great filter pack 'off the shelf' with no huge volume commitment - no volume commitment at all! And you can get world class filtering for your open source search project.

Come hear ISYS, the guys from Lucid Imagination, and us here at New Idea Engineering talk about the critical role of filters in your search applications. See you then!

/s/Miles

June 06, 2009

Impressions of first Lucene/Solr SF Meetup

Kudos to Carl, our NIE Marketeer and defacto social director, for getting us to attend, well worth it, and conveniently coinciding with Gilbane.

The Good:

  • VERY entertaining, very informative.  Lots of good info about upcoming versions of Lucene and Solr, including additional performance tweaks.
  • A friendly, supportive bunch of like-minded nerds, and I mean this is the best possible way.
  • Also discussions of other related Apache projects.  We're all gonna need a cheat sheet pretty soon to keep track of it all.
  • Lucene/Solr will soon have implemented much of the core features of Autonomy IDOL, Endeca, FAST, etc.  They really ought to be spying.  :-)

Personally I think Otis & co. might wanna fly out for the next one.  I also think Dieselpoint ought to attend and talk about Open Pipeline.  If we get up enough energy maybe we could even volunteer to do that next time, we're on the board after all, but this is really Chris's baby.

The Not-so-Good:

  • About 50 terms that clients would not understand.  Don't get me wrong, we love the Map/Reduce, Bayesian, K-Means, SVD stuff, but most corporate clients would be lost.
  • Not much for Enterprise Packaging.  Ironically it's the mundane aspects of search, from a non-developer standpoint, that are still not on the horizon.  Not a criticism of the developers, they have what they need.
  • Not much about Nutch.  Nutch 1.0 is out, along with rumors of a revised admin GUI, but not much coverage here.

Impressions of Lucid Imagination:

This event was sponsored by Lucid, a company that recently got funding for bringing commercial packaging and services to the open source search world, and their senior staff includes quite a few of the core committers.

  • A very sincere bunch of guys.
  • They haven't sold their souls to corporate America, I think their "geek cred" is still well in tact.
  • Probably will not be filling in enterprise packaging pot holes any time soon.
  • Do they understand the Enterprise Market?

Also a shout out to LinkedIn and IBM for giving back to open source community.

There was also an "open mic" segment, and I'd like to give a shout to Avi Rappaport - I agree 1,000%, "stop words bad!" (or at least the blind use of index time stop words)


Surprises:

  • Not much of a threat to Google Appliance, due to packaging.  Yes, Google scales with their Map/Reduce and relevancy algorithms, and the open source guys have responded, but that's not the stuff that makes Google tick these days.
  • And despite the impressive and rapidly evolving core technologies, also not a real threat to the other Tier One vendors like FAST and Autonomy.  More on this seeming contradiction in a bit.
  • The Tier 2 vendors of the world, Attivio, Exalead, Dieselpoint, etc. DO need to pay attention.  There is a place for Tier 2 vendors, but they need to mind what the open source products do and do not provide more carefully.
  • It's really cool to see IBM willing to contribute so aggressively to the open source search engines, even though they sell several of their own.  A naive person might think they are competing with themselves, sabotaging their own sales guys, but they're a lot smarter than that.  They are selling their commercial search products as pure search, those technologies are always part of a larger (and more expensive) grand business solution.  They know what they're doing!

For similar reasons, still not a huge threat to Autonomy, MS/FAST, Endeca, etc. on corporate services.  I said earlier that the Apache projects are implementing a lot of the "secret sauce" that launched Autonomy and Endeca, etc, so you'd think this represents "a clear and present danger", but Mike Lynch's secret algorithms are not why people buy IDOL anymore.  Things like giant reference accounts, professional services, and commercial grade spiders have a lot more to with why big companies still pay six figures for search technology.

And speaking of surprises and Lucid Imagination, I wanna circle back to their PR a few months back when they got their funding and launched their company.  They talked about relevancy in their press releases!?  Wow... Yes, Lucene and Solr have some good traction there, but that specific competitive advantage has been used by almost every commercial search vendor in the past 15 years, including Verity, Autonomy and Google!

I would've expected them to say something like "we're gonna do for Lucene what RedHat did for Linux" - this would have been a very clear business-oriented proposition, though to be fair lots of companies have used that business model as well.  It wouldn't be original, but would be more business centric.  Then again, I'm not in Marketing, and their VC's obviously liked their pitch, so what do I know!

s/Mark