33 posts categorized "Open Source"

January 25, 2017

Lucidworks 3 Released!

Today Lucidworks announced the release Fusion 3, packed with some very powerful capabilities that, in many ways, sets a new standard in functionality and usability for enterprise search.

Fusion is tightly integrated Solr 6, the newest version of the popular, powerful and well-respected open source search platform. But the capabilities that really set Fusion 3 apart are the tools provided by Lucidworks on top of Solr to reduce the time-to-productivity.

It all starts at installation, which features a guided setup to allow staff, who may be not be familiar with enterprise search, to get started quickly and to built quality, full-featured search applications.

Earlier versions of Fusion provided very powerful ‘pipelines’ that allowed users to define a series of custom steps or 'stages' during both indexing and searching. These pipelines allowed users to add custom capabilities, but they generally required some programming and a deep understanding of search.

That knowledge still helps, but Fusion 3 comes with what Lucidworks calls the “Index Workbench” and the “Query Workbench”. These two GUI-driven applications let mere mortals set up capabilities that used to require a developer, and enables developers to create powerful pipelines in much less time.

What can a pipeline do? Let's look at two cases.

On a recent project, our client had a deep, well developed taxonomy, and they wanted to tag each document with the appropriate taxonomy terms. In the Fusion 2.x Index Pipeline, we wrote code to evaluate each document to determine relevant taxonomy terms; and then to insert the appropriate taxonomy terms into the actual document. This meant that at query time, no special effort was required to use the taxonomy terms in the query: they were part of the document.

Another common index time task is to identify and extract key terms, perhaps names and account numbers, to be used as facets.

The Index Workbench in Fusion 3 provides a powerful front-end to these capabilities that have long been part of Fusion; but which are now much easier for mere mortals to use.

The Query Workbench is similar, except that it operates at query time, making it easy to do what we’ve long called “query tuning”. Consider this: not every term a user enters for search is of equal important. The Query Workbench lets a non-programmer tweak relevance using a point-and-click interface. In previous visions of Fusion, and in most search platforms, a developer needed to write code to do the same task.

Another capability in Fusion 3 addresses a problem everyone who has ever installed a search technology has faced: how to insure that the production environment exactly mirrors the dev and QA servers. Doing so was a very detailed and tedious task; and any differences between QA and production could break something.

Fusion 3 has what Lucidworks calls Object Import/Export. This unique capability provides a way to export collection configurations, dashboards, and even pipeline stages and aggregations from a test or QA system; and reliably import those objects to a new production server. This makes it much easier to clone test systems; and more importantly, move search from Dev to QA and into production with high confidence that production exactly matches the test environment.

Fusion 3 also extends the Graphical Administrative User Interface to manage pretty much everything your operations department will need to do with Fusion. Admin UIs are not new; but the Fusion 3 tool sets a new high bar in functionality.

There is one other capability in Fusion 3 enabled by a relatively new capability in Solr: SQL.

I know what you’re thinking: “Why do I want SQL in a full-text application?”

Shift your focus to the other end.

Have you ever wanted to generate a report that shows information about inventory or other content in the search index? Let’s say on your business team needs inventory and product reports on content in your search-driven eCommerce data. The business team has tools they know and love for creating their own reports; but those tools operate on SQL databases.

This kind of reporting has always been tough in search, and typically required some customer programming to create the reports. With the SQL querying capabilities in Solr 6, and security provided by Fusion 3, you may simply need to point your business team at the search index, verify their credentials, and connect via OBDC/JDBC, and their existing tools will work.

What Else?

Fusion 3 is an upgrade from earlier versions, so it includes Spark, an Apache took with built-in modules for streaming, SQL, machine learning and graph processing. It works fine on Solr Cloud, which enables massive indices and query load; noit to mentin failover in the even of hardware problems. 

I expect that Fusion 3 documentation, and the ability to download and evaluate the product, will be on the Lucidworks site today at www.lucidworks.com. “Try it, you’ll like it”.

While we here at New Idea Engineering, a Lucidworks partner, can help you evaluate and implement Fusion 3, I’d also point out that our friends at MC+A, also Lucidworks partners, are hosting a webinar Thursday, January 26th. The link this link to register and attend the webinar: http://bit.ly/2joopQK.

 

Lucidworks CTO Grant Ingersol will be hosting a webinar on Friday, February 1st. Read about it here.

 

/s/ Miles

November 01, 2016

One search to rule them all

(Originally published on LinkedIn)

Lucene was ‘born’ in 1999, created by Doug Cutting; and in 2005, it became a top-level Apache project. That year, Gartner Group announced that the search ‘Leaders’ platforms on their Enterprise Search Magic Quadrant included Autonomy, FAST, Endeca, IBM Omnifind, and Verity. The Google Search Appliance was right on the cusp between ‘Challengers’ and ‘Leaders’. Not many people knew about Lucene; and few who did saw it as much more than a quirky little project.

Just a year later, Yonik Seeley and his employer, CNET Networks, published and donated the Solr search server to the Apache Software Foundation, where it became an incubator project in 2006; the two projects soon merged into a single top-level Apache project. That same year, Gartner narrowed the ‘Leaders’ in their 2006 Magic Quadrant for Search to Autonomy (which acquired Verity the previous year), FAST, and Endeca.

Jump forward to the present. FAST is gone, acquired by Microsoft in 2008 and morphed into SharePoint Search. Hewlett-Packard acquired Autonomy in October of 2011, followed a few weeks later by Oracle’s acquisition of Endeca. Endeca is no longer available as a search platform; and Autonomy is mostly seen as a strategy to keep a large number of HP consultants fully employed, often on compliance applications.

Only a spattering of commercial enterprise search platforms that once flooded the market just a few years back exist any more. While Gartner continues to list 14 or 15 products in their Magic Quadrant Enterprise Search grid, about the only pure commercial products we see any more are the Google Search Appliance and Recommind. And Google recently announced that the appliance is scheduled to go ‘end of life’ over the next few years. All of those bright yellow boxes become really nice Dell servers by the end of 2018.

A new crop of search platforms has grown to fill the void.

As an open source product, Solr has grown in its capabilities, and is now widely used for enterprise search and data applications in corporations and government projects. Solr Cloud extends the platform to a scalable high-availability platform for demanding enterprise and data search applications. Solr is an open source solution.

Cloudera also bundles some interesting extra tools including Solr in their HUE bundle; free to download and free to use as long as you like. Cloudera runs a slightly older but stable release, 4.10; but with a committers Yonik Seeley and Mark Miller, I suspect they’re in a good position.

Hortonworks, a Cloudera competitor, also offers Solr/Solr Cloud in their releases, in partnership with Lucidworks - a company with a large number of committers on staff.

There are also three companies that have proprietary offerings based on open source technology.

Attivio, founded in 2007, is a “Leader” in the most recent Gartner Magic Quadrant for Enterprise Search. Their product, while not open source, nonetheless thrives by combining search, BI, data automation, analytics and more.

Elasticsearch has evolved into a strong platform for search and data analytics, and a number of organizations are finding it useful in some tradition enterprise search applications as well. Elastic has also integrated Kibana, a powerful graphical presentation tool that adds value for content analytics, not just search activity reporting.

Lucidworks Fusion is a relative newcomer to enterprise search. It includes many of the rich architectural features that enterprises expect, including a powerful crawler, connectors, and reporting. With its ‘Anda’ crawler and connectors, admin UI, and reporting, some people see it as a contender to replace the Google Search Appliance.

The one thing that all of these ‘proprietary’ products have in common? They are based on Apache Lucene to deliver critical functionality. And when you consider all of the web sites that use some form of Lucene for their site search, I think you'd agree that it really is a powerful little package. It’s available for virtually any operating systems, and can be integrated using just about any programming language from C/C++ to Java to Perl to Python to .NET.

Even more amazing is that these companies with commercial products based on Lucene – and who compete in the marketplace - actually cooperate when it comes time to fix bugs or add new capabilities to Lucene. Given all of the commercial players that have closed their doors - leaving their customers to find replacement platforms – we’ve reached the point where open-source-based software really is the safe choice now. And universally, Lucene is the common element.

The quirky little search API Doug Cutting put together in 1999 has evolved to be the platform that drives the leading search platforms used in big data, NoSQL, enterprise search, and search analytics. And it doesn’t seem like it’s going to be phasing out any time soon.

May 31, 2016

The Findwise Enterprise Search and Findability Survey 2016 is open for business

Would you find it helpful to benchmark your Enterprise Search operations against hundreds of corporations, organizations and government agencies worldwide? Before you answer, would you find that information useful enough that you’re spend a few minutes answering a survey about your enterprise search practices? It seems like a pretty good deal to me to have real-world data from people just like yourself worldwide.

This survey, the results of which are useful, insightful, and actionable for search managers everywhere, provides the insight into many of the critical areas of search.

Findwise, the Swedish company with offices there and in Denmark, Norway Poland, Norway and London, is gathering data now for the 2016 version of their annual Enterprise Search and Findability Survey at http://bit.ly/1sY9qiE.

What sorts of things will you learn?

Past surveys give insight into the difference between companies will happy search users versus those whose employees prefer to avoid using internal search. One particularly interesting finding last year was that there are three levels of ‘search maturity’, identifiable by how search is implemented across content.

The least mature search organizations, roughly 25% of respondents, have search for specific repositories (siloes), but they generally treat search as ‘fire and forget’, and once installed, there is no ongoing oversight.

More mature search organizations that represent about 60% of respondents, have one search for all silos; but maintaining and improving search technology has very little staff attention.

The remaining 15% of organizations answering the survey invest in search technology and staff, and continuously attempt to improve search and findability. These organizations often have multiple search instances tailored for specific users and repositories.

One of my favorite findings a few years back was that a majority of enterprises have “one or less” full time staff responsible for search; and yet a similar majority of employees reported that search just didn’t work. The good news? Subsequent surveys have shown that staffing search with as few as 2 FTEs improves overall search satisfactions; and 3 FTEs seem to strongly improve overall satisfaction. And even more good news: Over the years, the trend in enterprise search shows that more and more organizations are taking search and findability seriously.

You can participate in the 2016 Findwise Enterprise Search and Findability Survey in just 10 or 15 minutes and you’ll be among the first to know what this year brings. Again, you’ll find the 2016 survey at http://bit.ly/1sY9qiE.

January 20, 2015

Your enterprise search is like your teenager

During a seminar a while back, I made this spontaneous claim. Recently, I made the comment again, and decided to back up my claim - which I’ll do here.

No, really – it’s true. Consider:

You can give your search platform detailed instructions, but it may or may not do things the way you meant:

Modern search platforms provide a console where you, as the one responsible for search, can enter all of the information needed to index content and serve up results. You tell it what repositories to index; what security applies to the various repositories; and how you want the results to look.  But did it? Does it give you a full report of what it did, what it was unable to do, and why?

You really have no idea what it’s doing – especially on weekends:

 Search platforms are notorious for the lack of operational information they provide.

Does your platform give you a useful report of what content was indexed successfully, and which were not – and why? And some platforms stop indexing files when they reach a certain size: do you know what content was not completely indexed?

When it does tell you, sometimes the information is incomplete: 

Your crawler tells you there were a bunch of ‘404’ errors because of a bad or missing URL; but will it tell you which page(s) had the bad link? Chances are it does not. 

They can be moody, and malfunction without any notice:

You schedule a full update of you index every weekend, and it has always worked flawlessly – as far as you know. Then, usually on a 3-day weekend, it fails. Why? See above.

When you talk to others who have search, theirs always sounds much better than yours:

As a conscientious search manager, you read about search, you attend webinars and conferences, and you always want to learn more. But you wonder why other search mangers seem to describe their platform in glowing terms, and never seem to have any of the behavioral issues you live with every day. It kind of makes you wonder what you’re doing wrong with yours.

It costs more to maintain than you thought and it always needs updates:

When you first got the platform you knew there we ongoing expenses you’d have to budget – support, training, updates, consulting. But just like your kid who needs books, a computer, soccer coaching, and tuition, it’s always more than you budgeted. Sometimes way more!

You can buy insurance, but it never seems to cover what you really need:

Bear with me here: you get insurance for your kids in case they get sick or cause an accident, and you buy support and maintenance for your search platform.  But in the same way that you end up surprised that orthodontics are not fully covered, you may find out that help tuning the search platform, or making it work better, isn’t covered by the plan you purchased – in fact, it wasn’t even offered. QED.

It speaks a different vocabulary:

You want to talk with your kid and understand what’s going on; you certainly don’t want to look uncool. But like your kid, your search platform has a vocabulary that only barely makes sense to you. You know rows and columns, and thought you understood ‘fields’; but the search platform uses words you know but that don’t seem to be the same definition you’ve known from databases or CMS systems.

It's hard for one person to manage, especially when it's new:

Many surveys show that most companies have one (or less) full-time staff responsible for running the search engine – while the same companies claim search is ‘critical’ to their mission.  Search is hard to run, especially in the first few years when everything needs attention. You can always get outside help – not unlike day care and babysitters – but it just seems so much better if you could have a team to help manage and maintain search to make it behave better.

How it behaves reflects on you:

You’re the search manager and you’ve got the job to make search work “just like Google”.  You spent more than $250K to get this search engine, and the fact that it just doesn’t work well reflects badly on you and your career. You may be worried about a divorce.

It doesn’t behave like the last one:

People tend to be nostalgic, as are many search managers I know. They learned how to take care of the previous one, but this new one – well, it’s NOTHING like the earlier one. You need to learn its habits and behaviors, and often adjust your behavior to insure peace at work.

You know if it messes up badly late at night, even on a weekend or a holiday, you’ll hear about it:

If customers or employees around the world use your search platform, there is no ‘down time’: when it’s having an issue, you’ll hear about it, and will be expected to solve the issue – NOW. You may even have IT staff monitoring the platform; but when it breaks in some odd and unanticipated way, you get the call. (And when does search ever fail in an expected way?)

 You may be legally responsible if it messes up:

Depending on what your search application is used for, you may find yourself legally responsible for a problem. Fortunately, the chances of you personally being at fault are slim, but if your company takes a hit for a problem that you hadn’t anticipated, you may have some ‘career risk’ of your own. Was secure content about the upcoming merger accidentally made public? Was content to be served only to your Swiss employees when they search from Switzerland exposed outside of the country? And you can’t even buy liability insurance for that kind of error.

When it’s good, you rarely hear about it; when it's bad, you’ll hear about it:

Seriously, how many of you have gotten a call from your CIO to tell you what a great experience he or she had on the new search platform? Do people want to take you to lunch because search works so well? If you answered ‘yes’ to either of these, I’d like to hear from you!

In my experience, people only go out of their way to give feedback on search when it’s not working well. It’s not “like Google”. Even though Google has hundreds or people and ‘bots’ examining every search query to try to make the result better, and you have only yourself and an IT guy.

You’ll hear. 

The work of managing it is never done:

The wonderful southern writer Ferrol Sams wrote :

“He's a good boy… I just can't think of enough things to tell him not to do.” Sound like your search platform? It will misbehave (or fail outright) in ways you never considered, and your search vendor will tell you “We’ve never seen a problem like that before”. Who has to get it fixed? You have to ask?

Once it moves away, you sometimes feel nostalgic:

Either you toss it out, or a major upgrade from your vendor comes alone and the old search platform gets replaced. Soon, you’re wishing for the “Good old days” when you knew how cute and quirky the old one was, and you find yourself feeling nostalgic for it and wishing that it didn’t have to move out.

Do you agree with my premise? What  have I missed?

September 18, 2014

Lucidworks ships Fusion 1.0 - Pretty exciting next gen platform.

OK, I've known about this coming for a while, just didn't know when until this afternoon - so I stayed up late to get the download started after midnight.

Fusion is more than an updated release of Lucidworks Search. It is Solr based, but it's a re-write from top to bottom. And it's not a bare bones search API only a developer can love. Connectors? Check. Security? Check. Analytics? Check. Entity extraction? Check. All included. 

But what it adds is where the real capabilities and contributions are. Machine learning? Check. Admin console? Check. Machine learning? Check. Log analytics? Check. A document pre-processing pipeline? Check. Deep signal processing (think 'automated context processing')? Check. 

Even if you think these new unique capabilities are not your style, then you can buy Solr support and still get licenses for connectors, entity extraction, and a handful of other formerly 'premium' products. Want it all? License the full product at a per-node price I always thought was underpriced. I'm sure you'll be hearing alot more in the coming days and weeks, but go - download - try - and see what it does for your sites. Your developers will love it, your business owners will love it, your users will love it, and I bet even your CFO will love it.  

Full disclosure: I am a former employee of Lucidworks; but I'd be just as excited even if I were not. Go download it for sure and try it on your content. But be sure to check out the  'search as killer app' video on Lucid's home page www.lucidworks.com

s/ Miles

 

 

September 09, 2014

Sometimes you're just wrong! (Maybe).

OK, this one falls into the 'eat your own words' category, so I have to come clean. Well, partly clean. Let me explain.

I was out of town last week, but just before I left I wrote an article asserting that Elasticsearch really isn't 'enterprise' search. The article drew alot of attention and comments from both sides of the argument. I have to say I still think that's the case, but an announcement by Microsoft seems to differ, and end up a net positive for Elasticsearch. Microsoft tells us that Elasticsearch is the platform under the covers of Microsoft's Azure search offering. It looks like you have a couple of options - as long as you're on Azure:

a) You can download and use the open source Elasticsearch platform available on GitHub; or

b) Use Microsoft's managed service 'Facetflow Elasticsearch' which incorporates (some of) the open source code in various places.

Microsoft calls this "a fully-managed real-time search and analytics service" while, according to ZDNet, it is for 'web and mobile application developers looking to incorporate full-text search into their applications'. 

Either way, it's certainly yet another step forward for Elasticsearch, and is a big step forward in visibility for the company. It's not clear what kind of revenue they will receive from the deal - Microsoft being relatively famous for being quite frugal. And after all, smart search folks like Kevin Green of Spantree Technology Group talk about its strengths and liabilities, saying it *is* fast ('wicked fast'); fault-tolerant; distributed and more. But it is not a crawler; a machine learner; a user-facing front end, and it is not secure. 

So I'll agree a partial 'mea culpa' is in order; adding capabilities to an open source project can make it more enterprise ready. But I think the jury may still be out on the rest of my piece. Stay tuned!

July 21, 2014

What does it take to qualify as 'Big Data'?

If you've been on a deserted island for a couple of decades, you may not have heard the hot new buzz phrase: Big Data. And you many not have heard of "Hadoop", the application that accidentally solved the problem of Big Data.

Hadoop was originally designed as a way for the open source Nutch crawler to store its content prior to indexing. Nutch was fine for crawling sites; but of you wanted to crawl really massive data sets – say the Internet – you needed a better way to store the content (thank goodness Doug Cutting didn’t work at a database giant or we’d all be speaking SQL now!) GigaOm has a great series on the history of Hadoop http://bit.ly/1jOMHiQ I recommend for anyone interested in how it all began and evolved,

After a number of false starts, brick walls, and subsequent successes, Hadoop is a technology that really enables what we now call ‘big data’- usually written as "Big Data". But what does this mean?  After all, there are companies with a lot of data – and there are companies with limited content size that changes rapidly every day. But which of these really have data that meets the 'Big" definition.  

Consider a company like AT&T or Xerox PARC, which licenses its technology to companies worldwide. As part of a license agreement, PARC agrees to defend its licensees if an intellectual property lawsuit ever crosses the transom. Both companies own over tens of thousands patents going back to its founding in the early 20th century. Just the digital content to support these patents and inventions must number on the tens of millions of documents, much of which is in formats no longer supported by any modern search platform. Heck, to Xerox, WordStar and Peachtext probably seem pretty recent! But about the only time they have to access their content search is when a licensee needs help defending a licensee against an IP claim. I don’t know how often that is, but I’d bet less than a dozen times a year.

Now consider a retail giant like Amazon or Best Buy. In raw size, I’d bet Amazon has hundreds of millions of items to index: books, products, videos, tunes. Maybe more. But that’s not what makes Amazon successful. I think it’s the ability to execute billions of queries every day – again, maybe more – and return damn good results in well under a second, along with recommendations for related products. Best buy actually has retail stores, so they have to keep purchase data, but also buying patterns so they know what products to stock in any given retail location.

A healthcare company like UnitedHealth must have its share of corporate intranet content. But unlike many corporations, these companies must process millions of medical transactions every week: doctor visits, prescriptions, test results, and more. They need to process these transactions, but they also must keep these transactions around for legally defined durations.

Finally, consider a global telecom company like Ericsson or Verizon. They’ve got the usual corporate intranet, I’m sure. They have financial transactions like Amazon and UHG. But they also have telecomm transaction records that must count in the billions a month: phone calls and more. And given the politics of the world, many of these transactions have to be maintained and searchable for months, if not years.

These four companies have a number of common traits with respect to search; but each has its own specific demands. Which ones count as ‘big data’ as it’s usually defined? And which just have ‘a bunch of content?

As it turns out that’s a touch question. At one point, there was a consensus that ‘big data’ required three things, known as the “Three V’s of Big Data’. This escalated to the ‘5 V’s of Big Data’, then the “7 V’s”– and I’ve even seen some define the “10 V’s of Big Data”. Wow.. and growing!

Let’s take a look at the various “V’s” that are commonly used to define ‘Big Data’.

Depending on who you ask, there are four, five, seven or more ‘requirements’ that define ‘big data. These are usually referred to as the “Vs of Big Data”, and these usually include:

Volume: The scale of your data – basically, how many ‘entries’ or ‘items’, you have. For Xerox, how many patents; for a telecom company, how many phone ‘transactions’ have there been.   

Variety: Basically this means how many different types of data you have. Amazon has mouse clicks, product views, unique titles, subscribers, financial transactions and more. For UHG and Ericsson, I’d guess the majority of their content is transactional: phone call metadata (originating and receiving phone number, duration of the call, time of day, etc.). In the enterprise, variety can also mean data format and structure. Some claim that 90% of enterprise data is unstructured, which adds yet another challenge.

Veracity: The boils down whether the data is trustworthy and meaningful. I remember a survey HP did years ago to find out what predictors were useful to know whether a person waking into a random electronics store would walk out with an HP PC. Using HP products at work or at home we the big predictors; but the fact that the most likely day was Tuesday was perhaps spurious and not very valuable.

Velocity: How fast is the data coming in and/or changing. Amazon has a pretty good idea on any given day how many transactions they can expect, and Verizon knows how much call data they can expect. But things change: A new product becomes available, or a major world event triggers many more phone calls than usual.

Viability: If you want to track trends, you need to know what data points are the most useful in predicting the future. A good friend of mine bought a router on Amazon; and Amazon reported that people who bought that router also bought.. men’s extra large jeans. Now, he tells me he did think they were nice jeans, but that signal may not have had long viability.

Value: How useful or important is the data in making a prediction, or in improving business decisions. That was easy!

Variability: This often refers to how internally consistent the data is. To a data point as an accurate predictor, that data point is ideally consistent across the wide range of content. Blood pressure, for example, is generally in a small range; and for a given patient, should be relatively consistent over time. When there is a change, UHG may want to understand the cause.

Visualization: Rows and columns of data can look pretty intimidating and it’s not easy to extract meaning from them. But as they say, ‘a picture is worth a thousand words’, so being able to see charts or graphs can help meaning and trends jump out at you.  I’d use Lucidworks’ SiLK product as an example of a great visualization tool for big data, but there are many others.

Validity: This seems like another way to say the data has veracity, but it may be a subtle point. If you’re recording click-thru data, or prescriptions, or intellectual property, you have to know that the data is accurate and internally consistent. In my HP anecdote above, is the fact that more people bought HP PCs on Tuesday a valid finding? Or is it simply noise? You’ll probably need a human researcher to make these kinds of calls.

Venue: With respect to Big Data, this means where the data came from and where it will be used. Content collected from automobiles and from airplanes may look similar in a lot of ways to the novice. In the same way, data from the public Internet versus data collected from a private cloud may look almost identical. But making decisions for your intranet based on data collected from Bing or Google may prove to be a risk.

Vocabulary: What describes or defines the various items of the data. Ericsson has to know which bit of data represent a phone number and which represent the time of day. Without some idea of the schema or taxonomy, we’ll be hard pressed to reach reasonable decisions from Big Data.

Volatility: This may seem like velocity above, but volatility in Big Data really means how long is the data value, how long do you need to keep it around.  Healthcare companies may need to keep the data a lot longer than

Vagueness: This final one is credited to Venkat Krishnamurthy of YarcData just last month at the Big Data Innovation Summit here in Silicon Valley.  In a way, it addresses the confidence we can have in the results suggested by the data. Are we seeing real trends, or are we witnessing a black swan?

In the application of Big Data not all of these various V’s are as valid or valuable to the casual (or serious) observer. But as in so many things, interpreting the data is to the person making the call. Big Data is only a tool: use it wisely!

Some resources I used in collection data for this article include the follow web sites and blogs:

IBM’s Big Data & Analytics Hub 

MapR's Blog: Top 10 Big Data Challenges – A Serious Look at 10 Big Data V’s 

See also Dr. Kirk Borne’s Top 10 List on Data Science Central   

Bernard Marr’s LinkedIn post on The 5 Vs Everyone Must Know 

 

May 14, 2013

Open Source Search Myth 5 - Total Cost of Ownership

This is part of a series addressing the misconception that open sounce search is too risky for companies to use. You can find the introduction to the series here; and Part 4, Features and Capabilities, here.

Part 5: Total Cost of Ownership

Total cost of ownership, TCO, is a big deal to large users of search technology. Usually, the component of TCO with respect to search is the license fee; enterprise search was historically an expensive proposition. But in fact there are other major components of TCO including implementation/operations, hardware cost, and ongoing support come to mind.

Walter Underwood, one of the key developers at Ultraseek and later the guy who did the Netflix relevancy contest, once explained the difference between commercial and open source search. Let me paraphrase: 

"With commercial search, you spend a lot of money to license it; then you spend a lot of money to implement it.

With open source search, you download the software for free; then you spend alot of money implementing it."

But there is another big element: how much iron do you need? A few years ago we helped a company switch search platform. Their business was search enabling small-town newspaper archives going back to the 1890s, via OCR'd content. They add tens of thousands of documents - historical newspaper articles - every day. 

The commercial platform they replaced required major expense in new servers as they content grew. Every year.

As it turns out, the ROI for swapping out their old search engine was easy: they needed less new hardware every year than with the old engine. And so much less that the ROI period was less than a year.

A different project we did when we were still doing business as New Idea Engineering involved a comparison between Microsoft SharePoint 2010 and search with Solr. Our customer wanted to know if the switch would, indeed, require fewer servers to do the job. It turns out that it was quite reasonable to replace the 12 servers Microsoft FAST required with 6 or fewer servers running Solr. Half the cost of servers; half the cost of energy; half the cost of maintenance. Like the concept?

Now, I'll agree that LucidWorks - my employer - markets a proprietary search platform based on Solr. And we do not license the product for free. But compared to most commercial platforms, LucidWorks Search is pretty darned reasonable. And you still get the cost savings in energy, iron, and scalability.

Less hardware. Better search. How is the TCO of open source a liability compared to most commercial search platforms?

 

 

March 20, 2013

Open Source Search Myth 3: Skills Required In-House

This is part of a series addressing the misconception that open source search is too risky for companies to use. You can find the introduction to the series here; this is Part 3 of the series; for Part 2 click Potentially Expensive Customizations.

Part 3: Skills Required In-House

One of the hallmarks of enterprise software in general is that it is complex. People in large organizations who manage instances of enterprise search as no less likely than their non-technical peers to believe that "if Google can make search so good on the internet, enterprise search must be trivial". Sadly, that is the killer myth of search.

Google on the internet - or Bing or Baidu or whichever site you use and love - is good because of the supporting technology, NOT simply because of search. I'd wager that most of what people like about Google et al has very little to do with search and a great deal to do constant monitoring and tweaking of the platform.

Consider: at the Google 'command line' (the search box), you can type in an arithmetic equation such as "2+3" get 5. You can enter a FedEx tracking number and get a suggestion to link to FedEx for information. It's cool that Google provides those capabilities and others; but those features are there because Google has programs looking at search behavior for all of its users every day in order to understand user intent. When something unusual comes up, humans get involved and make judgments. When it makes sense, Google implements another capability - in front of the search engine, not within it.

Enterprise search is the same - except that very few companies invest money in managing and running their search; so no matter how well you tune it at the beginning, quality deteriorates over time. Enterprise search is not 'fire and forget'.

 Any company that rolls out a mission critical application and does NOT have their own skilled team in house is going to pay a consulting form thousands of dollars a day forever. 'Nuff said.

 

March 18, 2013

Solr 4 Training 3/27 in Northern Virginia/DC area

Interrupting my series on whether open source search is a good idea in the enterprise to tell you about an opportunity to attend LucidWorks' Solr Bootcamp in Reston, Virginia on Wednesday March 27. Lucid staff and Lucene/Solr committers Erick Erickson and Erik Hatcher will be there, along with Solr pro Joel Bernstein. Heck, I'll even be there!

The link is here; for readers of our blog, use discount code SOLR4VA-5OFF for a discount.

Course Outline:

  • What's new in Solr 4
  • Solr 4 Functional Overview
  • Solr Cloud Deep Dive
  • Solr 4 Expert Panel Case Studies
  • Workshop and Open lab

And ask the guys how you can get involved in Solr as a contributor or committer!