14 posts categorized "eCommerce"

January 06, 2020

It's a new year: Time for better metadata!

The new year is a time when most of us resolve to make changes in our personal lives: losing weight, exercising more, spending more time with a spouse and/or the kids. We start the year with great energy to meet our goals, but sadly many of us fall short through the year.

This often happens in the enterprise as well. Improving internal search is a common resolution at the time of the year. For eCommerce sites, January generally means fewer site visitors once the holiday rush is done; so making changes won’t have a great impact on sales. For corporations, it’s a time of new budgets and great expectations: and more than a few of the clients I’ve we’ve worked with over the years tell me how poorly their internal search performs compared to the public search sites like Google, Bing, and DuckDuckGo. Why do these search platforms work so well? And why can’t your site search match their success? It’s a numbers game. By definition, public search platforms index millions of sites; and many of these contain similar if not identical content. This makes is easy to find what you’re looking for because thousands of sites have relevant results for just about any query you may try.

Intranet sites are different, Usually, there is only one page with the information you are looking for. But often, content authors, who have read about how to promote consent on Google, will add keywords using Microsoft Word’s “Properties” field in an effort to promote their documents. This attempt to ‘game’ the internal search platform generally interferes with the platform’s relevance functions and results in poor result relevance. Even the Document Properties the Microsoft Word provides can interfere with search effectiveness.

Years ago, we were working with a client who was interested in knowing which employees were contributing to the intranet content. When the data was processed, it turned out that an Administrative Assistant in Marketing had authored more documents than anyone else in the corporation. After a quick review, we discovered why this one person was apparently more prolific than any other employee. That person had created all of the template forms used throughout the company, so the Word Document Properties listed that employee’s name as the author of virtually every standard template throughout the company.

So in the spirit of the new year, I’d suggest that you spend a day or two performing a data audit to discover where your content – or lack thereof – is negatively impacting your enterprise search results. And if you find any doozies – I’d love to hear about it!

 

 

December 10, 2019

A Working Vacation

The month of January is associated with the Roman god Janus who, with two heads, could look forward and back. That said, I find December a quiet time that provides the opportunity to review the current year and to plan the coming new year. As I tweeted yesterday at @miles_kehoe, this is the most stressful time of the year for most sites focused on eCommerce. Changes are generally 'off-limits' - even an hour offline can put a dent in sales.

But for those responsible for corporate internal and public-facing sites, this is the time to review content, identify potential changes, and even new content. And if planned well, the holidays are often a great time to update intranet sites: from late November through the new year, activity tends to slow for more corporate sites. Both IT and content staff should be using this quiet time to make changes, from updates to current content - the new vacation schedule is just one the comes to mind - to minor restructuring. (Note: while the holidays are a great time to roll out major changes, these should have been in planning months ago: it's a holiday, not a sabbatical!)

For the search team, this is time to review search activity: top queries, zero hits, misspellings, and synonyms come to mind as a minimum effort. It's also a good time to identify popular content, as well as content that was either never part of any search result or was included in result lists but never viewed.


So - December is nearly half over: take advantage of what is normally a quiet time for intranets and make that site better!

Happy Holidays!

 

January 20, 2015

Your enterprise search is like your teenager

During a seminar a while back, I made this spontaneous claim. Recently, I made the comment again, and decided to back up my claim - which I’ll do here.

No, really – it’s true. Consider:

You can give your search platform detailed instructions, but it may or may not do things the way you meant:

Modern search platforms provide a console where you, as the one responsible for search, can enter all of the information needed to index content and serve up results. You tell it what repositories to index; what security applies to the various repositories; and how you want the results to look.  But did it? Does it give you a full report of what it did, what it was unable to do, and why?

You really have no idea what it’s doing – especially on weekends:

 Search platforms are notorious for the lack of operational information they provide.

Does your platform give you a useful report of what content was indexed successfully, and which were not – and why? And some platforms stop indexing files when they reach a certain size: do you know what content was not completely indexed?

When it does tell you, sometimes the information is incomplete: 

Your crawler tells you there were a bunch of ‘404’ errors because of a bad or missing URL; but will it tell you which page(s) had the bad link? Chances are it does not. 

They can be moody, and malfunction without any notice:

You schedule a full update of you index every weekend, and it has always worked flawlessly – as far as you know. Then, usually on a 3-day weekend, it fails. Why? See above.

When you talk to others who have search, theirs always sounds much better than yours:

As a conscientious search manager, you read about search, you attend webinars and conferences, and you always want to learn more. But you wonder why other search mangers seem to describe their platform in glowing terms, and never seem to have any of the behavioral issues you live with every day. It kind of makes you wonder what you’re doing wrong with yours.

It costs more to maintain than you thought and it always needs updates:

When you first got the platform you knew there we ongoing expenses you’d have to budget – support, training, updates, consulting. But just like your kid who needs books, a computer, soccer coaching, and tuition, it’s always more than you budgeted. Sometimes way more!

You can buy insurance, but it never seems to cover what you really need:

Bear with me here: you get insurance for your kids in case they get sick or cause an accident, and you buy support and maintenance for your search platform.  But in the same way that you end up surprised that orthodontics are not fully covered, you may find out that help tuning the search platform, or making it work better, isn’t covered by the plan you purchased – in fact, it wasn’t even offered. QED.

It speaks a different vocabulary:

You want to talk with your kid and understand what’s going on; you certainly don’t want to look uncool. But like your kid, your search platform has a vocabulary that only barely makes sense to you. You know rows and columns, and thought you understood ‘fields’; but the search platform uses words you know but that don’t seem to be the same definition you’ve known from databases or CMS systems.

It's hard for one person to manage, especially when it's new:

Many surveys show that most companies have one (or less) full-time staff responsible for running the search engine – while the same companies claim search is ‘critical’ to their mission.  Search is hard to run, especially in the first few years when everything needs attention. You can always get outside help – not unlike day care and babysitters – but it just seems so much better if you could have a team to help manage and maintain search to make it behave better.

How it behaves reflects on you:

You’re the search manager and you’ve got the job to make search work “just like Google”.  You spent more than $250K to get this search engine, and the fact that it just doesn’t work well reflects badly on you and your career. You may be worried about a divorce.

It doesn’t behave like the last one:

People tend to be nostalgic, as are many search managers I know. They learned how to take care of the previous one, but this new one – well, it’s NOTHING like the earlier one. You need to learn its habits and behaviors, and often adjust your behavior to insure peace at work.

You know if it messes up badly late at night, even on a weekend or a holiday, you’ll hear about it:

If customers or employees around the world use your search platform, there is no ‘down time’: when it’s having an issue, you’ll hear about it, and will be expected to solve the issue – NOW. You may even have IT staff monitoring the platform; but when it breaks in some odd and unanticipated way, you get the call. (And when does search ever fail in an expected way?)

 You may be legally responsible if it messes up:

Depending on what your search application is used for, you may find yourself legally responsible for a problem. Fortunately, the chances of you personally being at fault are slim, but if your company takes a hit for a problem that you hadn’t anticipated, you may have some ‘career risk’ of your own. Was secure content about the upcoming merger accidentally made public? Was content to be served only to your Swiss employees when they search from Switzerland exposed outside of the country? And you can’t even buy liability insurance for that kind of error.

When it’s good, you rarely hear about it; when it's bad, you’ll hear about it:

Seriously, how many of you have gotten a call from your CIO to tell you what a great experience he or she had on the new search platform? Do people want to take you to lunch because search works so well? If you answered ‘yes’ to either of these, I’d like to hear from you!

In my experience, people only go out of their way to give feedback on search when it’s not working well. It’s not “like Google”. Even though Google has hundreds or people and ‘bots’ examining every search query to try to make the result better, and you have only yourself and an IT guy.

You’ll hear. 

The work of managing it is never done:

The wonderful southern writer Ferrol Sams wrote :

“He's a good boy… I just can't think of enough things to tell him not to do.” Sound like your search platform? It will misbehave (or fail outright) in ways you never considered, and your search vendor will tell you “We’ve never seen a problem like that before”. Who has to get it fixed? You have to ask?

Once it moves away, you sometimes feel nostalgic:

Either you toss it out, or a major upgrade from your vendor comes alone and the old search platform gets replaced. Soon, you’re wishing for the “Good old days” when you knew how cute and quirky the old one was, and you find yourself feeling nostalgic for it and wishing that it didn’t have to move out.

Do you agree with my premise? What  have I missed?

February 14, 2013

A paradigm shift in enterprise search

I've been involved in enterprise search since before the 'earthquake World Series' between the Giants and the A's in 1989. While our former company became part of LucidWorks last December, we still keep abreast of the market. But being a LucidWorks employee has brought me to a new realization: commercial enterprise search is pretty much dead.

Think back a few years: FAST ESP, Autonomy IDOL (including the then-recently acquired Verity), Exalead, and Endeca were the market. Now, every one of those companies has become part of a larger business. Some of the FAST technology lives on, buried in SharePoint 2013; Autonomy has suffered as part of HP because - well, because HP isn't what it was when Bill and Dave ran it. Current management doesn't know what they have in IDOL, and the awful deal they cut was probably based on optimistic sales numbers that may or may not have existed. Exalead, the engine I hoped would take the place of FAST ESP in the search market is now part of Dassault and is rarely heard of in search. And Endeca, the gem of a search platform optimized for the lucrative eCommerce market, has become one of three or four search-related companies in the Oracle stable. 

Microsoft is finally taking advantage of the technology acquired in the FAST acquisition for SharePoint 2013, but as long as it's tied to SharePoint - even with the ability to index external content - it's not going to be an enterprise-wide distribution - or a 'big data' solution. SharePoint Hadoop? Aslongf as you bring SQL Server. Mahout? Pig? I don't think so. There are too many companies that want or need Linux for their servers rather than Windows.

Then there is Google, the ultimate closed-box solution. As long as you use the Google search button/icon, users are happy – at least at first. If you have sixty guys named Sarah? Maybe not.

So what do we have? A few good options generally from small companies that tend to focus on hosted eCommerce - SLI Systems and Dieselpoint; and there’s Coveo, a strong Windows platform offering.

Solr is the enterprise search market now. My employer, LucidWorks, was the first, and remains the primary commercial driver to the open source Apache project. What's interesting is the number of commercial products based on Solr and it's underlying platform, Lucene.

Years ago, commercial search software was the 'safe choice'. Now I think things have changed: open source search is the safe choice for companies where search is mission. Do you agree?

I'll be writing more about why I believe this to be the case over the coming weeks and months: stay tuned.

/s/Miles

 

September 18, 2012

Better Handling of Model Numbers and Software Versions

In a recent post I talked about different ways to tokenizer your data.  Today I'll extend that by talking about tokenizing text that has a small amout of structure in it, and the relationship between tokenziation and Entity Extraction.

Although this post talks about eCommerce related items Model Numbers and Version Numbers, this same logic could also be applied to dates, amounts of money, social security numbers, phone numbers, ISBN numbers, patent references, legal citations, etc. 

Better handling of Model Numbers

Note: Parts suppliers often have product names that look more like model numbers, so they might benefit from this as well.

It would be possible use field specific tokenization rules, in conjunction with search time logic, to allow for more superior partial matches. In a manner somewhat analogous to the previous section (Improved Tokenization for Punctuation), structured product names could be broken down into components, and also maintained in their original form, and overload the tokens in the index.

Search time patterns could also possibly enhance this search logic.

Potential advantages:

  • Ability to rank more exact matches (if the user types the longer form)
  • More predictable partial matches
  • Could enable normalized sorting and field collapsing
  • Could link from more specific to less specific and vice versa
  • Possibly improve autosuggest searches
  • Avoid use of wildcards (although this isn't a problem in some search engines)

Normalizing Version Numbers

Technical websites have a great deal of software and drivers, with many version numbers. Similar to the methods suggested for model numbers, these special numbers could be recognized and normalized as they're added to the index. Potential advantages:
  • Allow for proper version number sorting within one software component or driver (there is no absolute scale that’s comparable across disparate software)
  • Allow for proper partial matches
  • Allow for proper range searches
  • Possibly add an additional sentinel tokens for “latest” Entity Extraction / Normalization

Depending on the search engine, there might not be much implementation difference between normalizing model and version numbers as mentioned previously, and doing full entity extraction. However, regardless of implementation similarity, designing for full Entity Extraction elicits a more complete functional spec and UI treatment.

Benefits of full Entity Extraction over simple normalized tokenization:

  • Usually includes using the extracted entities in Faceted Navigation. If some silos already have good metadata for Facets but other silos lack it, this might allow those other silos to have almost comparable values for the same data type (via extraction vs. well defined metadata) and have more consistence coverage for faceted search.
  • Encourages further thought as to the preferred canonical internal representation and display format for each type of entity.
There is one potential issue with the first point, using entity extraction for faceted search: the text of a document may reference many valid entities, while the document itself is only primarily related to one or two of them, so there may be a tendency towards “Facet Inflation”. This can sometimes be mitigated by having several classes of the same facet, but where the scope of one type is more heavily restricted by having it only pull values from key parts of the document, such as title or model number.

September 11, 2012

Are you Tracking MRR? - "Mean Reciprocal Rank" Trend Monitoring

MRR is a simple numerical technique to monitor the overall relevancy performance of search engines over time. It is based on click-throughs in the search results, where a click on the top document is scored as 100%, a click on the second document is 50%, 3rd document is 33%, etc. These numbers are collected and averaged over units of time.

The absolute value of MRR is not necessarily the important statistic because each site has different content, different classes of users, and different search technology. However, the trend of MRR over time can allow a site to spot changes quickly. It can also be used to score "A/B" testing.

There are certainly more elaborate algorithms that can be used, and MRR doesn’t account for whether a user liked the document once they opened it. But having even possibly imperfect performance data that can be trended over time is better than having nothing.

Reference: http://en.wikipedia.org/wiki/Mean_reciprocal_rank

Walter Underwood (of Ultraseek, Netflix, MarkLogic fame) gave a presentation (in PPT/PowerPoint) of this topic a couple years ago about NetFlix's use of MRR.

January 11, 2012

Webinar: What users want from enterprise search in 2012

If you ask the average enterprise user what he or she wants from their internal search platform, chances are good that they will tell you they want search 'just like Google'. After all, people are born with the ability to use Google; why should they need to learn how to use their internal search?

The problem is that web search works so well because, at the sheer scale of the internet, search can take advantage of methodologies that are not directly applicable to the intranet. Yet many of the things that make the public web experience so good can, in fact, be adapted in the enterprise. Our opinion is that, beyond a base level, the success of any enterprise search platform depends on how it is implemented and managed rather than on the core technology.

In this webinar we'll talk about what users want, and how you can address the specific challenges of enterprise content and still deliver a satisfying and successful enterprise search experience inside the firewall.

Register today for our first webinar of the new year scheduled for January 25 : What enterprise users want from search in 2012.

 

 

 

 

 

 

December 12, 2011

New Phrase for determining Sentiment Analysis / Customer Interest

If you lookup:

fedex "Package not due for delivery"

which is one of the status messages you can get when tracking a package, you'll see a lot of postings asking about it.

FYI: It means your new toy has arrived in the city you live in, but will NOT be delivered today, because they didn't promise to get it to you until tomorrow.  Whether this is to force customers into paying for express service, or simply a logistics issue, or a mix of the two, depends on your view of companies and I won't get into that here.

However, you'll notice a lot of the postings asking about it are from folks waiting for delivery of things they're very excited to get, often some big-ticket peice of shiny electronics.  They're dying for Fedex to deliver it - they're so anxious and upset about the delay that they motivated enough to go online and search, and make ranting posts - all because their "toy" is delayed.

So we have particular emotional response, often about an upscale product, with a reasonably distinct search phrase - cool!

Yes, yes, of course you could say that the customers are mad about the percieved injustice of it, the Occupy Wall Street spin, or that sometimes the package could be really important for other reasons, which are certainly valid points.  I'm not taking sides or passing judgement - and I found discovered this today looking for a friend's overdue toy - that's not the point.  I'm just saying that I bet there's a good statistical correlation, and of course it wouldn't apply 100% of the time - which would actually be quite rare in such things.

November 20, 2011

Dell.com Site Search acting weird today

FYI: The Dell Zino is (was?) a small form factor machine that could be used as a portable server, probably a bit larger than a Mac Mini, but still portable.

If you go to Dell.com and use their search and do a one word search for "zino" you get 8 results, but none of them are for that machine.  7 are for memory sims, and the bottom result is for a different dell machine, a small tower.  At first I was worried that perhaps they had discontinued the cute little guy.

I sent to Google and one of the suggested searches was "dell zino discontinued", aw... I was afraid of that.. But wait! - The first page didn't actually say it was discontinued, it just had those words in a long discussion thread.  And the second result goes to the Zino page on Dell.com, and it is still listed, though I wasn't able to actually buy it.  When you click the "choose" link you're asked to choose a market segment but the list was empty.  Maybe it's not for sale and their site search knows to know display it???

August 09, 2011

So how many machines does *your* vendor suggest for 100,000,000+ document dataset?

We've been chatting with folks lately about really large data sets.  Clients who have a problem, and vendors who claim they can help.

But a basic question keeps coming up - not licensing - but "how many machines will we need?"  And not everybody can put their data on a public cloud, and private clouds can't always spit out a dozen virtual machines to play with, plus duplicates of that for dev and staging, so not quite as trivial as some folks thing.

The Tier-1 vendors can handle hundreds of millions of dcs, sure, but usually on quite a few machines, plus of course their premium licensing, and some non trivial setup at that point.

And as much as we love Lucene, Solr, Nutch and Hadoop, our tests show you need a fair number of machines if you're going to turn around a half billion docs in less than a week.

And beyond indexing time, once you start doing 3 or 4 facet filters, you also hit another performance knee.

We've got 4 Tier-2 vendors on our "short list" that might be able to reduce machine counts by a factor of 10 or more over the Tier-1 and open source guys.  But we'd love to hear your experiences.