November 14, 2008

Market Trends in Embedded Search

Are you trying to find an embedded search solution that meets your users’ needs and your specific application requirements?  Have you tried to embed search into your application, but found it difficult and expensive to customize and integrate? Have you already embedded a solution only to find that it lacks the performance and functionality your customers are demanding? Would you like to learn about how you can cost-effectively give your customers access to search that has been architected for ISVs, offers value-added features, and scales?

View a brief video interview about the webinar with Miles Kehoe.

If so, we’d like to invite you to join our webinar, “12 Leading Insights on Embedded Search for ISVs.” Learn about:
•    Major Market Trends for Embedded Search
•    Key Challenges Facing ISVs with Embedded Search
•    5 Most Important Embedded Search Requirements
•    What Works and What Doesn’t
•    Overview of Exalead CloudView OEM Edition
Moderator
•    Eric Rogge, Senior Director of Marketing, Exalead

Featured Speakers
•    Ranjeet Vidwans, VP of OEM, Exalead
•    Miles Kehoe, President, New Idea Engineering

Date/Time
•    Friday, December 5, 2008 at 11am PST

Registration:
To register for the webinar, please click here.

To download the Exalead whitepaper "The ISV Challenge: Satisfying the Demand for Better Search,” 

October 27, 2008

Grep is not a search engine

I actually started out to write an entry on the weird search terms we've seen, but that will have to wait. As I was doing some research for that entry, I ran into yet another annoyance we often see: a 'search engine' works just like grep.

For those of you who don't know the pleasures and utility of grep, I feel both regret and envy. After all, I've spent years relying on that bizarre Unix & Linux utility - so much so that I use the MKS Toolkit on all of my Windows PCs. But as useful as grep can be, it is not a search engine.

Consider Adobe Acrobat. I found a PDF on the web, and viewed it with the Adobe Acrobat Version 7.00 add-in for Firefox. I am looking for a phrase popular in the management service consulting business that describes a process: 'as is, to be'. Now, all of these are typically stop words which is the point of my 'weird searches' entry to come.

When you search a PDF file with the built-in Search feature, the built-in engine will return all instances of the sequences of characters you enter. Search for the phrase view and you'll see all of the instances of the term as well as the term views. Cool - stemming! But wait! Dig a bit further and you find it also returns review, interviews viewpoint, and any other terms whose only similarity to the original query is that it contains the same letter sequence. How about a phrase? Adobe doesn't seem to support quoting a phrase; but it seems when you enter multiple space-delimited terms it assumes you want a phrase search. But even in a phrase search, the last term only seems to start with a partial. Thus, a search for switch vendors will find the term; but it will also work if you search for switch v.

This capability can be cool - for example, if you want to find the instant of the string 30/60/90, you can do so. Heck, just type 30/ and you're there. And if you have really weird error numbers or status codes (0x00ffdd07) it works great!

In fairness, Adobe does let you specify whole words only and case-sensitive search. But often we see companies that provide grep-like search in their product or service and eagerly claim 'search included'. I guess companies for whom search is a check-box feature and not really seen as contributing to corporate success will accept such an attitude. 

And by the way - we don't think the SQL LIKE operator counts as a search engine either. But that's a rant for another day.

October 13, 2008

Reviewing OpenPipeline

OpenPipeline is an initiative proposed by search engine company Dieselpoint to begin development of standards in the enterprise and customer facing search marketplace.

"Current solutions are proprietary and require that search administrators define and manage data source connectors, file filters, text analyzers, taxonomy, and dictionaries for each search engine technology," says Miles Kehoe, CEO of New Idea Engineering. "Defining once and maintaining a single source regardless of how many and which search engine you use is a big win for customers. We hope other search engine vendors will be adopting this strategy soon." 

"Enterprise search is not the same as web searching", Chris Cleveland, CEO of Dieselpoint says, "because it entails all of the nitty-gritty preparation for search—that is, it requires doing all of those things you need to do to get a document and standardize it before indexing. OpenPipeline, he says, aims to streamline the preparation process through its innovative document-processing capabilities."

Additional information ... 2008 Enterprise Search Vendors: The New Fab 4 ... and 1/2. (http://www.ideaeng.com/pub/entsrch/2008/number_01/article01.html)

OpenPipeline was created and by Chris and his team of developers at Dieselpoint, whose intranet and customer-facing search product is written in Pure Java. Dieselpoint Search is a powerful product, and has many of what we call 'Enterprise Search 2.0' capabilities designed in from the start. For example, it has a web-based control panel for business and IT managers, and provides great support for features like dynamic facets, activity reporting, and powerful data crawling capabilities. It has an elegant and clean interface which is extremely scalable. Dieselpoint Search integrates OpenPipeline for crawling, parsing, analyzing, and routing documents.

About Dieselpoint
Founded in 1999, Dieselpoint provides high-performance search, navigation, and discovery/information retrieval software for structured and unstructured data. Every day, Dieselpoint customers search millions of items and terabytes of data. Customers like The Nielsen Company, Northrop Grumman, Porsche, HMV, McGraw-Hill, ITT, Waterstone’s Books, and British Telecom use Dieselpoint software for corporate portals, intranet search, product catalogs, and engineering databases. Dieselpoint has developed industry-leading advances in faceted search and scalability. Coupled with a new Open Pipeline architecture and outstanding ease of implementation, Dieselpoint is the platform of choice for corporate search needs.  Further information can be found online at www.dieselpoint.com.

September 10, 2008

Do You Plan to Attend ESS West 2008 in San Jose this month?

The Enterprise Search Summit - West starts Monday September 22 with  pre-conference workshops, and the show kicks off Tuesday the 23rd. We'll be exhibiting once again- please stop by and say hello at Booth 229! 

Early bird pricing ends  Sept 3!

  You can register here and get a special rate through New Idea Engineering. Use promotion code VIPIDEA.

Don't miss our sessions.

  • Tuesday Sept 23 2008 at 11:45 - 12:15 pm
    A101:  The Nuts and Bolts of Selecting a Search Engine
    Companies often spend huge sums of money and months of work effort to replace an existing enterprise search engine only to find they still are not happy with the results. With a little planning you can avoid this disaster. Kehoe will outline a phased approach for selecting an enterprise search engine, verifying quality of results against your existing solution, and transitioning to your new infrastructure. This talk takes a hard look at the fix vs. buy decision by focusing on methodology as well as on technology.
  • Wednesday Sept 24 2008 at 3:00 - 3:45pm 
    B206: Search and the Virtual Machine
    Enterprise search is incredibly demanding on hardware resources. Virtualized solutions allow server consolidation and higher server utilization. Virtualization also allows the IT staff to better allocate resources—processors and memory—to optimize performance, yet there are trade-offs to be considered with any approach. This session will examine virtualized solutions in the context of real-world implementations to help attendees understand how this approach can impact operation and performance.

June 23, 2008

13 Powerful Entity Extraction Techniques

Modern Entity Extraction systems typically employ some combination of the following general techniques; this list is shown in approximately smallest to largest scope/complexity:

  1. Simple Pattern Based:
    Examples: 5 digits in a row might be a US postal zip code, 1 - (nnn) - nnn-nnnn could be a US phone number, and nnn-nn-nnnn could be a government Social Security number.
  2. Simple Dictionary/Thesaurus Based:
    Examples: IBM, Apple and Microsoft are all US companies.  Bill Gates, George Bush and Paul Revere were all famous people.
    - - - (basic toolkits end here) - - -
  3. Hybrid Pattern plus Dictionary:
    Example: A pattern finds a sequence of words that all have capital letters, so this is likely to be a proper name.  But to distinguish San Francisco, George Washington and Oracle Corporation as a place, person and company respectively, the system needs to consult some dictionaries.  And notice that "Washington" can be use as both a place and person name.  And if I see Flamingo Geodarney Foobazar, which uses either common words or words or names that are not in the dictionary, it may be even more difficult to disambiguate.

Continue reading "13 Powerful Entity Extraction Techniques" »

June 18, 2008

Search Quality: You Can't Improve What You Don't Measure

In our latest survey of new newsletter subscribers we found that 29% had no formal metrics for measuring quality of search results.  Search metrics allow you to keep search on the right track and can be a powerful tool for managing your systems.  They are a wonderful source for insights and trends.  We thought we would share a couple that we think work well. Many of these are covered in greater depth in Interpreting Your Search Activity Reports in the Enterprise Search newsletter.

  • Count the number of people who use search  
  • Count the total number of searches  
  • Count the number of zero search results  
  • User feedback on top 100 searches  
  • Track email complaints about search  
  • Measure number of clicks on navigators (navigation menu items)  
  • Business Goals  
  •    
    • Reduce call volume (normallized for growth in customer base) by enabling self-service from search: results are good enough to reduce calls.
    • Reduce e-mail volume (again adjusted for growth in customer base) by enabling self-service from search: results are good enough to reduce e-mails. 
    • Revenue       
    • Add-on revenue       

May 08, 2008

A proposed standard for enterprise search

Dieselpoint has announced support for a technology it calls OpenPipeline, which can enhance the task virtually every enterprise search technology uses to get documents into the search index. They will be showing the pipeline at the upcoming Enterprise Search Summit on May 20-21 integrated with their new Dieselpoint Search 4.0, also on display.

The Dieselpoint press release claims:

OpenPipeline provides a common architecture for connectors to data sources, file filters, text analyzers and modules to distribute documents across a network. It is fully functional out of the box and includes an installer, a job scheduler, file scanner and crawlers, doc filters, and point and click interface with drag and drop module installation.

OpenPipeline is compatible with IBM's UIMA (Unstructured Information Management Architecture), and is designed to connect UIMA annotators to other systems.

Document processing can be centralized or parallelized as needed. The transport mechanism is simple, web-services XML over HTTP. RSS/Atom feeds are also possible.

The development philosophy behind OpenPipeline stresses simple, elegant design, and massive scalability. Minimal external dependencies and straightforward plug-in implementation ensure that the learning curve is low.

OpenPipeline can be downloaded without charge from http://www.OpenPipeline.org. It's available under the Apache License.


Making this technology open source makes sense. The core technology for an enterprise search company, their 'secret sauce', is optimizing the index and making search great, not creating new code to parse the latest version of Microsoft Office or of Documentum. By embracing OpenPipeline, presumably we will start to see pipeline stages created by a number of smaller companies and individuals, easing the burden on enterprise search companies. And companies that provide possible sources of data like Content Management Systems, can create a single pipeline stage for their product that could work for every search technology, and be done with it.

To create a searchable index, all search technologies need to create a stream of text. If the source document is a binary file - Microsoft Word, for example - search vendors need to provide some way to read the format and convert it to text. The same is true of content stored in a relational database: each row represents a virtual document which needs to be extracted from the database and turned into a stream of text. This conversion is typically done as one stage of a pipeline. Other stages may include adding metadata, performing entity or sentiment extraction, or even enhanced language processing.

The concept of a 'pipeline' applies directly to many existing search technologies, each with a proprietary method of accessing content. On top of that, no search technology companies have cooperated with competitors to create standards. In the relational database world, standards have made life much better: consider ODBC and JDBC. Because of these standards, developers can write code that can connect to just about any relational database. Not so in search. Maybe this effort will help break the ice. Stay tuned...

As enterprise search users, are you glad to see an open source solution for part of the search puzzle?

March 19, 2008

Advanced Duplicate Detection (also related to spam detection and clustering)

We need to do a dedicated article about this area, but I wanted to share some material here that we have written about it, and that will likely re-appear in a future article.

In our recent newsletter article, we covered the problem of generic duplicate detection in search, and them duplicate detection in federated search.

A SearchDev posting Mark talked more about why checksums aren't always enough for duplicate detection, in messages 485 and 490

March 03, 2008

Deep Web proposes federation resource site

Sol Ledeman of Deep Web Technologies wants to create a one-stop demo center for federation technology and has invited all of the major vendors to participate.

Federated search is becoming increasingly popular as more corporate customers are looking for ways to delivery results from multiple enterprise search installations, often from many different vendors. Sometimes the issue is technical, sometimes political, but nearly all companies have three or more search vendor technologies running somewhere behind the firewall.

The one thing we'd like to have seen in Sol's challenge is security, since that's what we think separates the winners from the also-rans in federation. It's not always easy, but it is 'real world' in companies. Nonetheless, a demo site where users can compare vendor solutions 'apples to apples' on the same data sources would be nice.

By the way, we've seen some confusion among our customers and prospects on the subject, so we've taken a shot at defining 'federated search' in our Enterprise Search newsletter. We hope that helps some.

January 31, 2008

Notes: Sample data for indexing, searching, analysis

Updated Feb 26, 08, added ref. from Chris C.
http://www.datawrangling.com/some-datasets-available-on-the-web.html

Wikipedia database download
Many formats and versions:
http://en.wikipedia.org/wiki/Wikipedia:Database_download
Example: From January 10, 2008
http://download.wikimedia.org/enwiki/20080103/
file pages-articles.xml.bz2 = 3.2 GB (claimed to be the version most folks want)

Enron public emails:
Released in 2003
http://www.cs.cmu.edu/~enron/
UC Berkeley
http://bailando.sims.berkeley.edu/enron_email.html
Emails searchable with Lucene
http://orange.sims.berkeley.edu/~atf/enron/enron.cgi
"Enronic"
http://jheer.org/enron/v2/

Reuters Corpus Volume 1
http://trec.nist.gov/data/reuters/reuters.html

TREC and TREC Data
http://trec.nist.gov/
http://trec.nist.gov/data.html
Terabyte Data, $800 US or is it Uklb 600, 426 GB, 25 million docs GOV2 collection http://es.csiro.au/TRECWeb/gov2-summary.htm
http://ir.dcs.gla.ac.uk/test_collections/
http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html http://www.mccurley.org/trec/
http://trec.nist.gov/data/terabyte/04.guidelines.html
http://es.csiro.au/TRECWeb/

CIA World Fact Book FactBook
Updated Feb-4-2007 Revised link here:
https://www.cia.gov/library/publications/the-world-factbook/index.html
Updated Feb-4-2007 In various Database and XML formats: (thanks Clint)
http://www.dbis.informatik.uni-goettingen.de/Mondial/
http://www.cia.gov/cia/publications/factbook/
http://www.cia.gov/search?NS-collection=World%20Factbook
Public domain, questions: (703) 482-0623
Easy download at: http://www.odci.gov/cia/download.html

Database: (though may have copyright restrictions, and not in generic format)
MS SQL: Northwind, pubs, AdventureWorks

Oracle: books database?

TODO: MySQL, PostgreSQL

Ideas for Electronic parts manufacturers: (not much luck so far)
TI, National Semi, AMD, Motorola, Cypress. Connectors: AMP and Molex

Sample data sets:
http://lib.stat.cmu.edu/datasets/ Boston housing prices (old)
http://lib.stat.cmu.edu/datasets/boston_corrected.txt
cars
http://lib.stat.cmu.edu/datasets/cars.data (a bit cryptic)
http://lib.stat.cmu.edu/datasets/cars.desc

colleges
http://lib.stat.cmu.edu/datasets/colleges/

Maybe ACM data
http://www.acm.org/

TODO: DMOZ category data, need to find link.

TODO: Bruce suggested US Patent Data, need to find link.

SOCNET data: Approx 10 M blog entries (via Sean)
Agreement http://www.blogpulse.com/www2006-workshop/datashare-agreement.pdf
Date: Fri, 16 Dec 2005 10:32:44 -0800
Sender: Social Networks Discussion Forum <SOCNET@LISTS.UFL.EDU
From: Eytan Adar
Subject: [SOCNET] Announcing: data availability for the 3rd Annual Workshop on the Weblogging Ecosystem
We are happy to announce the public availability of a substantial collection of blog data for research purposes. The data is being made available by Intelliseek/BlogPulse in conjunction with the 3rd Annual Workshop on the Weblogging Ecosystem. A DVD containing full text from nearly 1 million blogs can be requested by filling out the form at the workshop homepage: http://www.blogpulse.com/www2006-workshop/

Search Blog Archive

Dr Search

  • Dr. Search is the technical genius of enterprise search. Feel free to Ask the Doctor any questions you may have about enterprise search.

Enterprise Search Newsletter

Other Resources