« New Idea Engineering Selected by Endeca as Systems Integration Partner | Main | Customer Facing Search: 7 eCommerce Friendly Search Engines »

June 23, 2008

13 Powerful Entity Extraction Techniques

Modern Entity Extraction systems typically employ some combination of the following general techniques; this list is shown in approximately smallest to largest scope/complexity:

  1. Simple Pattern Based:
    Examples: 5 digits in a row might be a US postal zip code, 1 - (nnn) - nnn-nnnn could be a US phone number, and nnn-nn-nnnn could be a government Social Security number.
  2. Simple Dictionary/Thesaurus Based:
    Examples: IBM, Apple and Microsoft are all US companies.  Bill Gates, George Bush and Paul Revere were all famous people.
    - - - (basic toolkits end here) - - -
  3. Hybrid Pattern plus Dictionary:
    Example: A pattern finds a sequence of words that all have capital letters, so this is likely to be a proper name.  But to distinguish San Francisco, George Washington and Oracle Corporation as a place, person and company respectively, the system needs to consult some dictionaries.  And notice that "Washington" can be use as both a place and person name.  And if I see Flamingo Geodarney Foobazar, which uses either common words or words or names that are not in the dictionary, it may be even more difficult to disambiguate.
  1. Corpus Level Rules:
    Looking at the actual URLs, filenames, etc.
    For example the URL http://mycompany.com/marketing/file2.html is obviously represents a Marketing document.
  2. Context/Proximity to other Identified Elements:
    Example: If I see a two letter abbreviation by itself, I might not be 100% sure what it is.  But if it's next to a known city's name, I can be pretty confident that it's a US state abbreviation.  Understanding the context of numbers can be greatly aided by this technique.
  3. Hierarchal Context:
    Identifying some elements can lead to discovering others, and then those can lead to even more.
    Example: I see a city, state and zip code here, and there is some numbers and possible street name above, this may be a postal address. Although an extension of the previous item, it's listed separate because not all systems allow for nesting of rules, or having rules that are relative to other matches.
  4. Page Structure and Document Type: Extreme Hierarchy
    Example: If the system sees a postal address, an account number, some line items and prices,  this might be an invoice or shipping manifest. This is an extension of the previous element, but listed separately since some vendors specifically talk about this feature and ship predefined templates.  As an example, these rules can detect pages that are flagged "AIB", or "Active and In Business", which recognizes web pages of company web sites, vs. person web sites or blogs, etc.
    - - - (getting pretty high end here) - - -
  5. Fonts and other Formatting Cues:
    Fonts, bold formatting and CSS styles can indicate important info or contextual headings.
  6. Area Weighting, Row and Column Alignment, Fonts and other Visual Queues:
    Estimating the amount of space certain bits of text or graphics takes up, and then making assumptions and assigning weights based on that. Elements found in tables can use row and column headings for additional context.
  7. Link and page ancestor/descendant/path based context:
    The path of links taken to reach this page, and the information on those ancestor pages and links, can provide additional context for the current page.
  8. Synthetic Tokens:
    Additional word-like tokens are fed into the search index based on other attributes of the document.  Common examples include Mime-Type, presumed document type (press release, email message, blog posting, etc.), language (English, Spanish, German, etc.), document size and user ratings.  More advanced synthetic tokens can include a "spam score", number of links, ratio of text to whitespace, ratio of text to overall document size, ratio of link text to overall text, etc.  To be compatible with some search engine indexes, variable attributes such as size and ratio can be expressed as textual tokens.  For example, "s" could represent a spam score of 0-5%, "ss" = 6%-25%, "sss" = 26-65%, "ssss" = 66-95%, and "sssss" = > 95% likelihood of spam.
  9. Statistical inference (LSA, PCA, SVM, Bayesian, etc):
    These long established mathematically based searching techniques can be re-tooled to also help entity extraction, though this is not widely used at the moment.
    - - - (these last 2 methods are also useful in document clustering, duduping and spam detection) - - -
  10. Operator/User Based:
    Manual data entry and/or manual editing.  This is still in some high end applications where accuracy is very important.  In a much simpler form this is also how many modern tag-based web sites work, although the tags do not have the same fielded structure as the traditional applications had.

We haven't seen a vendor who does all of  these yet.

If you'd like to chat about some of these techniques further, please do leave a comment or drop us an email.


TrackBack URL for this entry:

Listed below are links to weblogs that reference 13 Powerful Entity Extraction Techniques:


14. A combination of (3) and (5) / (6) plus references to a semantic network which knows that "corporation" is a kind of a business, and can operate with rules like "a capitalized noun" + "any kind of business enterprise", so it can find both "Oracle Corporation" and "Jones Partnership".

The comments to this entry are closed.