« Request for other enterprise search professionals | Main | Call for Papers: Enterprise Search Summit East, May 2010 »

November 04, 2009

PDF - The New Legacy Data

In the old days companies referred to paper documents as "legacy data", boxes and boxes of important printed documents that were difficult to access.  If you've been in the industry for a while, you'll recall all the high speed scanning / OCR companies that cropped up to solve this problem.

Today virtually all documents and manuals are created electronically, and thanks high quality formats like Adobe PDF and numerous electronic distribution channels, documents tend to stay in digital format.

To me, PDF has replaced paper as the new "legacy" format - there's a ton of technical data now being published in this format.  And to paraphrase an old commercial "data checks in, but it don't check out".

Of course there were ways to get all that tabular technical data back out of PDF, and into more usable forms, but getting this right is not trivial and it's certainly not Adobe's priority.  We're not chiding them for this, their business model is clearly served by getting data INTO PDF, and Acrobat can now export to XML. and other solutions can help you get the content out as well.

A similar case could be made for HTML, Word, Excel and PowerPoint.  Each of these formats have problems of their own.

PDF has some particularly details that can thwart enterprise search:

  • Not all PDF files have searchable text, and users are generally unaware of the difference.
  • PDF files come in many dialects.
  • Tabular data in PDF is sometimes difficult for software to infer; humans easily see the rows and columns, but unlike other document formats, there is no intrinsic hierarchical document structure, just pixels, lines and text snippets with various X,Y coordinates.
  • Older PDF formats were not as capable when dealing with other languages, such as Arabic.

All of these issues have solutions, but all of them require some thought and careful tool selection.

So the complexity of OCR has been replaced, on some level, with document filters, entity extraction, ETL and optimized fulltext search.

TrackBack

TrackBack URL for this entry:
https://www.typepad.com/services/trackback/6a00d8341c84cf53ef01053685e6ca970c

Listed below are links to weblogs that reference PDF - The New Legacy Data:

Comments

PDF is where text goes to die.

PDF is not a text format. It is instructions for putting marks on paper. Those instructions move virtual rubber stamps around. Any standard relation between those stamps and actual text characters was minimal until after Acrobat 4, and optional even then.

At one point, I considered using cryptanalysis algorithms to figure out non-western characters in PDF, but decided it would be hell to export. Still, PDF is closer to bad cryptography than to a good text format.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated, and will not appear until the author has approved them.