Exploring the Best File Formats for Document Management

1025 Monarch Street
Suite 170
Beaumont Centre
Lexington, KY 40513
tel. 859.252.6225
fax 859.252.6528
Exploring the Best File Formats
for Document Management
H
ello, my name is Sarah Smith and I am part of the ISV Business Development team at Fujitsu Computer Products of America.
My primary role as an ISV BD manager is to work with ISVs like TrinSoft in order to ensure that we have a seamless solution
between our scanners and their software. I have learned quite a bit about document imaging and document management working in
this role over the years and found that sometimes my greatest value is sharing what I have learned. That being said, TrinSoft has
asked me to be a guest writer and proposed the topic of PDF vs. TIFF in terms of strengths and weaknesses.
Apparently there are a lot of questions out there as to which format is better for ECM (Enterprise Content Management) solutions. I
have done a little research and have also asked my ECM guru, Pam Doyle, in order to gather information and answer this question.
So, let’s start with the history of the two formats. TIFF was created in the 1980s in order to create a standard file format for the storage of scanned images. It was originally created by Aldus Corporation, but is not controlled by Adobe. When we refer to TIFF format
in document imaging there are two formats being used. There is Group4, which is the standard TIFF format used for bitonal (black
and white) images, and there is JPEG, which comes into play when dealing with color images. PDF was designed by Adobe in 1993
and became an open standard that was officially published in July of 2008 by the ISO (International Standard for Organization). PDF is
your standard PDF file; PDF/A is a format approved by AIIM and ISO and was designed for the long-term archiving of electronic documents. What that means is that PDF/A documents will be able to be reproduced exactly the same way in years to come. The PDF/A
standard will not go away.
TIFF has been the favored format in document imaging as long as document imaging has existed, but it appears as though that might
be changing. It might be due to the fact that the PDF format was developed to improve on TIFF. It provides the ability to store both
image and text and has an easy migration path to long-term preservation via PDF/A. So getting back to the specifics, I picked out
what I think are the most significant features regarding the two file formats.
File Size– Probably one of the biggest concerns being that some users are looking to store massive amounts of information in their
ECM or document management solution. In looking at file size we are going to consider TIFF Group 4 (standard TIFF format for
scanned, bitonal images), JPEG (standard TIFF file format for scanner color images), Image only PDFs and Searchable PDFs (a PDF that
has been run through an OCR process that takes the PDF from being just an image but an image with recognizable text). Typically a
TIFF Group 4 image will be your smallest file type and a searchable PDF will be your largest. When it comes to color documents, a PDF
will be smaller than a JPEG. This is because there is a compression function when scanning to PDF in color (JPEG 2000) that does not
exist when scanning to JPEG.
Searchability- When looking at searchability you have two options, do you want the entire document searchable or only keywords?
In an environment where scanning is a major part of the organization’s business process, they will only want to store key bits of information related to the document in order to create rhyme and reason to the way these documents will be searched and retrieved. If
this sounds like a method that your organization will like, then TIFF or PDF Image Only is going to be the way to go. The other option
is to OCR the entire document so that you can search and find the document based on any word that is on it. This would be a searchable PDF. Now of course that sounds like the easiest way to go, but understand that it is not a very organized, methodical way for
searching or retrieving documents on a larger scale. If you are looking at doing this for your personal records like bills and change of
address forms…go for it. But remember that searchable PDFs are large files so they are going to take up a lot of space.
Continued on page 2
www.trindocs.com
trindocs.com
Exploring the Best File Formats for Document Management - Page 2
Metadata- Think of metadata as being a very important aspect of searchability. Metadata is also known as index information and
it is key information related to the document that will be used to store and search for that document. Think of a folder that has a
bunch of documents in it all related to Sarah Smith and her employment with FCPA. On the top of that folder is likely to be a tab
with the words Sarah Smith HR. Those words, Sarah Smith HR, would be the metadata, or index information, for those documents. A more typical example would be an invoice from a vendor. When indexing an invoice one would typically chose to isolate or identify the date, the name of the vendor, the invoice number and maybe the dollar amount of the invoice. When it
comes to automatically indexing documents like high-end capture solutions do, TIFF is the preferred format. If you plan on
manually indexing these documents, both TIFF and PDF will allow it. Just know that if you plan on making all of your documents
searchable PDFs (again, not a recommend method to document imaging), you can also assign metadata to that PDF but it is kind
of pointless and redundant.
Viewing- If the documents that are being scanned are going to be shared with people outside of your department or organization, you are going to want them to be in a file format the person can open. Both TIFF and PDF have widely available viewers, but
remember that your average person who has no experience when it comes to document imaging will know what a PDF is and
know exactly how to open it. The chances of them knowing what a TIFF is or how to open it are less likely. Also, if you have documents that have multiple pages within them, you are less likely to run into issues viewing multi-page PDFs vs. multi-page TIFFs.
Below is a high level overview of each file format we have discussed.
TIFF- This format is great for black and white documents, smaller file size and very specific metadata or index information. Users
that are scanning a lot of documents and have a lot of automated processes will usually go with the TIFF format. Also note that
most advanced capture solutions are optimized for the TIFF format.
JPEG- It is a standard color format that is widely known and used, but is a large format due to the color. The point here is to only
scan documents in color when really needed. If the color is not important, allow the scanner to scan in black and white.
PDF and PDF/A - It is becoming more popular to see organizations leverage this file format for their document management solution. It might be a slightly larger file size, but almost anyone will be able to open and view the file. PDF files support both color
and black and white. PDF also supports metadata/indexing and it can also be very secure. You can assign passwords to open it
and if any changes are every made to the PDF, those changes will be noted in the document properties. (TIFF does not have this
capability.)
Searchable PDF– This is a great format if you want to edit the document or extract a lot of information beyond metadata. Just
remember that running an OCR engine takes time and can bog down a system. Also remember that searchable PDFs are large
files.
There really isn’t a right or wrong file format when it comes to scanning images, it is just a matter of understanding the strengths
and weaknesses of each format in order to understand which ones to use when.
If you have further questions on this matter or any questions related to scanning documents please feel free to contact me. We
have lots of experts here at Fujitsu and are happy to share our knowledge where we can. Thank you for taking the time to read
this article.
Sarah Smith
Fujitsu Computer Products of America
ISV Business Development
[email protected]
trindocs.com