When you’re looking to start your first digitization project, file type options can seem overwhelming, especially if you’re not familiar with their advantages and disadvantages. Backstage can help you understand the different types of deliverables common for digital imaging projects and make sure you are getting the most out of your files.
Choosing the best type of deliverable files for your project depends on a few factors. You should ask yourself two questions before you begin:
- How are your images going to be used?
- Where are you going to upload the images?
You may have a digital asset management system (DAM) that specifies a required or preferred format. You may have to choose a compact file type for online access, while you need a different format for archival preservation, high-resolution reproduction, or other purposes.
Common image types for scanned content include:
- PDF (single or multi-page)
JPEG and JPEG2000
A JPEG can be a great option for individual images. The JPEG format can easily be used for anything from thumbnail previews to large files displaying intricate details and fine text. The format is widely supported across web browsers and other common software applications. A JPEG is a compressed version of the original image, resulting in a smaller file size that makes these images useful for internet and local access.
The main drawback of JPEGs is the loss of data inherent in compressed files. One example of how compression works is that areas of similar color may be consolidated to the same value, which can produce an effect called posterization in place of smooth color transitions. Editing a JPEG image and saving it again may create additional compression artifacts, further reducing image quality. This compression cannot be reversed. Simply put, a JPEG will not retain all of the data present in the original image file. A compressed image is often good enough for access on a screen, but the JPEG format is not the best for long-term preservation.
A JPEG2000 is similar to a JPEG, but with a higher image quality, built-in scaling, and the option to save a lossless file, which means the image isn’t as compromised by the compression. In fact, the FADGI guidelines consider JPEG2000 as an acceptable master file format for many items, including printed materials, manuscripts, newspapers, and digitized microfilm.
The compression algorithm is different from a standard JPEG, which makes a JPEG2000 a bit smaller than a JPEG of the same dimensions.
Unfortunately, most web browsers do not natively support JPEG2000. Online display typically requires additional software, such as a DAM, so it’s important to verify if and how this format can be used for your project.
PDF (single or multi-page)
A PDF file is a good choice for multi-page items, especially since Backstage offers a bound PDF option, which can be used to digitally bind together the pages of a periodical issue or volume, book, archival folder, or microfilm reel.
PDFs have the advantage of gathering together images, text, and metadata in an easy-to-view and easy-to-print file. PDFs can be text searchable, with uncorrected text from optical character recognition (OCR) mapped to the same location as the corresponding image of that text. This OCR text can be enabled in both our single-page and multi-page PDF files.
Like JPEG, PDF is a compressed file, which again means it’s best for access purposes but not long-term preservation. Many web browsers do support PDF viewing, but your users may need a dedicated application, such as Adobe Reader, to fully enable the features of this format.
The archival TIFF is an uncompressed preservation file. It is the highest quality image deliverable and the standard for archival storage. Conveniently, many image applications that support JPEGs also support TIFFs.
A TIFF maps the image data sequentially at the pixel level, with 8 or 16 bits per color channel. Saved without compression, the TIFF data structure is straightforward and less prone to catastrophic image corruption from bit loss than other formats.
TIFF images are larger than comparable JPEGs or PDFs, so they’re not ideal for web use. File size may also affect your planning for digital storage. You may need to ask whether your institution has storage space and bandwidth for TIFFs to be managed long-term as preservation files.
The TIFF format is an excellent choice if you plan to reproduce high-resolution or large-format prints. Unlike compressed files, TIFF images will still appear sharp when enlarged.
When you’re deciding on file formats, remember that we’re happy to provide more than one file for each image. For example, you may want both TIFFs and PDFs, or TIFFs with JPEGs may suit your project better. It all depends on what you need and how you plan to use the files.
We can include uncorrected OCR text with any of our image options. Optical character recognition performs best when the text being identified is from a print item like a book or a newspaper. OCR does not work well for handwriting. If you have handwritten items and would like the text for your files, please ask us about transcription options as you plan your project.
Uncorrected OCR can be embedded in your PDF files or extracted and delivered as a separate TXT or XML file.
In processing image files, we use OCR software to read the letters contained in the images and generate the searchable text. The uncorrected OCR is then embedded under the images in a PDF file, allowing you to search, highlight, copy, and paste the text visible on the page.
When delivered separately, page-level TXT files contain the uncorrected OCR text, while XML files will have the OCR text as well as the coordinates for each character in the image, which translates to the ability to highlight the text when uploaded to a compatible digital asset management system.
Some DAMs require a specific file set to be properly ingested, which is a common reason for requesting OCR as XML. While planning, make sure you know what will work best with your DAM or repository and share any file specifications with us so we can match our output to your system’s requirements.
Depending on your choice of digital asset manager, Backstage can help build your collection’s online presence to help with remote access to your unique items.
If your library has CONTENTdm from OCLC, Backstage can build the collection for you and upload it into the CONTENTdm site. This process requires specific deliverables, so if you need a collection built for CONTENTdm, let us know, and we are happy to walk you through that planning.
Additionally, we have experience working with clients who utilize a variety of other DAM platforms, and we are available to discuss your individual requirements.
We also offer METS/ALTO XML, as well as Article Level Segmentation and Table of Contents Segmentation for any institution with software that supports those outputs.
The information given here is not meant to be exhaustive in an explanation between the pros and cons for every deliverable choice, but rather an overview of the types of options you have when considering your digital project.
Once again, the main thing to consider with all of these options is making sure your deliverables work best for you and your needs. Don’t forget the two main questions when you start your project planning:
- How are your images going to be used?
- Where are you going to upload them?
We are happy to help talk you through your project from start to finish. Make sure to reach out to us to get started.