General comments on
digital reproductions of textual materials for American Memory
Introduction
Applicants for DLI-Phase II who would like to make use of the textual materials converted by
the National Digital Library Program (NDLP) for American Memory should be aware of the
heterogeneity in digital format.
The resource includes different genres of original material, including pamphlets, typed and
handwritten pages, theater programs, sheet music, and full-length books. They have been
converted at various times since 1990, during a period when scanning and delivery technology
advanced rapidly and the NDLP's understanding of what was desirable and feasible changed. The
earliest conversions were made with dissemination on CD-ROM in mind, since the World Wide
Web had not yet provided a network-based mechanism for accessing multimedia materials. The
NDLP believes that heterogeneity presents a challenge that digital libraries must be able to handle;
developing technology will continue to provide better solutions for capture, storage, and
representation, but materials captured in the 1990s must still be accessible in years to come.
Some documents are available online only as page-images, others only as searchable text
marked up in SGML (Standard Generalized Markup Language), and others in both forms. The
resolution and digital format of page-images varies. In contrast to some other digital library
projects, the page-images were all scanned without disbinding the source volumes; this constrains
the spatial resolution that can be achieved and most page-images have been captured at between
150 and 300 dots-per-inch. In some cases, separate images have been created for illustrations
using different digital procedures or formats, to avoid the moiré effects obtained by
scanning printed half-tones and to provide images that can be printed acceptably on inexpensive
printers. The choice of approach has been affected by several factors: the state of technology
at the time of capture; characteristics of the original documents (such
as number of pages and degree of
graphical presentation); physical characteristics of the original material or surrogate (such as
microfilm) which could be scanned; and cost.
More detailed information relating to the Library's current practices can be found in a 1996
Request for proposals: Digital
Images from Original Documents -- Text Conversion and SGML-Encoding. Sections C
(Description/Specification/Work Statement) and J (Attachments) contain the technical
sections.
Page-images
Typography or line art is often successfully reproduced in a bitonal image which
usually also provides adequate printed output. The bitonal images used for American Memory
have typically been captured at 200-300dpi and are stored as TIFF images with ITU (formerly
CCITT) Group 4 compression. Some materials converted earlier use Group 3 compression. For
documents that have not also been converted to searchable text, there is usually also a
screen-sized GIF or JPEG image.
The Library has been experimenting with tonal (color and grayscale) reproduction
of manuscript and older printed documents, partly after noting shortcomings in some bitonal
(black and white) images produced during the American Memory pilot. The tonal images are
usually at 8 bits per pixel for grayscale and 24 bits for color, with a spatial resolution of 200-400
dpi. The highest quality image may be an uncompressed TIFF image or a lightly compressed
JPEG.
For an illustration or a table in a book, there may be an additional image (either of the
illustration or of the page containing it). The illustrations may be available both in a
printing-quality image and as a GIF sized for presentation inline in an HTML version of the
searchable text.
Searchable text The Library is converting a
wide array of documents to searchable form, including books, pamphlets, legal materials, serial
articles and manuscripts. The texts are encoded with SGML, using the American Memory document type definition
(DTD), which is based on the guidelines for humanities texts developed by the Text
Encoding Initiative (TEI).
The American Memory Document Type Definition (ammem.dtd) was developed to
accommodate a broad range of materials by conceptualizing a generalized humanities text, rather
than seeking to describe specific document types and subtypes, or text genres. Simple, streamlined
models and flexible structure are characteristic of the American Memory DTD. The Library's
approach to tagging identifies basic text structure, but little content-related matter. The American
Memory DTD is not optimized to accommodate "value-added" material such as
editorial comments and annotations, and there are no plans at this time to expand the Library's
approach in this direction.
The Library's transcription requirement for contractors has usually been 99.95 percent
accuracy compared to the original. [In future contracts, 99.995 percent accuracy will be required
for some items.] The Library does not stipulate the method to be used for text conversion, only
that the final product meet our accuracy requirements. To date, almost all documents have been
rekeyed; OCR has, so far, performed rather poorly with historical materials that contain a wide
variety of non-standard type fonts and typographical design.
Links from the marked-up texts to page-images and illustrations use SGML entity references
to provide a level of indirection. Links are marked by one of three elements (corresponding to
pages, illustrations, and tables) and identified in an entity attribute that names the external entity
within which the graphic image is stored. Each document instance, stored in file with the
extension ".sgm", has an associated file, with the extension ".ent", that
contains the entity declarations for the system entities containing the graphics; these entity
declarations, in effect, map the entity attribute values in the document instance to the filenames of
the referenced digital objects. For each document, the .sgm and .ent files and all image files are
stored in a single directory. A style-sheet (ammem.ssh) and navigator (amhead.nav) and
other support files appropriate (and necessary) for use with Panorama
(a viewer for SGML) are also available. |