Text Quality Review Tools
 
  The Text_Check Report

Element content is not checked by a parser but is often regular enough to be checked with a program or script that finds and compares the contents of an element with a standard expectation.

Text_check is a PERL script written by Jane Bossert. It is run on a UNIX server. Its purpose is to verify and\or analyze the content for some elements of documents encoded with the American Memory DTD(ammem.dtd).

The text_check script is located in Martha Anderson’s home directory on rs6. If you cd to /u/mande, you can copy text_check to your own home directory on rs6.

How to Run:

  1. Change to directory where your script is stored
  2. Type:
    text_check directoryname batchname

    Example:

    text_check llac sigll001
  3. This checks all files in directories below the directoryname and creates a .log file that is your report. Example: It checks all .sgm files it finds in directories that are in llac.

    Warning: If there are ^M characters in your file, the <controlpgno> count will not work.

  4. To read the report, open the .log file in any text editor.

    Sample report:

    u/mande/gw07.sgm <amid>: 0 errors &ldquo; 109 &rdquo; 107 &lsquo; 2 &rsquo; 2 &apos; 717 &mdash; 307 <omit>: too many occurrences (6) 2 character omits: 0 <idinfo>: 1 occurrence <handwritten>: 0 <stamped>: 3 <hsep>: 505 <controlpgno>: sequence correct Total number pages found: 537

Text_Check Report Explanation

Report Line

Reports Action
/u/mande/gw07.sgm Path and file name.  
<amid>: 0 errors That the item identifier part of the content of the <amid> element matches the file name. If does not match, check target and title page to verify the file name and amid values.
&ldquo;	109
&rdquo;	107
&lsquo;	2
&rsquo;	2
&apos;	717
&mdash;	307
The number of uses of defined character entities.

[Note: Special entity sets may be defined for different projects.]
If counts for pairs vary greatly, may indicate that the entity is misused. Example: &rsquo; used instead of apostrophe (&apos;)
<omit>: too many occurrences (6) Number of uses of <omit> element when it exceeds two If numerous, may indicate poor quality page images. If less than 20, may desire to decipher and complete.
2 character omits: 0 Number of occurrences of two or more ?? which usually indicates that characters are omitted If numerous, may indicate poor quality page images. If less than 20, may desire to decipher and complete
<idinfo>: 1 occurrence Number of uses of the idinfo type attribute on <div> element If not what is expected for your project, investigate. May indicate two volumes bound together or a misinterpretation by the encoders.
<handwritten>:	0
<stamped>:	3
<hsep>: 	505
Number of uses of special text elements If not what is expected for your project, investigate.
<controlpgno>: sequence correct That controlpgno values are in one-up sequence. If incorrect, scrutinize text more closely, could indicate a page that was keyed locally. Text should be returned for reordering and renumbering of controlpgno.
Total number pages found: 537 The number of controlpgno elements If not equal to or greater than number of expected pages, investigate. May indicate that images were missing.

-- Return to top --
-- Return to Text Quality Review Home --

NDLP Production Group -- Text_check Report Instruction Sheet -- ma 12/97