Parsing American Memory Documents
Parsing: Using NSGMLS Parser | Using OmniMark Parser | Sample Error Reports
Installing Parsers: NSGMLS Parser | OmniMark Parser | OmniMark Batch Parsing
American Memory documents are parsed with two parsers; NSGMLS, a free-ware parser courtesy of James Clark and OmniMark, a commercial parser that is part of a powerful SGML programming software. Both of these parsers are run from the command line. For the convenience of the quality review teams, batch files have been created to eliminate the tedious typing of commands.
For the batch parsing to work correctly, all files that are to be parsed should be copied into the C:\parse directory on your C:\ drive. Files required for parsing are the .sgm and the .ent files for each item.
Both parsers are invoked when you click on the desktop icon. NSGMLS runs first and creates a report called parse.err. OmniMark runs last and creates several files; errors.rpt, batch$.out, batch$.log and batch$.lst.
The view errors look at parse.err and errors.rpt. Batch$.out is a record of the messages generated by the Omnimark parser. The other files can be disregarded.
When you have completed parsing and viewing your report files, delete the.sgm, .ent, .log, .out, .err, .rpt and .lst files from the parse directory. <!--Do not delete any file with the .cat extension.-->

Using the NSGMLS Parser - TOP -
Convenient method:
To parse American Memory documents copied into C:\parse:
- Copy .sgm and accompanying .ent files into C:\parse.
- Click on desktop icon labeled "Parse". DOS window will open and you will see the batch file processing on the screen. When the text stops and there is no blinking cursor visible, you can close the window and view the parsing report.
- To view parsing report, open C:\parse\parse.err in Notespad (or another ASCII editor).
- If parse.err contains only a list of filenames, the parse is clean. If there are messages, the files have errors.
Command Line method:
To parse a file located in any directory on your hard drive or on a drive mounted from the server:
In the same directory with the file, type:
nsgmls -s -g -ferror.log -cC:\parse\nsgmls.cat filename
filename is the name of the sgm file.
This parses and creates a report file called error.log.

Using the OmniMark Parser - TOP -
Convenient method:
To parse American Memory documents copied into C:\parse:
- Copy .sgm and accompanying .ent files into C:\parse.
- Click on desktop icon labeled "Parse." The DOS window will open and you will see the batch file processing on the screen. When the text stops and there is no blinking cursor visible, you can close the window and view the parsing report.
- To view parsing report, open C:\parse\error.rpt in Notespad (or another ASCII editor).
- If report.err contains only a list of filenames, the parse is clean. If there are messages, the files have errors. <!--If the file is blank, there were no errors found.-->
Command Line method:
To parse a file located in any directory on your hard drive or on a drive mounted from the server:
- Change to the directory where there are SGML files to be parsed.
type:
sgmlomni sgmlfile
sgmlfile is the name of the SGML file, without the .sgm e.g., sgmlomni x0030
- This parses x0030.sgm and creates the following output files:
- x0030.log - check this file for SGML errors
- x0030.out - a normalized version of the SGML file, without the doctype declaration
- Delete the .log and .out files after you have verified that the file is a valid SGML file.

Sample Error Reports - TOP -
The following text excerpt has a single error.
<!doctype tei2 public "-//Library of Congress - Historical Collections (American Memory)//DTD ammem.dtd//EN" [<!entity % images system "gw07.ent"> %images;]>
<tei2>
<teiheader type="text" creator="National Digital Library Program, Library of Congress" status="new" date.created="1997/09/18">
<filedesc>
<titlestmt>
<amid type="aggitemid">
mgw-gw07
<title>
The Writings of George Washington from the Original Manuscript Sources, 1745-1799. John C. Fitzpatrick, Editor.
</title>
<amcol>
<amcolname>
The Papers of George Washington at the Library of Congress
</amcolname>
<amcolid type="aggid">
</amcolid>
</amcol>
<respstmt>
<resp>
Selected and converted.
</resp>
<name>
American Memory, Library of Congress.
</name>
</respstmt>
</titlestmt>
<publicationstmt>
<p>DC, YYYY.
</p>
<p>
Preceding element provides place and date of transcription only.
</p>
<p>
For more information about this text and this American Memory collection, refer to accompanying matter.
</p>
</publicationstmt>
<sourcedesc>
<lccn>
31-5736
</lccn>
<sourcecol>
Manuscript Division, Library of Congress.
</sourcecol>
<copyright>
Copyright status not determined; refer to accompanying matter.
</copyright>
</sourcedesc>
</filedesc>
<encodingdesc>
<projectdesc>
<p>
The National Digital Library Program at the Library of Congress makes digitized historical materials available for education and scholarship.
</p>
</projectdesc>
<editorialdecl>
<p>
This transcription is intended to have an accuracy rate of 99.95 percent or greater and is not intended to reproduce the appearance of the original work. The accompanying images provide a facsimile of this work and represent the appearance of the original.
</p>
</editorialdecl>
<encodingdate>
YYYY/MM/DD
</encodingdate>
<revdate>
</revdate>
</encodingdesc>
</tei header>
Error message from NSGMLS:
---------- GW07SG~1.LOG
C:\PARSE\BIN\NSGMLS.EXE:GW07.SGM:9:6:E: document type does not allow element "TITLE" here
C:\PARSE\BIN\NSGMLS.EXE:GW07.SGM:9:6: open elements: TEI2 TEIHEADER[1] FILEDESC[1] TITLESTMT[1] AMID[1] (#PCDATA[1])
C:\PARSE\BIN\NSGMLS.EXE:GW07.SGM:12:6:E: document type does not allow element "AMCOL" here
C:\PARSE\BIN\NSGMLS.EXE:GW07.SGM:12:6: open elements: TEI2 TEIHEADER[1] FILEDESC[1] TITLESTMT[1] AMID[1] (#PCDATA[1])
C:\PARSE\BIN\NSGMLS.EXE:GW07.SGM:19:9:E: document type does not allow element "RESPSTMT" here
C:\PARSE\BIN\NSGMLS.EXE:GW07.SGM:19:9: open elements: TEI2 TEIHEADER[1] FILEDESC[1] TITLESTMT[1] AMID[1] (#PCDATA[1])
C:\PARSE\BIN\NSGMLS.EXE:GW07.SGM:27:11:E: end tag for "AMID" omitted, but OMITTAG NO was specified
C:\PARSE\BIN\NSGMLS.EXE:GW07.SGM:6:0: start tag was here
C:\PARSE\BIN\NSGMLS.EXE:GW07.SGM:27:11: open elements: TEI2 TEIHEADER[1] FILEDESC[1] TITLESTMT[1] AMID[1] (#PCDATA[1])
C:\PARSE\BIN\NSGMLS.EXE:GW07.SGM:27:11:E: end tag for "TITLESTMT" which is not finished
C:\PARSE\BIN\NSGMLS.EXE:GW07.SGM:27:11: open elements: TEI2 TEIHEADER[1] FILEDESC[1] TITLESTMT[1] (AMID[1])
Error message from OmniMark:
OmniMark V5.0
Copyright (c) 1988-1997 by OmniMark Technologies Corporation.
omnimark --
SGML Error (0104) on line 9 in file gw07.sgm:
An end tag must not be omitted when OMITTAG is NO.
The element is "AMID".
There was 1 SGML error detected.

Installation of Parsers
Installing the NSGMLS parser - TOP -
- All files for parsing with NSGMLS can be found on the NDLP NT server that is usually designated as X:\ in your Windows Explorer.
- Make directory C:\parse.
- Copy file named sp1_1_1.zip from X:\prdction\parsers. This file is originally from www.jclark.com where you can find more detailed technical information about this parser.
- Extract files from sp1_1_1.zip to C:\parse. Copy file nsgmls.cat from X:\prdction\parsers\nsgmls\ to C:\parse.
- Modify autoexec.bat file by adding ;C:\parse\bin to the PATH line. (Note: You must restart your machine after modifying autoexec.bat.)
- Copy amparse.bat (RIGHT click, Save File As...) into C:\windows. Make a shortcut and drag it to the desktop. Name it whatever you like. (I named it "Parse Me")
- RESTART your machine, if you did not do so after editing autoexec.bat.
- Copy .sgm and accompanying .ent files into C:\parse.
- Run batch file by clicking on desktop shortcut "Parse"
- View parse.err file in an ASCII editor, such as Notespad, Notepad, CodeWright, Rulesbuilder.
Assumptions:
- There is a directory on the hard drive called C:\sgmldtds\entities that contains all publicly declared entities.
- There is a directory on the hard drive called C:\sgmldtds\ammem2 that contains the ammem.dtd and amsgml.dec files. (Note: You may have to create this directory or modify the existing C:\sgmldtds\ammem - if so, RIGHT click on the ammem directory in Windows Explorer to rename the directory.)
- The batch file assumes that the nsgmls.cat file is in the same directory with the files to be parsed and that directory is named C:\parse.
All files needed to install and run the NSGMLS parser are available at X:\prdction\parsers.

Installing OmniMark and Using OmniMark Parsing Program - TOP -
(Prepared by Marla Banks)
- Install the OmniMark executable:
- Create a directory named omnimark on the C:\ drive.
- Download the OmniMark 5.0 software from the NT server.(The NDLP NT server is usually designated as the X:\ drive in your Windows Explorer.) Copy X:\prdction\parsers\om_ide-s.exe to C:\temp or some other temporary location.
- Install OmniMark C/VM 5.0 by double-clicking on the file om_ide-s.exe.
- Use the InstallShield wizard to help you install the executable.
- When asked to Select Install Components, click on the third icon labeled "OmniMark C/VM." Its description states, "(allows you to compile and run OmniMark source programs from the command line)."
- Respond "yes" to the license agreement.
- When asked for a "Destination" for the OmniMark files, use the BROWSE button to locate your C:\omnimark directory. Select C:\omnimark as your destination directory.
- Complete the installation by selecting the default options until you see the message that the installation is completed.
- Modify autoexec.bat file by adding ;C:\OMNIMARK to the PATH line. (Note: You must restart your machine after modifying autoexec.bat.)
- If you have an earlier version of Omnimark, continue editing the autoexec.bat file. Delete the PATH for earlier Omnimark programs. It is likely to be ;C:\OMCI40;C:\OMCI40\BIN or ;C:\OMNIV3R1 After the last line of text, delete the license statement:
set OMNIMARK_LICENSE_FILE=7476@rs6.loc.gov You may delete any old OmniMark files and directories.
(Note: You must restart your machine after modifying autoexec.bat.)
- Copy all the files from X:\prdction\parsers\omnimark into C:\omnimark These files are all the OmniMark programs and supporting files you will need for parsing.
- The following files are needed in the following directories on the C:\drive to run the parsing program. They are available from the NT server in the X:\prdction\parsers directory:
- - \sgmldtds\ammem2\ammem.dtd
- - \sgmldtds\ammem2\amsgml.dec
- - \omnimark\omni.cat
- - \omnimark\catvalid.xom
- - \omnimark\socat.xin
- - \omnimark\sgmlomni.bat
- - ISO character entity filesshould be in C:\sgmldtds\entities.
- RESTART your machine, if you did not do so after editing autoexec.bat.
- Test with a couple of sample files.

OmniMark Batch Parsing Program and DOS Batch Files: - TOP -
The following files are needed for batch processing. All these files are found on the NT server in X:\prdction\parsers copy the files listed below into the C:\omnimark directory.
- xombatch.xom - main OmniMark program that parses batches of SGML files
- dtdparse.xin - OmniMark code called by xombatch.xom processes the SGML Open catalog for Public Identifiers
- amombat.bat - DOS batch file for running xombatch.xom
- ombatch.bat - DOS batch file for running xombatch.xom generic parsing; requires an argument for SGML declaration
Output:
Two files are created for each run, errors.rpt and batch$.out.
errors.rpt lists each file in the batch and the number of SGML errors in each file.
batch$.out lists the SGML error messages for each file.
Assumptions:
- Both batch files assume that the OmniMark programs are in the C:\omnimark directory.
- Both batch files assume that the catalog file (omni.cat) is in the C:\omnimark directory.
- amombat.bat assumes that the American Memory SGML declaration is in the C:\sgmldtds\ammem2 directory (i.e., C:\sgmldtds\ammem2\amsgml.dec).
- ombatch.bat assumes that the beginning path name for the SGML declaration argument is C:\sgmldtds. <!--This is used for parsing with any DTD, not solely ammem.dtd -->
For the SGML declaration in C:\sgmldtds\html\html.dec, use html\html as the argument.
For the SGML declaration in C:\sgmldtds\ead\eadsgml.dec, use ead\eadsgml as the argument.
- C:\omnimark is on the path statement in autoexec.bat so that the batch files can be invoked from any directory.
Examples for amombat.bat
- amombat law*.sgm
- amombat *.sgm
- amombat t*.sgm
Examples for ombatch.bat
- ombatch ammem2\amsgml law*.sgm
- ombatch ammem\amsgml old*.sgm
- ombatch html\html *.htm
-- Return to top --
-- Return to Text Quality Review Home --
|