Recovering and Preserving the Record

On several occasions in 1990 and 1991, "archiving" was done in the ICMS (Integrated Case Management System) database. An archiving run was expected to produce several things, among them:

A tape containing ASCII representations, as produced by SQL, of all tuples belonging to archived cases, from all relations where such tuples appear. This is the primary product of the archive run, as from it (and only from it) the original contents of the database with respect to an archived case can be recreated for any purpose.
A file containing "docket sheets" for the cases just archived. A docket sheet is a type of report produced from the contents of the database, and as such can (theoretically) be recreated from the information on the archive tape itself, by reloading it into a database and running the report. Because docket sheets are often referred to, the plan was to run these reports for all cases to be archived, and save them for reference after the case data were removed from the database.
A file containing docket sheets for all archived cases, to be produced by merging the file of docket sheets for the latest run into a cumulative file similarly produced for all previous runs. This file could be placed on tape and sent to a computer-output- microfiche service to support one form of ready reference to archived cases.
"Case Index" and "Party Index" files containing, respectively, one-line entries ordered by case number and giving title and summary information for a case, and one-line entries ordered by party name and listing associated cases.
An "inventory" file listing case numbers for archived cases and the labels of the corresponding archive tapes, allowing the proper tape to be located for restoration of a case's data.

A problem is first suspected

In October 1991, a new Assistant Systems Manager was appointed. and began, among other duties, to try to locate all of the archiving end-products that should have existed.

By early 1992, it was clear that a cumulative, merged file of docket sheets was not in evidence. The only docket sheets to be found for archived cases would be in the separate files created during the several archive runs. The process of merging those collections retroactively into a single file would be complicated by the fact that numerous cases had appeared in an early archive, been later restored and modified, and later archived again. Thus, there would be some unquantified overlap of docket sheets contained in the several files. The merged file would need to contain the most-recently-generated docket sheet for each case, regardless of which file it appeared in. The existing software to merge docket sheet files was not equal to the task.

The need to merge several large docket files by independently comparing the dates of the contained docket sheets, to efficiently maintain and update the merged collection once it was created, and to store and manipulate the collection using commonly-supported formats and standard tools, led to the creation in May 1992 of the district's own docket-fiche software.

Once this software was available, the Court could begin to merge the existing docket sheet files into a single, up-to-date collection. The next difficulty encountered was that, while some of the clearly-labelled archive tapes in the library contained docket sheet files presumably created during the corresponding archive run, other archive tapes did not contain docket sheet files at all. A search was conducted for other docket sheet files, and some were found on other tapes in the library which carried various more or less informative labels. Some of the files found appeared to be duplicates of files found earlier, and some did not.

In the summer of 1992, the locally-developed docketfiche software was used to combine all of the docket-sheet files that had been unearthed in the tape library into a single, merged docket sheet collection. Because of the variety of tapes from which docket files had been retrieved, it was plausible but not proven that the merged collection just created actually contained all dockets for archived cases. In any event, the collection was used to create a new set of computer-output-microfiche for the court.

An index is needed

A new set of docket microfiche was a step in the right direction, but its utility to the public was limited by the lack of an accompanying set of index fiche. Furthermore, if a complete index of archived cases could be recovered in machine-readable form, it could be used easily to determine whether the docket-sheet collection just created was in fact complete. Work therefore began on retrieving the index files created during the previous archiving sessions.

This project was not unlike that of collecting the docket sheet files themselves; several tapes in the library, as well as on-line storage areas on the court's computers, were searched for index files. By the time the search was abandoned and duplicate copies of index files retrieved from different places had been identified and eliminated, five case index files had been located, but only three party index files.

Closer examination revealed that one of the five case index files was in fact a party index which had been mistakenly given a file name starting with cs instead of pt. This discovery brought the number of case and party index files recovered to four each, which was comforting, but did not prove that all index information had been recovered. The fact that one of the index files had been misnamed suggested that the procedures that had been used to archive cases had been minimally automated, with great opportunity for human error, casting some doubt on the reliability of the index files recovered. Finally, the recovered index files did not all contain the same set of fields; some lacked fields that were deemed desirable in an index. To have an index of known reliability, containing all the desired fields, it would be necessary to regenerate an index from the ultimate authority: the archived data tuples themselves, contained on the archive tapes. At the time, the court had no way to do so without restoring the entire contents of one or more archive tapes into disk files, which would have required prohibitive amounts of disk space.

Archiving and restoration revisited

In April 1993, the Case Management application was moved from the Unisys 5000 computer on which it had been running to a new Dell 466SE. The Dell system lacked a 9-track tape drive, and so could not read the court's existing archive tapes. This posed an immediate difficulty for the occasional need to restore an archived case to the database for further docketing. It posed a further difficulty in that future archive runs would not be able to produce the same tape format as the earlier runs. These difficulties suggested that the time was right to revisit the procedures used in the court to archive and restore cases.

Unlike archiving, an infrequent operation which had been performed in the past, restoration continued to be done periodically on request from users who needed to update archived cases. It was easily seen that restoration was a complex and error-prone process which required an operator to obtain database administrator privilege, read a number of files from a tape onto disk, edit some by hand, and invoke a script which would load them into the database. Tuples which could not be loaded provoked diagnostic messages which did not clearly identify a particular case, relation, or tuple. The tape from which to restore the files was determined by running a script which looked for the desired case number in the "inventory" file produced during the archive runs. However, the information from the inventory file was suspect; often a label listed in the file would appear on more than one tape in the library, and the correct tape on which to find the case would then be determined by lore in the head of an old-timer.

The earlier difficulty in finding docket sheet and index files, the discovery of the misnamed index file, and the unreliability of the inventory file all reinforced the impression that archiving must have been done as an ad-hoc and unsystematized process highly susceptible to human error. Because the new 466SE computer offered generous disk space, it was resolved to configure the database with much more space than it had enjoyed on the Unisys, to postpone the need to archive again. The time thus bought, while finite, would be used to develop reliable, systematic procedures for archiving and restoration.

Because of the amply-demonstrated possibility that index files and inventory files might become corrupted or lost and no longer correspond to the actual case archives on tape, the new archiving design called for an archive tape format sufficiently self-describing to permit accurate index and inventory files to be regenerated at any time directly from the contents of the tape. An improved restoration procedure could then also read directly from the tape and restore a case without the need to create intermediate files on disk and edit them by hand. Naturally, it would be necessary to convert all of the court's existing 9-track archive tapes into the new format, which would be written on 8mm tapes usable with the new system. For all of these tasks, it would be desirable to process information contained in files archived to tape by reading it off the tape directly rather than by restoring all of the files to a disk first. In the case of mass conversion of the existing archives, the disk space and time required would be completely prohibitive in the absence of such an ability.

Tool construction begins

May through November of 1993 were devoted to the development of software tools to provide the needed ability. A tool called pax (Portable Archive Exchange) was freely available from the USENIX Association, and a version of find was available from the Free Software Foundation. The find tool can select files and choose operations to perform on them based on characteristics of the files themselves. Between May and July of 1993, this tool was modified to function as a coprocess, running concurrently with, and communicating with, another piece of software. The pax tool was adapted to exploit find as a coprocess, sending it information about archive contents and getting instructions back. The result was a version of pax which can choose and apply any desired operation to data contained in archived files "on the fly," i.e., while the tape is being read, without first extracting files from the tape onto a disk.

From July through November of 1993, another Free Software Foundation tool was adapted to play a part in the court's archiving effort. The awk language (named for its developers, Aho, Weinberger, and Kernighan) is a general-purpose programming language with features that make it especially convenient for dealing with large collections of data in text format. This language can express manipulations that could not be easily expressed using only pax, find, and the shell. The FSF's implementation of awk was extended with a generalized coprocess feature, an exec() builtin, the ability to perform many calculations with calendar dates and times, and the ability to exchange data in the format developed for the communication between pax and find. Completion of this tool gave the court the ability to perform rather sophisticated processing of information selectively read directly from an archive tape, in either (or a combination) of the two commonly-supported archive formats, cpio and ustar.

Making Courtran dockets accessible

Concurrently with the gawk enhancements, a project was underway in July and August of 1993 to convert a collection of Courtran dockets into the same format developed in May 1992 for the court's local dockets. Courtran was a centralized docketing system which had been provided by the Administrative Office until mid-1992. The court had abandoned Courtran several years earlier, for purposes of civil docketing, in favor of the ICMS software run locally, but criminal cases had continued to be maintained on Courtran until its decommissioning. As a result, the court did not have any local machine-readable record of criminal dockets except on tapes provided by the AO after Courtran's demise. The tapes contained dockets in a printable format rather different from that produced by ICMS. The project was to scan the tape, extract case numbers and dates, and produce a data stream with ICMS-like headers and trailers inserted between the original dockets. That data stream could be fed into the court's own docket maintenance software to merge the Courtran dockets into the existing collection, producing a single complete reference source.

The project was complicated by the fact that Courtran case numbers did not include a component identifying the divisional office where the case was filed, while the office code is an integral part of an ICMS case number and is expected by the court's local docket software. Conventions had for the most part been observed in which the first digit of the case sequence number had to do with the divisional office, but the conventions had varied from year to year. By trial, error, and repeated consultation with the Administrative Manager, a set was devised of 15 rules by which the proper divisional office could be inferred from the case year and first sequence-number digit, for all but five of the Courtran dockets. The five exceptions were handled by five more rules triggered by specific case numbers.

The rules were encoded in an awk program which implemented a little language for rules about case numbers. The stage was set to complete the conversion, but performance of the program, using the nawk interpreter supplied with the UNIX system, was unacceptable, rapidly degenerating to a few records per minute. After several hours, most of the first tape was still on the feed spool. The awk program was not complex; rather, it appeared that an algorithm used in the nawk interpreter had a pathological worst case somehow exercised by the associative array operations in the program.

The work on extending gawk was not far along at that point, but the sources had been obtained and it was possible to build a stock gawk from the unmodified copies. The same Courtran conversion was attempted again after merely replacing nawk with gawk, and the entire set of tapes converted successfully in 27 minutes, suggesting that gawk used a different algorithm.

Old archive tapes converted

By September of 1993, enough of the gawk extensions had been completed to build a tool, combining pax, find, and gawk, to begin the conversion of the Court's entire set of 9-track archive tapes to a newly-devised Self-Describing Archive format on 8mm tape. The conversion was completed in November of 1993. The SDA format stored the archived case information in such a way that it could be easily scanned by future tools--to recreate a complete, accurate inventory file for use in case restoration; to create full, consistent index files for public access; and to create a much more efficient case restore procedure. All of these tasks could be accomplished easily by reading the needed information directly off the tape, using the pax, find, and gawk tools just developed.

The advent of CHASER

In late 1993 and early 1994, archiving/public access projects were interrupted for the adaptation of CHASER (CHAmbers Access to Selected Electronic Records), an ICMS add-on released by the Administrative Office. In addition to numerous efficiency modifications enabling the CHASER report extractors to run in acceptable amounts of time on the court's large database, it was necessary to develop CHASER print options that did not depend on Mirror III (which the court no longer used) and, most significantly, to gut the CHASER docket sheet lookup facility and hook it instead to the court's locally-developed docket software. The result was a version of CHASER that provided seamless access not only to current ICMS dockets, but also to the archived ICMS dockets that had been recovered in 1992, and the Courtran criminal dockets just converted. Seamless access was provided when an exact case number was known; it was not yet possible to locate an archived or Courtran docket by a party name or partial case number, as the required index information was not available.

In April of 1994, using pax and the FSF tools, a program was completed capable of indexing any of the information contained in a Self-Describing Archive, under control of a simple specification naming the tables and fields to index. The program was given the ability to handle up to six index specifications in a single pass through a tape. Conferences with the Administrative Manager produced a set of specifications for case, party, and nature-of-suit index information desirable for public access. Using these specifications, the tool was used to extract the index information from the SDA tape.

With only minor manipulation, easily done in awk, the index files so created were converted into the format used by CHASER and merged into that program's index structure. With this step, access to archived, as opposed to current, ICMS dockets in CHASER became completely transparent. The Courtran dockets, converted more recently, could still be found only by case number, as a corresponding index had yet to be created.

Work commenced immediately on creating a Courtran index in the same format as the newly-extracted ICMS data. Index tapes had been supplied along with the docket tapes after Courtran's decommissioning, but the tapes were quickly seen to contain less information than had been requested for a public index. The solution was to develop software to scan the Courtran dockets themselves, converted the previous August, and extract the needed information from the paginated reports. This already-unenticing task was complicated by the lack of regularity in Courtran docket entries. Party names in case titles often differed in spelling (or worse) from names appearing in the docket body, and even names of judges appeared with numerous variant spellings, requiring many alternative tests in the software.

What the index revealed

Meanwhile, the creation of a complete case index by extraction from the ICMS archive tape at last made it possible to determine if the set of archived dockets recovered in 1992 was in fact complete, something long hoped and supposed but never proven. Comparison of the recovered dockets to the case index revealed, disappointingly, that a good 40 percent of the court's archived dockets were still missing.

The 1992 effort had searched all tapes with likely-looking labels for signs of docket files. Now every tape in that section of the library was searched, regardless of what the label said. Remembering the misnamed index file found earlier, files with unlikely names were examined to see if they might contain dockets. The search did unearth more docket files, some of which duplicated files found earlier, while some provided new dockets. Ultimately, however, it was clear that about 13,400 dockets--23 percent of the court's archived cases--had either never been retained, or had at some point been simply lost. The only way to obtain dockets for those cases would be to restore the case relations themselves from the archive tape into an ICMS database and run the docket report generator. The only problems were the lack of disk space for another 13,000-case ICMS database, and the existing case restoration procedure, which was cumbersome for restoring an individual case, and unusable to restore 13,000.

How to restore 13,000 cases

One goal of the SDA design had always been to make possible a tool to restore cases very efficiently by pulling the data directly off the tape. In addition to speed, such a procedure would make reduced demands on disk space. The court's invoke software made it simple for such a tool to perform privileged database operations without the need for special operator action; the gawk, pax, and find tools provided the needed off-the-tape capabilities, and the SDA tape itself was now available. Work began on building a case-restore tool that would turn the potential benefits into reality.

At the time, the court had a spare 800 MB disk drive that had been purchased as insurance for another machine. A new project, Operation Restore Faith, was conceived in which the spare drive would be attached to the ICMS machine and used to set up an empty ICMS database with capacity for the 13,000 missing cases. The mass loading of the "orefa" database would be the most demanding test of the new restoration software when it was completed.

Viewing dockets from anywhere

In July 1994, as work proceeded on Courtran indexing and a restoration tool, an important component of the court's public-access plan was completed and deployed. The CHASER and PACER systems distributed by the Adminstrative Office included a docket-lookup facility whereby the user could view a docket sheet located on the same machine where CHASER/PACER was running. When CHASER was installed at the court, that docket facility had been gutted and adapted to the local docket software, which still required the docket files to be on the same machine. This was adequate for CHASER.

However, PACER was intended to run on a dedicated machine, separate from ICMS. The court's own requirements for a public-access system had always included access to current docket sheets; the available PACER software only offered access to docket sheets updated the previous night. Ability to update individual dockets on the fly had been designed into the court's docket software in 1992, but incorporating that feature into a public access system might entail some overhead to update copies of the docket on both the ICMS machine and the separate PACER machine.

The solution was to design something resembling the CHASER/PACER docket lookup software, but split in half. A server component was to deal with the docket files, while a separate client would interact with the user, issue prompts, and display the docket. The server and client would communicate with each other on the user's behalf, using a simple docket lookup protocol (SDLP). Server and client could of course both reside on one machine, and provide the same functions available in stock CHASER or PACER. But the client could also be on another machine, such as the court's dedicated PACER platform. In fact, an SDLP client could reside on any machine on the judiciary-wide private internetwork. As other pieces of the court's public-access effort had yet to be completed, the proof-of-concept for SDLP was accomplished by sending an SDLP client to the District Court for the Western District of Michigan, which was then able to locate and view Eastern Michigan dockets just as though they were using a local CHASER system.

The SDLP server and client were both written in the court's locally-extended gawk language, using to good advantage the coprocess and other features added in 1993, and the extended find tool also developed at that time. As a result, the first versions of server and client were written in less than two days each. Subsequent enhancements, owing to the simple, interpreted language, have been quick and straightforward to make and to test. SDLP clients have been delivered to the probation and pretrial services agencies and to the Sixth Circuit Court of Appeals in Cincinnati, allowing them to look up the court's dockets as though on a local system. In addition to simplifying procedures for probation, pretrial, and circuit court personnel, the move eliminated a number of otherwise-unnecessary logins to the court's overloaded ICMS machine.

A problem is again suspected

Also in July 1994, reports began filtering in that CHASER was unable to find certain old civil case numbers, or parties to very old civil cases, in its index. Investigation of this disappointing news revealed a new vista of recovery and conversion work lying between the court and its integrated public-access goal.

At the time of Courtran's decommissioning in 1992, that system had been used only for criminal cases. Civil cases had been docketed using ICMS and local computing facilities for several years at that point. However, before the advent of ICMS, both civil and criminal docketing had been done on Courtran. 51,357 case records, and 167,513 party records, were not included in the recovered ICMS index data, because they had never been entered in an ICMS database. The court did not have dockets for these cases in any machine-readable form, but did have machine-readable index files, which would need to be converted into the same format as the court's other ICMS and criminal Courtran index data.

Conversion of the existing files as to record format was straightforward, but revealed serious problems with original data entry standards, as well as some thoroughly corrupted records probably produced by an (inferred) earlier attempt at automated conversion. The data would be clearly unusable without manual review and markup of the entire index.

Record recovery continues through first outside PACER tests

August 1994 saw the installation of the computer purchased to be the dedicated public-access host, and the start of development of the secure front-end software that would run on that machine, guiding public dial-in users to an index database and SDLP client. September saw the first outside users given access to the public-access system in development, for testing and evaluation. September also saw the completion of the Courtran criminal index conversion project begun in the spring, giving CHASER and public-access users seamless access, by case number or by party name, to current ICMS information, archived ICMS information, and Courtran criminal cases. Two public-access hurdles remaining were the 13,000 missing dockets to be addressed by Operation Restore Faith, and the 200,000 old civil Courtran records to be addressed by long, squint-eyed hours. A start was made on Operation Restore Faith by attaching the spare 800 MB disk drive to the ICMS machine. Operation Restore Faith was promptly rescheduled when the drive was found to be DOA and returned to the manufacturer.

In October 1994, the first version of the rapid case restoration software was completed and ready for the big test. A replacement disk drive had been obtained and, while apparently functional, did not work well with the court's equipment. A good deal of testing and some lengthy, unrewarding conversations with the manufacturer's support staff ultimately revealed that the replacement drive, though of the same model as the original, was a more recent revision; the revision had apparently introduced an incompatibility with the court's equipment, where the older revision--no longer available--was working perfectly in the public access machine.

In addition to the ICMS database in production use, the court had another database available for use, which would not require another disk. The test database was used only for training and for testing new software releases, and was configured for not much more than a thousand cases maximum. Operation Restore Faith Slowly proceeded by restoring a thousand cases at a time from the archive tape, running the docket report generator, deleting the cases, and repeating. Using the new restoration software, which clocked out at a couple of cases per minute, restoration of a thousand cases could begin mid-morning and complete in time for evening backups. After backups, the report-generating software could run overnight; deletion would begin early in the morning and complete in time for the next batch. The users' patience with the predictable effects on response time was much appreciated.

A number of incompatibilities were discovered between data entry conventions used by the older versions of ICMS from which the cases had been archived and those accepted by the current version into which they were being restored. Corresponding automatic edits were added to the case restoration software. Still, a few dozen cases were encountered whose archive files simply didn't contain all of the associated information. Investigation revealed that the missing tuples could still be found in the production ICMS database, from which they had been neither copied nor deleted during the earlier archiving process. Therefore, the remaining stragglers, with a few hand edits, could be successfully restored into the production database, which contained the rest of the required information. The number of such cases was small enough to make such a solution practical. They will probably meet the archiving criteria and be removed again from the database when archiving is next done.

Going public

In late December, 1994, the last of the 13,000 missing dockets was reconstructed using version 1.7 of the court's new restoration tool. On 3 January 1995, as announced in the media, the first public dial-in accounts were accepted for the court's PACER-like service, which now offered access to all current ICMS cases, all archived ICMS cases, and all Courtran criminal cases, lacking only the civil Courtran index records requiring manual review. A team was formed to begin the laborious process of cleaning up the civil index entries.

A tangible index

While electronic public access is certainly convenient, the Court maintained a prudent interest in a physical fallback form such as microfiche. As the various pieces of the Court record were reconstructed, cleaned up, and merged, it became thinkable that a comprehensive set of merged microfiche could be within reach. In the Fall of 1995, formats for new Case Index, Party Index, and Nature-of-Suit Index reports were devised, and scripts were written for the INGRES Report Writer to produce the fiche as soon as the data could be ready. Some energy was diverted from the project, however, to prepare for the installation of the new host, wilkins, and the threatened-but-postponed upgrade of operating systems to Solaris. At the same time, the Court's locally-purchased UNIX systems underwent a threatened-and-carried-out upgrade to SCO OpenServer 5.

Just the ticket for documentation

In the Spring of 1996, the Assistant Systems Manager began to prepare for a move out of state. The need for a way to make the existing systems documentation more readily available to successors was handily addressed when the Open Server 5 upgrade was found to have replaced paper manuals with completely on-line documentation for the UNIX system. While not the first to go that route, SCO made the amazingly sensible decision to base their on-line system on the freely-available software used elsewhere to create the World-Wide Web. Because of this, it was possible to begin converting all of the Court's local documentation into the same form, to create a one-stop shop for (at first) documentation related to UNIX and the Court's central applications. It was also possible to obtain the free, compatible browsing software for any PC in the court, making documentation available wherever it is needed. It will also be possible to broaden the sense of documentation to include all kinds of reference information, including live queries of the Case Management database and immediate access to local program sources.

One more index hurdle

The replacement of the old Unisys computer, spmied, by the upstart wilkins, drew attention unavoidably to one remaining collection of case records that had yet to be merged into the comprehensive index. A "miscellaneous civil" database had been developed locally in the late 1980's using UNIFY 3.2, which would not be available for wilkins. Since the database could not be supported after the transition to spmied, all of the records were extracted into a file in the form of a self-describing archive, to await development of software that could convert them into ICMS records to be loaded into the Case Management database. That conversion remains to be done, and it will be an interesting project, as the "misc" database schema was far from normalized. A case record includes repeated fields for a fixed number of parties, and at times the parties are entered without regard to field boundaries, as in:

  Party 1:  Adam Bosco, Charles
  Party 2:  Davis, and Eleazor Frenette

The prospects for a successful, purely-automated conversion seem slim; more promising might be a hybrid approach, with software analyzing each record and presenting a user several likely ways of reforming it, so that the user could choose the right one with less effort than actually retyping each entry. A CGI-based approach might be especially convenient.

Except for that

The records from the "misc" database (representing 6,548 cases) are for the time being suspended in a file /archival/CaseMgt/SDAmisc.gz. While this file is a self-describing archive, it is not an ICMS self-describing archive such as could be loaded into ICMS with rest_case. Rather, in describing itself it describes the old "misc" schema, and awaits a conversion effort to be usable in ICMS.

Meanwhile, all the other data sources identified to date have been merged to create a system where public-access users may query cases from the current day back to (thinly) 1932, and on 31 July 1996 a set of index microfiche tapes was created covering 147,677 cases and 338,135 persons.