Some UNIFY internal integrity tests

Starting in January, an extensive series of tests (and a few corrective procedures) have been run on the UNIFY database underlying Case Management to make sure everything is stored consistently.

Records in the database are stored in slots of which so many are allocated at any given time. Slots in turn reside in segments, which live in volumes. When records are deleted, the slots they occupied are linked into a delete chain so they can be reused.

A UNIFY program called delch was used to check the integrity of each delete chain. In addition, software developed locally compared the delch results mathematically to the allocation statistics produced by the DBSTATS program.

Some 17 tables were found to have one or another kind of problem in slot allocation. In many cases, slots beyond a table's high-water mark were incorrectly tagged as allocated. These problems were corrected by reconfiguring the database, altering segment sizes to exclude the erroneous slots. The few tables whose problems were not corrected by that action contained slots marked unallocated interspersed with allocated slots. Those slots were manually retagged as deleted slots and the delch program was run again to put them correctly on the delete chain.

Another locally-developed program was run to check for duplicate keys among all tables which declare a primary key. This program had been pivotal in solving serious earlier problems in January 1994 and June 1994, but when run in January 1996 it gave all our primary keys a clean bill of health.

Easily the most time-consuming integrity test concerns "explicit relationships", a UNIFY feature where a record in one table can actually contain the slot number of a related record in another table. This allows UNIFY to find the related records very quickly, provided all such "links" are correct. If they're not, many strange consequences are possible.

A UNIFY program called relchk is used to check the integrity of these links. It goes exhaustively through every table involved, following links and examining the linked records to determine if they should really be linked, or reporting records that are not linked but should be.

The output produced by relchk can be so overwhelming that a filter script is quite handy to read through the unabridged relchk output file and produce a summary file on a more human scale. The summary includes line numbers and byte offsets into the unabridged file in case specific details need to be retrieved; often they don't.

If a particular relationship is found to be mislinked, another UNIFY program called REPOINT can be used to relink it correctly, in many cases. (In past problems we have identified and documented certain types of mislink which REPOINT fails to detect and correct.)

There are 113 explicit relationships defined in ICMS, some involving only a handful of records, others involving a million or more. REPOINT and relchk both run in comparable time, which can be seconds for one of the small relationships to ten hours or more for each of the larger ones. Because it only looks and does not modify, relchk can be run at any convenient time and interrupted if need be without corrupting the database; it does, however, require that updates to the affected tables be prevented while it is running, so its results will be accurate.

REPOINT, on the other hand, modifies the database, and once started must run to completion. If it is interrupted for any reason whatever, the database must be restored from the backup. This implies a fresh backup must be on hand. If relchk reveals that something is mislinked, the time to run REPOINT on that relationship must be estimated (roughly the same as the time relchk required) and a block with that much uninterrupted time available set aside for running REPOINT, with no other database activity.

The 113 explicit relationships were split into groups such that each group could be relchked in a weekend (including the Presidents Day weekend). After the entire round of checks, 18 of the 113 were found to have mislinks. The number of mislinks varied from one (in the hist->evlist relationship, and also one in histdp->hist) to over 400,000 (docproc->hist). Most mislinks involved the "zero" or "null" records that must exist in certain tables; their relatives in other tables were not correctly linked to them. The situation was probably created when additional zero/null records were added to the database on advice of the training center; existing database records were not automatically linked to these additions.

Sixteen of the 18 relationships with problems were corrected with REPOINT, also in stages over two recent weekends. Because the two relationships with one mislink apiece are among the largest, the single mislink in each of these was corrected by hand rather than blowing the many hours a REPOINT would require. (In SYS920 this can often be done by altering the foreign key field in the child record, then restoring the original value and using faccess to verify linkage. setsize can be used as a quick check for linkage in the other direction--the count should have incremented--though stepping through the set is more conclusive if its membership is manageably small.)

Only one problem was found that will require personal attention: the schedule record at location 50125 (schedule id 380527) lists as its terminator the nonexistent hist id 1120579. Either that hist id is wrong, and the correct id should be determined and placed in that record, or it's right and the corresponding hist record has vanished and must be reconstructed.

Because it has taken many weeks to complete this series of tests, it is possible problems have crept in since the earlier tests in the series; so, it is not possible to say for certain the database is 100% consistent. It is also worth emphasizing that consistency from the UNIFY point of view is not the same as correctness of all the stored records. These tests have checked only that the records are stored in the proper structure for UNIFY to access and retrieve them. Many could still contain incorrect information from a legal or real-world standpoint.

Nevertheless, once the one remaining schedule-record problem is corrected, we will have a database probably as close to consistent as it's been in my time at the Court. Nobody sneeze.