On Thursday, 30 June 1994, a problem was reported in an attempt to update an attorney address late the afternoon before. An entry had been made in the Case Management error log; apparently the corresponding entry is this one:
Wed Jun 29 16:56:47 EDT 1994
Program: unknown
Calling function: error
Offending function: See notes.
Status: -4
Errno: 9
Notes: gett (tnum=80, offset=1, words=1)
Inspection of the error log revealed not only that entry, but many entries
of the form
Thu Jun 30 07:50:38 EDT 1994
Program: SQL
Calling function: error
Offending function: See notes.
Status: 0
Errno: 22
Notes: (dread) addr: 194094456
going back to Tuesday, 28 June. This message was considered the more
disturbing, from the explanation in the UNIFY Direct HLI Programmer's
Manual (p. E-39):
Possible causes and/or solutions:
1. The data base file doesn't exist, or it is not
readable by the current operating system user.
[not applicable to our situation]
2. Invalid starting address to read from, or nonsensical
number of bytes to read. [The starting address in the
log entry is clearly invalid.] Either the program is
overwriting itself, or the database is corrupted and
needs to be recovered.
The San Antonio Training and Support Center was called and was able on Friday, 1 July, to isolate 'hist' as the table involved when the 'dread' (pronounced 'dee read' except for dramatic effect) errors were logged. The problem was to isolate the offending records in that table.
Late Friday afternoon, on a whim, I ran the program I had written in January 1994 while working on the duplicate person ID problem. This program scans a table for duplication of primary keys, and returns all record locations sharing each key. For the 'hist' table, this output was produced:
31667037 200006 200011 200016 63617365 200002 200007 200012 200017 200022 200027 200032 643a2032 200003 200008 200013 200018 200023 200028 200033 656c6563 200005 200010 200015 200020 200025 200030 200035(The keys themselves are left in hexadecimal form, since the program only cares if two keys are equal and does not need to interpret them.)
The duplication of keys shows a striking regularity: for each key value that is duplicated, the records sharing that value occupy slots exactly five apart. However, the expected sequence 200004, 200009, ... does not occur, and the sequence 200006, 200011, ... is truncated.
Examination of the offending records in SYS920 showed them to be utter nonsense. The keys were enormous values completely out of range for hist IDs, dates such as 12/22/29 and 1/6/99 appeared, and employee names such as '1fp7', 'case', 'd: 2', and 'elec' were found. A curious note was that, if two records were compared from different primary-key groups, the same enormous value that was the primary key in one record was likely to turn up in some other field of the other record.
As these records contained no recoverable data, it appeared that the only solution was to delete them. Several other tables make explicit references to hist, and UNIFY will not delete a record until all other records referring to it have been removed. The typical procedure in such a case is to dump all related tables into ASCII files, empty them all, edit the ASCII files to eliminate the bad records, and load them back into the database. This procedure would be very time-consuming for tables the size of hist and the others that refer to it.
The same issue was involved in the duplicate Person ID problem back in January, and I had developed a method for deleting records with current references. On Sunday afternoon, 3 July, the weekend backups having been done, I used this method to try to delete the offending records. In this case, SYS920 was aborted with a memory fault without deleting the records. Apparently, the corruption affected the non-data, UNIFY-internal-use portions of the records as well as the data portions, and in such a way as to cause a memory fault in the deletion logic.
This would call into question the practicality of the "typical" procedure as well: after dumping the tables into ASCII files, it is still necessary to delete all entries from the tables; if the deletion can't be done, the procedure is unworkable. A typical procedure in that situation is to dump all tables in the entire database to ASCII files, create a new, empty database from scratch, and reload the files. That approach would probably make Case Management unavailable for a matter of days.
I experimented in the ICMS Training database with the idea of doing a reconfigure after deleting a table from the schema, and then another reconfigure after redefining the same table with the same table number. This appeared to be a workable method of emptying a table if delete will not work. However, as it would involve an ASCII dump, two reconfigures, and an ASCII load, it would still be quite time-consuming.
I turned my attention back to the strange regularities in the corrupted records. The hist ID is a numeric field, and when displayed that way, as by SYS920, did not suggest any pattern in particular. However, when seen in hexadecimal, as in the output of my duplicate-check program, a pattern emerges:
Hist Id ASCII
31667037 1fp7
63617365 case
643a2032 d: 2
656c6563 elec
Michelle Roberts of the San Antonio Training and Support Center was on call
on Sunday, and sent me a program called 'dbcheck' which reveals some of the
internal organization of a UNIFY database. Part of dbcheck's output follows:
Record 27 (hist), Free Space Pointer 1082197, Record Count 1082195
Records Per Segment 200001, Delete Chain Pointer 1082196
Data Segments: 0:162164, 1:1, 1:51564, 1:103127, 1:154690, 1:206253,
0:0, 0:0, 0:0, 0:0,
UNIFY divides a table into several segments, each containing up to a fixed maximum number of records. When a segment is filled, another segment is allocated for subsequent records. For the hist table, each segment contains 200001 records. The first corrupted record in our table was slot 200002-- the very first record of the second segment.
The ICMS database is kept in eight separate files or "volumes." Each segment of a table is located by the volume that contains it, and its starting block within that volume. The second segment of the hist table is identified as 1:1--the segment begins at block 1 of the volume icms.db1.
All of this suggested that some insight into the corruption might come from directly examining the first few blocks of icms.db1. The contents:
titlefile open returned an error case selected: 2:92fp75674 case selected: 2:92fp75684 case selected: 2:92fp75670 case selected: 2:92fp74851 case selected: 2:92fp73889 case selected: 2:92fp73343 case selected: 2:92fp73164 case selected: 2:92fp72714 case selected: 2:92fp72514 ...and so on into the tenth block of the file. This is recognizable as the progress output usually produced by the dkt.rpt program. At some point since January (because all tables were checked for duplicates in January), the first ten blocks of icms.db1 were overwritten with this text.
If the problem was caused by overwriting part of icms.db1, what's to lose by considering a similar radical step as part of the solution? Perhaps a good hist record from later in the file could be used to overwrite the corrupted slots at the beginning. The result would be a database that still contained many duplicate keys, but with the non-data portions uncorrupted, it might be possible to delete them.
Such a project would require knowing the total size (data and non-data) of a single hist entry, and how to calculate the position in the file of a given entry. That depends in turn on whether UNIFY numbers blocks starting with zero or one; i.e., does segment 1:1 actually begin at the first block in icms.db1, or the second?
The size of a hist entry is easily found using dbcheck:
Key Field Information
Record # Key Field # Record Length Key Length
27 233 132 4
Note that each entry " case selected: 2:92fp73343" in the overwriting text is 30 bytes long, and that 30/132 = 5/22. Every fifth hist entry will line up in just the same way with a 'case selected' line (each twenty-second). Since all of the text in the 'case selected' line repeats exactly but for the case number itself, every fifth hist entry appears to contain much the same information.
By pulling up several hist records in SYS920, converting the displayed values back into their internal formats, and comparing to the raw data in the file, I located the first record of the segment--200002--at byte offset 512 in the file, or the beginning of the second block. The first uncorrupted hist entry, 200036, lies at byte offset 5000.
By examining the first blocks of other ICMS volumes, I concluded that the first block of each volume (except the root) contains all zeros. This was enough information to try rewriting the first blocks of icms.db1.
First, the single block of zeros was extracted from another database volume
into a file named block0. Then, 132 bytes at offset 5000 into icms.db1 were
extracted into a file named rec200036.
The bs, iseek, and count options
of dd make such extractions
straightforward.
Finally, 'block0' and 34 repetitions
of 'rec200036' were catenated into a single 5000-byte file, 'newstuff'.
For historical interest, the first 5000 bytes of icms.db1 were extracted into a file named 'badstuff'. Then 'newstuff' was written over the first 5000 bytes of icms.db1.
bs, oseek, and count options
in a dd command with file descriptor 1
redirected in this manner.
Newer versions of dd often provide
a conv=notrunc option, which is more convenient.
Either way,
it is good to practice on a file that is not irreplaceable.
SYS920 then confirmed the presence of 35 identical hist entries, copies of the original entry at slot 200026. Using the method developed in January, 34 of them were deleted without incident, leaving only the original. The hash table was regenerated, B-tree index 13 was rebuilt, the explicit relationships on dh_cr, dh_trm, dc_trm, who_hid, hd_hi, hi_evid, and ch_id were repointed, and the database was backed up and placed back in service at 2:00 AM on Tuesday, 5 July.
In all, 34 hist entries (of about 1,082,000, or about 0.003%) have been lost. We may be able to recover these, as time permits, from an old backup. In any case, the report from 'repoint', which lists records in other tables which 'repoint' could not relink to the missing hist entries, should provide information to help reconstruct the missing entries.
It is curious that icms.db1 could have been overwritten by output from dkt.rpt. The file is on a file system reserved for ICMS volume files only, so even corruption of the file system could not have confused icms.db1 with the observed text. The most likely explanation seems to me that dkt.rpt may once have been exec(2)ed by a process that had closed its standard input and standard output. The first two files opened by dkt.rpt, likely to be icms.db0 and icms.db1, would then be assigned file descriptors 0 and 1. During dkt.rpt's execution, writes to standard output (fd 1) would be written on icms.db1.
There are no programs currently in use that close standard output before invoking dkt.rpt. It may that such a flaw existed at one time during testing of our docket fiche software. I do not expect this problem to recur.
Date: Mon, 4 Jul 94 19:58:25 -0400
From: Chap Flack <chap@delmie.detroit.6.us_courts.gov>
Message-Id: <9407042358.AA02927@delmie.detroit.6.us_courts.gov>
To: chap@delmie.detroit.6.us_courts.gov
The Case Management system is unavailable while corruption in the hist
table is investigated.
--------------------------------------------------------------------------------
CaseMgt: The status message above is dated Jul 3 16:55.
--------------------------------------------------------------------------------
Starting dh_cr at Mon Jul 4 03:10:32 EDT 1994
dh_cr
--WARNING --
All the records were not relinked.
You must manually relink them, following
the instructions given in the file "repoint.err".
real 42:00.1
user 2:54.8
sys 7:48.6
dh_cr done at Mon Jul 4 03:52:32 EDT 1994. repoint.err:
Mon Jul 4 03:52:31 1994
This file contains a list of the records REPOINT was unable to fix.
The values in the Reference Field do not exist in the Related Table
You can fix them by accessing the Table (use the 'setloc' function
with the Location), figuring out what value the Reference Field
should contain, and storing that value into the Reference Field.
Table Location Related Record Reference Field
schedule 86933 hist dh_cr
schedule 37740 hist dh_cr
schedule 30025 hist dh_cr
schedule 87150 hist dh_cr
schedule 58990 hist dh_cr
schedule 53697 hist dh_cr
schedule 97021 hist dh_cr
Starting dh_trm at Mon Jul 4 03:52:32 EDT 1994
dh_trm
--WARNING --
All the records were not relinked.
You must manually relink them, following
the instructions given in the file "repoint.err".
real 1:23:04.3
user 3:30.2
sys 8:08.7
dh_trm done at Mon Jul 4 05:15:37 EDT 1994. repoint.err:
Mon Jul 4 05:15:36 1994
This file contains a list of the records REPOINT was unable to fix.
The values in the Reference Field do not exist in the Related Table
You can fix them by accessing the Table (use the 'setloc' function
with the Location), figuring out what value the Reference Field
should contain, and storing that value into the Reference Field.
Table Location Related Record Reference Field
schedule 108574 hist dh_trm
schedule 112976 hist dh_trm
schedule 10152 hist dh_trm
schedule 48372 hist dh_trm
schedule 49403 hist dh_trm
Starting dc_trm at Mon Jul 4 05:15:37 EDT 1994
dc_trm
--WARNING --
All the records were not relinked.
You must manually relink them, following
the instructions given in the file "repoint.err".
real 1:54:22.9
user 14:33.3
sys 26:58.7
dc_trm done at Mon Jul 4 07:10:00 EDT 1994. repoint.err:
Mon Jul 4 07:09:57 1994
This file contains a list of the records REPOINT was unable to fix.
The values in the Reference Field do not exist in the Related Table
You can fix them by accessing the Table (use the 'setloc' function
with the Location), figuring out what value the Reference Field
should contain, and storing that value into the Reference Field.
Table Location Related Record Reference Field
docproc 2585 hist dc_trm
docproc 339415 hist dc_trm
docproc 355417 hist dc_trm
Starting who_hid at Mon Jul 4 07:10:00 EDT 1994
who_hid
--WARNING --
All the records were not relinked.
You must manually relink them, following
the instructions given in the file "repoint.err".
real 1:49:49.9
user 8:53.6
sys 14:20.0
who_hid done at Mon Jul 4 08:59:50 EDT 1994. repoint.err:
Mon Jul 4 08:59:48 1994
This file contains a list of the records REPOINT was unable to fix.
The values in the Reference Field do not exist in the Related Table
You can fix them by accessing the Table (use the 'setloc' function
with the Location), figuring out what value the Reference Field
should contain, and storing that value into the Reference Field.
Table Location Related Record Reference Field
who 189337 hist who_hid
who 189339 hist who_hid
who 189344 hist who_hid
who 210636 hist who_hid
who 210641 hist who_hid
who 210642 hist who_hid
who 210643 hist who_hid
who 210644 hist who_hid
who 210645 hist who_hid
who 189336 hist who_hid
who 189343 hist who_hid
who 189346 hist who_hid
who 210646 hist who_hid
who 32113 hist who_hid
who 51493 hist who_hid
who 189322 hist who_hid
who 189844 hist who_hid
who 193800 hist who_hid
who 57636 hist who_hid
who 208426 hist who_hid
who 189363 hist who_hid
who 189364 hist who_hid
who 189361 hist who_hid
who 189366 hist who_hid
who 179592 hist who_hid
who 12980 hist who_hid
who 147179 hist who_hid
who 129940 hist who_hid
who 189372 hist who_hid
who 178031 hist who_hid
who 189374 hist who_hid
who 189376 hist who_hid
Starting hd_hi at Mon Jul 4 08:59:50 EDT 1994
hd_hi
--WARNING --
All the records were not relinked.
You must manually relink them, following
the instructions given in the file "repoint.err".
real 3:52:41.1
user 14:36.4
sys 34:25.4
hd_hi done at Mon Jul 4 12:52:30 EDT 1994. repoint.err:
Mon Jul 4 12:52:26 1994
This file contains a list of the records REPOINT was unable to fix.
The values in the Reference Field do not exist in the Related Table
You can fix them by accessing the Table (use the 'setloc' function
with the Location), figuring out what value the Reference Field
should contain, and storing that value into the Reference Field.
Table Location Related Record Reference Field
histdp 427716 hist hd_hi
histdp 318582 hist hd_hi
histdp 318594 hist hd_hi
histdp 443190 hist hd_hi
histdp 428806 hist hd_hi
histdp 318586 hist hd_hi
histdp 318588 hist hd_hi
histdp 428808 hist hd_hi
histdp 499595 hist hd_hi
histdp 318592 hist hd_hi
histdp 318599 hist hd_hi
histdp 456011 hist hd_hi
histdp 318607 hist hd_hi
histdp 418076 hist hd_hi
histdp 418079 hist hd_hi
histdp 318596 hist hd_hi
histdp 318597 hist hd_hi
histdp 318598 hist hd_hi
histdp 192001 hist hd_hi
histdp 234182 hist hd_hi
histdp 20800 hist hd_hi
histdp 189429 hist hd_hi
histdp 389145 hist hd_hi
histdp 432905 hist hd_hi
histdp 337919 hist hd_hi
histdp 337936 hist hd_hi
histdp 337931 hist hd_hi
histdp 318614 hist hd_hi
histdp 318615 hist hd_hi
histdp 318616 hist hd_hi
histdp 349770 hist hd_hi
histdp 222583 hist hd_hi
histdp 336783 hist hd_hi
histdp 281930 hist hd_hi
histdp 281151 hist hd_hi
histdp 57931 hist hd_hi
histdp 61720 hist hd_hi
histdp 75156 hist hd_hi
histdp 318625 hist hd_hi
histdp 380295 hist hd_hi
histdp 381480 hist hd_hi
histdp 383389 hist hd_hi
histdp 318627 hist hd_hi
histdp 319164 hist hd_hi
histdp 318628 hist hd_hi
histdp 318631 hist hd_hi
histdp 318632 hist hd_hi
histdp 255531 hist hd_hi
histdp 194268 hist hd_hi
histdp 349774 hist hd_hi
histdp 434101 hist hd_hi
Starting hi_evid at Mon Jul 4 12:52:31 EDT 1994
hi_evid
real 5:14:30.2
user 11:22.0
sys 15:37.6
hi_evid done at Mon Jul 4 18:07:00 EDT 1994.
Starting ch_id at Mon Jul 4 18:07:00 EDT 1994
ch_id
--WARNING --
All the records were not relinked.
You must manually relink them, following
the instructions given in the file "repoint.err".
real 1:51:22.3
user 8:42.0
sys 13:17.4
ch_id done at Mon Jul 4 19:58:22 EDT 1994. repoint.err:
Mon Jul 4 19:58:21 1994
This file contains a list of the records REPOINT was unable to fix.
The values in the Reference Field do not exist in the Related Table
You can fix them by accessing the Table (use the 'setloc' function
with the Location), figuring out what value the Reference Field
should contain, and storing that value into the Reference Field.
Table Location Related Record Reference Field
cashist 222101 hist ch_id
cashist 203508 hist ch_id
cashist 203509 hist ch_id
cashist 222892 hist ch_id
cashist 224643 hist ch_id
cashist 203512 hist ch_id
cashist 226677 hist ch_id
cashist 101634 hist ch_id
cashist 203515 hist ch_id
cashist 231456 hist ch_id
cashist 203517 hist ch_id
cashist 218879 hist ch_id
cashist 203519 hist ch_id
cashist 252840 hist ch_id
cashist 344506 hist ch_id
cashist 254427 hist ch_id
cashist 202206 hist ch_id
cashist 210663 hist ch_id
cashist 203525 hist ch_id
cashist 203526 hist ch_id
cashist 331511 hist ch_id
cashist 212019 hist ch_id
cashist 130107 hist ch_id
cashist 292085 hist ch_id
cashist 172528 hist ch_id
cashist 176344 hist ch_id
cashist 203533 hist ch_id
cashist 319237 hist ch_id
cashist 203535 hist ch_id
cashist 206152 hist ch_id
cashist 203537 hist ch_id
cashist 309255 hist ch_id
cashist 258346 hist ch_id
cashist 331514 hist ch_id
*************************************************
Cron: The previous message is the standard output
and standard error of one of your cron commands.