Overwritten UNIFY volume (symptoms, recovery)

On Thursday, 30 June 1994, a problem was reported in an attempt to update an attorney address late the afternoon before. An entry had been made in the Case Management error log; apparently the corresponding entry is this one:

    Wed Jun 29 16:56:47 EDT 1994

    Program: unknown
    Calling function: error
    Offending function: See notes.
     Status: -4
    Errno: 9
    Notes: gett (tnum=80, offset=1, words=1)
Inspection of the error log revealed not only that entry, but many entries of the form
    Thu Jun 30 07:50:38 EDT 1994

    Program: SQL
    Calling function: error
    Offending function: See notes.
     Status: 0
    Errno: 22
    Notes: (dread) addr: 194094456
going back to Tuesday, 28 June. This message was considered the more disturbing, from the explanation in the UNIFY Direct HLI Programmer's Manual (p. E-39):
    Possible causes and/or solutions:

	1.  The data base file doesn't exist, or it is not
	    readable by the current operating system user.
	    [not applicable to our situation]
	
	2.  Invalid starting address to read from, or nonsensical
	    number of bytes to read.  [The starting address in the
	    log entry is clearly invalid.]  Either the program is
	    overwriting itself, or the database is corrupted and
	    needs to be recovered.

The San Antonio Training and Support Center was called and was able on Friday, 1 July, to isolate 'hist' as the table involved when the 'dread' (pronounced 'dee read' except for dramatic effect) errors were logged. The problem was to isolate the offending records in that table.

Late Friday afternoon, on a whim, I ran the program I had written in January 1994 while working on the duplicate person ID problem. This program scans a table for duplication of primary keys, and returns all record locations sharing each key. For the 'hist' table, this output was produced:

31667037 200006 200011 200016
63617365 200002 200007 200012 200017 200022 200027 200032
643a2032 200003 200008 200013 200018 200023 200028 200033
656c6563 200005 200010 200015 200020 200025 200030 200035
(The keys themselves are left in hexadecimal form, since the program only cares if two keys are equal and does not need to interpret them.)

The duplication of keys shows a striking regularity: for each key value that is duplicated, the records sharing that value occupy slots exactly five apart. However, the expected sequence 200004, 200009, ... does not occur, and the sequence 200006, 200011, ... is truncated.

Examination of the offending records in SYS920 showed them to be utter nonsense. The keys were enormous values completely out of range for hist IDs, dates such as 12/22/29 and 1/6/99 appeared, and employee names such as '1fp7', 'case', 'd: 2', and 'elec' were found. A curious note was that, if two records were compared from different primary-key groups, the same enormous value that was the primary key in one record was likely to turn up in some other field of the other record.

As these records contained no recoverable data, it appeared that the only solution was to delete them. Several other tables make explicit references to hist, and UNIFY will not delete a record until all other records referring to it have been removed. The typical procedure in such a case is to dump all related tables into ASCII files, empty them all, edit the ASCII files to eliminate the bad records, and load them back into the database. This procedure would be very time-consuming for tables the size of hist and the others that refer to it.

The same issue was involved in the duplicate Person ID problem back in January, and I had developed a method for deleting records with current references. On Sunday afternoon, 3 July, the weekend backups having been done, I used this method to try to delete the offending records. In this case, SYS920 was aborted with a memory fault without deleting the records. Apparently, the corruption affected the non-data, UNIFY-internal-use portions of the records as well as the data portions, and in such a way as to cause a memory fault in the deletion logic.

This would call into question the practicality of the "typical" procedure as well: after dumping the tables into ASCII files, it is still necessary to delete all entries from the tables; if the deletion can't be done, the procedure is unworkable. A typical procedure in that situation is to dump all tables in the entire database to ASCII files, create a new, empty database from scratch, and reload the files. That approach would probably make Case Management unavailable for a matter of days.

I experimented in the ICMS Training database with the idea of doing a reconfigure after deleting a table from the schema, and then another reconfigure after redefining the same table with the same table number. This appeared to be a workable method of emptying a table if delete will not work. However, as it would involve an ASCII dump, two reconfigures, and an ASCII load, it would still be quite time-consuming.

I turned my attention back to the strange regularities in the corrupted records. The hist ID is a numeric field, and when displayed that way, as by SYS920, did not suggest any pattern in particular. However, when seen in hexadecimal, as in the output of my duplicate-check program, a pattern emerges:

    Hist Id               ASCII

    31667037              1fp7
    63617365              case
    643a2032              d: 2
    656c6563              elec
Michelle Roberts of the San Antonio Training and Support Center was on call on Sunday, and sent me a program called 'dbcheck' which reveals some of the internal organization of a UNIFY database. Part of dbcheck's output follows:
    Record 27 (hist), Free Space Pointer 1082197, Record Count 1082195
    Records Per Segment 200001, Delete Chain Pointer 1082196
    Data Segments: 0:162164, 1:1, 1:51564, 1:103127, 1:154690, 1:206253,
		   0:0, 0:0, 0:0, 0:0,

UNIFY divides a table into several segments, each containing up to a fixed maximum number of records. When a segment is filled, another segment is allocated for subsequent records. For the hist table, each segment contains 200001 records. The first corrupted record in our table was slot 200002-- the very first record of the second segment.

The ICMS database is kept in eight separate files or "volumes." Each segment of a table is located by the volume that contains it, and its starting block within that volume. The second segment of the hist table is identified as 1:1--the segment begins at block 1 of the volume icms.db1.

All of this suggested that some insight into the corruption might come from directly examining the first few blocks of icms.db1. The contents:

titlefile open returned an error
  case selected: 2:92fp75674
  case selected: 2:92fp75684
  case selected: 2:92fp75670
  case selected: 2:92fp74851
  case selected: 2:92fp73889
  case selected: 2:92fp73343
  case selected: 2:92fp73164
  case selected: 2:92fp72714
  case selected: 2:92fp72514
  ...
and so on into the tenth block of the file. This is recognizable as the progress output usually produced by the dkt.rpt program. At some point since January (because all tables were checked for duplicates in January), the first ten blocks of icms.db1 were overwritten with this text.

If the problem was caused by overwriting part of icms.db1, what's to lose by considering a similar radical step as part of the solution? Perhaps a good hist record from later in the file could be used to overwrite the corrupted slots at the beginning. The result would be a database that still contained many duplicate keys, but with the non-data portions uncorrupted, it might be possible to delete them.

Such a project would require knowing the total size (data and non-data) of a single hist entry, and how to calculate the position in the file of a given entry. That depends in turn on whether UNIFY numbers blocks starting with zero or one; i.e., does segment 1:1 actually begin at the first block in icms.db1, or the second?

The size of a hist entry is easily found using dbcheck:

    Key Field Information
    Record #   Key Field #   Record Length   Key Length
	27            233           132               4

Note that each entry " case selected: 2:92fp73343" in the overwriting text is 30 bytes long, and that 30/132 = 5/22. Every fifth hist entry will line up in just the same way with a 'case selected' line (each twenty-second). Since all of the text in the 'case selected' line repeats exactly but for the case number itself, every fifth hist entry appears to contain much the same information.

By pulling up several hist records in SYS920, converting the displayed values back into their internal formats, and comparing to the raw data in the file, I located the first record of the segment--200002--at byte offset 512 in the file, or the beginning of the second block. The first uncorrupted hist entry, 200036, lies at byte offset 5000.

By examining the first blocks of other ICMS volumes, I concluded that the first block of each volume (except the root) contains all zeros. This was enough information to try rewriting the first blocks of icms.db1.

First, the single block of zeros was extracted from another database volume into a file named block0. Then, 132 bytes at offset 5000 into icms.db1 were extracted into a file named rec200036. The bs, iseek, and count options of dd make such extractions straightforward. Finally, 'block0' and 34 repetitions of 'rec200036' were catenated into a single 5000-byte file, 'newstuff'.

For historical interest, the first 5000 bytes of icms.db1 were extracted into a file named 'badstuff'. Then 'newstuff' was written over the first 5000 bytes of icms.db1.


The trick to overwriting a portion of a file without disturbing the rest is to open it in RDWR mode, or in WRONLY without CREAT or TRUNC. This ability is not offered by the Bourne shell. The Korn shell, however, does have a little-known <> operator that effects a RDWR open and can be used for this purpose. An arbitrary portion of a file can be overwritten by using the bs, oseek, and count options in a dd command with file descriptor 1 redirected in this manner.

Newer versions of dd often provide a conv=notrunc option, which is more convenient. Either way, it is good to practice on a file that is not irreplaceable.


SYS920 then confirmed the presence of 35 identical hist entries, copies of the original entry at slot 200026. Using the method developed in January, 34 of them were deleted without incident, leaving only the original. The hash table was regenerated, B-tree index 13 was rebuilt, the explicit relationships on dh_cr, dh_trm, dc_trm, who_hid, hd_hi, hi_evid, and ch_id were repointed, and the database was backed up and placed back in service at 2:00 AM on Tuesday, 5 July.

In all, 34 hist entries (of about 1,082,000, or about 0.003%) have been lost. We may be able to recover these, as time permits, from an old backup. In any case, the report from 'repoint', which lists records in other tables which 'repoint' could not relink to the missing hist entries, should provide information to help reconstruct the missing entries.

It is curious that icms.db1 could have been overwritten by output from dkt.rpt. The file is on a file system reserved for ICMS volume files only, so even corruption of the file system could not have confused icms.db1 with the observed text. The most likely explanation seems to me that dkt.rpt may once have been exec(2)ed by a process that had closed its standard input and standard output. The first two files opened by dkt.rpt, likely to be icms.db0 and icms.db1, would then be assigned file descriptors 0 and 1. During dkt.rpt's execution, writes to standard output (fd 1) would be written on icms.db1.

There are no programs currently in use that close standard output before invoking dkt.rpt. It may that such a flaw existed at one time during testing of our docket fiche software. I do not expect this problem to recur.


Attachment: REPOINT output:

Date: Mon, 4 Jul 94 19:58:25 -0400
From: Chap Flack <chap@delmie.detroit.6.us_courts.gov>
Message-Id: <9407042358.AA02927@delmie.detroit.6.us_courts.gov>
To: chap@delmie.detroit.6.us_courts.gov

The Case Management system is unavailable while corruption in the hist
table is investigated.
--------------------------------------------------------------------------------
CaseMgt: The status message above is dated Jul 3 16:55.
--------------------------------------------------------------------------------
Starting dh_cr at Mon Jul  4 03:10:32 EDT 1994
dh_cr
--WARNING --
All the records were not relinked.
You must manually relink them, following
the instructions given in the file "repoint.err".

real    42:00.1
user     2:54.8
sys      7:48.6
dh_cr done at Mon Jul  4 03:52:32 EDT 1994.  repoint.err:

Mon Jul  4 03:52:31 1994

This file contains a list of the records REPOINT was unable to fix.
The values in the Reference Field do not exist in the Related Table
You can fix them by accessing the Table (use the 'setloc' function
with the Location), figuring out what value the Reference Field
should contain, and storing that value into the Reference Field.

Table         Location   Related Record   Reference Field

schedule         86933   hist             dh_cr
schedule         37740   hist             dh_cr
schedule         30025   hist             dh_cr
schedule         87150   hist             dh_cr
schedule         58990   hist             dh_cr
schedule         53697   hist             dh_cr
schedule         97021   hist             dh_cr

Starting dh_trm at Mon Jul  4 03:52:32 EDT 1994
dh_trm
--WARNING --
All the records were not relinked.
You must manually relink them, following
the instructions given in the file "repoint.err".

real  1:23:04.3
user     3:30.2
sys      8:08.7
dh_trm done at Mon Jul  4 05:15:37 EDT 1994.  repoint.err:

Mon Jul  4 05:15:36 1994

This file contains a list of the records REPOINT was unable to fix.
The values in the Reference Field do not exist in the Related Table
You can fix them by accessing the Table (use the 'setloc' function
with the Location), figuring out what value the Reference Field
should contain, and storing that value into the Reference Field.

Table         Location   Related Record   Reference Field

schedule        108574   hist             dh_trm
schedule        112976   hist             dh_trm
schedule         10152   hist             dh_trm
schedule         48372   hist             dh_trm
schedule         49403   hist             dh_trm

Starting dc_trm at Mon Jul  4 05:15:37 EDT 1994
dc_trm
--WARNING --
All the records were not relinked.
You must manually relink them, following
the instructions given in the file "repoint.err".

real  1:54:22.9
user    14:33.3
sys     26:58.7
dc_trm done at Mon Jul  4 07:10:00 EDT 1994.  repoint.err:

Mon Jul  4 07:09:57 1994

This file contains a list of the records REPOINT was unable to fix.
The values in the Reference Field do not exist in the Related Table
You can fix them by accessing the Table (use the 'setloc' function
with the Location), figuring out what value the Reference Field
should contain, and storing that value into the Reference Field.

Table         Location   Related Record   Reference Field

docproc           2585   hist             dc_trm
docproc         339415   hist             dc_trm
docproc         355417   hist             dc_trm

Starting who_hid at Mon Jul  4 07:10:00 EDT 1994
who_hid
--WARNING --
All the records were not relinked.
You must manually relink them, following
the instructions given in the file "repoint.err".

real  1:49:49.9
user     8:53.6
sys     14:20.0
who_hid done at Mon Jul  4 08:59:50 EDT 1994.  repoint.err:

Mon Jul  4 08:59:48 1994

This file contains a list of the records REPOINT was unable to fix.
The values in the Reference Field do not exist in the Related Table
You can fix them by accessing the Table (use the 'setloc' function
with the Location), figuring out what value the Reference Field
should contain, and storing that value into the Reference Field.

Table         Location   Related Record   Reference Field

who             189337   hist             who_hid
who             189339   hist             who_hid
who             189344   hist             who_hid
who             210636   hist             who_hid
who             210641   hist             who_hid
who             210642   hist             who_hid
who             210643   hist             who_hid
who             210644   hist             who_hid
who             210645   hist             who_hid
who             189336   hist             who_hid
who             189343   hist             who_hid
who             189346   hist             who_hid
who             210646   hist             who_hid
who              32113   hist             who_hid
who              51493   hist             who_hid
who             189322   hist             who_hid
who             189844   hist             who_hid
who             193800   hist             who_hid
who              57636   hist             who_hid
who             208426   hist             who_hid
who             189363   hist             who_hid
who             189364   hist             who_hid
who             189361   hist             who_hid
who             189366   hist             who_hid
who             179592   hist             who_hid
who              12980   hist             who_hid
who             147179   hist             who_hid
who             129940   hist             who_hid
who             189372   hist             who_hid
who             178031   hist             who_hid
who             189374   hist             who_hid
who             189376   hist             who_hid

Starting hd_hi at Mon Jul  4 08:59:50 EDT 1994
hd_hi
--WARNING --
All the records were not relinked.
You must manually relink them, following
the instructions given in the file "repoint.err".

real  3:52:41.1
user    14:36.4
sys     34:25.4
hd_hi done at Mon Jul  4 12:52:30 EDT 1994.  repoint.err:

Mon Jul  4 12:52:26 1994

This file contains a list of the records REPOINT was unable to fix.
The values in the Reference Field do not exist in the Related Table
You can fix them by accessing the Table (use the 'setloc' function
with the Location), figuring out what value the Reference Field
should contain, and storing that value into the Reference Field.

Table         Location   Related Record   Reference Field

histdp          427716   hist             hd_hi
histdp          318582   hist             hd_hi
histdp          318594   hist             hd_hi
histdp          443190   hist             hd_hi
histdp          428806   hist             hd_hi
histdp          318586   hist             hd_hi
histdp          318588   hist             hd_hi
histdp          428808   hist             hd_hi
histdp          499595   hist             hd_hi
histdp          318592   hist             hd_hi
histdp          318599   hist             hd_hi
histdp          456011   hist             hd_hi
histdp          318607   hist             hd_hi
histdp          418076   hist             hd_hi
histdp          418079   hist             hd_hi
histdp          318596   hist             hd_hi
histdp          318597   hist             hd_hi
histdp          318598   hist             hd_hi
histdp          192001   hist             hd_hi
histdp          234182   hist             hd_hi
histdp           20800   hist             hd_hi
histdp          189429   hist             hd_hi
histdp          389145   hist             hd_hi
histdp          432905   hist             hd_hi
histdp          337919   hist             hd_hi
histdp          337936   hist             hd_hi
histdp          337931   hist             hd_hi
histdp          318614   hist             hd_hi
histdp          318615   hist             hd_hi
histdp          318616   hist             hd_hi
histdp          349770   hist             hd_hi
histdp          222583   hist             hd_hi
histdp          336783   hist             hd_hi
histdp          281930   hist             hd_hi
histdp          281151   hist             hd_hi
histdp           57931   hist             hd_hi
histdp           61720   hist             hd_hi
histdp           75156   hist             hd_hi
histdp          318625   hist             hd_hi
histdp          380295   hist             hd_hi
histdp          381480   hist             hd_hi
histdp          383389   hist             hd_hi
histdp          318627   hist             hd_hi
histdp          319164   hist             hd_hi
histdp          318628   hist             hd_hi
histdp          318631   hist             hd_hi
histdp          318632   hist             hd_hi
histdp          255531   hist             hd_hi
histdp          194268   hist             hd_hi
histdp          349774   hist             hd_hi
histdp          434101   hist             hd_hi

Starting hi_evid at Mon Jul  4 12:52:31 EDT 1994
hi_evid

real  5:14:30.2
user    11:22.0
sys     15:37.6
hi_evid done at Mon Jul  4 18:07:00 EDT 1994.
Starting ch_id at Mon Jul  4 18:07:00 EDT 1994
ch_id
--WARNING --
All the records were not relinked.
You must manually relink them, following
the instructions given in the file "repoint.err".

real  1:51:22.3
user     8:42.0
sys     13:17.4
ch_id done at Mon Jul  4 19:58:22 EDT 1994.  repoint.err:

Mon Jul  4 19:58:21 1994

This file contains a list of the records REPOINT was unable to fix.
The values in the Reference Field do not exist in the Related Table
You can fix them by accessing the Table (use the 'setloc' function
with the Location), figuring out what value the Reference Field
should contain, and storing that value into the Reference Field.

Table         Location   Related Record   Reference Field

cashist         222101   hist             ch_id
cashist         203508   hist             ch_id
cashist         203509   hist             ch_id
cashist         222892   hist             ch_id
cashist         224643   hist             ch_id
cashist         203512   hist             ch_id
cashist         226677   hist             ch_id
cashist         101634   hist             ch_id
cashist         203515   hist             ch_id
cashist         231456   hist             ch_id
cashist         203517   hist             ch_id
cashist         218879   hist             ch_id
cashist         203519   hist             ch_id
cashist         252840   hist             ch_id
cashist         344506   hist             ch_id
cashist         254427   hist             ch_id
cashist         202206   hist             ch_id
cashist         210663   hist             ch_id
cashist         203525   hist             ch_id
cashist         203526   hist             ch_id
cashist         331511   hist             ch_id
cashist         212019   hist             ch_id
cashist         130107   hist             ch_id
cashist         292085   hist             ch_id
cashist         172528   hist             ch_id
cashist         176344   hist             ch_id
cashist         203533   hist             ch_id
cashist         319237   hist             ch_id
cashist         203535   hist             ch_id
cashist         206152   hist             ch_id
cashist         203537   hist             ch_id
cashist         309255   hist             ch_id
cashist         258346   hist             ch_id
cashist         331514   hist             ch_id



*************************************************
Cron: The previous message is the standard output
      and standard error of one of your cron commands.