Duplicate Person Record IDs (generally, primary keys, corruption, effects)

In early December 1993, while installing the CHASER add-on to the ICMS application, I noticed that most of the 'person' records corresponding to judges did indeed have the person type and subtype (pr_type and pr_stype) set to <'jud', 'jud'> or <'jud', 'mag'> as appropriate. However, I noticed some with blank type and subtype. The sense of several discussions of this state of affairs was that there was no apparent reason any judge should not have the type and subtype filled in, and I could go ahead and add the missing information; this would make it possible for CHASER and other software to easily distinguish district judges from magistrate judges.

This promised to be helpful in development of the Magistrate Tilt report, so on the evening of Tuesday, 4 January, I used SQL update statements to add types and subtypes to several person records. The updates completed with apparent success, and the feedback messages ("n records updated") reflected exactly the number of records I had expected to update. When I repeated my initial query, some of the judges whose types and subtypes had previously been blank now showed complete information. However, some were still blank.

I repeated the SQL update for the records which were still blank. Again, the updates gave every indication of success and reported that the expected number of records had been updated. With the next query, one more judge now showed correct information, but three were still blank.

I started SYS920 (a UNIFY tool intended for exploring and correcting anomalous conditions in a database) and used 'acckey' to pull up the 'person' records for the three judges in question. The types and subtypes showed the updated information.

Once again in SQL, I did a simple query of the 'person' table alone, by the person record IDs of the affected judges. This query returned the updated information. However, my original query, which joined information from the 'person' and 'judge' tables by person record ID, still returned blank types and subtypes.

In SYS920, I found that while an 'acckey' of the 'person' table produced a record with updated type and subtype, an 'acckey' of the 'judge' table using the same ID number, followed by an 'faccess' of the 'person' table on that ID, produced a record with blank type and subtype. Use of the 'loc' command in both instances confirmed that these were separate records, in different physical locations, sharing the same person record ID number for each affected judge. One of the records in each pair was entered in the hash table (used by 'acckey'), while the other was linked to the corresponding 'judge' record by "explicit relationship" (used by 'faccess').

The person record ID number is set up as the "primary key" for the 'person' table. UNIFY is designed to prevent the addition, under any circumstances, of a record duplicating an existing primary key. Hence, it is curious that such a situation could have been created in the first place. It may have existed for some time; in at least one pair, both records bore entry dates in 1987.

My first attempt at correcting the situation was to REPOINT the judge/person relationship. REPOINT is a UNIFY tool intended to correct explicit relationships. According to the manual (Developer, 9.4), REPOINT "...zeros the existing pointers, then reads the database, relinking the related records." UNIFY requires every explicit relationship to involve a primary key, implying that REPOINT uses 'acckey' to locate the proper parent for each related record. It would seem, then, that REPOINT would leave the 'judge' records linked to the same 'person' records that were available to 'acckey'; 'acckey' and 'faccess' would then produce the same record, eliminating the inconsistent behavior; the other record of each duplicate pair would still exist, but not appear in the hash table or any explicit relationship, and could be deleted.

However, REPOINT had no apparent effect whatever. At its completion, those 'judge' records which had been linked to the non-hashed 'person' records were still linked to the same records. I considered a few explanations; in light of later observations, the most plausible seems to be that REPOINT does not, in fact, zero all pointers and then relink them all; rather, it only relinks any given record if its existing linkage "appears bad" to REPOINT. Since each judge record was already linked to a person record with the proper ID, REPOINT saw no reason to change it. It could be argued that a tool intended for correcting database corruption should not be assuming an existing linkage is correct; the UNIFY developers probably chose their approach to improve REPOINT's time performance.

My next approach was to try using SYS920 to delete one of each pair of person records by specifying its location. If these records could be deleted, they would leave a number of explicit relationships dangling, which would then need to be corrected by REPOINTs. (In addition to 'judge', several other tables, such as 'who', have explicit relationships to 'person'.)

However, SYS920 will decline to delete any record with existing explicit relationship links. Failing that, I tried to temporarily change the primary keys of the affected records, which might also have been enough to trigger REPOINT to relink the related records. SYS920 also declines to alter the primary key of a record with existing children linked to it.

The next approach seemed to be to extract and delete all of the related records, delete the offending 'person' records (now free of links and deletable), and restore the related records. Because this promised to be a tedious task involving some 94,000 'who's among other things, I chose to consult the Texas Training and Support Center before proceeding further.

On Wednesday morning, 5 January, I contacted Don Killian at TXTSC. I described the problem to Don and Dan Bennett, who set about replicating the problem on their system. This implies a theory of how such a situation could have been created in the first place; their method was to start and abort a 'REKEY' (hash table rebuild) leaving the hash table incomplete, then to add duplicate records while UNIFY was unable to locate the existing records with the same keys. This method worked, in that it allowed Don and Dan to create a similar situation in Texas; we can only speculate whether this is the way our own situation first arose.

While Dan and Don worked on replicating the situation in Texas, I left a message with UNIFY Corp. in Sacramento describing the problem and asking for a return call if they knew of solutions simpler than a mass extract and reload. When I first started in 1991 I had called UNIFY a couple of times and found them very helpful, but this is the first time I've called since and they seem to be less so. They have yet to return the call.

By noon, the situation had been replicated in Texas and they were experimenting with solutions. I spent the morning trying to determine the exact extent of our problem. Since I had only encountered 3 duplicated person IDs by happenstance while working on a different problem, I wrote a program to scan an entire table and dump all primary key values, along with a filter to sort that output and collect duplicates; these were used to scan the entire 'person' table and identify any other duplicated IDs. The scan found eleven (total), each appearing exactly twice: the 0 person ID ("Unspecified"), the 1 ID ("Everyone"), and nine judges, including the three I had already found. Each affected judge had one related 'judge' record, linked to one or the other 'person' record, and several thousand 'who' records arbitrarily split between the two. This situation would make it difficult for ICMS to reliably locate all 'who' records for a given judge; it would be more likely to find one group or the other.

The 0 and 1 person IDs also had many thousands of related 'schedule' records, arbitrarily split like the 'who's, as well as a handful of 'atyaka's and 'perplc's. Most of the results of my exploration are attached below.

By early afternoon, I reported my further findings to Texas, and they expanded their study for solutions to include the additional tables (beyond 'judge' and 'who') related to the 0 and 1 person IDs. We had discussed some questions about the actual operation of REPOINT, and they were beginning to lean toward an extract/reload approach. We estimated that such a solution would require 20 to 30 hours (elapsed) for processing, exclusive of set-up and verification.

We spent the remainder of the afternoon independently developing possible solutions. I spent some time reverse-engineering REPOINT hoping to answer the questions Texas had raised, and possibly to find a way to coerce REPOINT into relinking all records, even those with existing links that did not "appear bad." I also looked at SYS920, and by about 5:30 had devised a way to temporarily break SYS920 so it would delete any record I told it to, without performing the usual check for existing links. I called Texas to report this development and found they had left for the day.

In our test database, I experimented with using the "broken" SYS920 to delete a record with existing links, then using REPOINT to correct the links severed thereby. All worked smoothly, so I left and returned at 7 PM after Cedric had completed backups.

With the fresh backup on hand, I set about deleting one 'person' record from each duplicate pair. Using the timings we collected from our last full REPOINT, I chose three relationships to 'person' which would require only a few minutes apiece, REPOINTed them, and checked that they had been properly relinked. They had not; REPOINT had left them linked to the original (now deleted) 'person' records.

In the manual (HLI, 5.3) in the discussion of Figure 5-19, it becomes apparent that the pointer clusters in a record are not cleared when that record is deleted, but only when a new record is added in its place. It seems that a link only "appears bad" to REPOINT if it points to an invalid pointer cluster; as long as a link points to a valid cluster, even if that cluster is part of a deleted record, REPOINT will not correct it.

In SYS920, I used 'addrec' to create 11 new, uninteresting 'person' records. On the assumption that UNIFY maintains a delete chain in last-in-first-out order, these new records should have been placed in the same locations as the 11 I had earlier deleted; with the 'loc' command I determined this was so. Having overwritten the original pointer clusters, I deleted the new records and repeated the smaller REPOINTs. The results this time were satisfactory, so I started the repoint of 'schedule,' which had taken 35 minutes in May.

After the 'schedule' repoint (it took 46 minutes), I verified that the 'schedule' records had been properly relinked, and started the 'who' repoint (4 hours 56 in May).

At 4:30 AM on 6 January, I checked on the 'who' repoint. It had completed at 3:48 (5 hours 49) and I verified that the 'who' records were properly linked. All being satisfactory, I backed up the corrected database and returned the system to service at 8:00 AM. The results were shared with Dan Bennett once he returned to work.

Rather later, at leisure, a script was written to run the newly-developed key-dump and duplicate-checking tools systematically against all keyed tables in the database. All were given a clean bill of health. However, this tool did again prove useful in June 1994.

Exploring the extent of the problem

                    person
                   +---------++------------+---Z
                   |  pr_id  ||  pr_lname  |  ...
                   +---------++------------+---Z
                     ^  ^  ^
                     |  |  |                 judge
                     |  |  |                +---------++------------+---Z
                     |  |  |                |  jd_id  ||  jd_title  |  ...
                     |  |  |                +---------++------------+---Z
           /---->----/  |  |                     V
           |            |  |                     |
           |            |  \----------<----------/
           |            \__________
    who    ^                       \__________
   +----------+------------+---Z              \________
   |  who_id  |  who_case  |  ...              \   \   \ ...
   +----------+------------+---Z                  others

pr_id pr_lname      loc A              loc B
----- ------------- -----              -----

    0 Unspecified   23548         22W  54886  K     169W
    1               26387              58362  K
44885 Feikens       23526       3894W  23552 JK    5356W
44888 DeMascio      23529  K+   3882W  23555 J -    381W
44890 Taylor        23531  K+  13158W  23557 J -   2650W
44893 Hackett       23534  K    2372W  23560 J     8656W
44897 Churchill     23538       1786W  23564 JK    3225W
44902 Hooe          23543 JK+   7432W  23569   -   4375W
44903 Cooke         23544  K+   1905W  23570 J -   3622W
44904 Pepe          23545       3153W  23571 JK   12619W
44905 Morgan        23546 J +   3271W  23572  K-  11936W

J: this is the location that is picked up by an faccess() through the corresponding judge record.
K: this is the location that is picked up by an acckey() through the hash table.
+,-: if the two locations do not have identical contents, + indicates the one whose contents are more correct, - the other. If the contents are the same, + and - are not shown.
W: the number of who records linked to this location (setsize of makeset pr_id who_id).

Though more things than just jd_id and who_id have explicit relationships to person, I checked the set sizes for all other relationships at locations 23570, 23544, 23557, 23531, 23555, and 23529 and they were empty as of the evening of 4 January. This was tedious and I did not similarly check the remaining locations, assuming that who and judge records may be the only children expected for a person entry representing a judge.

The 0 (Unspecified) person (23548 and 54886) participates in (4,2) aa_prid2 relationships, (34890,23376) dh_prid relationships, (1,0) jd_id relationships, (1,3) pp_prid relationships, and (22,169) who_id relationships, respectively. The 1 person (26387 and 58362) participates in (15210,105476) dh_prid relationships.

Using SYS920 to delete a record regardless of links

Run SYS920 under the control of a debugging tool such as sdb.
Once a SYS920 prompt has been displayed, interrupt the program and examine the variable NRELT (an integer). This will be the number of explicit relationships defined in the database, currently 113 for ICMS.
Assign a new value of zero to the NRELT variable.
Continue the program.

From this point, SYS920 will behave as though no explicit relationship is defined in the database. Explicit relationship commands such as makeset and faccess will probably not work, and no checking for existing links will be done by delete or pfield.

To restore the original function at any time, interrupt the program, reassign the original value to the NRELT variable, and continue (or simply exit and start up a fresh copy of SYS920).

For further study

If an 'addrec' is done in SYS920 while NRELT is coerced to zero, will existing pointer clusters in the chosen location be cleared or not? If they are cleared, then to prepare several records for REPOINTing, one would need only to zero NRELT once, then delete-addrec-delete each location in turn. If not, NRELT will have to be zeroed before deleting each original record, but restored to its original value before the 'addrec's are done.

This question was not answered in this episode because my 'addrec's were done later, using a fresh SYS920 with NRELT unaltered.