Diagnosing ICMS faults: an example

Excerpts from some correspondence with the AO concerning a flaw discovered here in an AO-supplied ICMS executable, and how it was identified.

>   I was told that you made changes which caused EDITOR to abort; and that
>   later you went into the executable an altered it thereby solving the
>   problem.  I'm an inquiring mind.  I want to hear the facts.

The 93CC02 version of EDITOR, just the way we received it off the tape, would hang (not abort) when the user tried to exit. It would work fine, just as pretty as you please, right up until the user hit ^K to get out...
...and then it would sit there...
...until somebody logged in from another terminal and killed it.

>   What was the problem?  What caused it in the first place?  How did you
>   know what to change in the executable?  And how did you do that?

Well, a "hang" generally indicates either that the program is blocking on some operation that will never complete, or that it's in an infinite loop executing the same code over and over.

If it's blocked, you'll see a consistent, unchanging value under WCHAN in a ps display, and the CPU time will not be increasing. If you're still an inquiring mind, you can fire up crash on the running system and do a kernel stack trace on the process, or use findslot on the WCHAN value, to get an idea of what system call is involved, and a full dump of the u-area will tell you what the arguments were. However, since crash reads physical memory, you need privilege to run it.

If the process is looping, on the other hand, you'll see CPU time increasing between consecutive ps displays. There's generally no point in bothering with the WCHAN value or crash with a looping process, because you'd only be looking at snapshots of a process that isn't nicely stopped in one place waiting for you to look at it.

In our case, it was obvious that EDITOR was looping rather than blocked, so the next question was naturally: where in the code is it looping?

My technique was to send a signal to the process causing it to dump a core image, and then examine the core image using sdb. This revealed that a function named endwin() was being executed at the time the signal was received. After that it's just a straightforward matter of running EDITOR under sdb, setting a breakpoint at endwin(), executing to the breakpoint, and then single-stepping until the loop is observed. If you don't have source (that's us!) and/or the program was not compiled and linked with the -g option, some familiarity with assembly language is handy.

Since our conversations with SATSC suggested that EDITOR was not looping in other courts, it was reasonable to look for something in our execution environment that might be different and to which EDITOR was reacting badly. There are really only a few things that are much different in this court from the point of view of an executing program.

We're probably more careful about the ownerships and permissions on database and related files than some courts.
Our invocation front-end differs from the usual doappl in its handling of IDs; it changes only the effective ID for access to the database, where doappl changes both the effective and real ID. In other words, if I went in through doappl and did an id, I would get a display like uid=1000(dbown) ... whereas using our front-end I would get the display uid=1158(chap) euid=1000(dbown) so the system always knows who I really am.
The vast majority of our users connect to the ICMS host using TELNET over the DCN. The Lan Workplace TELNET client emulates a "dec-vt220" terminal, which we had to add to terminfo and termcap.

We sometimes have issues to iron out for new software releases that have to do with each of the above. The first two generally show up as "Permission denied" messages, though sometimes more obscure things happen. The second should not be a problem with any properly-designed program, and any problem that shows up almost always indicates that some programmer misunderstood what the access system call is intended for. (Many programmers may have learned about it from the David Curry _Using C on the UNIX System_ book (O'Reilly), which described it incorrectly, at least as of the fifth printing. The manual page and most UNIX security books get it right.)

We have had very few permission- or access-related problems with ICMS releases; ICMS seems to be relatively clean where those issues are concerned. Those problems have cropped up more often with other nationally-supported applications which shall remain nameless.

We have had problems with past ICMS releases where certain executables would misbehave (they aborted, if memory serves) simply because the termcap descriptions for our terminal types were too long, apparently overflowing a fixed-size buffer. We went through several iterations of shortening the "descriptive name" part of the termcap to get the whole thing to fit in the buffer. I think our dec-vt220 long description is now something like "LnWkpTN" but I think we avoided having to remove many actual capabilities from the description.

Oddly, it seemed as though different executables were using different buffer sizes for termcap descriptions. Somebody would report an executable not working right, we'd shorten the entry, and it would work fine--but later on somebody else would report another executable that still didn't work until we made the description even shorter. I don't know why that limitation isn't consistent; it might be an interesting thing to find out.

But back to EDITOR.

Usually, when something we receive doesn't work right, I want to find out if the problem is related to any of the peculiarities of our environment. An easy way to check for access/ID sorts of problems is to run the executable with the real ID set to dbown (i.e. just the way doappl does it) and see if the problem goes away. With EDITOR, it did not.

Since we'd had problems with termcap length in the past, it seemed reasonable to check that out with EDITOR as well. Further more, endwin() sounds a lot like a terminal-related function, and stepping through it shows that it calls tputs, which makes it definitely a terminfo/termcap-related function. Aha!

I was almost certain the problem would go away when I tried running it on a vt100 or tvi922 with one of the old termcaps.

But it didn't.

It was aggravatingly consistent regardless of what terminal type or terminfo or termcap entry I tried.

So, if other courts have not experienced the problem, it's clear that something in our environment is contributing to it, but I'm at a loss to say what.

The offending loop in endwin() involves a tputs() (or was it tputso()) call with an argument of GX. (It may be buried in a macro definition somewhere if it's not immediately obvious in the C source.) endwin() is apparently testing various flags encoded as bits at an offset 12 bytes into a structure named curscr, and using those flags to select which control strings such as GX need to be sent to the terminal. It then resets some flags and loops again until nothing is left to do.

Unfortunately, when I step through it here, after it tputses GX, it doesn't reset the flag bits involved in the preceding tests! So when it loops again, the tests still indicate that it should tputs GX. So it does. And loops. Again.

This is an obvious logic problem, even if other courts don't have whatever environment factor makes it show up.

What I changed to work around the problem is pretty much slash-and-burn technique. Since I don't have endwin() source, I don't really have much idea what the different flags being tested signify and what the logic is supposed to do. On the other hand, it's a safe bet that all endwin() has to do with is sending the proper control sequences to leave the terminal in a reasonable state when the program exits, and the worst that could happen if I made a bad change is that the terminal would be left in a screwy state when the program exits--and if so I could try changing something else until it worked right. The downside risk was not too alarming.

The C code to reset the flags after the tputs() was compiled into an andw instruction with an immediate operand. Using the -w option to sdb, I just changed the immediate operand to exclude (n.b.) the bits that seemed to be involved with the previous tests.

That turned out to be almost enough. It seems that different combinations of flags have been set in curscr by the time the program exits, depending on whether the user just hit ^K on the first screen, or went in to other screens, did work, and then came back out. These result in slightly different execution paths through endwin(), though both paths wound up looping. My change to the andw operand solved the problem in one of those cases but not the other, since that part of the code was branched over.

So, I found that conditional branch and overwrote it with the appropriate number of nop instructions, causing the GX tputs and the andw to be executed in both cases. Very slash-and-burn, but on the reasoning that unconditionally sending a certain terminal-reset sort of control sequence, even when not absolutely necessary, might not be maximally efficient but wouldn't be likely to hurt anything. It solved the problem and didn't cause any apparent new ones, so that's the way we're running it now.

Hope this satisfies your inquiring mind.

-Chap