Freeware employed in the Eastern District of Michigan

We believe that our needs can be met most effectively by capitalizing on software that has already been developed wherever possible. To that end, we make regular use in our operation of several pieces of freely-available software, mostly from the GNU Project of the Free Software Foundation. Because this software is not purchased, we see no conflict with the AO recommendation against local purchases of UNIX software. The free software we use, while free, is not "public-domain." The authors or their organizations do maintain copyright and, in the case of the FSF, explicitly use licensing terms to insist that the software remain freely available and that source code for it can always be obtained. These restrictions are fine with us.

The availability of source for the freeware we use has allowed us, in a few cases, to find a freeware utility close to what we need and make it exactly what we need, in a fraction of the time it would have taken us to develop it from scratch. We maintain our modifications in the form of patch files (see below) to the original freeware source. Where we have added more than one feature, each feature we have added is in its own patch file. We also create separate patch files of any platform-specific changes we made to get the utility to work under ISC or Unisys UNIX. What we will make available to other courts is the original, unmodified source code, plus our various patch files.

A patch file is a listing of the changes made to one or more source files, and it's easy for a human to read, as well as for the machine to work from. Our reasons for maintaining the freeware source as "original + patchfiles" are as follows:

Full disclosure: The comments in each patchfile describe the new feature or capability. By scanning the patchfiles, you know exactly what we've added or changed with respect to the original freeware. By reading the actual changes described in the patchfiles, you can see exactly how we did it, gathered together in one place rather than scattered through the source files.
Trust: The changes in the patchfiles are the only changes we have made to the freeware. You are free to (in fact, it's not a bad idea) obtain the original freeware sources independently and apply our patches on your own. If you use any of these utilities, you know exactly what you're getting.
Flexibility: Our patches are in context diff format, which for the most part allows you to pick and choose from our added features and install the ones that interest you.

The freeware utilities we have collected, and our reasons for using each one, are described below.

patch is the program that reads a patchfile and makes the indicated changes to original files. Without it, we couldn't use such a convenient method of representing our software alterations. We are now using a later version of the program that was adopted and slightly enhanced by the Free Software Foundation. Aside from minor platform-dependent changes to get a clean compilation (described in the patch files, of course) we use patch exactly as distributed.

diff3

This diff, from the GNU project, does everything the regular UNIX diff command does, and also has a -c (context) option. With this option, the diff listing produced includes a few lines of context above and below every change. It is this added information that allows the patch program above to apply our patches in various orders, or mixed and matched, when the actual line numbers of changes could vary depending on the other patches previously installed. GNU diff also includes other useful features not offered by the standard diff; see the man page. Our only changes to diff are to accomodate the platform.

This is a stripped-down version of dd, used mainly to copy data between various files and devices, using specified block sizes. The author wanted a faster dd, and got it by using a double-buffer scheme. We were looking for a dd that could handle running into the end of a tape and allow the data to continue on a new volume. Certain programs, like cpio, have this ability built in, but by teaching ddd this trick, we can use it with any other program whether designed for multiple volumes or not. Also, programs like cpio have a fixed method of prompting for the next volume (typically a question at the terminal). Our version of ddd will use any given shell procedure to indicate the need for a new volume, and determine when to proceed. This allows us to build procedures that run in background or not directly attended.

ddd does not include all the advanced features of dd. The author wanted simple and fast. We liked the simplicity since it was easy to add the end-of-volume checks we wanted. At present, ddd can be fast or multivolume but not both; if you use the ieof= or ofull= multivolume options, ddd reverts to single-buffering to sidestep the complexities of synchronizing two processes when one of them hits the end of a volume. The task is not impossible in theory and we will probably eliminate the restriction at some point. But at that time we will probably discard ddd and transplant our modifications to the GNU dd, which is fully-featured, better written, and part of a set of GNU utilities we're using anyway.

We got ddd before we started using patch, and our modifications to ddd are not as carefully classified as the changes we've made to other freeware.

The Portable Archive Exchange program sponsored by the USENIX Association to implement the archiving requirements of the POSIX (ISO 9945) portable operating system standard. pax understands both the tar and cpio -c archive formats.

We were interested in extracting some summary information from our old civil/criminal archive tapes, which consisted of files stored within tar archives nested within cpio archives. We wanted to scan these files and extract information from the tape "on the fly," without first extracting the files to disk. To do so required modifying a program that understands cpio format, and one that understands tar format. Finding pax, which understands both, we accomplished our goal with one program and one modification. Our general-purpose modification to pax allows files within any tar or cpio archive to be processed on the fly by any desired combination of UNIX commands. The commands have access to the names and other status information concerning the files being processed.

We also added the complementary ability to pax; it can filter each individual file through an arbitrary bunch of UNIX commands on the fly while writing an archive. In this mode, pax reads standard input for not just filenames, but complete specifications of file name and status in a special format, which GNU find and gawk can provide. This gives complete control over what the headers in the archive will say about each file.

As a separate modification to pax, we added some privilege checks to be made if pax is running as the superuser for a user who is not. This allows us to make a separate, setuid copy of pax for use in our backup procedure, so an otherwise-nonprivileged operator can back up the system without logging in as the superuser. pax detects this situation and drops its privilege for certain actions the backup operator could otherwise exploit. The setuid copy of pax, called bakpax, is not executable by any user except through our backup front-end, which enforces further restrictions.

We also corrected some minor mistakes in the original code, and corrected the blocking of data to conform to the ISO 9945 standard.

A souped-up version of the ordinary UNIX find tool. GNU find provides a lot of added features, including full regular expression matching, timestamp testing to the minute, determining whether files have been read since last inode-change, finer control over the portions of the directory tree to search, and the ability to print any combination of desired information about matching files in any desired format. This last capability is used by pax to create customized command sequences for processing archive files on the fly.

We added the -abstract option allowing GNU find to apply any of its tests and actions to "hypothetical" files as well as real ones, allowing other programs like pax to take advantage of 'find's capabilities.

We also added privilege checks allowing us to use a setuid copy of GNU find in our backup procedure. The setuid find allows the nonprivileged operator to scan the entire filesystem, passing the file names to the setuid pax described above, which can archive the files. Like pax, find recognizes this situation and drops its privilege for certain operations an operator could exploit. Also, like bakpax, the setuid copy of find (we call it gnufind) is executable only through our backup front-end program, which enforces further restrictions.

Part of the GNU find package. One feature they added to find was a -print0 option which terminates file names with NUL characters instead of newlines. This is hard for a human to read, but easy for another program, and their version of xargs has a -0 (that's a zero) option to do just that. So, a command like:

    find / -whatever -print0  |  xargs -0 something

will do the same thing as your usual find | xargs, but will work right even if some file names contain newlines. (We find we have a surprising number of those in users' civ/crim report directories.)

The GNU version of awk has a few features not found in the standard System V awk, but its greatest benefit to us may be simply that it's different. The AT&T awk, while ordinarily quite adequate in speed, uses an algorithm somewhere with a pathological worst case which can be exercised (we know!) by certain programs doing repeated associative array operations. On large files, these programs start fast and rapidly deteriorate, sometimes processing only a few records per minute. All AT&T-derived awks I've seen do this. Since the FSF had to write gawk without reference to the AT&T source, they picked a different algorithm or implementation that simply doesn't have that problem. An awk program we wrote to convert our final Courtran docket tapes, which ran for hours without making a dent in the first tape, did the entire set in 27 minutes running under gawk.

Happily, gawk also stacks up pretty well on speed for programs that don't exercise the bug in awk. gawk also tries to support unlimited output redirections, which is a handy feature, and we made two minor modifications so that works right. We have also found and patched several other flaws in gawk that caused it to behave differently from nawk in certain situations, or to consume excessive amounts of memory.

As far as adding brand-new features to the language--it's tempting. On the other hand, part of awk's appeal has always been its simplicity and high power-to-manual-weight ratio. If an everything-and-the-kitchen-sink-too language is desired, there's always perl. So we've settled on four new additions to the gawk language, which we feel are simple, general, applicable to a wide variety of problems, and consistent with awk's mission. We'll probably take two of them back out later, replacing them with a single more-general feature to cover them both.

The two most significant are probably coprocesses and the exec() builtin. With these, just about any other special requirement can be met by setting up appropriate coprocesses rather than by further extending the awk language. For example, the original awk had no easy way to make a string upper or lower case, so SysV.4 awk extended the language with the toupper() and tolower() builtins. The extensions would probably not have been necessary had it been easy to set up a coprocess that execs tr, feed it strings as desired, and read what comes back. That's exactly the kind of thing made possible by our coprocess and exec() additions.

Another addition we've been using widely is a cal() builtin for doing calendar calculations. We found a great piece of freeware by W.M. McKeeman that converts dates from the Gregorian calendar to Julian day numbers and back, valid from 1582 until way in the future, added the ability to deal with the UNIX date/time structures, and built the whole thing in for any gawk program to use as a built-in function. It can be used to test the validity of an entered date, calculate the interval between two dates, ensure that one date precedes or follows another, etc. All of our locally-developed ICMS reports, among other things, now use this feature of gawk to check date parameters in the requests. The scripts are now simpler than they were before, despite much more thorough validity tests.

The last function we added to gawk is for dealing with file information. When we modified pax and GNU find to work with each other as coprocesses, we did it by defining a compact way of representing all the information known about a file, and teaching both pax and find to exchange file information in that format. The statxr() builtin in gawk allows gawk programs also to understand that format, and examine individual file status fields by name without the usual bother of trying to parse the output of ls or what not and wondering what happens if a file turns up with newlines or special characters in its name.

To avoid starting to turn gawk into perl, we did not add any facility for gawk itself to stat a file or change a file's status. statxr() simply provides a way to reliably and simply exchange complete file information with other programs such as find and pax. Even so, statxr scores lowest of our four modifications on the scale of generality and usefulness as a part of gawk. We may at some future time find a way to move it back out of gawk and provide the same ability with some combination of a more general-purpose builtin and a separate program that can be used as a coprocess. When we do that, we can probably handle cal() the same way.

In developing a significant body of code in gawk 2.13, we have discovered a few more bugs that we have not yet corrected. They seem to be connected with:

Passing and returning array parameters which may not have had any elements allocated at the time of call or return
Using $n field references before any getline has been done (as by assigning to $0)
Using string variables derived from field splitting as arguments to certain gawk functions that interface to system calls or library routines. String variables in gawk are represented internally with a count, and are not necessarily NUL-terminated; the functions implemented entirely within gawk have no problem with that, but some of the functions that call on the library forget to make sure there's a NUL on the end. The problem is usually seen in connection with field variables $n since they are picked out of a line just by manipulating pointers and lengths, and there's no place to insert a NUL; it can generally be worked around by simply assigning the string to another gawk variable (or catenating an empty string onto it) and passing that to the offending function instead.

We have techniques that usually work around these things (check any of the larger gawk scripts for examples) but be aware that a little extra time should be allowed after a script has been written to fit in whichever workarounds may be needed.

Version 2.13 is by now a pretty old gawk, and they've probably fixed many of the bugs in current versions. The catch is that porting our coprocess, etc., modifications to a newer gawk would take some time, so it hasn't been undertaken to date.

I have recently been learning Tcl which seems to be able to do a lot of what we've been using our modified gawk to do, and possibly better. I got into it because it was included with SCO OpenServer 5, but it is also freely available, so I obtained the source and built it for the Interactive systems as well. The CGI scripts for our www-based documentation are largely written in Tcl, as are the scripts that merge our live and archived index information for use by the public access clients. It seems to be more solid than gawk, and might be a better choice for future scripts.

chgrp cp du ls mknod rmdir

chmod dd ginstall mkdir mv touch

chown df ln mkfifo rm

These GNU file utilities all come in one package. There were only a few we really need, but many of them have features beyond the standard UNIX commands of the same names, so we installed them all.

The AT&T SysV.3 mkdir and mv commands (that we know of) contain flaws in the testing of permissions. Both commands were setuid in earlier versions of UNIX and needed to explicitly check the access of the user's real ids. Both commands have been made no longer setuid in this version of UNIX, which provides new system calls eliminating the need; but both commands have been incompletely updated to perform access checks correctly based on the user's effective ids. As a result, these commands will often refuse (with a "Permission denied" message) to create a directory or rename a file where the real user could not, even though the effective user can. The situation crops up frequently in applications front-ended by invoke. The GNU versions of these utilities simply use the appropriate system calls and let the kernel take care of access checking rather than trying to second-guess it; they work fine and completely eliminate the problem.

When making a named pipe (FIFO) with the AT&T mknod, the FIFO gets created with the real rather than effective ids, and can be inaccessible to the process that created it. The GNU mkfifo command uses the effective ids, in accordance with the POSIX practice, eliminating the problem.

egreg © W. M. McKeeman 1986

A Gregorian/Julian date-conversion program. Given enough information about any date, it will calculate all fields of the date, including the Julian day, which is just a number of consecutive days since way back when. This makes date calculations very easy; if you need yesterday's date, say, take today's date, use egreg to get the Julian day, subtract 1, and use egreg again to get the corresponding date. Your script logic does not need to worry about wrapping around the beginning of the month, week, or year.

McKeeman's original program is called greg; egreg is our "extended" greg, which has one added feature in that it can do a (simplistic but effective) search for the latest date satisfying other specified fields; this simplifies finding the last day or week of a month or year, etc. This is triggered simply by putting a too-big value in one of the fields; to find the last date in February 1993, say 'egreg 1993 2 32' and it will reply '1993 2 28 1 5 59 2449047'. (It's a Sunday(1), week 5 of the year, day 59 of the year.)

The above is a cop-out approach to one of a wide variety of possible date calculation problems. Since McKeeman designed greg to act as a logic programming predicate, it would be ideal to incorporate it into an LP language and be able to say "find me a date satisfying these arbitrary constraints." Anyone know of a freeware Prolog?

In the meantime, incorporating egreg into any programming language would be more flexible than using it as a clumsy external command from shell scripts. My current thinking is to eventually graft it into gawk.

Hold the presses. That's done now. The capabilities of greg are now available as the cal() builtin function in gawk. We no longer use egreg in any new development--gawk is used instead-- and egreg is still sitting around only to satisfy any scripts we developed earlier and haven't gotten to updating yet.

The Bourne-Again SHell is the GNU Project's answer to the POSIX shell specification. It supports Bourne and Korn syntax and some extras. Job control even works right.

Our biggest reason for building it (confession time) was to build cat into it as a builtin command, so we could have a shell that had cat as a builtin command. We had a big old hairy script intended to process a ton o' files on the fly from a tape archive, and building cat into the shell reduced its fork/exec overhead by a factor of about 11.

However, there seems to be a bug in bash causing it to start misbehaving after a couple thousand iterations of that script's loop. That causes enough trouble that our current versions of archiving scripts are back to using ksh instead of bash, greater overhead and all. It turns out in our usual archiving operations, the extra overhead does not translate into a significant speed difference, as another part of the pipeline is the bottleneck.

Caution: there are some reasons you might not want to make bash the only shell on your system. For starters, compare the size of the executable to the size of /bin/sh. Also, the : command in bash always has a 'success' exit status, whereas in sh and ksh it reflects the status of any command substitution in that simple-command. Many of our scripts depend on that and, therefore, break when run under bash.

A pager program similar to more. This one is very feature-rich and has been tested long and hard on many different terminal types. Our purpose in building it was to replace the page program that comes with CHASER, since that program hasn't worked well on most of the terminals and emulators we have. By using less, we can also provide a CHASER 'view recent' option that allows paging back, and case-insensitive searches so a user needn't worry about whether to search for order or ORDER. All of the key mappings and prompts in less can be completely redefined, so we were able to give it a new personality pretty similar to stock CHASER.

So widely-used and mature is less that it compiled the first time with no local changes and not even a warning. (OK, for SCO it needed an #include added to less.h. But who uses SCO?)

The install script offers to build a version of less without shell or editor escapes, etc. For PACER we certainly don't want those escapes, but they can be disabled just as effectively by using -k with a key-definition file that invalidates those keys, and that way we can have both original less functionality and the PACER personality with only one executable.

To simplify making up new less personalities, we divided our PACER key definitions into two files. The first, nokeys.less, simply invalidates every single keystroke sequence built into less (verified by checking the source where the default command table is created). If you invoke less with only this definition file, you will be able to do absolutely nothing, not even quit. Then another key definition file can be created containing just the key definitions desired for the new personality. Multiple -k options are allowed, and later ones take precedence, so by invoking less -k nokeys.less -k mykeys.less you get a brand new personality.

zcat

(Three links, same program.) A compression program using an algorithm not subject to the Unisys patent that applies to compress. There are reports of a legal opinion that anyone who uses compress is infringing with each use. gunzip/zcat can uncompress files in gzip's own format, compress format, or pack format, since the uncompression algorithms are not patented. It can automatically detect which format it is trying to uncompress. The algorithm used by gzip often yields better compression than compress. It also includes a CRC, allowing it to determine if the file has been corrupted.

Files compressed with gzip seem to consistently begin with the bytes 1f 8b 08 00 and this observation has been used to update the magic file on some of our systems so the file tool can identify such files.

This program autoconfigures so well it needed absolutely no changes to compile and run on any of our systems. None.

Receive and send using the ZMODEM protocol. One change made to sz locally: the -h handle option allows sz to read a file already open on the specified handle; the command line argument specifies only the name to be passed to the receiving system. This allows a script to control the name under which a file will be received without having to worry about what the file is named on the system of origin.

The __stdio_gen_tempname() routine from the GNU C Library, pulled out and modified so it can be compiled and linked into programs on vanilla UNIX systems without the rest of the GNU library headers and what not. With at least three different temporary-file-name routines documented in the stock UNIX library, why use another one? Two reasons. (1) The GNU routine has an option to actually create the file--atomically--as its means of testing for a name collision. This eliminates the race condition created by the existing routines when a test determines a file name is not currently in use but the file is not created until later. (2) The stock routine misuses the access system call (see discussion re: mkdir et al) and can refuse to create a file in a specified directory even though the effective user has all the necessary permissions and the program really needs the file in that directory. __stdio_gen_tempname eliminates that problem.