Reflections on developing the abcMIDI package

Introduction

The abc language is very simple on the surface. However, closer investigation will reveal that there are a vast number of subtleties and a lot more to the language than you first think. If you're considering writing your own abc program, rather than develop it from scratch, why not add to the abcMIDI package or base your program on some its code ? The abcMIDI code is distributed under the GNU public license, which is intended among other things to make the code available for re-use by other freeware developers.

I've written this page chiefly to offer some advice to anyone thinking of developing the abcMIDI package. The code has been written with the aims of being portable and easy to maintain, but working with such a large program is a daunting task, particularly if you haven't tried anything similar before. Also, the code is written in C, so you should be prepared to work in that language. If you plan on writing your own separate utility, then you have a free hand to do pretty much what you like in terms of re-organizing the code. However, if you are making changes in the code that you want to 'hand back' to other users of the original code, then you need to make sure your changes disrupt the code as little as possible, or else the handing back may become impossible. A simple change to the code can be handed on as a patch; a small file showing which lines have to be inserted, deleted or changed. The patch can be generated automatically using the diff utility. There is also a utility called patch which will apply such patches, but usually the patch can be applied by hand.

Here I describe the tools and techniques that I use for working with the code. This is part documentation, part essay on software engineering; I hope you find something interesting. You will probably find places where the code fails to live up to the high ideals I have set down.

How to Read the Code

The first approach that most people are likely to try with a large program is to open up the source files and read through the source code. This will give you a general overview of the code, a feel for the coding style and maybe you'll find a few useful comments documenting bits of the code. However, for most big programs you can't gain a detailed understanding of the program that way; certainly not one that will allow you to go in and start altering code. There is just too much information. Normally you will be interested in a just a small part of the code which you want to change.

I like to think of a complicated program as a big machine built out of "black boxes", where each function is a black box. A black box is a unit that does a well-defined job (described by a short comment at the start in the case of the function). If we open up the machine, we find it is made up of a small number of black boxes connected together in a fairly simple manner (at this level we are just examining the code for main()). Suppose we are trying to change some aspect of the program. There will be one of the black boxes that does not behave quite as we want. We will need to open up that new black box, see how it works and which component black box or boxes are not behaving as desired, open them up and so on down until we reach the code that does what we are interested in.

Naturally enough, the black boxes will not always be laid out next to each other and you will have to jump around in the code a bit to trace them. This is where an editor with a search facility becomes vital. However, I have tried to group related routines together.

What I've described is really just a top-down approach to reading a program. From the program maintainer's point of view, the moral is that by using a large number of short functions, each with a well-defined job, you can get away with reading only a small proportion of the code to understand it well enough to start modifying it. This brings me to the first 2 aims for the abcMIDI coding style:

Keep routines short.
Use local instead of global variables wherever practical. Careless use of global variables (particularly re-using them for another purpose), makes the code harder to maintain. If we change a local variable in a routine, it is obvious this will have no effect elsewhere. However, if we change the value of a global variable in one routine, we need to scan the whole of the code to make sure the change doesn't have an effect somewhere else.

I have also tried to use a consistent indentation style. I use braces round if .. else .. clauses, even if they only contain one statement, so that it is easier to check for correct nesting visually. One quirk of my style is that case statements are not indented within a switch statement. If you do indent, it will look as if there is a missing brace at the end of the switch statement.

Tools to use

This is partly a matter of taste. I use a set of free unix-like tools ported to the DOS environment. Obviously you will need a text editor with a global search facility (I use vim, a vi-clone which has been ported to a number of operating systems). I also find the grep file search program very useful.

Of course, a very important tool is your C compiler. I use and recommend DJGPP, a port of gcc, the GNU C compiler, to DOS/Windows. This not only performs compilation, but also comes with a number of useful utilities including make and split/merge, utilities for breaking up a large file into components small enough to go on a floppy and then combining them back into the original file afterwards. These latter two are needed to install DJGPP from floppy. The GNU C compiler also has options to do checks on C code that were traditionally done by lint in the past. There are also a number of other programs, including symify, an extremely useful post-mortem tool that can pinpoint where pointer errors are causing segmentation faults.

Of course, there are other compilers, many of which are smaller and easier to install. I have tried to make the code portable, which means that I have avoided features that are only provided by one particular compiler or operating system. PCC is a smaller compiler which will compile abcMIDI and is much simpler to install.

Program Quality Objectives

Beyond the obvious aims such as making the abc2midi program interpret abc reliably, I have aimed for a number of more general features which come under the heading of Program Quality:

Robustness. This means that the programs should never crash or start doing strange things. The key to this is good checking of the input data. Moreover, if I'm doing this, I might as well provide useful diagnostics to the user if I do find an error. abc2midi uses a central error handling routine which outputs a line number for any error. This does mean storing line numbers away in case an error is encountered not in the source abc but at a later stage of processing.
If I am adding a complicated unit to the code (for instance, the queue- handling procedures in abc2midi), I generally try to verify the operation of the unit separately before incorporating it into the code. I don't want to add a faulty sub-system which may cause subtle errors later.
Pointers tend to be a source of mysterious errors, though much less so once I discovered symify. I have tried to follow a fairly strict discipline with pointer variables. When they do not hold a valid pointer, they are assigned the value NULL. When they are dereferenced, I usually first check for a NULL value.
Extensibility. Generally, where an array can be of arbitrary size, I have tried to an extensible data structure. There are two basic appoaches I have used. The first is the traditional computer science linked list. The other is a less conventional expanding array. This is initially allocated with one fixed size. If the space gets used up, a new array twice as big is created and the data is copied across into the new array. Inevitably, these data structures make the code just a fraction less easy to follow, even when manipulation of the data structures is abstracted out to special functions.
For some things, providing this sort of extensibility is not worth doing because you can choose an upper bound which is only going to be exceeded by incorrect or very bizarre input. For example, the level of bracket nesting in a part specifier is unlikely to exceed the hard limit of 10.
Portability. I initially started writing the code on a Sun workstation, but with the aim of ultimately porting the code to a PC. I had previously used the netpbm graphics utilities and was impressed by their simple design. Thus I wrote the code to use vanilla C with no interactive interface and hence only rudimentary I/O requirements from the operating system.

Why Modules are a Good Thing

All the programs in the abcMIDI package are divided into modules. Each module is compiled separately and then they are all linked together. The parser part of the code is written as a separate module which can be linked in with any one of 3 other codes.

If you look at the source for abc2ps (not written by me), you will find is consists of many files, but they are all referenced using #include statements by one 'master' file. Therefore, what the compiler sees after pre-processing is one massive file. Doing it this way is perfectly valid C, but it does have some drawbacks when compared to using modules.

Perhaps the most obvious reason for using modules is to break a very large program into manageable chunks. Most editors have a limit on the size of file they can handle. Also, many compilers (including PCC) have a limit to the size of file they can compile before they overflow their internal tables. Using modules means that only the linker need deal with the whole thing (and linkers are usually written to be capable of this). An added fringe benefit is that by using a makefile to do the compilation, you only need to re-compile the modules that have changed since your last compilation.

Dividing the code into modules also breaks up the code into logical units, which makes it easier to read. The C language only allows access to variables in another module if they are declared with an extern statement. This enforces the logical separation of the modules since the programmer cannot inadvertantly access global variables in other modules. Doing things the other way round, a global variable declared in one module can be made local to that module and invisible to the other modules by declaring it as static, re-inforcing the "black box" approach.

Another reason why I used modules was to be able to keep the midifile code (not written by me) as a separate unit that couldn't be affected by my own coding changes. This meant that the code was modular right from the start. The midi2abc code has remained small, but the abc2midi code grew so large that I had to break up one of the large files into smaller modules to get it to compile with PCC.

If you wish to add a new body of code to abc2midi, one way you might consider doing it is to write a new module and link it into the main code with a small number of changes to the main code providing the interface to your module. This way, plugging in and unplugging your module becomes a simple matter.

To program with modules, you do need to understand how C handles module interaction (in particular the extern and static keywords), but it gives a number of advantages. A good way of thinking about a module is to think of it supplying a set of routines in the same way that a system library does.

Code Merging

From time to time I made changes to one version of the code and needed to incorporate the changes into another version. The essential tool to compare two versions a file is the diff utility, which shows which lines have changed. Of course, if you've made too many changes, this may not be of much help. In a few cases, diff has helped me track down the problem when a few simple changes have caused the program to fail mysteriously. Naturally, I keep a copy of the most recent working version of the code to fall back on if something goes badly wrong during development.

There is a utility called diff3 which will merge together two variants of a program, but I usually find my own hand-editing is good enough to do the job.

One thing to be wary of is making lots of changes for no good reason; for example using indent or some other program to pretty-print the code in a style which is more to your personal taste. Doing this is likely to make it impossible to use diff to pinpoint new code and result in a variant abc2midi strain which cannot be merged back with the original.

Testing

Before releasing a new version of the code, I test it on some samples of abc code. These consist of test files of my own devising as well as a collection of tunes taken from the web. The testing is automatic, using DOS batch files and using diff and fc to compare the results against a reference file of previous good results.

Ideally, when I add a new feature to one of the programs, I should add a new test to show whether that feature is working properly. However, I have been fairly lax about this. From time to time, I change the way error-handling is done and the tests show differences between the output and the reference file. As long as I can convince myself that the new output is correct, I update the reference file.

I always try to put out bug fixes for reported problems fairly quickly. However, there were a few times when I released bug fixed versions which had worse problems than the original bug. This is what convinced me of the need for a quick automatic test.

If you are working with the code and adding new functionality, it is better to release a series of small updates than to release your masterpiece after six months of work. This way, improvements in the code can be blended in relatively painlessly, and you are unlikely to be duplicating someone else's work. This is my philosophy at least and the reason why I need 3 numbers to specify the version. The automatic tests can be applied quickly and either expose problems or give me a lot of confidence in the current version.

Books

If you plan to do much coding, you will almost certainly find you need to buy a reference book on C. There are many good books available, so I won't try to recommend one. However, you should get one that is modern enough to explain K&R C (the original) and ANSI C (the modern standard).

And Finally...

I offer you 3 definitions to the speaker of British English :

Quality Insurance means paying out a sum of money of certain quality targets are not met.
Quality Assurance means convincing other people that quality targets have been met, whether or not they actually have.
Quality Ensurance means guaranteeing that quality targets actually are met.

Where possible go for number 3.