Software Archeology: Understanding Large Systems
Chet Hendrickson
My first programming job was with EDS, where I worked on a team of mostly rookie developers maintaining General Motors’ materials management system. The programs were around ten years old, written in PL/1 and IMS. After I gained some experience I was put in charge of the inventory update system. Its centerpiece was a jobstream called VINMSPL, it processed information about incoming parts shipments, updating our central database and forwarding the information to the appropriate assembly plant. VINMSPL in its basic form was composed of about ninety PL/1 modules. These modules could be linked with an optional set of modules to form an IMS MPP program. By using a different set of I/O modules, you could build four IMS BMP programs that were run in the same job stream.
Some of the modules were at Panlevel 250, give or take, meaning that they had been promoted to production status 250 times in their ten-year life. They had been changed about once every two weeks. We had a couple of lines of comments for each of the last few changes. The original developers were long gone, there wasn’t even a binder of bad documentation, and now it was my responsibility.
The MPP version was by a user base of about 150, working two shifts at headquarters and at each of GM's twenty odd assembly plants. The batch version ran continuously for eighteen hours a day, five days a week, with daily executions running between sixty and one hundred twenty. These modules were free of 'golden code'.
It is very common among old, often changed programs that certain routines are deemed 'golden code'. The are viewed as so important, so complicated and in a state of absolute perfection that they should never be changed.
VINMSPL did not have any 'golden code', but over the years it had picked up an interesting idiosyncrasy. During some modification, many years ago, the online version abended on the production system. The problem could not be reproduced in the testing environment and visual inspection provided no solution. The only answer was to debug it in production. This was early 1980's IBM mainframe PL/1 and IMS; the debugging facilities were limited to writing messages to the system console. So, that is what they did, the first message they inserted said 'in mainline'. The program was recompiled, the modules were relinked and the whole thing was promoted to production. The screen was called up and some data entered, 'in mainline' scrolled up on the system console, but the program did not crash. Everyone stood around scratching his or her heads. They took the message out. The program crashed. They put it back in and it ran.
The best answer anyone could come up with was that the extra instructions required to put out the message had caused the program to shift around in memory, perhaps causing something to be aligned on a full-word boundary and preventing the crash. PL/1 uses a primitive form of garbage collection; memory is allocated and freed automatically by each module as it sees fit. Why it ran in test but not in production was never figured out. I never had the courage to take the message out and see if it would still fail. I wish I had.
When it came time to update these modules, I would talk the proposed changes over with my co-workers. We would discuss where and how to make the change. The only real tool we had was a known set of input data and database. The test data could be run through the programs and if all were well the output would match a reference file. The most important thing we did was determine how to change the input and predict how the new output would vary from the reference. If all went according to plan the reference would be changed to reflect the new behavior.
It is surprising that we did as little damage as we did.
I have done a lot of software archeology since then. A great deal of making the C3 payroll system work was finding the hand full of hard coded Social Security Numbers that identified people who should have their union dues calculated differently. Our greatest challenge was reproducing the payroll master with 2500 fields, many of which had had their semantics changed but not their names.
The most important tool in unearthing why a program is as it is, are the collective memories of those who have come before us. At C3 we were able to make great use of the memories of the programmers working on the old payroll programs. Change logs will tell us when a change was made, by whom and for what purpose. They seldom explain why the change was made in the way it was, and for a long-lived and volatile program the changes may overwhelm the original code. Unfortunately, those who have made the changes are not always available and when they are they most often do not remember why a thing was done as it was.
We are then left to understand, debug, or extend these programs on our own. The two techniques I have found useful have been active debugging and refactoring.
When I worked with monolithic procedural programs, such as those that comprised VINMSPL, the best technique I found was to take a set of known input data and step through the program in question, using the best interactive debugger available. During my last few years of mainframe work, Micro Focus had available a very good PC based COBOL toolkit. This workbench, as it was called, allowed you to simulate a mainframe environment, including databases and on-line subsystems. It had a very good symbolic debugger, which allowed you do step through the source code and examine and change the values in the program's memory. It was on a par with the Smalltalk debugger. This sort of tool makes it possible to see and understand how the program functions with a known set of input.
My current procedure for software archeology is to refactor the code. Source code is a form of communication, one of the most powerful techniques for understanding someone's communication, is to put the ideas into your own words and speak them back. This provides you with immediate feedback as to whether you have understood. Refactoring the code provides the same feedback. Having the unit test green bar with your refactored code means that you have understood the codes intent well enough to successfully rewrite it.
Software source code is one of the most densely packed forms of communication we have. But, it is still a form of human communication. Refactoring gives us a very powerful tool for improving our understanding of what someone else has written. Imagine how much easier Homer would have been if we could have put the Iliad into our own words then run the unit test to see if it still meant the same thing!