Position Paper: A Software Dig
Author: Charles Weir
© Penrillian 30 August 2001
Project Background
Penrillian is a small software development company, specializing in Java for mobile communicator devices. Much of our development is cross-language: implementing functionality in Java using the C++ native functionality of the Symbian platform.
This paper describes a project we've been working on for a while, taking a Java runtime implementation written in Java and C++, and changing the user interface - and where necessary other parts as well - to support a different kind of device. The codebase is large: around 7000 source files in total, with a similar number generated by compilation. This is around the same number as the whole of the rest of the operating system with its standard applications.
For us software archeologists, this particular 'dig' has been long, painstaking, and hard. In this paper I'll examine some of the things that made it so.
The Problems
All software development is difficult. But which of our problems were associated particularly with the code size and unfamiliarity to us, rather than the difficulty of the problem? I can identify several:
- Locating functionality: Finding each piece of functionality is difficult when you have many thousands of possible locations for it.
- Build procedures. The project used 'make' files directly (in contrast to most Symbian source code, which uses a generator tool to create make files with correct dependencies). Given the large number of different components, these makefiles factored out shared functionality into 'include' files. However more than half a dozen level of nested includes for each makefile made them extremely difficult to analyze. There were also a surprising number of omitted dependency relationships, making it difficult to be sure that source changes are reflected in binaries.
- Deep directory hierarchies: some source files were as much as 10 levels of directory hierarchy down; that makes it tedious to get to files even when we know where they are.
- Linked functionality: It was difficult to associate Java source code with its related C++ source code.
The remainder of this paper discusses how we tackled, or might tackle each of these issues.
Locating Functionality
The main tools we use for locating functionality is a web-style search engine. We keep an up-to-date directory of the entire codebase available to all on the network, run the search engine over this directory. We index every text file - particularly resource files and makefiles as well as compiler source code.
We use ISYS, as it's the only multi-user search engine we've found that's priced per user rather than per file indexed. Another option is the freeware version of AltaVista discovery released several years ago; but this is intended to be single user only, and it's difficult to get it to work well with more than one.
It's then simple to locate a particular class or function: for classes, search for "class X". You can find the C++ function X::Foo just by searching for it. In Java it's usually easiest to find the file by searching for class X, and locate the function within the file. Or if you're looking for, say, a function Foo whose first parameter is a string, just search for "Foo String". Similarly its easy to search for references to a particular function.
Search engines just produce a list of files. A key point to productivity is to make it possible to go from the search engine to looking at the file with your favourite source code editor with a minimum number of keystrokes. In our case this required setting up NT registry entries to make NT invoke (in this case) Visual Studio as the default application for a large list of file types. In a UNIX environment the approach would probably be very different. It's worth spending time for the improvement in productivity and usability this generates.
Build Procedures
Initially we thought the build procedures were an unimportant thing we just needed to cludge to work. After all, we thought, they worked for the team that provided us with the system, didn't they?
This proved a big mistake. It cost us a lot of time: we'd change some source, rebuild, and spend time debugging - only to find eventually that due to poor dependency control in the build system we were debugging binaries that didn't correspond to our source. We suspect our suppliers only ever did full builds (taking several hours); or used 'on-the-fly' scripts for individual components.
Eventually we took the hit of tackling the problem head on, and of assigning the same kind of analytic thinking we do for software design. The result has justified the effort; we can now make 'normal progress' on the project, and morale is much better.
Deep Directory Hierarchies
These directory hierarchies have proved a particular pain, since it makes it very difficult to hold an intuitive picture of the software structure. In this case, one hierarchy might contain the Java code; another the C++; two others, apparently only minimally related, the make files to compile each.
The solution we've found to this is to use a use tools that map the hierarchy to a flat view. The search tool effectively does this. Surprisingly Visual Studio's project view also lists files in a flat structure, no matter where they may be in practice. We use Visual Studio to bring together all the files related to the current task; it's easy to select each one without worrying where it may be (and Visual Studio's integration with Perforce also makes source code control easy).
Linked Functionality
Typically one might be looking at a Java class. It calls a Java native function. The framework code implementing the native function mangles the Java function names in a non-intuitive way, and to implement the functionality, there's another level of indirection to C++ classes having a different naming convention again. This is a common problem where there are more than one programming language working together; I've had similar problems following Kernel calls too.
We have no current answer for this. One option we may consider is to establish the pattern used, or even just to make a spreadsheet with the mapping for each class (or even each class and function).