Software Archaeology: Understanding Large Systems
Andy Schneider
Introduction
This position paper contains a number of things:
- My general view about understanding and improving a large system.
- Lists of the sorts of things I do to help me both understand a system and measure my affect on it.
- Some brief musings on refactoring, tool support and how to make changes stick.
My Position
The efficacy of any particular technique varies depending on certain criteria.
The techniques used to understand a large system that you have not encountered before are affected by a vast number of factors. Some of the more important factors are:
- Degree of maturity of the development process associated with the system.
- Importance of the system to the business.
- Degree of knowledge of the system in-house.
- Age of the system.
- Health of the system.
- Your purpose.
- Degree of structure (monolith to highly distributed)
The software community should share archaeology experiences through a pattern language.
Learning which techniques to apply when, is a matter of experience as well as raw intelligence. The choice of technique is always influenced by the context in which they are to be applied. We (the workshop) could do far worse than build a pattern language to communicate best practice within the subject area.
The best tools are a structured approach and human interactions.
I find the best way to improve the speed at which I grok a system is to focus on a series of questions. These questions give me structure. My experience with developers who struggle understanding a system is that they approach the system in an unstructured manner. As with most development, the addition of structure improves performance rather than reduces it.
Large legacy systems often (unless there is high churn) have large amounts of knowledge buried in the minds of old hands. Successfully working on a large system relies on you locating and extracting this knowledge. This information is far more useful than the output of any reverse engineering visualisation tool, despite what marketing would have us believe.
The only way to know you’ve made an improvement and not made the system worse is to test.
In legacy systems there are always a hundred and one reasons why you can’t test. Listen to them and then test. Without testing, you can’t know the system works. If you don’t know the system works you can’t know if you’ve made an improvement. Test, even if you have to do it by hand through a text interface on a VT52 terminal. If you really can’t test anything then get off the project.
Be a historian
You will make far greater progress if you understand why something is the way it is. At first glance you may see something that looks completely mad. You’ll refactor it and find that in stress testing the system falls to pieces. If you’d ask Fred, he’d have told you that the code is like this because of some problem with the interaction between this system and another one. All this mad code actually ensures correct timing synchronisation between systems. Fred may go on to tell you that they can’t change the other system because they lost the source code (it happens). If you’d asked in the first place you wouldn’t have wasted your time. To reiterate, mad code can often be the result of compromises made for good reason. Of course, it can just be mad… then you need to shoot it.
My approach
I ask myself a series of questions. These questions form the structure for my work. I’ve listed some of these in no particlar order.
What am I doing?
What you are doing has an important influence on how you approach a large system. You need to be precise about what you are doing. Large systems are associated with teams and an incumbent culture. You need to understand what it means to be doing activity X, since there maybe many unspoken assumptions by the team of what the task entails. The first thing you do is to open up lines of communication with the team. Without access to team members your job is that much more difficult. You may have been asked to add feature X. The question is, if you are adding feature X are you supposed to minimise modifications to the system or actively attempt to improve the parts of the system you touch? If you are supposed to be doing the former then you will need to understand the system, but you probably shouldn’t be spending time improving the system unless the management think the investment is going to provide an adequate ROI.
You will need to understand the processes (if any) that the team or organisation requires you to go through to perform your task. This gives you some guide to how the team works and how development is done. This in turn gives you a better handle on who does what and where useful information is likely to be stored.
What are they saying?
Start to build a glossary. Never believe that just because you’ve heard term X used in previous experience that it can be applied on your new project in the same way.
Who should I talk to?
One of the key success factors is to identify, early on, who in the team is the natural networker. The networker won’t know all your answers but you can bet your career that they’ll know that Bob, on floor 10 at the back of the office behind the stuffed giraffe knows just what the 10 lines of SPARC assembler are doing in the code you’re looking at. Once you’ve identified the networker, you can ascertain who is responsible for what. Initially, I take all answers with a pinch of salt and try and find more than one person to answer my questions. This way I can start to work out who really knows about sub-system Z and who just thinks they do. Resolving conflicting viewpoints can also lead to a far better understanding of the system.
Often there are no developers who understand an area well. This can be for any number of reasons but the three most common ones I encounter are:
- “That was developed ages ago and has worked fine up until now”
- “Joe was the only person who worked on that and he left”
- “Its complex and unreliable, we make tactical bug fixes to it but we don’t seem to understand really what is going on”
In this case it’s just you and the code.
What information is available about the area I’m working on and how accurate is it?
We’ve discussed people but we also need to understand what documentation is available and how accurate it is. Even if documentation isn’t up to date it can often be useful as a guide to the basic structure of the area being worked on.
What is the purpose of the system?
I like to be given a demonstration of the system. I like this to consist of a chat with a user about the overall structure of the system. If I’m to be working on a specific functional area it is very useful to be given an overview of that too. Getting this from the user is much better than getting a demo from a developer. A user can talk about its actual use, rather than a developer, who often talks about how they think a feature is used, rather than how it is actually used. When a system is a middle or back office, then the user maybe the administrator, operator or a senior architect. For instance, if you are working on a real-time information feed, then the users may in fact be other systems who consume that data. Therefore developers, rather than some actual “normal” person maybe who you need to talk to.
How important/reliable is the system?
If a system is mission critical, then you’ll want to spend more time validating what you do. This affects everything, from the initial understanding process through to the testing phase. Whilst everyone will say that quality/time is not a zero-sum game, the fact is that you do not want to apply testing suitable for a safety critical system to a system that processes payments on a nightly basis.
Use bug statistics to work out which areas of the system are unreliable. If you are doing system-wide refactoring then you should always address the areas that yield the most support calls earliest. This provides the business with the most business value for their investment in your time. Areas that don’t change much and are reliable may well become less reliable after being refactored. This may not be because the refactoring is bad but simply because a lot of refactoring to a stable system is bound to introduce bugs, testing or not.
Find out which areas of the system are hard to change and changed frequently. These are good places to look for potential refactoring sites. Refactoring these areas will also add significant business value because it should reduce the cost of each new release.
If you don’t have bug report data and data (however anecdotal) indicating which modules are frequently and painfully changed then you could also try using tools that generate complexity metrics. The upside to these tools is that they generate good data on complexity and are objective. The downside is that you need to understand the metrics to interpret them and you also still need to understand how often code is changed in areas identified as having high complexity. Without this last datum you may find refactoring is misdirected.
Where does the system fit into the organisation’s information systems architecture (if at all)?
I like to understand the overall IT architecture for the organisation, in particular any interactions the system I am involved with has with other systems. Whilst this may not directly relate to what I’m doing, an understanding of the context of the system can help understand why things are the way they are.
What is the overall structure of the system?
What I want to understand about the overall system is what “shape” it is. I ask myself what sorts of architectural types are buried in it, does it have layering, pipelines, shared blackboards etc? This gives me an overall feel for how the system works. The most valuable sources of this information are:
- Documentation (even if out of date, a systems basic shape often remains in-tact over time, even if the code has “rotted”).
- Diagrams: When developers draw diagrams they often draw their descriptions in a manner that reflects the basic shape of the software. If you find people drawing stacked boxes then the system maybe layered, if you find people drawing data flow type diagrams then the system may be using a pipeline type architecture.
- People: If a projects churn rate is reasonably low, then people are the most important source of information. I always try and ask more than one person the same question, since, in large systems, two people can have different views of the same system facet. Understanding why the views are different can lead to a better understanding of the code, or lead you to work out just who really knows what is going on and who is sorely mistaken.
If documentation doesn’t exist I’ll generate (probably on an A3 pad than in a case tool) rough architectural diagrams, sequence diagrams and any useful key use-cases I can derive. Keeping the documentation on paper keeps it simple. As soon as documentation becomes more formal in a large project you can find yourself waylaid into a documentation activity rather than fulfilling your primary goal.
When is the code I am working on invoked and why is it invoked?
I look into the circumstances under which a particular section of code is invoked. You need to understand what it is doing and why.
What is the behavioural structure of the code I’m working on?
When I understand why the thing I’m working on is called, I then start to trace its dynamic behaviour. I’ll sketch out sequence diagrams and jot down what I think the key functions/objects are. I’ll then try and find a developer who has been on the project for a while and convince them to spend some time validating the understanding I’ve acquired. This is very important, because, often you’ll find you didn’t understand the significance of a data structure. In practice, large legacy systems are always more complicated then they look.
What tests are available for the system and area I’m working on?
Find some tests for your area and run them. If you are lucky you’ll find some regression tests (probably in a commented out main () in a source file!). Getting these working is important. It is instructive since they probably don’t compile and you’ll have to spend time modifying them until they do. This process leads you to a better understanding of the area you are working on and the associated tests. You also want to look for a way of “smoke testing” the system. This is important because large old systems have all sorts of complicated dependencies and the most trivial change in one area can result in a failure in a supposedly unrelated area elsewhere. If there are no tests then you need to build some. This can involve:
- Learning to use the system so you can test by hand. You need a user or experienced developer and you need to understand what the results should be and what are considered boundary condition inputs.
- Writing your own test cases. Again, you need to understand what the system is doing and what the input/outputs are. Without this you will be unable to test the system.
If I’m fixing a bug, how does the bug manifest itself?
I try and reproduce the bug. This may seem like a statement of the obvious, but I’ve met many people whose approach is to work out what the bug is by staring at the code for long periods of time.
What facilities are available for debugging?
I want to understand the code (or bug) by working through the system in a debugger. I need to know whether I can run a system or a component of that system under a debugger and how I go about building a part of the system with symbols. This is often harder than it seems, since developers on long running projects may have their own personal makefile and tools for doing this, but they aren’t consistent across the project or even checked-in to source control.
What is the lifetime for dynamically allocated resources?
It is very important to focus on the lifetimes and allocation/deallocation protocols for resources used within the area you are working. You need to fully understand the protocols used to allocate and deallocate objects and also transfer ownership of resources through the system. In large C and C++ systems this area can be a minefield of inconsistent protocols and errors.
What is enough?
All the above activities can be applied to the whole system or an individual area. It is important to apply them at the appropriate scale. When I’m working at the system level I identify key functions and use them as a guide to what is important to understand. I look for major interfaces and major interactions but I don’t get too detailed. If I’m working at a sub-system level, I’ll trace from entry points into the sub-system and then from the same entry points out into the rest of the system. I’ll be far more detailed and thorough than I would be at higher levels.
Brief Musings
Refactoring
Many people’s response to a discussion on large systems is to point to refactoring as a significant technique.
Refactoring is a fashion[1]. Like all fashions you should not follow it slavishly. Refactoring is most successful when people consistently make small testable changes as they work. Often large systems are a mess. For refactoring to have any impact you need to either make large changes or start making small changes and convince everyone else to do the same. The first approach requires you to have a good understanding of the entire system and know all the subtle dependencies, such as memory layout issues, that always lurk in old code. The second approach requires you to sell what maybe a foreign approach to an existing team. This can be hard, can provoke bad feeling (“who is this know it all on our team?”) and maybe outside the scope of what you have been asked to do. Successful refactoring requires tests. It is dangerous to refactor without tests. In particular, the lack of tests can restrict the scope of refactoring you can successfully apply. Often the best thing you can do is make the smallest possible change to achieve your end goal and get out of there ;-) Too many times I’ve seen developers refactor code into a nice structure that works in isolation but results in code elsewhere in the system failing. This is always the result of ignoring the old adage “look before you leap”.
Refactoring is good. Good developers have been doing it for years. I tend to refactor in large systems only when I’ve fully grokked the code and am confident I know what each statement does. I’ve found two approaches work well:
- Work from the inside out. Re-organise one function or method. Remove duplication, change poorly named types and variables, move code around to better reflect what should be happening.
- Isolate yourself. Locate the entry points into a particular area and build a clean interface to encapsulate them. Fix the calling code to use this interface (making the minimum of changes). Now, if you’ve understood the code properly, you can start to change things with greater abandon than otherwise. With the interface in place, you can start to test in isolation and build tests with greater speed. This will improve the likelihood of success in refactoring.
Tool support
Tool support is a good way of speeding up the process of understanding the system. It also assists you understanding if you’ve improved the system.
- Reverse Engineering
: I personally find reverse engineering tools useless. They tend to take a long time to set-up (if they aren’t already set-up) and the output can be complex and misleading. You normally end up with a mass of poorly laid out lines and rectangles and you don’t have the important thing, the semantics of these entities.
- Program execution visualisation tools
: (I’m not talking about normal debuggers here). Whilst the ability to see resources allocated, deallocated and interacting visually is appealing, our displays aren’t large enough to accommodate enough detail and they normally perform too poorly to be of use.
- Source navigation tools
: These can be immensely helpful. Tools that will build include/included-by trees, calls/called-by trees and references/ referenced-by (often misleading due to aliasing), are effective for assisting in understanding the structure of the system.
- Scripting languages
: The first thing I do when I get a development environment is find or download Perl. I use Perl to build scripts that generate custom reports for me or allow me to make regular changes across a large codebase.
- Profiling tools
: Tools that let me profile performance (if I’m supposed to be making the system run faster) and show me memory access errors (such as purify) can be useful. Often tools such as purify can be difficult to apply across a large system, because of their memory requirements, so you need to be able to work out how to apply that tool to a particular section of the code. If I’ve been asked to improve the performance of part of the system and I don’t have access to a profiler then I’m doomed.
Of course, a fool with a tool is still a fool.
Make the value you’ve added permanent
- Comment code that you are working on so the next person doesn’t have to go through the same pain.
- Find design documents to update and update them.
- Document how to run any tests you’ve developed.
- Give any useful scripts you’ve developed to the local toolsmith (there is always one).
- Become influential. Be humble, ask questions, provide good suggestions and produce changes in the code base that add value. If you do all these you’ll be on your way to becoming an influential member of the team. This will allow you to start to improve processes that need improving and to introduce new ideas that can add value but that are often lacking in a team that’s bogged down in keep a mission critical system running on a day-by-day basis.
[1] Though good engineers have been improving existing code for years the software community seems to have discovered this and named it “refactoring”. Refactoring is now seen by some engineers as a silver bullet.