Statistics, a vision into your complex application.
By John M McIntosh
Corporate Smalltalk Consulting Ltd.
www.smalltalkconsulting.com
johnmci@smalltalkconsulting.com
Many years ago I was involved in tuning IBM Mainframe operating systems. Much of what we faced was similar to large system today. MVS 370 was a very large program, you didn't have access to the source code and understanding how it worked in it's entirety was a domain understood by only a few. At any one point no-one could really say what was happening, and if what was going on was good or bad. This might be somewhat of a generalization, but in fact it's similar to what you face today when you are greeted by a million line system and told to improve it's performance within just a few days.
Back then much of the work I did centered around collecting statistical information about the system behaved. By examining numbers, on a daily basis, one could form an opinion on how the system was behaving and how the users of the system would experience thruput and performance. Now by applying a bit of understand of how the system worked, and by collecting the right numbers you could determine what type of tuning was required to make the system more responsive or more capable of increasing thruput.
The key issue here is measurement. Today we talk about ensuring we have test cases to ensure a system is built correctly, but I would suggest we also make available statistical collection points so you can infer what the system is doing. If for example your system does transaction processing, should you not have a facility to monitor how many transactions a second your system is actually processing? If for example if your system only needs to do so many transactions per second and it is much faster, could you not throttle response times and give back CPU time for something else to use? As an example here are a number of charts from a system I wrote that fed back information on what it was doing in real time to a monitor as it was running.
In the first chart this example of colored squiggly lines told us the server was behaving normally. In fact if you understood the application you knew what the slope of the color lines really represented

In the second chart this view was result of a mainframe connectivity failure. In both cases it was very evident what the server as doing based on the visual information and what really was happening internally. To support this application, we gave operations a console to observe this chart, then pictures of normal and abnormal behavior and a warning bell! When the bell rang they could observe the chart and based on what it looked like take action.
Over the years I have had the chance to work on a number of large systems that have had garbage collection problems. In reviewing how they work I collect statistics that pertain to object creation, usage, and death. In Cincom's VisualWorks most of the garbage collection logic is exposed and accessible, so I can easily instrument the GC and collect megabytes of information.
By reviewing this data I can form an opinion of how the system is working. Then I examine the code by sitting with one of the programmers of the system and asking him to explain or show me what the system is doing at different points in the time line based on interesting things I am seeing in the charts or statistical data. From this viewpoint I can identify problem points within a large server based system that are affecting the performance of that system. This is quite a different way to tackle code exploration.
Example 1

In this case the real problem was the system would crash at the end of this chart, but in reviewing the chart you can see there are some interesting datums. One of which was the spikes of the pink lines. In discussion with the developer we discovered a certain task he was performing at the start of each transaction would cause us to stress the garbage collector and run into a hard implementation limit which would result in a VM failure. The highest peak at the right occurred just before the failure. From the time line of each spike we determined the problem code and rewrote it to eliminate the situation. Before doing this charting the only information we had was the system would run for a time and then core dump for unknown reasons.
Another example.

In this case the observed behavior was that the clients of the server application would start dropping off the interenet when the top of the staircase was reached. At this point we determine a full GC event would occur and would take a very long time to complete at a high priority which causes people's connections to fail. However to solve this we were more interested in the stair case effect and determine why we had such an interesting pattern. In fact the developers could not understand why such a perfect pattern existed because the application in theory was processing variable amounts of transactions per minute. Yet the chart implied a very steady state.
In the end it was determined the pattern was the result of a periodic task running that generated lots of garbage which the incremental collector was unable to collect fast enough. To fix this problem at the point of each vertical bar we rewrote that code to reduce the amount of garbage generated, then altered the GC tuning parameters to collect garbage faster so the curve would not rise and trigger a full collection.
So as you can see visualization of a system can be another tool to use to determine it's behavior and the impact of changes, or to indicate exactly where to change it.