INTRODUCTION

In an environment where programmers are five to ten times more productive, both features and defects can be added to the product much more quickly than would otherwise be possible. Dramatic productivity improvements will only be sustained when testing is an integral part of the Smalltalk development process.

In this paper we discuss the issues involved in testing Smalltalk systems, based on our experiences in the testing and release engineering of ENVY/Smalltalk, a commercial Smalltalk implementation.

BACKGROUND

ENVY/Smalltalk is a multi-platform Smalltalk implementation designed around the notion of portable API's (Application Programming Interfaces). While the implementations of certain API's may vary, applications programmed in accordance with the API's will work on all supported platforms. In addition, such programs adopt the native look and feel of the particular platform. To maximize performance and platform integration, we implement the portable API's using corresponding platform capabilities directly wherever possible.

Software quality obviously depends on the quality of the specification, so this is where it all starts. Detailed specifications exist for all ENVY/Smalltalk API calls. However we will focus on implementation quality issues, leaving specification issues for another discussion. We do feel, however, that it is essential to have good specifications so that there is criteria which can be applied to decide when the software works correctly. The purpose of testing, then, is to attempt to ensure that the software performs in accordance with the specification.

First, some terminology [Liskov 86]:

Validation: Process designed to increase our confidence that a program will function as we intend it to.
Verification: Validation by reasoning about the program source (or equivalent descriptive representation). Formal verification techniques, code inspection, review etc.
Testing: Validation by executing the program on a set of test cases, and comparing actual with expected results.

Assuming we have done a good job of understanding the customer requirements, and a good job of developing a specification and design, we are interested in what techniques are available to us to validate the software. The point of validation is to discover (and fix) as many defects as possible in the software, before the customer has a chance to see it. To do this effectively, a combination of verification and testing techniques are required.

APPROACHES TO SMALLTALK PROGRAM VALIDATION

Verification and Layered Systems

Formal verification is (arguably) the only way to ensure a program is correct, but it is unworkable except in simple cases, and practically useless when we depend on unverifiable software from "third" parties e.g. MS Windows or OSF/Motif!

Even if we could formally verify our program, it may still crash or behave incorrectly because of a 'bug' in some third party API we call. Neither is it sufficient to verify our program, even if it were practical, because to ensure the quality of our software system we must test both it and everything it depends on. Not only must we ensure our code is written correctly, we must also avoid writing our code in such a way that it exposes problems in the underlying third party API's that we call.

On the other hand, our code may use the API's incorrectly, but appear to work, at least initially. To isolate and correct these types of defects, resource monitors and debug versions of the API's are essential.

Code Inspection

Studies [Jones 81] have shown that informal verification techniques such as code inspection are far more effective at finding defects than is testing! This is a kind of code review, where the reviewer is very suspicious and asks a lot of questions of the author of the code, concerning the logical operation of the program, and especially the assumptions made by the author concerning the other software components relied on, such as the MS Windows API. In our experience this kind of review is very effective in both finding defects and in allowing the reviewer and author learn from each other's code and expertise.

Double Checks

People make mistakes -- lots of them. We have found that a three-person approach works well during the last stages of a release, when introducing a new problem could mean missing the ship date:

Fixer: Makes the change.
Checker: Verifies the change - must be completely satisfied they understand the change. Can veto fix.
Re-Tester: Person who reported the defect. Retests the fix. Can also veto fix.

Regression Testing

Regression Testing is the re-running of test cases to ensure that the software still passes all of the tests it used to pass. This serves to increase our confidence that we have not introduced any new defects, but does not prove it.

As part of the development process, we incorporate tests for new features into the standard regression test set. In addition, new tests are designed specifically to detect known problems that the regression tests did not catch.

Wherever possible, the regression test set is fully automated, so it can be run and rerun at any time, since a seemingly innocuous change has the potential to introduce a new defect in what was thought to be an unrelated section of the code. Regression tests consist of a combination of black box and white box tests.

Black Box Tests: Written according to the specification without knowledge of the implementation. Tests the externally visible interface. Do not need to be recoded if the implementation changes as long as the spec stays the same. Black box tests also have the advantage that they are portable across different implementations of the same API on different platforms.

White Box Tests: Written using 'insider' knowledge of the implementation in order to test known special cases, boundary cases, code paths, and limitations. Are deviously designed to force the implementation to fail, while still using it in accordance with the specification. White box tests may or may not be fully portable, because they are often permitted to call private methods in order to examine implementation state.

We have found that the best arrangement for test development is to pair up the test case developer with the API developer in complementary but somewhat adversarial roles. Both are interested in ensuring the quality of the software, but they approach the problem from different angles. The test developer keeps the API developer "honest" by trying to find as many problems as possible.

Stability Testing

Regression tests should be repeatable and should not include random elements. This is because, by their very nature, they are designed to ensure the same behavior every time. However, stability tests are often useful for increasing confidence in the robustness of the software. These complement regression tests by exercising the software in a rigorous and/or possibly pseudo-random manner. In contrast, regression tests are normally algorithmic, and assume that if a test is passed once, it will pass again. Of course if a stability test fails, it can be very difficult to determine what happened and why. Typically, stability tests run for a very long time and push the hardware and underlying software very hard. Interestingly, these kinds of tests often uncover leaks or memory management problems in the underlying operating systems and user interface libraries.

TESTING AND DEFECT CONVERGENCE

The fact that Smalltalk can make a software team highly productive can also be the team's downfall. Sure, maybe you can develop a system five times faster than in C, but you can also get into trouble five times faster. This is compounded by the fact that traditionally many Smalltalk programmers have been cavalier and undisciplined. [Barry 95]

One of the problems that can befall a software team is that the team keeps fixing defects, and the fixes introduce new problems that previously did not exist. In many cases, a developer may fix a low severity problem that would not have prevented the software from shipping, only to introduce a high severity problem that delays the ship date.

We have evolved a release process that we have found to be very effective in achieving "convergence" on schedule. Three freeze-and-test cycles are done, each of which consists of the following steps:

Freeze all code
Run all regression and stability tests and log all problems
Entire team meets in a "war meeting" to review all outstanding problems and decide the fix/do-not-fix list.
Developers fix all agreed upon problems.
Repeat cycle

At each successive cycle the severity criteria that a problem must meet in order to make the fix list is raised, until at the last cycle only stop-ship severity problems are fixed. In addition, during the last cycle, the three-person double check described earlier is used on every change. The combination of these approaches ensures that we are focusing on the most important problems at each stage, and also serves to reduce risk by eliminating non-critical code changes.

CONCLUSIONS

Although Smalltalk provides the potential for dramatic gains in productivity, these benefits can not be sustained unless testing is made an integral part of the software development process. Ironically, without a diciplined approach to testing and release engineering, Smalltalk's ability to facilitate rapid application development may instead turn out to be a major source of risk.

REFERENCES

[Barry 95] Barry, B., Personal Communication, 1995

[Jones 81] Jones T.C., Programming Productivity: Issues for the Eighties (IEEE Catalog NO.. EHO 186-7), 1981

[Liskov 86] Liskov, B., Guttag, J., Abstraction and Specification in Program Development, McGraw-Hill 1986