
   A Safe, Efficient Regression Test Selection Technique1
   
   
   Abstract
   Regression testing is an expensive but necessary maintenance activity
   performed on modified software to provide confidence that changes are
   correct and do not adversely affect other portions of the software. A
   regression test selection technique chooses, from an existing test
   set, tests that are deemed necessary to validate modified software.
   Most regression test selection techniques depend on a particular test
   adequacy criterion or require prior knowledge of where code has been
   modified. We present a new technique for regression test selection
   that is neither adequacy-based, nor requires prior knowledge of
   modifications. Our algorithms construct control flow graphs for a
   procedure or program and its modified version, and use these graphs to
   select tests, from the original test set, that execute changed code.
   We prove that under certain conditions, the set of tests our
   algorithms select includes every test, from the original test suite,
   that can expose faults in the modified procedure or program. Thus,
   under these conditions, the algorithms are safe. Moreover, although
   our algorithms may select some tests that cannot expose faults, they
   are at least as precise as other safe regression test selection
   algorithms. Unlike many other regression test selection algorithms,
   our algorithms handle all language constructs, and all types of
   program modifications. We have implemented our algorithms; initial
   empirical studies indicate that our technique can significantly reduce
   the cost of regression testing modified software.
   
   1 Introduction
   
   Software maintenance activities may account for as much as two-thirds
   of the overall cost of software production[38, 43]. One necessary
   maintenance activity, regression testing, is performed on modified
   software to provide confidence that the software behaves correctly,
   and that modifications have not adversely impacted the software's
   quality. Regression testing is expensive; it can account for as much
   as one-half of the cost of software maintenance[30].
   
   An important difference between regression testing and development
   testing is that during regression testing, we may have an established
   suite of tests available for reuse. One regression testing strategy
   reruns all such tests, but this retest all approach may consume
   inordinate time and resources. Selective retest techniques, in
   contrast, attempt to reduce the time required to retest a modified
   program by selectively reusing tests, and selectively retesting the
   modified program.
   
   Although many selective retest techniques have been proposed, most of
   these techniques emphasize the use of structural coverage criteria[4,
   7, 9, 13, 14, 15, 18, 19, 20, 21, 23, 22, 29, 37, 44, 45, 54]. These
   1A preliminary version of this work appeared in Reference [40]. This
   work was partially supported by a grant from Microsoft, Inc., and by
   NSF under Grant CCR-9357811 to Clemson University and Ohio State
   University.
   
   techniques identify changed or affected program components, such as
   statements or paths, and attempt to select tests that exercise these
   components. Coverage criteria are useful because they offer a way to
   measure test adequacy. However, another goal of regression testing is
   to provide confidence that program functionality has not been
   adversely affected by modifications. In practice, most test suites
   contain tests designed to exercise functionality, rather than tests
   designed to cover components; this is particularly true for test
   suites designed for integration and system testing. Where selection of
   these tests from an existing test suite is concerned, we are often
   more concerned with finding tests that reveal faults in the modified
   program, than we are with finding tests that cover modified
   components. The coverage-based selective retest techniques cited above
   can omit such \fault-revealing" tests in significant situations[42].
   
   Some selective retest techniques place less emphasis on coverage
   criteria[1, 11, 28, 32, 31, 48, 49]. One drawback of these techniques
   is that they require prior knowledge of where the modified program has
   been changed. This knowledge must be obtained either by a programming
   environment that tracks changes as they are made, or by algorithms[28,
   52] that determine corresponding program components. The assumption of
   a programming environment that tracks changes may be unreasonable
   given current practice. The algorithms for determining corresponding
   components may be unnecessarily costly for this application.
   
   Selective retest techniques address two problems: (1) the problem of
   selecting tests from an existing test suite, and (2) the problem of
   determining where additional tests may be required. Both of these
   problems are important; however, in this paper, we address the first
   problem: the regression test selection problem. In this paper, we
   present a new technique for regression test selection that is neither
   coverage-criteria based nor requires complete information on
   corresponding program components. Our algorithms construct control
   flow graphs for a procedure or program and its modified version, and
   use these graphs to select tests from the original test suite that
   execute changed code.
   
   Our technique has several advantages over other regression test
   selection techniques. Unlike many techniques, our algorithms select
   tests that may now execute new or modified statements, and tests that
   formerly executed statements that have been deleted from the original
   program. We prove that under certain conditions, our technique selects
   every test, from the original test suite, that can expose faults in
   the modified program. Thus, under those conditions, our algorithms are
   safe. Moreover, although our algorithms may select some tests that
   cannot reveal faults, they are more precise than other safe
   algorithms, because they select fewer tests that cannot execute
   changed code. Our algorithms are also more efficient than most
   existing algorithms. The algorithms are easy to implement, and they
   automate an important portion of the regression testing process.
   Finally, our algorithms are more general than most previous
   techniques. They handle both interprocedural and intraprocedural
   regression test selection. Also, they handle all language constructs
   and all types of program modifications for procedural languages.
   
   We have implemented our algorithms, and conducted empirical studies on
   several subject programs and modified versions. Our results suggest
   that in practice, our algorithms can significantly reduce the cost of
   regression testing a modified program.
   In the next section, we provide background information. We present our
   test selection algorithms in Section 3, and we analyze and evaluate
   our algorithms in Section 4. Section 5 presents our conclusions.
   
   2 Background
   
   We adopt the following notation throughout the rest of this paper. Let
   P be a procedure or program, let P be a modified version of P , and
   let S and S0 be the specifications for P and P , respectively. P (i)
   refers to
   the output of P on input i, P (i) refers to the output of P on input
   i, S(i) refers to the specified output for
   P on input i, and S0(i) refers to the specified output for P on input
   i. Let T be a set of tests (a test suite)
   created to test P . A test is a 3-tuple, <identifier, input, output>,
   in which identifier identifies the test, input is the input for that
   execution of the program, and output is the specified output,
   S(input), for this input. For simplicity, in the sequel we refer to a
   test ht; i; S(i)i by its identifier t, and refer to the outputs P (i)
   and S(i) of test t for input i as P (t) and S(t), respectively.
   We assume that all statements in P are statically reachable. Because
   it is possible to eliminate code that is not statically reachable from
   a program, this is not an unreasonable restriction.
   
   2.1 Control Flow Graphs
   
   A control flow graph (CFG) for procedure P contains a node for each
   simple or conditional statement in P ; edges between nodes represent
   the flow of control between statements. Figure 1 shows procedure avg
   and its CFG. In the figure, statement nodes, shown as ellipses,
   represent simple statements. Predicate nodes, shown as squares, stand
   for conditional statements. Labeled edges leaving predicate nodes are
   called branches { they represent control paths taken when the
   predicate evaluates to the value of the edge label. Statement and
   predicate nodes are labeled to indicate the statements in P to which
   they correspond. In the figure, for simplicity, we use statement
   numbers as labels, but we could instead use the actual code of the
   associated statements as labels. A unique entry node and a unique exit
   node represent entry to and exit from P . The CFG for a procedure P
   has size, and can be constructed in time, linear in the number of
   simple and conditional statements in P [2].
   
   To facilitate the presentation and discussion of our algorithms, we
   think of unlabeled CFG edges as having the label \ffl"; we also assume
   initially that case statements are rendered as nested if-else
   statements. Given these assumptions, every CFG node has either one
   out-edge labeled \ffl", or two out-edges, labeled \T" and \F",
   respectively. Assume further that we associate all declaration and
   nonexecutable initialization statements that pertain to a procedure
   with a declaration node that we insert into the CFG as the immediate
   successor of the entry node.2 We discuss other methods for handling
   case statements, declarations, and other types of nonexecutable
   initialization statements in Section 3.1.4.
   
   Code Instrumentation
   
   We can instrument a program P , such that when we execute the
   instrumented version of P with test t, the program reports the
   branches taken in the program for t. Using this branch trace
   information, we can easily determine, for t, which edges in the CFG G
   for P were traversed when t was executed.3 We call the 2For simplicity
   we omit these nodes from our figures.
   3We say that an edge (n1; n2) in G was traversed by test t if and only
   if, when t was run on P , the statements associated with n1 and n2
   were executed sequentially at least once during the execution.
   
  
   information thus gathered an edge trace for t on P . An edge trace for
   t on P has size linear in the number of edges in the CFG for P ; we
   can use a bit vector to represent such a trace.
   
   Given program P and test suite T , we construct a test history for P
   with respect to T by gathering edge trace information for each test in
   T , and representing it such that for each edge (n1; n2) in the CFG
   for P, we know which tests traversed that edge. This representation
   requires O(ejT j) bits, where e is the number of edges in G, and jT j
   is the number of tests in T . For CFGs of the form we employ, e is no
   greater than twice the number of nodes in G; thus, e is linear in the
   size of P . Figure 2 gives a table that reports test information,
   along with the corresponding test history, for program avg of Figure
   1.
   
   For convenience, in the remainder of this paper we assume the
   existence of a function, TestsOnEdge(n1; n2), that returns a bit
   vector v of size jT j bits, such that the ith bit in v is set if and
   only if test i in T traversed edge (n1; n2) in G.
   
   2.2 Regression Testing
   
   Research on regression testing spans a wide variety of topics,
   including test environments and automation[10, 12, 25, 24, 55],
   capture-playback mechanisms[34], test suite management[16, 23, 34, 45,
   51], program size reduction[8], and regression testability[30]. Most
   recent research on regression testing, however, concerns selective
   retest techniques.
   
   Selective retest techniques reduce the cost of regression testing by
   reusing existing tests, and identifying portions of the modified
   program or its specification that should be tested. Selective retest
   techniques differ from the retest-all technique, which runs all tests
   in the existing test suite. Leung and White[33] show that a selective
   retest technique is more economical than the retest-all technique if
   the cost of selecting a reduced subset of tests to run is less than the cost of running the tests that
   the selective retest technique lets us omit. A typical selective
   retest technique proceeds as follows:
   1. Select T ? T , a set of tests to execute on P .
   2. Test P with T , establishing P 's correctness with respect to T .
   3. If necessary, create T 00, a set of new functional or structural
   tests for P .
   4. Test P with T 00, establishing P 's correctness with respect to T
   00.
   5. Create T 000, a new test suite and test history for P , from T , T
   , and T 00.
   
   In performing these steps, a selective retest technique addresses
   several problems. Step 1 involves the regression test selection
   problem: the problem of selecting a subset T of T with which to test P
   . This
   problem includes the subproblem of identifying tests in T that are
   obsolete for P . Test t is obsolete for
   program P if and only if t specifies an input to P that, according to
   S0, is invalid for P , or t specifies an
   invalid input-output relation for P . Step 3 addresses the coverage
   identification problem: the problem of
   identifying portions of P or S0 that require additional testing. Steps
   2 and 4 address the test suite execution
   problem: the problem of efficiently executing tests and checking test
   results for correctness. Step 5 addresses the test suite maintenance
   problem: the problem of updating and storing test information.
   Although each of these problems is significant, we restrict our
   attention to the regression test selection problem. We further restrict our attention to code-based regression test selection
   techniques, which rely on analysis of P and P to select tests.4
   
   We distinguish two phases of regression testing: a preliminary phase
   and a critical phase. The preliminary phase of regression testing
   begins after the release of some version of the software; during this
   phase, 4For examples of specification-based techniques, see References
   [32] and [46].
   
   programmers enhance and correct the software. When corrections are
   complete, the critical phase of regression testing begins; during this
   phase regression testing is the dominating activity, and its time is
   limited by the deadline for product release. It is in the critical
   phase that cost minimization is most important for regression testing.
   Regression test selection techniques can exploit these phases. For
   example, a technique that requires test history and program analysis
   information during the critical phase can achieve a lower critical
   phase cost by gathering that information during the preliminary
   phase.5
   
   3 Regression test selection algorithms
   
   For reasons that shall become clear, our goal is to identify
   nonobsolete tests in T that execute changed code with respect to P and
   P . In other words, we wish to identify tests in T that (1) execute
   code that is new
   or modified for P , or (2) used to execute code in P that is no longer
   present in P . We call such tests
   \modification-traversing for P and P ". To capture the notion of these
   tests more formally, we define an
   execution trace ET (P (t)) for t on P to consist of the sequence of
   statements in P that are executed when P is tested with t. We say that
   two execution traces ET (P (t)) and ET (P (t)) are equivalent if they
   have
   different lengths, or if, when we compare their elements from first to
   last, we find some pair of elements that are lexically nonidentical.
   We say that t is modification-traversing for P and P (or for
   simplicity, that t is
   modification-traversing) if and only if ET (P (i)) is not equivalent
   to ET (P (i)).
   To identify the modification-traversing tests in T we must identify
   nonobsolete tests in T that have different execution traces in P and P
   . Assume henceforth that T contains no obsolete tests, either because
   it contained none initially, or because we have removed them.6 In
   addition, assume that for each test t 2 T , P terminated and produced
   its specified output when we executed it with t. Our task is to
   partition T as nearly as possible into tests that have different
   execution traces in P and P , and tests that do not.
   We next present an algorithm that identifies modification-traversing
   tests for intraprocedural test selection. Section 3.2 presents our
   interprocedural test selection algorithm.
   
   3.1 Intraprocedural test selection
   
   We begin the presentation of our intraprocedural algorithm by
   informally motivating our approach.
   
   Given an execution trace, there is a unique mapping between that trace
   and the nodes of a CFG. We obtain this mapping by replacing each
   statement in the execution trace by its corresponding CFG node, or
   equivalently, by the label of that node. We call such a mapping a
   traversal trace and say that for test t with execution trace ET (P
   (t)), the traversal trace for t on G is TR(P (t)). If N is a node in
   traversal trace TR(P (t)), we define the traversal trace prefix for
   TR(P (t)) with respect to N to be the portion of TR(P (t)) beginning
   with the first node in the trace and ending at N .
   5There are various ways in which this two-phase process may fit into
   the overall software maintenance process. A big bang process performs
   all modifications, and when these are complete, turns to regression
   testing. An incremental process performs regression testing at
   intervals throughout the maintenance life cycle, with each testing
   session aimed at the product in its current state of evolution.
   Preliminary phases are typically shorter for the incremental model
   than with the big bang model; however, for both models, both phases
   exist, and can be exploited.
   6If we cannot effectively determine test obsolesence, we cannot
   effectively judge test correctness. Thus, this assumption is necessary
   if we intend to reuse tests at all, whether selectively or not.
   
   Given two execution traces, we can perform a pairwise comparison of
   the traces by comparing the first statements in each trace to each
   other, then comparing the second statements in each trace to each
   other, and so forth. If a test t has nonequivalent execution traces in
   P and P , a pairwise comparison of the
   execution traces for t in P and P reaches a first pair of statements
   that are lexically different. Assume
   that CFG nodes contain, as labels, the text of the statements to which
   they correspond. In this case, we can perform a pairwise comparison of
   the nodes in two traversal traces by comparing node labels. If t has
   different traversal traces in P and P we shall reach a first pair of
   nodes that have lexically different labels.
   
   Suppose t has nonequivalent execution traces ET (P (t)) and ET (P (t))
   in P and P , and let N and N be the first pair of nodes found to
   differ during a pairwise comparison of the corresponding traversal
   traces, TR(P (t)) and TR(P (t)). The traversal trace prefixes of TR(P
   (t)) and TR(P (t)) with respect to N and N ,
   respectively, are equivalent up to, but not including, N and N . In
   other words, if t is modification-traversing
   
   for P and P , there is some pair of nodes N and N in G and G0, such
   that N and N differ, and N and N are endpoints of traversal trace
   prefixes that are identical up to, but not including, N and N . To
   find tests
   that are modification-traversing for P and P , we can synchronously
   traverse CFG paths that begin with
   the entry nodes of G and G0, looking for pairs of nodes N and N that
   differ. When we find such a pair, we
   use TestsOnEdge to select all tests known to have reached N .
   
   On first view, it may appear expensive to compare traversal traces,
   because such traces can be lengthy { particularly when procedures
   contain loops. However, three facts combine to render the approach
   efficient. First, when we walk a pair of traversal trace prefixes p
   and p0 in G and G0, respectively, reach a pair of nodes N and N that
   differ, and select the tests that reach N , we need not walk beyond N
   and N : we
   have selected all tests that have traversal trace prefixes p and p0
   and reach additional modifications beyond N and N . Tests that reach
   modifications accessible along other traversal trace prefixes are
   selected when we traverse those other prefixes. Second, when we reach,
   in our walk, a pair of nodes N and N that have
   already been pairwise compared, we do not need to compare N and N
   again. Third, when we attach test
   information to edges in G, a single walk of G simultaneously accounts
   for all tests attached to each edge as it crosses the edge; we do not
   require a separate graph walk for each test in T . For these reasons
   the graph walks required to select modification-traversing tests on
   CFGs require worst-case time quadratic in the number of nodes in the
   graphs.
   
   3.1.1 The basic test selection algorithm
   
   Figure 3 presents SelectTests, our intraprocedural regression test
   selection algorithm. SelectTests takes a procedure P , its modified
   version P , and the test suite T for P , and returns T , a set that
   contains tests
   that are modification-traversing for P and P . SelectTests first
   initializes T to OE, and then constructs
   CFGs G (with entry node E) and G0 (with entry node E0) for P and P ,
   respectively. Next, the algorithm
   
   calls Compare with E and E0. Compare ultimately places tests that are
   modification-traversing for P and P into T . SelectTests returns these
   tests.7
   
   7An earlier version of this algorithm[40] was based on control
   dependence graphs for P and P . The possibility of performing
   that algorithm on CFGs was suggested by Weibao Wu (personal
   communication). Both approaches select safe test sets, but the
   CFG-based approach is more efficient, and easier to implement, than
   the earlier approach.
   
  
   
   Compare is a recursive procedure. Compare is called with pairs of
   nodes N and N , from G and G0,
   respectively, that are reached simultaneously during our algorithm's
   synchronous walks of traversal trace prefixes. Given two such nodes N
   and N , Compare determines whether N and N have successors that
   are syntactically nonequivalent along pairs of identically labeled
   edges. If N and N have syntactically
   nonequivalent successors along some pair of identically labeled edges,
   tests that traverse the edges are modification-traversing due to
   changes in the code associated with those successors. In this case,
   Compare selects those tests. If N and N have syntactically equivalent
   successors along a pair of identically labeled
   edges, Compare walks forward along the edges in G and G0, by invoking
   itself on those successors. Lines 7-20 of Figure 3 describe Compare's
   actions more precisely. When Compare is called with CFG nodes N and N
   , Compare first marks node N \N -visited" (line 8). After Compare has
   been called once
   with N and N , it does not need to consider them again { this marking
   step lets Compare avoid revisiting a pair of nodes. Next, in the for
   loop of lines 9-19, Compare considers each control flow successor of N
   . For each successor C, Compare locates the label L on the edge from N
   to C, then seeks the node C in C such
   that (N ,C ) has label L. (Such a C necessarily exists, because
   Compare is called only with pairs of nodes
   N and N that are syntactically equivalent; thus N and N are either
   both predicate nodes with out-edges
   labeled \T" and \F", or N and N are both statement nodes with
   out-edges labeled \ffl".) Next, Compare
   considers C and C . If C is marked \C -visited", Compare has already
   been called with C and C , so Compare
   does not take any action with C and C . If C is not marked \C
   -visited", Compare calls Equivalent with C
   
  
   and C . If Equivalent(C,C ) is false, then tests that traverse edge
   (N; C) are modification-traversing for P
   and P , so Compare uses TestsOnEdge to identify these tests, and adds
   them to T . If Equivalent(C,C ) is
   true, Compare invokes itself on C and C , to continue the graph
   traversals beyond these nodes.
   The Equivalent function takes a pair of nodes N and N , and determines
   whether the statements S
   and S0 associated with N and N may be semantically nonequivalent.8 A
   simple version of Equivalent
   answers conservatively the question of whether S and S0 are
   semantically nonequivalent, by comparing the two statements
   lexicographically, ignoring extra white-space characters when not
   included in character constants. We discuss more precise, albeit more
   complex, versions of Equivalent in Section 3.1.4.
   
   We next consider several examples that illustrate how SelectTests
   works. Figure 4 presents procedure avg2 and the CFG for avg2 { avg2 is
   a modified version of procedure avg, shown in Figure 1. In avg2,
   statement S7 has erroneously been deleted, and statement S5a has been
   added. When called with avg and avg2, and with test suite T (shown in
   Figure 2), SelectTests initializes T to OE, constructs the CFGs for
   the two procedures, and calls Compare with entry and entry0. Compare
   marks entry \entry0-visited", and then considers the successor of
   entry, S1. Compare finds that S10 is the corresponding successor of
   entry0, and because S1 is not marked \S10-visited", calls Equivalent
   with S1 and S10. Because S1 and S10 are syntactically equivalent,
   Equivalent returns true, and Compare invokes itself on S1 and S10
   (invocation 2). Recursive calls continue in this manner on nodes S2
   and S20 (invocation 3), and P3 and P30 (invocation 4); in each case
   the successors of the nodes are equivalent. On invocation 4, Compare
   must consider two successors of P 3: P4 and S9. When Compare considers
   S9, it calls itself with S10 and S100, and subsequently with exit and
   exit0, and selects no tests. When Compare considers P4 and P40, it
   first seeks a true child 8The phrase \may be" is important here,
   because the problem of determining semantic equivalence of statements
   is, in general, undecidable[26].
   
   of P40 to compare with S5; it finds S5a and calls Equivalent with S5
   and S5a. The statement associated with S5 and the statement associated
   with S5a are not syntactically equivalent, so Equivalent returns
   false; Compare uses TestsOnEdge(P4,S5) to locate the set of tests
   (ft2g) that reach S5 in avg and adds these tests to T . When Compare
   seeks a false successor of P40, it finds S60, and calls Equivalent
   with S6 and S60; Equivalent returns true for these nodes, so Compare
   invokes itself on S6 and S60. Compare finds the successors of these
   nodes, S7 and S80, nonequivalent, and adds to T the set of tests that
   traverse edge
   (S7,S8); T is now ft2,t3g. At this point the algorithm has walked all
   traversal trace prefixes either up to modifications or up to the exit
   node, so no further traversal is necessary; recursive Compare calls
   return to the main program, and the algorithm returns set T = ft2,t3g,
   in which both tests are modification-traversing.
   Many other regression test selection techniques[4, 7, 14, 18, 20, 23,
   37, 45] omit t2 or t3.
   
   If, for avg and avg2, the deletion of S7 had been the only change,
   SelectTests would have returned only ft3g. If the addition of S5a had
   been the only change, SelectTests would have returned only ft2g. In
   this latter case, observe that Compare would eventually invoke itself
   with S8 and S80, but find the successor of S8, P 3, already marked
   \P30-visited"; thus, Compare would not reinvoke itself with P3 and
   P30.
   
   To see how SelectTests handles changes in predicate statements,
   consider the result when line P4 in procedure avg is also changed
   (erroneously), to \n>0". This change alters only the text associated
   with node P40 in avg2's CFG. In this case, called with the CFGs for
   avg and avg2, SelectTests proceeds as in the previous example until it
   reaches P3 and P30. Here it finds nonequivalent successors P4 and P40,
   and selects ft2,t3g. The procedure does not need to analyze successors
   of P4 and P40.
   
   To see how SelectTests handles large-scale structural changes,
   consider the result when new error handling code is inserted into avg,
   such that the procedure checks fileptr as its first action, and
   executes the rest of its statements only if fileptr is not NULL.
   SelectTests detects this change on the initial invocation of Compare,
   when Compare detects the differences between successors of entry and
   entry0. The procedure does not need to analyze successors of entry and
   entry0: it returns the entire test set T because all tests in T are
   modification-traversing.
   To understand why, at line 12, SelectTests marks C \C -visited" rather
   than just \visited", consider
   Figure 5. The figure contains procedure twovisits (far left), a
   modified version of that procedure, twovisits0 (far right), and the
   CFGs for the two procedures (next to their respective procedures). The
   two versions produce identical output for all values of x other than
   \0". When x = 0, the execution trace for twovisits is h entry, P 1,
   S2, S3, S4, exit i, and the procedure prints \1". For the same input,
   the execution trace for twovisits0 is h entry0, P10, S20, S50, exit0
   i, and the procedure prints \2".
   
   When SelectTests runs on the CFGs for twovisits and twovisits0, it
   considers entry and entry0
   
   first, and then invokes itself on P1 and P10. Suppose SelectTests next
   invokes itself on S3 and S30 (we have not required the algorithm to
   visit the successors of a pair of nodes in any particular order; if we
   had, we could reverse the contents of the if and else clauses in this
   example and still make the point that we are about to make).
   SelectTests marks S3 \S30-visited", then continues with invocations on
   S4 and S40, and exit and exit0, selecting no tests, because tests that
   take these paths in the two versions are non-modification-traversing.
   Now, SelectTests resumes its consideration of successors of P1 and P10
   and
   
   invokes itself on S2 and S20. SelectTests marks S2 \S20-visited" and
   considers the successors of S2. S2 has only one successor, S3; the
   corresponding successor of S20 is S50. Here is the point we wish to
   make. If on visiting S3 SelectTests had simply marked S3 \visited",
   and now seeing that S3 had been visited declined to visit it again,
   SelectTests would not compare S3 and S50 and would not detect the need
   to select tests through S3. However, as the algorithm is written,
   SelectTests sees that S3 is only marked S30-visited, not S50-visited,
   and thus, proceeds to check for equivalence of S3 and S50, and select
   the necessary tests. In certain cases, it is possible for SelectTests
   to select tests that are not modification-traversing for P and P .9
   Figure 6, which illustrates this possibility, depicts a C function,
   pathological, and a modified version of that function, pathological0,
   with the CFGs for the versions. Each while construct in the versions
   first increments the value of x, and then tests the incremented value,
   to determine whether to enter or exit its loop. Suppose test suite T
   for pathological contains tests t1 and t2 that use input values \0"
   and \?2", respectively, to exercise the function. Figure 7 shows the
   inputs, outputs, and traversal traces that result when we run
   pathological and pathological0 on these tests. When we run
   pathological on tests t1 and t2, it outputs \1" for both. When we run
   pathological0 on test t1, it also outputs \1"; however, when we run
   pathological0 on test t2, it outputs \3". Tests t1 and t2 both
   traverse edge (P1; S4) in the CFG for pathological.
   
   Consider the actions of SelectTests, invoked on pathological and
   pathological0. Called with entry and entry0, Compare invokes itself
   with P1 and P10, then with P2 and P20. When invoked with P2 and P20
   Compare considers their successors, P1 and P30, respectively, finds
   them lexically identical, and invokes itself with them. Compare finds
   their successors, S4 and P40, nonequivalent, and selects the tests on
   edge (P1; S4), that is, tests t1 and t2.
   
   The problem with this test selection is that, whereas t2 is
   modification-traversing for the two versions of pathological, t1 is
   not modification-traversing for the two versions. Figure 7 shows the
   traversal traces, 9This is not surprising, because the problem of
   precisely identifying these tests in general is PSPACE-hard; thus
   unless P=NP, no efficient algorithm will always identify precisely the
   tests that are modification-traversing for P and P [39].
   
  
   
   which correspond to the execution traces, for the tests on the two
   versions. Test t1 has equivalent execution traces for the two
   versions, whereas test t2 does not: the traces for t2 differ in their
   fifth terms. Thus, t1 is not modification-traversing for the two
   versions, and SelectTests chooses it unnecessarily.
   
   3.1.2 Efficiency, effectiveness, and correctness of the algorithm
   
   In this section, we summarize results that are presented in detail in
   Reference [39].
   
   We define controlled regression testing 10 as the practice of testing
   P under conditions equivalent to those
   that were used to test P . As such, controlled regression testing
   applies scientific method to regression testing: to determine whether
   code modifications cause errors, we test the new code, holding all
   other factors that might affect program behavior constant.11
   
   For the purpose of regression test selection, we would like to
   identify all tests t 2 T that cause P to
   produce output that differs from its specified output and thus, reveal
   faults in P . We call such tests faultrevealing,
   and we call an algorithm that selects every fault-revealing test in T
   safe. There is no effective procedure that, in general, precisely
   identifies the fault-revealing tests in T . However, under controlled
   regression testing, the modification-traversing tests are a superset
   of the fault-revealing tests. Thus, for cases where controlled
   regression testing is possible, a regression test selection algorithm
   that selects all modification-traversing tests necessarily selects all
   fault-revealing tests, and is safe. This result is significant,
   because we can show that for cases where controlled regression testing
   is practiced, if t 2 T is modificationtraversing, then SelectTests
   selects t. Thus, the following theorem holds:
   
   Theorem 1: SelectTests is safe for controlled regression testing.
   Proof: See Reference [39], pages 77-81.
   Theorem 1 is important, because although we cannot find an algorithm
   that is safe in general, we know from the theorem that SelectTests is
   safe in situations in which controlled regression testing is possible.
   
   Theorem 1 lets us draw conclusions about the ability of our technique
   to select tests that may reveal faults. However, we are also
   interested in how well our technique does at omitting tests that
   cannot reveal faults. For controlled regression testing, tests that
   are non-modification-traversing cannot be fault-revealing; ideally, we
   would like to omit all such . In the previous section, however, we
   showed that there are cases in which SelectTests selects tests that
   are non-modification-traversing. The theorem identifies a necessary
   condition for SelectTests to select such tests: the
   multiply-visited-node condition. The multiply-visitednode condition
   holds for P and P if and only if SelectTests, run on P and P , marks
   some node in P
   \X-visited" for more than one node X in the CFG for P. Our next
   theorem is as follows: Theorem 2: Given procedure P , modified version
   P and test suite T for P , if the multiplyvisited-node
   condition does not hold for P and P , and SelectTests selects test set
   T for P ,
   then every test in T is modification-traversing for P and P .
   Proof: see Reference [39], pages 82-85.
   Theorem 2 gives us a way to characterize the class of programs and
   modified versions for which SelectTests selects
   non-modification-traversing tests. The theorem is significant for two
   reasons. First, procedures like pathological and pathological0, which
   cause the multiply-visited-node condition to hold, are atypical. With
   some work, we can construct other examples that illustrate similar
   effects; however, all examples located to date have been contrived.
   Second, in our empirical studies using \real" programs, we have never
   found a case in which the multiply-visited-node condition holds. Thus,
   we conjecture that in practice, given P , P , and T , SelectTests
   selects exactly the tests in T that are modification-traversing for P
   and P .
   10In earlier work[39, 42] we used the phrase \proper regression
   testing" in this context; however, the term \proper" has a pejorative
   connotation that we do not intend to imply. Hence, in this work, we
   substitute the more neutral term \controlled". 11The importance of
   controlled regression testing is stated well by Beizer[5], who writes:
   \It must be possible to precisely recreate the entire test situation
   or else it may be impossible to resolve some of the nastiest
   configuration dependent bugs that show up in the field."
   
   Our final theorem is as follows:
   
   Theorem 3: SelectTests terminates.
   Proof: The proof proceeds by showing that (1) the number of recursive
   calls made to Compare is bounded, and (2) the work required by a call
   to Compare is bounded. See Reference [39], page 81, for details.
   
   3.1.3 Complexity of the algorithm
   
   The running time of SelectTests is bounded by the time required to
   construct CFGs for P and P , plus
   the number and cost of calls to Compare. Let n be the number of
   statements in P , and n0 the number of statements in P . CFG
   construction is an O(n) operation[2]. We obtain an upper bound on the
   number of calls to Compare by assuming that Compare can be called with
   each pair of nodes N and N in G and G0,
   respectively. Under this assumption, the overall cost of SelectTests
   is O(n + n0 +m(nn0)), where m is the cost of a call to Compare.
   
   Each call to Compare results in an examination of at most two edges
   (\T" and \F") at line 9, and thus, two calls to Equivalent (line 13).
   Depending on the results of the Equivalent operation, the call to
   Compare results in either a set union operation (line 14) or an
   examination of at most two successors of N (line 16). The set union
   task, implemented as a bit vector operation, has a worst-case cost
   proportional to the number of tests in T . The Equivalent procedure
   has a cost that is linear in the number of characters in the
   statements compared; for practical purposes this size is bounded by a
   constant (the maximum line length present in the procedure). Thus, m
   in the above equation is bounded by kjT j for some constant k. It
   follows that given a pair of procedures for which CFGs G and G0
   contain n and n0 nodes, respectively, and given a test suite of t
   tests, if Compare is called for each pair of nodes (N ,N ) (N 2 G and
   N 2 G0),
   the running time of SelectTests is O(tnn0).
   The assumption that Compare may be called for each pair of nodes N and
   N from G and G0 applies
   only to procedures P and P for which the multiply-visited-node
   condition holds. Program pathological
   of Figure 6 illustrates a case in which that condition holds: in that
   example, for graphs of 6 and 7 nodes, respectively, the algorithm
   makes 16 calls to Compare. When the multiply-visited-node condition
   does not hold, however, Compare is called at most minfn; n0g times.
   Thus, if our conjecture that the multiply-visitednode condition does
   not hold in practice is correct, then for practical purposes,
   SelectTests runs in time O(t(minfn; n0g)).
   
   3.1.4 Improvements to the basic algorithm
   
   In the preceding sections, to facilitate the presentation, we
   presented a deliberately simplified version of SelectTests. There are
   several ways in which to increase the efficiency or precision of the
   basic algorithm. We discuss three improvements here; Reference [39]
   discusses additional improvements.
   
   Determining statement nonequivalence more precisely
   
   The Compare procedure that we presented in Figure 3 relies on a
   function, Equivalent, that takes a pair of nodes N and N , and
   determines whether the statements S and S0 associated with N and N may
   be
   
   semantically nonequivalent. To retain safety, Equivalent must
   conservatively estimate whether pairs of statements are semantically
   nonequivalent, erring only on the side of claiming nonequivalence when
   the statements are in fact equivalent.
   
   A simple version of Equivalent performs a lexicographic comparison of
   S and S0, ignoring white-space characters except when they are part of
   character constants. This approach is safe, and easy to implement, but
   also imprecise for several obvious cases. Consider, for example, the
   statements \if (x<0)" and \if (0>x)". Clearly, the two statements are
   lexicographically nonidentical but semantically equivalent. The same
   applies to the statements \x=(n*n)-(2*n)+1" and \x=(n-1)*(n-1)", and
   to the statements \x=1; /* comment */" and \x=1; /* different comment
   */".
   
   A more sophisticated implementation of Equivalent can detect, for
   cases like these, whether statements are semantically nonequivalent.
   For example, Equivalent can easily ignore comments when it performs
   lexicographic comparisons of statements. Additionally, for certain
   classes of expressions such as assignment statements, Equivalent can
   determine equivalence using techniques such as polynomial
   interpolation or symbolic computation[50]. At further expense,
   Equivalent can use algorithms, such as those described in References
   [4, 8, 26, 53], that conservatively determine statement equivalence.
   
   Improving Equivalent increases the precision of our test selection
   algorithms, but also increases the cost of the algorithms. In
   practice, we expect that some improvements to Equivalent will produce
   precision improvements sufficient to justify this added cost, while
   others will not.
   
   Handling variable and type declarations
   
   A second test selection issue involves variable and type declarations.
   A change in a variable or type declaration may render a test
   fault-revealing, even though that test executes no changed program
   statements other than the declaration. For example, in a Fortran
   program, changing the type of a variable from \Real*16" to \Real*8"
   can cause the program to fail even in the absence of direct
   alterations to executable code. A simple way to handle variable and
   type declarations in the context of our intraprocedural test selection
   algorithm is to associate each declaration with a CFG node, and to
   attach every test that enters the procedure to the edge that enters
   that node. An analogous technique handles the interprocedural case. In
   this case, our algorithms detect syntactic differences between
   variable and type declarations, and flag tests that encounter these
   differences as modification-traversing.
   
   This approach has drawbacks. Just as a modified variable
   initialization statement may unnecessarily force selection of all
   tests through a procedure, so a new, modified, or deleted variable or
   type declaration may unnecessarily force selection of all tests. If a
   declaration of variable v changes, the only tests that can be
   fault-revealing due to such a change (for controlled regression
   testing) are tests that reach some statement, in the executable
   portion of the procedure or program, that contains a reference to the
   memory associated with v. If a declaration of type t changes, the only
   tests that can be fault-revealing due to such a change (for controlled
   regression testing) are tests that reach some statement in the
   executable portion of the procedure or program that contains a
   reference to the memory associated with some variable whose type is
   based on t. Our test selection algorithms can be modified to reduce
   imprecise test selection at declaration changes.
   
   One approach requires the algorithm to determine affected variables,
   which are variables whose declarations have changed or whose
   declarations are dependent on changed type definitions, and keep a
   list of these variables. The Equivalent procedure uses this list to
   detect occurrences of affected variables, and report statements that
   contain references to the memory locations associated with those
   variables as modified. The modified algorithm postpones test selection
   until it locates statements that contain affected references; it then
   selects only tests that reach those statements.
   
   An alternative representation for case statements
   
   Instead of representing case statements as a series of nested if-else
   statements, we can represent them as a set of choices incident on a
   single predicate node that has a labeled out-edge for each separate
   case. This representation can yield more precise test selection than
   the nested if-else representation. A principal difference in this
   representation is that for switch predicates, the set of labeled
   out-edges may vary from P to P if cases are added or removed. If a
   case is added to a switch, then all tests in T that took the default
   case edge in P can conceivably take the branch to the new case in P ,
   thus Compare looks for
   new edges, and when it detects them, it adds tests that formerly
   traversed the default edge to T . If a case
   is removed from a switch, Compare detects the missing labeled edge in
   G0, and selects all tests that took the edge in G. With minor
   modifications, our Compare routine copes with these differences, and
   handles the alternative representation of switches.
   
   3.2 Interprocedural test selection
   
   Thus far we have considered our technique in an intraprocedural
   context. Our examples demonstrate that the technique reduces the
   number of tests that must be rerun on a single procedure. However, the
   examples also suggest that when procedures are small and
   uncomplicated, and their test sets are small, it may be more cost
   effective to rerun all tests.
   
   This objection is mitigated when we consider interprocedural testing.
   Test sets for subsystems and complete programs are typically much
   larger than test sets for single procedures. In this context, the
   savings that can result from selective retest increases. Moreover, in
   the interprocedural context, the problem of regression test selection
   is even more differentiable from the problem of coverage
   identification than in the intraprocedural context. Structural
   coverage criteria, if employed at all, are typically applied only
   during intraprocedural testing. Although integration and system-level
   coverage criteria have been proposed[35], in practice, most system and
   integration testing is functional or specification-based. It is
   precisely with functional and specification-based tests that we are
   most interested in running all tests that may reveal faults, without
   regard to coverage considerations. It is precisely with functional and
   specification-based tests that we can benefit most from an automatable
   regression test selection technique that selects tests based on
   program structure, because we cannot in general determine tests that
   are not fault-revealing simply by referring to program specifications.
   Assume that we can obtain a mapping of procedure names in P to
   procedure names in P , that tells us, for
   each procedure P 2 P , the name P of its counterpart in P . A simple
   but naive approach to interprocedural regression test selection runs the intraprocedural SelectTests
   algorithm on every pair of procedures (P ; P )
   such that P is the counterpart of P . With this simple approach, if a
   procedure P 2 P is no longer present in P , tests that used to enter P
   are selected at former call sites to P . Similarly, if a procedure P
   is inserted
   into P , tests that enter P are selected at the call sites to P .
   Notice that the number of times P is called
   from within P is immaterial: trace information tells us precisely
   which tests reach which edges in P , and a single traversal of P and P
   suffices to find tests that are modification-traversing for P and P .
   If we run SelectTests on all pairs of corresponding procedures in P
   and P , we obtain a safe test set
   T , but we fail to take advantage of several opportunities for
   improvements in efficiency. To describe these opportunities, we refer
   to the CFGs for the procedures in a program sys, shown in Figure 8.
   Program sys contains four procedures: main is the entry point to the
   program, and A, B, and C are invoked when the program runs. Notice
   that A is recursive.
   
   Suppose statement S1 in sys is modified, creating a new version of
   sys, sys0. In this case, all tests in T are modification-traversing
   for sys and sys0, and are selected when SelectTests, called with main
   and main0, reaches S1 in main. There is no need in this case to
   compare procedures A, B, and C to their counterparts in sys0: we have
   already selected all tests that could reach those procedures and
   become modification-traversing within them. The naive algorithm does
   unnecessary work for this example.
   
   Alternatively, suppose statement S6 (and no other statement) in sys is
   modified. In this case, because every test of sys that enters main
   necessarily reaches the call to A in main, then reaches the call to C
   in A, every test of sys becomes modification-traversing in C and C0.
   There is no need in this case to compare B with its counterpart in
   sys0, and no need to traverse portions of the CFG for main beyond the
   call to A, or portions of the CFG for A beyond the call to C. Here,
   too, the naive algorithm does unnecessary work. These observations
   motivate an algorithm for interprocedural test selection by CFG
   traversal that begins
   
   by examining the entry procedures for P and P . When the algorithm
   reaches call nodes, it immediately
   enters the graphs for the called procedures if their entry nodes have
   not previously been visited. If the algorithm has previously completed
   its walk of a procedure, it uses the knowledge, gained during that
   walk, of whether tests that enter the procedure can exit it without
   becoming modification-traversing: the algorithm continues its graph
   traversal beyond call nodes if and only if tests can pass through
   called procedures without becoming modification-traversing.
   
   3.2.1 The basic interprocedural test selection algorithm
   
   Figure 9 gives our algorithm, SelectInterTests, for selecting tests
   for subsystems or programs. The algorithm uses procedures SelectTests2
   and Compare2, which are similar to the intraprocedural test selection
   procedures SelectTests and Compare, respectively. The algorithm also
   keeps data structure proctable, that records the name of each
   procedure encountered in the traversal of the graph for P in a name
   field, and keeps a status flag for each procedure that, if defined,
   can have value \visited" or \selectsall". SelectTests first
   initializes T and proctable to OE, and invokes SelectTests2 on the
   entry procedures,
   PE and P E , of the two programs. Like SelectTests, SelectTests2 takes
   two procedures P and P as input,
   and locates tests that are modification-traversing for those
   procedures. However, SelectTests2 begins by inserting P into
   proctable, and setting the status flag for P to \visited", to indicate
   that a traversal of P has begun. The procedure then creates CFGs for P
   and P , and calls Compare2 with the entry nodes of those
   CFGs. When control returns from Compare2 to SelectTests2, SelectTests2
   checks to see whether the exit node of P was reached during the
   traversal of Compare2. If not, tests that enter P become
   modificationtraversing on every path through P ; thus, there is no
   point in visiting nodes beyond calls to P . To note this fact,
   SelectTests2 sets the status flag for P in proctable to \selectsall".
   Compare2 is similar to Compare, except that when Compare2 finds two
   nodes C and C equivalent, before it
   invokes itself on those nodes it checks to see if C contains any
   calls. If C contains calls, it may be appropriate to invoke
   SelectTests2 on the called procedures. (It is not necessary to check C
   for calls; at this point
   in the algorithm, C and C have already been compared and found
   lexically identical; thus, C and C are
   equivalent in terms of calls.) Compare examines the status flag for
   each procedure O called in C; if status is \visited" or \selectsall",
   then O has been (or in the former case may, in the case of recursive
   calls, be in the process of being) traversed, and there is no need to
   invoke SelectTests2 on O and O0 again. Finally, if any procedure
   called in C has its status flag set to \selectsall", then all tests
   through that procedure (and thus, all tests through C) have been
   selected, and there is no need to compare successors of C and C .
   In this fashion, SelectInterTests processes pairs of procedures from
   base and modified programs. By beginning with the entry procedures,
   and processing called procedures only when it reaches calls to those
   procedures, the algorithm avoids analyzing procedures when calls to
   those procedures occur only subsequent to code changes. Furthermore,
   the algorithm avoids traversing portions of graphs that lie beyond
   calls to procedures through which all tests are
   modification-traversing.
   Unlike the naive algorithm that processes every pair of procedures in
   P and P , SelectInterTests
   requires no mapping between procedure names in P and P (provided P and
   P compile and link). If
   
   algorithm SelectInterTests( P, P , PE, P E , T ) : T input P,P : base
   and modified versions of a program or subsystem procedure foo in P is deleted from P , statements in P that contain
   calls to foo must be changed for P .
   In this case, when Compare2 reaches a node in G that corresponds to a
   statement in which foo is called, Compare2 selects all tests that
   reach the node, and does not invoke SelectTests2 on foo. Thus, it is
   not possible for Compare2, in line 26, to be unable to find a
   counterpart for foo in P . The cases where foo is
   not present in P , but is added to P , and where foo is renamed for P
   , are handled similarly.
   
   The improvements to our basic intraprocedural algorithm, discussed in
   Section 3.1.4, also apply to our interprocedural algorithm.
   
   To illustrate the use of SelectInterTests, we offer the following
   example. Suppose we change program sys of Figure 8 to sys0, by
   modifying the code associated with node S5 in procedure B. (We do not
   show the modified graphs; to discuss the example we distinguish nodes
   in the CFG for sys from nodes in the CFG for sys0 by adding primes to
   them). Initially, SelectInterTests calls SelectTests2 with main and
   main0. SelectTests2 adds main to proctable with status \visited",
   creates the CFGs for the two procedures, then invokes Compare2 with
   the entry nodes of those graphs. Compare2 begins traversing the
   graphs, and on reaching nodes call A and call A0, because A is not
   listed in proctable, invokes SelectTests2 on A and A0. SelectTests2
   adds A to proctable with status \visited", then builds the CFGs for
   the two procedures, and begins traversing them. On reaching calls to
   C, the algorithm adds C to to proctable with status \visited", makes
   CFGs for C and C0, and begins traversing them. The algorithm finds no
   calls or differences in C and C0 and thus, when the call to Compare
   with the entry nodes of C and C0 terminates, the algorithm marks the
   exit node of C \exit0-visited". This marking means that tests entering
   C and C0 pass through unaffected, so SelectTests2 does not set the
   status flag for C to \selectsall".
   
   On returning from the call to SelectTests2 with C and C0, Compare
   resumes at line 29, finds that the status flag is not set, and
   continues traversing A and A0 by invoking itself with the call C node
   and its counterpart in the graph for sys0. The traversal eventually
   reaches the call to A in the if predicate: here Compare2 finds that A
   has status \visited" and does not reinvoke SelectTests2 on A, thus
   handling the recursion. On reaching the call to B in the else clause
   of the predicate, Compare2 invokes SelectTests2 with B and B0. This
   invocation ultimately identifies the differences in B and B0, and
   selects tests that reach the modified code. Furthermore, because the
   code difference prevents Compare2 from reaching the exit B node, when
   the call to Compare with the entry nodes of B and B0 returns,
   SelectTests2 sets the Status flag for B to \selectsall": all tests
   that enter B are modification-traversing.
   
   On returning from the call to SelectTests2 for A and A0, Compare2
   resumes, at line 29, with the call A node in main and its
   corresponding node in main0. The algorithm traverses the graphs
   through node S3 and its counterpart in main0. Here, when it examines
   successors of the nodes, Compare notes that the status flag for B is
   set to \selectsall", and thus, does not reinvoke SelectTests2 on B and
   B0. Furthermore, Compare2 sees that all procedures called in the call
   B node have status \selectsall", and thus does not traverse the graph
   for main any further.
   
   3.2.2 Complexity of the algorithm
   
   SelectTests and SelectInterTests are of comparable complexity. To see
   this, suppose P contains p procedures, and n statements of which c
   contain procedure calls, and suppose P contains n0 statements.
   Assume that the number of procedure calls in a single statement, and
   the length of a statement, are bounded by constants k1 and k2,
   respectively. We obtain an upper bound on the running time of
   SelectInterTests by considering the case where P and P are identical:
   in this case the algorithm builds and walks CFGs for
   every procedure in P and its corresponding procedure in P .
   
   In this worst case, regardless of the value of p, the time required to
   build CFGs for all procedures in P is O(n), and the time required to
   build the CFGs for all procedures in P is O(n0). We obtain an upper
   bound
   on the number of calls to Compare2 by assuming that every node in P
   must be compared to every node in P , as might happen if P and P
   contain single procedures; in this case, the number of Compare2 calls
   is
   O(nn0). Excluding, for the moment, the cost of the P rocT able lookups
   in lines 24-31, a call to Compare2 requires the same amount of work as
   a call to Compare, namely, k2jT j. Regardless of the number of calls
   to Compare, lines 24-31, over the course of a complete execution of
   SelectInterTests, require at most k1c table lookups, on a table that
   contains at most p entries; using a naive table lookup algorithm, the
   lines require O(cp) string comparisons, where the time for each
   comparison is bounded by k2. Thus, the time required by
   SelectInterTests, in the worst case, is O(n +n0 + jT jnn0 + cp), where
   cp is bounded above by k1n2. In practice, however, we expect a lower
   bound on execution time. When the multiply-visited-node condition does
   not hold, the term jT jnn0 becomes jT j(minfn; n0g). Furthermore, if
   we use an efficient hashing scheme to implement P rocT able, we reduce
   the O(cp) table lookup time to (expected time) O(c), which is O(n).
   
   4 Evaluations of the Algorithms
   
   In this section, we evaluate our algorithms, and compare them to
   existing test selection techniques. Section 4.1 presents an analytical
   evaluation and comparison; Section 4.2 presents empirical results.
   
   4.1 Analytical Evaluation
   
   In Reference [42], we present a framework for analyzing regression
   test selection techniques, that consists of four categories:
   inclusiveness, precision, efficiency, and generality. Inclusiveness
   measures the extent to which a technique selects tests that reveal
   faults in a modified program; a 100% inclusive technique is safe.
   Precision measures the extend to which a technique omits tests that
   cannot reveal faults in a modified program. Efficiency measures the
   space and time requirements of a technique, focusing in particular on
   critical phase costs. Generality measures the ability of a technique
   to function in some practical, and sufficiently wide, range of
   situations. We use our framework to compare and evaluate all
   code-based regression test selection techniques that we have found in
   the literature. We also use our framework to evaluate our technique
   and compare it to existing techniques. Here, we summarize the results
   reported in that work.
   
   Inclusiveness.
   
   Our test selection algorithms are safe for controlled regression
   testing. Only three other techniques[11, 23, 28] can also make this
   claim. Moreover, all of these safe techniques depend, for their
   safety, upon the same assumptions on which our algorithms depend.
   Thus, at present, with existing regression test selection techniques,
   safe test selection is possible only for controlled regression
   testing.
   
   Precision.
   
   Our test selection algorithms are not 100% precise. However, because
   the problem of precisely identifying the tests that are
   fault-revealing for a program and its modified version is undecidable,
   we know that we
   
   cannot have an efficient algorithm that is both safe and 100% precise.
   Thus, because our goal is safety, some imprecision is unavoidable.
   
   Nevertheless, our algorithms are the most precise safe algorithms
   currently available. For cases where the multiply-visited-node
   condition does not hold (which we expect includes all practical
   cases), our technique selects exactly the modification-traversing
   tests, whereas other safe techniques select the
   modification-traversing tests, along with tests that are not
   modification-traversing. In cases where the multiply-visited-node
   condition does hold, we can prove that SelectTests and
   SelectInterTests are more precise than two of the other three test
   selection techniques, and we have strong evidence to suggest that our
   algorithms are more precise than the third technique[39].
   
   Efficiency.
   
   As we discussed previously, our algorithms run in time O(jT jn2) for
   procedures or programs of n statements, and test set size jT j. This
   is an improvement over the efficiency of two of the other safe
   techniques. Moreover, in practice, we expect the basic algorithms
   (without added precision) to run in time O(jT jn); a bound comparable
   to the worst-case run time of the third safe technique. Our algorithms
   are fully automatable. Furthermore, much of the work required by our
   technique, such as construction of CFGs for P and collection of test
   history information, can be completed during the preliminary
   regression testing phase. Unlike other safe algorithms, our algorithms
   do not require prior computation of a mapping between components of
   programs or procedures and their modified versions; instead, they
   locate changed code as they proceed, and in the presence of
   significant changes avoid unnecessary comparison.
   
   Generality.
   
   Our algorithms apply to procedural languages generally, because we can
   obtain the required graphs and test history information for all such
   languages. Our technique supports both intraprocedural and
   interprocedural test selection. Our technique handles all types of
   program modifications, and handles multiple modifications in a single
   application of the algorithms.
   
   4.2 Empirical Evaluation
   
   Researchers wishing to experiment with software testing techniques
   face several difficulties { among them the problem of locating
   suitable experimental subjects. The subjects for testing
   experimentation include both software and test suites; for regression
   testing experimentation, we also need multiple versions of the
   software. Obtaining such subjects is a non-trivial task. Free
   software, often in multiple versions, is readily accessible, but free
   software is not typically equipped with test suites. Commercial
   software vendors, who are more likely to maintain established test
   suites, are often reluctant to release their source code and test
   suites to researchers. Even when suitable experimental subjects are
   available, prototype testing tools may not be robust enough to operate
   on those subjects, and the time required to ensure adequate robustness
   may be prohibitive.
   
   Given adequate experimental subjects and sufficiently robust
   prototypes, we may still question the generalizability of experimental
   results derived using those subjects and prototypes. Experimental
   results obtained in the medical sciences generalize due to the fact
   that a carefully chosen subset of a population of subjects typically
   represents a fair (i.e., normally distributed) cross-section of that
   population. As Weyuker states, however, when our subject population is
   the universe of software systems, we do not know what it means to
   select a fair cross-section of that population[47]. Nor do we know
   what it means to select a fair cross-section of the universe of
   modified versions or test suites for software. Weyuker concludes that
   software engineers typically perform \empirical studies," rather than
   experiments. She insists, however, that such studies offer insight,
   and are valuable tools in understanding the topic studied. We agree
   with Weyuker; hence, this section outlines the results of empirical
   studies.
   
   To empirically evaluate our regression test selection technique, we
   implemented the SelectTests and SelectInterTests algorithms as tools,
   which we call \DejaVu1" and \DejaVu2", respectively. Our
   implementations select tests for programs written in C.12 We
   implemented our tools and conducted our empirical studies on a Sun
   Microsystems SPARCstation 10 with 128MB of virtual memory.13
   
   4.2.1 Study 1: Intraprocedural and Interprocedural Test Selection
   
   Our first study investigated the efficacy of DejaVu1 and DejaVu2 on a
   set of small, but non-trivial, real subject programs.
   
   Subjects
   
   Hutchins, Foster, Goradia, and Ostrand[27] report the results of an
   experiment on the effectiveness of dataflow- and controlflow-based
   test adequacy criteria. To conduct their study, the authors obtained
   seven C programs, that ranged in size from 141 to 512 lines of code,
   and 8 to 21 procedures. They manufactured 132 versions of these
   programs, and created large test pools for the programs. The authors
   made these programs, versions, and test suites available to us. We
   refer to their experiment as the \Siemens study", and to the
   experimental programs as the \Siemens programs". Table 1 describes the
   Siemens programs.14
   
   To study our intraprocedural test selection algorithms, we considered
   each procedure in the Siemens programs that had been modified for one
   or more versions of a program. Table 2 lists these procedures.
   
   Because the Siemens study addressed error detection capability, the
   study employed faulty versions of base programs. For our purposes, we
   think of these faulty versions as ill-fated attempts to create
   modified versions of the base programs. The use of faulty versions
   also lets us make observations about error detection during regression
   testing.
   Hutchins et al.[27] describe the process used by the Siemens
   researchers to manufacture test suites and 12We describe the
   requirements for our tools, and their design and implementation, in
   Reference [39]. 13SPARCstation is a trademark of Sun Microsystems,
   Inc.
   14There are a few differences between the numbers reported in Table 1
   and the numbers reported in Reference [27]. Hutchins et al. report 39
   versions of tcas; their distribution to us contained 41. Also, the
   numbers of tests in the test pools we obtained from their distribution
   differed slightly from the numbers they reported: in the two most
   extreme cases, for example, we found 16 more tests (tcas), and 36
   fewer tests (usl.123). In some cases, we can attribute the difference
   in numbers of tests to repetitions of equivalent tests in their test
   plan. In any event, the difference amounts to less than 1% of the
   total test pool size and does not affect the results of this study.
   
 
   faulty program versions { we paraphrase that description here. The
   Siemens researchers created faulty versions of base programs by
   manually seeding faults into those programs. Most faults involve
   single line changes; a few involve multiple changes. The researchers
   required that the faults be neither too easy nor too difficult to
   detect (a requirement that was quantified by insisting that each fault
   be detectable by at least 3, and at most 350, tests in the test pool),
   and that the faults model \realistic" faults. Ten people performed the
   fault seeding, working for the most part without knowledge of each
   other's work.
   
   The Siemens researchers created test pools \according to good testing
   practices, based on the tester's understanding of the program's
   functionality and knowledge of : : : the code." The researchers
   initially generated tests using the category partition method and the
   Siemens TSL (Test Specification Language) tool[3, 36]; they then added
   additional tests to the test suites, to ensure that each coverage unit
   (statement,
   
   edge, and du-pair) in the base program and versions was exercised by
   at least 30 tests.
   
   The Siemens subjects present some disadvantages for our study, because
   they employ only faulty modifications, use manufactured faults rather
   than real ones, and use only faults that yield meaningful detection
   rates. However, the Siemens subjects also have considerable
   advantages. The fact that the Siemens researchers made the subjects
   available to us is an obvious advantage. Also, the Siemens test suites
   are enormous, and the seeded faults do model real faults. Furthermore,
   the source code for the base programs and versions is standard C,
   amenable to analysis and instrumentation by our prototype tools.
   Finally, the Siemens subjects have served previously as a basis for
   published empirical results.
   
   Empirical Procedure
   
   To obtain our empirical results, we initially used an analysis
   tool[17] on the base programs and modified versions to create control
   flow graphs for those versions. We then ran a code instrumentation
   tool to generate instrumented versions of the base programs. For each
   base program, we ran all of the tests for that program on the
   instrumented version of the program, and collected test history
   information for those tests. We then ran Dejavu1, our implementation
   of SelectTests, on each procedure from a Siemens program that had been
   modified in one or more modified versions, with each modified version
   of the base version of that procedure. We also ran Dejavu2, our
   implementation of SelectInterTests, on each Siemens base program, with
   each modified version of that base program. We obtained execution
   timings during off-peak hours on a restricted machine; our testing
   processes were the only user processes active on the machine. We
   repeated each experiment five times for each (base program, modified
   program) or (base procedure, modified procedure) pair, and averaged
   our results over these runs; all timings that we report for this study
   list these average results. In our experimentation, we practiced
   controlled regression testing.
   
   Results
   
   Figure 10 depicts the test selection results for our intraprocedural
   test selection tool, DejaVu1, in Study 1. The graph depicts, for each
   of the 41 base versions of Siemens procedures, the percentage of tests
   selected by DejaVu1, on average, over the set of modified versions of
   that base procedure. As the graph shows, for this study,
   intraprocedural test selection reduced the size of selected test sets
   in some cases, but the overall savings was not dramatic. In fact, for
   21 of the 41 subject procedures, DejaVu1 always selected 100% of the
   tests for those procedures. DejaVu1 reduced test sets by more than 50%
   on average in only five cases. We discuss these results in greater
   detail later in this section.
   
   Figure 11 depicts the test selection results for our interprocedural
   test selection tool, DejaVu2, in Study 1. The graph depicts, for each
   of the 7 base versions of Siemens programs, the percentage of tests
   selected by DejaVu2, on average, over the set of modified versions of
   that base program. As the graph shows, in Study 1, the average test
   set selected by DejaVu2 for a modified version was 55.6% as large as
   the test set required by the retest all approach. In other words,
   DejaVu2 averaged a savings, in test set size, of 44.4%. Over the
   various base programs, the test sets selected by DejaVu2 ranged from
   43.29% (on replace) to 93.58% (on schedule2) of the size of the total
   test sets for those programs.
   
  
   The fact that our test selection algorithms reduce the number of tests
   required to retest modified programs does not by itself indicate the
   possible worth of our algorithms. If we required ten hours of analysis
   to determine that we could save one hour of testing, this might not be
   of benefit { unless the ten hours were fully automated, the hour saved
   was an hour of human time, and we could spare the ten hours. We would
   like to show that the time we save in not having to rerun tests
   exceeds the time we spend analyzing programs and selecting reduced
   test suites. Toward this end, Figure 12 depicts some timings. For each
   of the seven base programs, the figure shows the average time required
   to run all of the tests on the modified version of the program
   (darkest column), the time required to perform analysis, on average,
   of the base and modified versions of the program (lightest column),
   and the time required, on average, to run the selected tests on the
   modified version of the program. Times are shown in minutes and
   seconds. Note that in this study, we were able to fully automate both
   the execution of tests, and the validation of test results.
   
   As the figure indicates, the cost of our safe test selection
   algorithms, in terms of time, is negligible: never exceeding 21.8
   seconds. Our measurements include the cost of building CFGs for both
   the base and modified program version; however, the CFG for the base
   version could have been computed and stored, like test history
   information, during the preliminary phase of testing, reducing the
   critical period cost of the algorithm. The figure also shows that in
   all cases, DejaVu2 produced a savings in overall regression testing
   time. In the worst case, for schedule2, DejaVu2 saved only 28 seconds,
   or 4%, of total effort. In the best case, for replace, using DejaVu2
   saved 9 minutes and 17 seconds, or 53% of total effort.
   
   By absolute measures, savings of a few minutes and seconds, such as
   this study illustrates, may be relatively unimportant. However, in
   practice, test suites may take hours, days, or even weeks to run, and
   much of this effort may be human-intensive. Moreover, lengthy testing
   times may occur despite efforts to
   

   automate test execution and results validation. If results such as
   those depicted by this study scale up to such larger-scale cases, a
   savings of even 10% may matter, and a savings of 50% may offer a big
   win. In fact, we conjecture that the savings obtainable from DejaVu2
   increase, on average, as larger programs are used as subjects. Our
   second study investigates this conjecture.
   
   4.2.2 Study 2: Interprocedural Test Selection on a Larger Scale
   
   Our second study investigated the efficacy of our interprocedural test
   selection algorithm on a larger subject.
   
   Subjects
   
   For our second study, we obtained a program, player, that is one of
   the executable components of an internet-based game called Empire. The
   player executable is essentially a transaction manager; its main
   routine contains initialization code, followed by a five-statement
   event loop that waits for receipt of a user command, and on receiving
   one, calls a routine that processes the command (possibly invoking
   many more routines to do so), then waits to receive the next command.
   The loop, and the program, terminate when a user issues a quit
   command. Since its initial encoding in 1986 as an internet-based game,
   Empire has been rewritten many times; several modified versions of the
   program have been created. Most of these versions involve
   modifications to player.
   
   For our study, we located a base version of player for which five
   distinct modified versions were available. Table 3 presents some
   statistics about the base version. As the table indicates, the version
   contains 766 C functions and 49,316 lines of code, excluding blank
   lines and lines that contain only comments. The CFGs for the functions
   contained approximately 35,000 nodes and 41,000 edges in total (these
   numbers are approximate for reasons explained later). Table 4
   describes the versions of player that we used for our study.
   
   There were no test suites available for player. To construct a
   realistic test suite, we used the Empire information files, which
   describe the commands that are recognized by the player executable and
   discuss parameters and special side-effects for each command. We
   treated the information files as informal specifica-
   
   tions; for each command, we used its information file to construct
   versions of the command that exercise all parameters and special
   features, and test erroneous parameters and conditions. This process
   yielded a test suite of 1035 functional tests. We believe that this
   test suite is typical of the sorts of functional test suites designed
   in practice for large software systems, for which test suites often
   contain only functional tests.
   
   The player program is a reasonable subject for several reasons. First,
   the program is part of an existing software system that has a long
   history of maintenance at the hands of numerous coders; in this
   respect, the system is similar to many existing commercial software
   systems. Second, as a transaction manager, the player program is
   representative of a large class of software systems that receive and
   process a variety of user commands.15 Third, we were able to locate
   several real modified versions of one base version of the program.
   Fourth, although the absence of established test suites for the
   program was a disadvantage, the user documentation provided a
   code-independent means for generating functional tests in a realistic
   fashion. Finally, although not huge, the program is reasonably large.
   
   Empirical Procedure
   
   Due to limitations in our prototype analysis tools, we could analyze
   only 85% of the procedures in the player program; thus, we could not
   instrument or run our DejaVu implementations on 15% of the procedures.
   However, we were able to simulate the test selection effects of DejaVu
   on player. Our simulation determines exactly the numbers of tests
   selected and omitted by our algorithm in practice; we were also able
   to determine exactly the time required to run all tests, or all
   selected tests. We could not obtain precise results of the time
   required to build CFGs for the versions and perform test selection on
   those graphs, but we were able to obtain close estimates of those
   times.16
   
   Results
   
   Figure 13 depicts the test selection results for our interprocedural
   test selection algorithm for the modified versions of player. The
   graph depicts, for each of the five modified versions, the percentage
   of tests that our algorithm selects. As the results indicate, on
   average over the five versions, our algorithm selects 4.83% of the
   tests in the existing test suite. In other words, on average, the
   algorithm reduces the number of tests that must be rerun by over 95%.
   
   Figure 14 depicts timings for this study. The graph depicts, for each
   of the five modified versions, the time required to run all of the
   tests on the modified version of the program (darkest column), the
   estimated time required to perform analysis of the base and modified
   versions of the program (lightest column), and the time required to
   run the selected tests on the modified version of the program. Times
   are shown in hours, minutes, and seconds. The graph shows that in all
   cases, our algorithm produced a savings in overall regression testing
   time. This savings ranged from 4 hours and 39 minutes (82% of total
   effort) to 5 hours and 37 minutes (93% of total effort).
   15Other examples of systems in this class include database management
   systems, operating systems, menu-driven systems, and computer-aided
   drafting systems, to name just a few.
   16We determined the time required to build CFGs and run DejaVu1 on 85%
   of the code. We report that time scaled up by 1.15. Work is ongoing to
   improve the robustness of our analysis tools.
   
  
   Our estimate of analysis time projects a cost of at most 25 minutes;
   however, this estimate computes the cost of building CFGs for every
   procedure in both the base and modified program version during the
   critical period, and walking all of those CFGs completely. In
   practice, we could build the CFGs for procedures in the base version
   during the preliminary phase, and build CFGs for procedures in
   modified versions on demand, lowering the analysis cost. Furthermore,
   our test timings consider only the cost of running the tests, because
   we were not able to automate the validation of results for these
   tests. In practice, then, the time required to run tests would be much
   larger than the times shown, and the resulting savings would increase.
   
   4.2.3 Additional discoveries and discussion
   
   Our studies yielded several additional discoveries.
   First, in Section 3 we saw that our algorithms can select tests that
   are not modification-traversing for P and P , but only when the
   multiply-visited-node condition holds. In our experiments, we never
   encountered a case where that condition held. The results support our
   conjecture that in practice, our algorithms will not select
   non-modification-traversing tests.
   
   Second, although we have reported our results as averages over sets of
   modified programs, it is interesting to examine the behavior of our
   algorithms for individual cases.17 Consider, for example, the results
   for program replace. The test pool for replace contains 5542 tests, of
   which DejaVu2 selects, on average, 2399 (43.29%). However, over the 32
   modified versions of replace, the selected test sets ranged in size
   from 52 to 5542 tests, with no size range predominant; the standard
   deviation in the sizes of the selected test sets was 1611.48. In
   contrast, for schedule2, DejaVu2 selects, on average, 2508 (93.58%) of
   the program's pool of 2680 tests, with a standard deviation of 253.49.
   On eight of the ten modified versions of schedule2, DejaVu2 selects at
   least 90% of the existing tests.
   
   Given this range of variance, we would like to be able to determine
   the factors that influence the success of test selection. This could
   help us determine when test selection is likely to be successful, and
   also help us draw conclusions about ways in which to build programs
   and test suites that are \regression testable." On examining our
   experimental subjects and test suites, we determined that the
   effectiveness of test selection was influenced by three factors: the
   structure of P , the location of the modifications in P , and the
   extent
   and type of code coverage achieved by tests in T . Although these
   factors can interact, they may also operate independently.
   
   Finally, as an interesting side-effect of this study, our results
   support some hypotheses about the faultrevealing capabilities of
   regression test selection techniques. Study 1 involved faulty modified
   program versions; for these versions, a very small percentage of tests
   are fault-revealing. For example, consider program replace, with a
   test suite of 5542 tests. For version 26 of replace, 302 of the 5542
   tests are fault-revealing. DejaVu2 finds that 1012 of the 5542 tests
   are modification-traversing and selects them. A minimization test
   selection technique that selects one test from the set of tests that
   cover modified code has only a 29.8% chance of selecting one of the
   302 fault-revealing tests. Next, consider version 19 of replace.
   Although 4658 of the 5542 tests of replace are modification-traversing
   for this version, only 3 of these tests 17Reference [39] lists results
   for all programs and modified versions individually.
   
   are fault-revealing for the version. A minimization test selection
   technique that selects only one of the 4658 tests that cover this
   modification has only a .00064% chance of selecting a test that
   exposes the fault. In either case, DejaVu2 guarantees that the fault
   is exposed. Every case considered in Study 1 supports a comparable
   conclusion.
   
   It would not be fair, on the basis of Study 1 alone, to draw general
   conclusions about the relative effectiveness of minimization and safe
   test selection techniques, because the Siemens study deliberately
   restricted program modifications to those that contained faults that
   were neither too easy nor too difficult to detect. Nevertheless, the
   Siemens study did employ faults that are representative of real
   faults, so we expect that cases such as the two discussed above, and
   such as those found in the Siemens study in general, arise in
   practice. Our results thus give us good reason to question the
   efficacy of minimization test selection techniques where fault
   detection is concerned.
   
   4.2.4 Summary of empirical results, and limitations of the studies
   
   We summarize the major conclusions derived from our empirical studies
   about the use of our algorithms in practice as follows:
   
   ffl Our algorithms can reduce the time required to regression test
   modified software, even when the cost of the analysis performed to
   select tests is considered.
   ffl We expect interprocedural test selection to offer greater savings
   than intraprocedural test selection. ffl Regression test selection
   algorithms may yield greater savings when applied to large, complex
   programs than when applied to small, simple programs.
   ffl There exist programs, modified versions, and test suites for which
   test selection offers little in the way of savings.
   ffl The factors that affect the effectiveness of test selection
   techniques include the structure of programs, the nature of the
   modifications made to those programs, and the type of coverage
   attained by tests.
   ffl Our conjecture that in practice, programs contain no
   multiply-visited-nodes, remains plausible. ffl Our results support a
   conjecture that minimization techniques for test selection may be
   ineffective at revealing faults.
   
   For the sake of fairness, we mention the following limitations on our
   studies:
   
   ffl We have studied only a small sample of the universe of possible
   programs, modified programs, and test suites. We have no hard data on
   which to base a claim that this sample is normally distributed. We
   believe, however, that the subjects of our studies are representative
   of significant classes of programs that occur in practice.
   ffl Both of our studies required some manufactured artifacts: study 1
   used manufactured modified versions and tests, and study two used
   manufactured tests. In both cases, however, efforts were made to
   ensure that manufactured artifacts were representative of real
   counterparts.
   ffl Due to limitations in our program analysis tools, our second study
   required an estimation of analysis times. However, we were careful to
   estimate fairly and conservatively. We believe that our estimates
   understate the analysis time required by our technique.
   
   5 Conclusions and Future Work
   
   We have presented a technique for selecting tests from an existing
   test suite for use in regression testing a modified procedure or
   program. We have shown that for controlled regression testing, our
   algorithms are safe: they select every test in the existing test suite
   that may reveal faults in the modified program. Our algorithms are
   also the most precise safe test selection algorithms currently
   available. The algorithms are efficient, and handle both
   intraprocedural and interprocedural regression test selection. Our
   empirical studies suggest that in practice, our technique can reduce
   the cost of regression testing a modified program. This work is
   important for two reasons. According to Pressman[38], the cost of
   software maintenance dominates the overall cost of software. Moreover,
   the cost of maintenance, measured in terms of the percentage of
   software budget spent on maintenance, is increasing. In the 1970s,
   typical industry expenditures on software maintenance comprised 35 to
   40 percent of total software budget. In the 1980s, expenditures rose
   to between 40 and 60 percent of total budget, and in the 1990s the
   figure exceeds 70 percent. Beizer presents similar statistics[6].
   Because regression testing constitutes a significant percentage of
   maintenance costs, there are good reasons to seek improvements in
   regression testing processes.
   
   Both development testing and regression testing are expensive.
   However, the cost of development testing is incurred once, when a
   software product is created; this cost is then amortized over the
   lifetime of the product. In contrast, a software product is regression
   tested every time a new version of that product is released. The cost
   of regression testing is thus compounded over the lifetime of the
   product. Thus, improvements in regression testing may have greater
   potential to reduce the overall cost of software than improvements in
   development testing.
   
   The motivation for finding better regression testing techniques is not
   just economic. Both developers and users of software want to believe
   that their software works correctly. Both developers and users of
   software want the probability that their software will fail to be
   reduced. Testing is the method of choice both for building confidence
   in software, and for increasing its reliability. Regression testing
   plays an important role in both of these tasks. At present, however,
   according to Beizer, regression testing is often overlooked or
   performed inadequately: either the testing of new features, or the
   revalidation of old features, or both, are sacrificed. As a result,
   software reliability decreases over the software's lifetime[6].
   Practical, effective selective retest techniques promote software
   quality.
   
   We have discovered several promising directions for future work in
   this area. First, while the empirical results reported in this paper
   are encouraging, they are also preliminary. Additional implementations
   of the algorithms and enhancements described in Section 3, and further
   empirical studies with those implementations, would be useful.
   Empirical work to examine the cost and the fault-revealing
   capabilities of these and other regression test selection techniques
   would be particularly valuable.
   
   Second, our work has focused on the problem of selecting tests from an
   existing test suite. An equally important problem, however, is that of
   ensuring that code modified or affected by modifications is adequately
   tested. The tests that we select from an existing test suite can help
   establish adequacy, but new tests may also be required. Future work
   should consider the extension of this technique to help identify the
   need for new tests.
   
   Third, our work considers the problem of code-based test selection;
   future work should consider the application of the technique to the
   problem of specification-based selection.
   
   Fourth, our research has revealed that the size of the test sets our
   algorithms select may vary significantly. This variance is a function
   of program structure, modifications, and test suite design. Further
   research could investigate correlations between these three factors
   and the related issues of regression testability and test suite
   design.
   
   Fifth, our technique is safe only for controlled regression testing.
   In this respect, all existing safe regression test selection
   techniques are alike. In the absence of controlled regression testing,
   techniques that are otherwise safe may omit fault-revealing tests.
   This raises the question: can we find ways to increase the conditions
   under which safe test selection is possible? There are several ways in
   which to address this question. We might try to find ways to make
   controlled regression testing possible in situations wherein it is
   currently difficult to attain. Or, we might try to find ways to
   identify the types of fault-revealing tests that techniques like ours
   might omit in the absence of controlled regression testing, and extend
   the techniques to select these tests. Or finally, we might find ways
   to otherwise identify the errors that could be exposed by tests
   omitted by these techniques.
   
   Finally, even when we cannot employ controlled regression testing, and
   thus cannot guarantee safety, our algorithms may still select test
   suites that are useful. It is not the case that our algorithms
   function only in the presence of controlled regression testing: it is
   simply that they, like all other regression test selection algorithms,
   are not safe in the absence of controlled regression testing. Viewed
   differently, our algorithms select all tests in T that, if we had
   practiced controlled regression testing, could have exposed faults in
   P .
   When the testing budget is limited, and we must choose a subset of T,
   modification-traversing tests such as those selected by our algorithm
   may be better candidates for execution than
   non-modification-traversing tests. Thus, even in the absence of
   controlled regression testing, we may find our algorithm useful.
   Empirical studies could investigate this conjecture further.
   
   References
   
   [1] H. Agrawal, J. Horgan, E. Krauser, and S. London. Incremental
   regression testing. In Proceedings of the Conference on Software
   Maintenance - 1993, pages 348{357, September 1993.
   
   [2] A.V. Aho, R. Sethi, and J.D. Ullman. Compilers: Principles,
   Techniques, and Tools. Addison-Wesley, Reading, MA, 1986.
   
   [3] M. Balcer, W. Hasling, and T. Ostrand. Automatic generation of
   test scripts from formal test specifications. In Proceedings of the
   Third Symposium on Software Testing, Analysis, and Verification, pages
   210{218, December 1989.
   
   [4] S. Bates and S. Horwitz. Incremental program testing using program
   dependence graphs. In Proceedings of the 20th ACM Symposium on
   Principles of Programming Languages, January 1993.
   
   [5] B. Beizer. Software Testing Techniques. Van Nostrand Reinhold, New
   York, NY, 1990.
   
   [6] B. Beizer. Black-Box Testing. John Wiley and Sons, New York, NY,
   1995.
   
   [7] P. Benedusi, A. Cimitile, and U. De Carlini. Post-maintenance
   testing based on path change analysis. In Proceedings of the
   Conference on Software Maintenance - 1988, pages 352{361, October
   1988.
   
   [8] D. Binkley. Using semantic differencing to reduce the cost of
   regression testing. In Proceedings of the Conference on Software
   Maintenance - 1992, pages 41{50, November 1992.
   
   [9] D. Binkley. Reducing the cost of regression testing by semantics
   guided test case selection. In Proceedings of the Conference on
   Software Maintenance - 1995, October 1995.
   
   [10] P.A. Brown and D. Hoffman. The application of module regression
   testing at TRIUMF. Nuclear Instruments and Methods in Physics
   Research, Section A, .A293(1-2):377{381, August 1990.
   
   [11] Y.F. Chen, D.S. Rosenblum, and K.P. Vo. TestTube: A system for
   selective regression testing. In Proceedings of the 16th International
   Conference on Software Engineering, pages 211{222, May 1994.
   
   [12] T. Dogsa and I. Rozman. CAMOTE - computer aided module testing
   and design environment. In Proceedings of the Conference on Software
   Maintenance - 1988, pages 404{408, October 1988.
   
   [13] K.F. Fischer. A test case selection method for the validation of
   software maintenance modifications. In Proceedings of COMPSAC '77,
   pages 421{426, November 1977.
   
   [14] K.F. Fischer, F. Raji, and A. Chruscicki. A methodology for
   retesting modified software. In Proceedings of the National
   Telecommunications Conference B-6-3, pages 1{6, November 1981.
   
   [15] R. Gupta, M.J. Harrold, and M.L. Soffa. An approach to regression
   testing using slicing. In Proceedings of the Conference on Software
   Maintenance - 1992, pages 299{308, November 1992.
   
   [16] M.J. Harrold, R. Gupta, and M.L. Soffa. A methodology for
   controlling the size of a test suite. ACM Transactions on Software
   Engineering and Methodology, 2(3):270{285, July 1993.
   
   [17] M.J. Harrold, L. Larsen, J. Lloyd, D. Nedved, M. Page, G.
   Rothermel, M. Singh, and M. Smith. Aristotle: a system for the
   development of program-analysis-based tools. In Proceedings of the
   33rd Annual Southeast Conference, pages 110{119, March 1995.
   
   [18] M.J. Harrold and M.L. Soffa. An incremental approach to unit
   testing during maintenance. In Proceedings of the Conference on
   Software Maintenance - 1988, pages 362{367, October 1988.
   
   [19] M.J. Harrold and M.L. Soffa. An incremental data flow testing
   tool. In Proceedings of the Sixth International Conference on Testing
   Computer Software, May 1989.
   
   [20] M.J. Harrold and M.L. Soffa. Interprocedural data flow testing.
   In Proceedings of the Third Testing, Analysis, and Verification
   Symposium, pages 158{167, December 1989.
   
   [21] J. Hartmann and D.J. Robson. Revalidation during the software
   maintenance phase. In Proceedings of the Conference on Software
   Maintenance - 1989, pages 70{79, October 1989.
   
   [22] J. Hartmann and D.J. Robson. RETEST - development of a selective
   revalidation prototype environment for use in software maintenance. In
   Proceedings of the Twenty-Third Hawaii International Conference on
   System Sciences, pages 92{101, January 1990.
   
   [23] J. Hartmann and D.J. Robson. Techniques for selective
   revalidation. IEEE Software, 16(1):31{38, January 1990.
   
   [24] D. Hoffman. A CASE study in module testing. In Proceedings of the
   Conference on Software Maintenance - 1989, pages 100{105, October
   1989.
   
   [25] D. Hoffman and C. Brealey. Module test case generation. In
   Proceedings of the Third Workshop on Software Testing, Analysis, and
   Verification, pages 97{102, December 1989.
   
   [26] S. Horwitz and T. Reps. The use of program dependence graphs in
   software engineering. In Proceedings of the 14th International
   Conference on Software Enginnering, pages 392{411, May 1992.
   
   [27] M. Hutchins, H. Foster, T. Goradia, and T. Ostrand. Experiments
   on the effectiveness of dataflow- and controlflow-based test adequacy
   criteria. In Proceedings of the 16th International Conference on
   Software Engineering, pages 191{200, May 1994.
   
   [28] J. Laski and W. Szermer. Identification of program modifications
   and its applications in software maintentance. In Proceedings of the
   Conference on Software Maintenance - 1992, pages 282{290, November
   1992.
   
   [29] J.A.N. Lee and X. He. A Methodology for Test Selection. The
   Journal of Systems and Software, 13(1):177{185, September 1990.
   
   [30] H.K.N. Leung and L. White. Insights Into Regression Testing. In
   Proceedings of the Conference on Software Maintenance - 1989, pages
   60{69, October 1989.
   
   [31] H.K.N. Leung and L. White. Insights into testing and regression
   testing global variables. Journal of Software Maintenance, 2:209{222,
   December 1990.
   
   [32] H.K.N. Leung and L.J. White. A study of integration testing and
   software regression at the integration level. In Proceedings of the
   Conference on Software Maintenance - 1990, pages 290{300, November
   1990.
   
   [33] H.K.N. Leung and L.J. White. A cost model to compare regression
   test strategies. In Proceedings of the Conference on Software
   Maintenance - 1991, pages 201{208, October 1991.
   
   [34] R. Lewis, D.W. Beck, and J. Hartmann. Assay - a tool to support
   regression testing. In ESEC '89. 2nd European Software Engineering
   Conference Proceedings, pages 487{496, September 1989.
   
   [35] U. Linnenkugel and M. Mullerburg. Test data selection criteria
   for (software) integration testing. In Systems Integration '90.
   Proceedings of the First International Conference on Systems
   Integration, pages 709{717, April 1990.
   
   [36] T.J. Ostrand and M.J. Balcer. The category-partition method for
   specifying and generating functional tests. Communications of the ACM,
   31(6), June 1988.
   
   [37] T.J. Ostrand and E.J. Weyuker. Using dataflow analysis for
   regression testing. In Sixth Annual Pacific Northwest Software Quality
   Conference, pages 233{247, September 1988.
   
   [38] R. Pressman. Software Engineering: A Practitioner's Approach.
   McGraw-Hill, New York, NY, 1987.
   
   [39] G. Rothermel. Efficient, Effective Regression Testing Using Safe
   Test Selection Techniques. Ph.D. dissertation, Clemson University, May
   1996.
   
   [40] G. Rothermel and M.J. Harrold. A safe, efficient algorithm for
   regression test selection. In Proceedings of the Conference on
   Software Maintenance - 1993, pages 358{367, September 1993.
   
   [41] G. Rothermel and M.J. Harrold. Selecting regression tests for
   object-oriented software. In Proceedings of the Conference on Software
   Maintenance - 1994, pages 14{25, September 1994.
   
   [42] G. Rothermel and M.J. Harrold. Analyzing regression test
   selection techniques. Technical Report OSU-CISRC- 4/96-TR23, The Ohio
   State University, February 1996.
   
   [43] S. Schach. Software Engineering. Aksen Associates, Boston, MA,
   1992.
   
   [44] B. Sherlund and B. Korel. Modification oriented software testing.
   In Conference Proceedings: Quality Week 1991, pages 1{17, 1991.
   
   [45] A.B. Taha, S.M. Thebaut, and S.S. Liu. An approach to software
   fault localization and revalidation based on incremental data flow
   analysis. In Proceedings of the 13th Annual International Computer
   Software and Applications Conference, pages 527{534, September 1989.
   
   [46] A. von Mayrhauser, R.T. Mraz, and J. Walls. Domain based
   regression testing. In Proceedings of the Conference on Software
   Maintenance - 1994, pages 26{35, September 1994.
   
   [47] E.J. Weyuker. Empirical techniques for assessing testing
   strategies,. (Panel discussion at the International Symposium on
   Software Testing and Analysis), August 1994.
   
   [48] L.J. White and H.K.N. Leung. A firewall concept for both
   control-flow and data-flow in regression integration testing. In
   Proceedings of the Conference on Software Maintenance - 1992, pages
   262{270, November 1992.
   
   [49] L.J. White, V. Narayanswamy, T. Friedman, M. Kirschenbaum, P.
   Piwowarski, and M. Oha. Test Manager: a regression testing tool. In
   Proceedings of the Conference on Software Maintenance - 1993, pages
   338{347, September 1993.
   
   [50] S. Wolfram. Mathematica: A System for Doing Mathematics on a
   Computer. Addison-Wesley, Reading, MA, second edition, 1991.
   
   [51] W. E. Wong, J. R. Horgan, S. London, and A. P. Mathur. Effect of
   test set minimization on fault detection effectiveness. In 17th
   International Conference on Software Engineering, pages 41{50, April
   1995.
   
   [52] W. Yang. Identifying syntactic differences between two programs.
   Software|Practice and Experience, 21(7):739{ 755, July 1991.
   
   [53] W. Yang, S. Horwitz, and T. Reps. A program integration algorithm
   that accomodates semantics-preserving transformations. ACM
   Transactions on Software Engineering and Methodology, 1(3):311{54,
   July 1992.
   
   [54] S.S. Yau and Z. Kishimoto. A method for revalidating modified
   programs in the maintenance phase. In COMP- SAC '87: The Eleventh
   Annual International Computer Software and Applications Conference,
   pages 272{277, October 1987.
   
   [55] J. Ziegler, J.M. Grasso, and L.G. Burgermeister. An Ada based
   real-time closed-loop integration and regression test tool. In
   Proceedings of the Conference on Software Maintenance - 1989, pages
   81{90, October 1989.
