Conference PaperPDF Available

Mutation analysis vs. code coverage in automated assessment of students' testing skills

Authors:

Abstract and Figures

Learning to program should include learning about proper software testing. Some automatic assessment systems, e.g. Web-CAT, allow assessing student-generated test suites using coverage metrics. While this encourages testing, we have observed that sometimes students can get rewarded from high coverage although their tests are of poor quality. Exploring alternative methods of assessment, we have tested mutation analysis to evaluate students' solutions. Initial results from applying mutation analysis to real course submissions indicate that mutation analysis could be used to fix some problems of code coverage in the assessment. Combining both metrics is likely to give more accurate feedback.
Content may be subject to copyright.
Mutation Analysis vs. Code Coverage in
Automated Assessment of Students’ Testing Skills
Kalle Aaltonen
kalle.aaltonen@gmail.com
Petri Ihantola Otto Sepp¨
al¨
a
Aalto University, Finland
{petri,oseppala}@cs.hut.fi
Abstract
Learning to program should include learning about proper
software testing. Some automatic assessment systems, e.g.
Web-CAT, allow assessing student-generated test suites us-
ing coverage metrics. While this encourages testing, we have
observed that sometimes students can get rewarded from
high coverage although their tests are of poor quality. Ex-
ploring alternative methods of assessment, we have tested
mutation analysis to evaluate students’ solutions. Initial re-
sults from applying mutation analysis to real course submis-
sions indicate that mutation analysis could be used to fix
some problems of code coverage in the assessment. Com-
bining both metrics is likely to give more accurate feedback.
Categories and Subject Descriptors K.3.2 [Computer and
Information Science Education]: Computer science educa-
tion
General Terms Experimentation, Measurement, Human
Factors
Keywords automated assessment, testing, programming
assignments, test coverage, mutation analysis, mutation test-
ing
1. Introduction
Students taking introductory programming classes are not
usually accustomed to perform their own testing. As a re-
sult, they tend to focus on the correctness of the output as
specified in the assignment and little else. If the program per-
forms unexpectedly, some manual testing is often done to lo-
cate a bug instead of using more systematic approaches. We
have observed this effect especially when feedback from au-
tomated assessment is available. Spacco and Pugh made sim-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
SPLASH’10,
October 17–21, 2010, Reno/Tahoe, Nevada, USA.
Copyright c
2010 ACM 978-1-4503-0240-1/10/10. . . $10.00
ilar observations and suggest giving detailed feedback only
after students have tested the code also by themselves [14].
Some automated assessment tools allow grading of stu-
dent tests making it worthwhile for the students to test. At
Aalto University (former Helsinki University of Technol-
ogy) we have used automated assessment of programming
assignments at least since 1994 and assessed students’ self-
written unit tests with Web-CAT1[2] since 2006. In Web-
CAT, the assessment of student-provided tests is based on
the percentage of the student’s self-defined tests passing and
the structural code coverage (i.e. statement or branch cov-
erage) of these tests. While coverage provides information
for the tester about possible places for improvement it might
not tell the whole story - this is because good code coverage
does not automatically guarantee proper test adequacy. It is
well known from the industry that developers can misuse
code-coverage-based test adequacy metrics to create a false
sense of well tested software [10]. Not too surprisingly we
have observed some students to do just the same to please
the automated assessment system.
Although Web-CAT performs static analysis to ensure
that tests include assertions, getting a good code coverage
can be achieved without strong enough assertions or even by
checking the assertions before running the code that was to
be tested. For example,
assertTrue(1 < 2); fibonacci(6);
assertTrue(fibonacci(6) >= 0);
assertEquals(8,fibonacci(6));
all achieve the same code coverage in automated cover-
age analysis, although their ability to tell how well the fi-
bonacci method works is quite different. It is even possible
that some students do not even see a problem in the first
and second examples, which would be even more worrying.
At least there are students following approaches similar to
all previous examples. When students are rewarded from the
code coverage of their tests, they seem to forget the true rea-
son why tests are written.
1http://web-cat.cs.vt.edu/
153
2. Research Problem and Method
To tackle the problem of poor test quality in students’ written
tests, we decided to seek alternative metrics to evaluate the
test adequacy.
Mutation analysis is a well known technique performed
on a set of unit tests by seeding simple programming er-
rors into the program code to be tested. Each combination
of errors applied to the code creates what is called a mutant.
These ”mutants” are generated systematically in large quan-
tities and the examined test suite is run on each of them. The
theory is that the test suite that detects more generated de-
fective programs is better than the one that detects less [1].
This makes mutation testing an interesting candidate for use
in automatic assessment of student tests. In this paper we
apply mutation analysis to real course data to study how this
method would perform in an educational setting.
The exact research questions we address are:
Q1: What are the possible strengths and weaknesses of mu-
tation analysis when compared to code coverage based
metrics?
Q2: Can mutation analysis be used to give meaningful grad-
ing on student-provided test suites requested in program-
ming assignments?
Suitability of mutation testing for test suite assessment
was evaluated with test data from Helsinki University of
Technology, Intermediate Courses in Programming L1 and
T1, held in Fall 2008. These identical courses both teach
object oriented programming in Java and are worth 6 ECTS-
credits2. They have a 5 ECTS-credits CS1 course as their
prerequisite. Automatic assessment is used on all courses but
students are not required to do unit testing on the prerequisite
course. The exercises count for 30 percent of the course
grade. All exercises were originally assessed with Web-CAT.
The number of resubmissions was not limited. When the
course was given no mutation testing was used.
The research method we applied was to compare the
test coverage to the mutation scores. Both scores were cal-
culated from existing student solutions to three program-
ming assignments. Mutation scores were calculated using
Javalanche [12]. During the course the students were “tra-
ditionally” awarded points from test coverage. We further
investigated submissions that got full points from the cover-
age but performed poorly in the mutation analysis. In addi-
tion, for test set A, we evaluated the effect of different so-
lutions by calculating the mutation scores of each submitted
test against all submitted solutions.
Generating and testing mutants requires processing time.
Providing instant feedback using mutation analysis in as-
sessment requires keeping the number of mutants at a rea-
sonable level. Minimum, maximum and average counts of
mutants are given for each of the test sets.
2European Credit Transfer and Accumulation System
3. Related Research
3.1 Automated Assessment of Testing Skills
Both code coverage based metrics and the ability to find
faulty implementations are used to automatically evaluate
students’ tests. Some of the related research conducted be-
fore 2006 is summarized in [14].
Assessing the Code Coverage
ASSYST [8] and Web-CAT [2] are perhaps the most widely
used tools that combine automated assessment of correctness
and tests. Grading in Web-CAT is based on three factors:
percentage of teacher’s tests passing, percentage of student’s
own tests passing and code coverage percentage of student’s
tests. In ASSYST, statement coverage can affect a grade that
is originally based on testing the correctness with teachers’s
tests. Still another system, Marmoset, has been modified to
take into account students’ tests when grading and giving
feedback [14]. By default, Marmoset has the tests grouped
into two sets. Feedback from public tests, including test def-
initions, are given immediately after submission. After the
public tests pass, the students can ask for the release tests to
be executed. Feedback from the release tests is both limited
and delayed. An enhanced version of Marmoset investigates
both the code coverage of release tests and student’s own
tests. As an incentive to test throughly, information about a
release test is provided only if student’s own tests cover the
same as what the release test covers.
Assessing the Ability to Find Bugs
Goldwasser [4] describes an idea where each student on a
course provides both a program and a test set - all combi-
nations of which are tried together. Test sets that reveal a
lot of bugs and programs that pass a lot of test sets are both
rewarded. It is also possible for the staff to seed a faulty im-
plementation to the competition. This also makes it possible
to give immediate feedback as students do not need to wait
until the exercise deadline when the competition can be per-
formed. Moreover, this allows students to learn from their
mistakes. Other papers with similar concept of a competition
include e.g. [6, 11]. Elbaum et. al. have created BugHunt [3],
a web based tutorial where students write unit tests to reveal
problems from given programs.
3.2 Mutation Analysis
Figure 1 explains the process of mutation analysis. Process
inputs are the program to be tested and the test suite to be
evaluated. In the next phase mutants are generated from the
program, and each mutant is tested with the test set. If a mu-
tant fails in the testing it is killed. If it passes all tests, it is
called a live mutant. Live mutants are examined in the next
phase by hand and split to equivalent and non-equivalent mu-
tants. For example, Listings 2 and 3 are mutants of Listing 1.
Note that Listing 2 is functionally identical with the original
(i.e. equivalent) whereas Listing 3 is not.
154
int normFib ( int N) {
int c u r r = 1 , p r e v = 0 ;
f o r (int i =0; i<N; i + +){
int te mp = c u r r ;
c u r r = c u r r + p r e v ;
p r e v = te mp ;
}
r e t u r n p r e v ;
}
Listing 1. Original
int e q u a l F i b ( int N ) {
int c u r r = 1 , p r e v = 0 ;
f o r (int i = 1 ; i <=N; i ++) {
int te mp = c u r r ;
c u r r = c u r r + p r e v ;
p r e v = te mp ;
}
r e t u r n p r e v ;
}
Listing 2. Equivalent mutant
int mutFib ( int N) {
int c u r r = 1 , p r e v = 0 ;
f o r (int i = 0 ; i <=N; i ++) {
int te mp = c u r r ;
c u r r = c u r r + p r e v ;
p r e v = te mp ;
}
r e t u r n p r e v ;
}
Listing 3. Non-equivalent mutant
Input program P
and test suite T
Create mutant M
from P
T(M) passes?
Kill the mutant M
P and M are
equivalent?
Amend T to
detect M.
Mark M as
equivalent
Fail
Pass
Yes
No
Figure 1. Mutation analysis process
The effectiveness of a test set in mutation analysis is
measured by a mutation score. This is normally defined as
the percentage of non-equivalent mutants killed by the test
set. Automated tools often estimate it by dividing the number
of all live mutants with all mutants. The latter is provided
by most mutation tools and is also what we have used in this
paper.
Mutating Java Programs
Mutants can be generated by modifying a program on dif-
ferent levels – from machine code to interpreted languages
with a high abstraction level. Current mutation analysis tools
for the Java language generate mutants from the Java source
code (e.g. µJava [9]) or from the intermediary bytecode ex-
ecuted by the Java Virtual Machine (e.g. Javalanche [12]),
as illustrated in Figure 2. There are pros and cons in both
source code level and bytecode level mutants:
Each examined source code mutant has to be compiled,
which is slow.
Bytecode mutants are difficult to examine afterwards as
it’s not possible or straightforward to generate the Java
source for the mutated bytecode.
The compiler can eliminate dead code, which in theory
can result in fewer equivalent mutants in source code
mutants, e.g. if the mutation operation targets a part of
the code that’s deemed dead by the compiler, the resulting
bytecode will be identical with the original.
Some more advanced operators are significantly easier to
implement in Java than in bytecode.
Java compiler
Java Virtual Machine
if_icmplt if_icmpgt
if(i > 5) { if(i < 5) {
Java mutation
Bytecode mut.
Figure 2. Mutant generation on the Java architecture
µJava3is a well known mutation tool for Java, devel-
oped since 2003. µJava’s mutation operations fall into two
distinct classes: method-level mutation operators and class
mutation operators. Class level operations are related to
encapsulation, inheritance, polymorphism, and some java-
specific features (e.g. add or remove keywords like this and
static). Method-level operations, presented in Table 1, are
very generic and can be applied to other languages.
Javalanche4is a simple and effective bytecode level mu-
tation analysis tool. It replaces numerical constants (x
x+ 1|x1|0|1), negates jump conditions, omits method
calls and replaces arithmetic operators. There are no ad-
vanced mutation operators related to visibility, inheritance
3http://cs.gmu.edu/~offutt/mujava/
4http://www.st.cs.uni-saarland.de/~schuler/javalanche/
155
Name Description Example
Arithmetic operators Replace, add, and remove unary and binary arith-
metic operators (+, -, /, *, ++, --) for both
integer and floating point operators.
x+1
x-1 x*1 x/1
Relational operators replace different comparison operators (>, >=, <,
<=, ==, !=) within the program.
x==1
x!=1 x>=1 ...
Conditional operators Replace, insert and remove conditional operators
(&&, ||, !). Bitwise operators &, |, and ^are
also used as replacements for these operators as
they are very common mistakes.
x||!y
x|!y x&&!y x||y ...
Shift operators Replace bit-wise shifting operators (<<, >>, >>>)x>>1
x<<1 x>>>1
Bitwise operators Replace, add and move four operators to perform
bitwise functions(&, |, ^, ~).
x&~y
x|~y x&y x^~y
Assignment operators Replace the convenience assignment operators pro-
vided by Java with another (+=, -=, *=, /=,
%=, &=, |=, ^=, <<=, >>=, >>>=).
x+=2
x-=2 x*=2 x/=2 ...
Table 1. Method-level mutation operators in µJava. The last column shows a tree diagram where the child nodes are possible
mutants generated by this operation applied on the parent.
and polymorphism as µJava has. Javalanche has been suc-
cessfully used to run mutation analysis on AspectJ, a large
open source Java project (with almost 100 thousand lines of
code) in under six hours on a single workstation [13].
4. Test Coverage vs. Mutation Score
We analyzed final submissions from each student to three
assignments – called test set A, B, and C later on. We failed
to apply mutation analysis successfully on some submissions
because:
1. The submission did not compile successfully.
2. The test suite did not pass on the unmutated program.
3. Individual tests were not repeatable and independent of
the execution order.
This implies that only submissions where all the student’s
tests passed are analyzed. Table 2 summarizes how many of
the submissions were successfully analyzed and how many
mutants, on average, were generated from each assignment.
Mutation analysis Generated Mutants
Mutation (per submission)
Name All Applicable score avg. min max avg
Set A 158 131 (83.0%) 80.5% 22 90 44.4
Set B 187 174 (90.0%) 73.7% 79 439 106.9
Set C 193 169 (87.6%) 84.9% 12 93 26.4
Table 2. Summary of the test sets used.
4.1 Test set A - Binary search trees
In this exercise the students were instructed to implement a
binary search tree by extending an existing binary tree im-
plementation through inheritance. As with all our exercises,
unit-testing the solution was required and code coverage was
used as the measure of completeness.
Of the 131 analyzed submissions 125 achieved perfect
code coverage. This is an expected result, as the students
were awarded for reaching perfect coverage, and in the case
of this assignment it was relatively easy to achieve.
The mutation analysis yielded an average of 44 mutations
per sample, and it took on average about 12 seconds to run
per sample. The best work managed to kill 48 of its 49 mu-
tants, resulting in 97.96% mutation score. The one remaining
live mutant was identical on the java source code level and
thus unkillable, so this can be considered a perfect score. On
average the mutation score was 80.48%, and the worst was
40%. The worst mutation score that had reached perfect code
coverage was 54.76%. There were several samples where the
tested method could be completely commented out, and the
test suite would fail to detect this. This would seem to sup-
port our assumption that mutation analysis offers better ca-
pabilities in identifying weak test sets.
Figure 3 illustrates relationship between mutation score
and test coverage in the set as a scatter plot. Histograms on
each axis show the distribution of the respective variables.
Code coverage (on the X-axis) was 100% for most submis-
sions. This also explains why the correlation between the
variables is small (ρ0.1628).
156
Effect of Implementation to the Generation of Mutants
The binary tree assignment was more restrictive than any of
the others. Not only were the students told which methods
to implement but they were not allowed to declare any ad-
ditional instance variables. This was checked automatically.
These restrictions made it possible to combine implementa-
tions and tests of different students.
In order to show how much variation does the implemen-
tation itself (excluding the tests) cause to the mutation score,
we selected 4 example test suites from the ones that had
achieved 100% test coverage; the worst, the best, and two
random ones. We ran mutation analysis on all the previously
analyzed submissions with each of these test sets, and results
are shown in Figure 4. Each box plot presents one of the four
selected test suites. Labels on the X-axis are the original mu-
tation scores and the box plot visualizes how the mutation
score varied when the test was executed against implemen-
tation of all the other students.
If we are to use the mutation score as an indication to the
adequacy of the test set, this score should not be affected by
the implementation, but the implementation does affect the
number of equivalent mutants created, which makes the mu-
tation scores of two test suites on two different implementa-
tions incomparable. It should be noted that in Figure 4 the
distribution of the first two test sets seems to be very sim-
ilar even though the original score is very different. This is
something that needs to taken into account if the students are
ever rewarded on basis of their mutation score.
4.2 Test set B - Hashing
This was the first exercise on the course and its main idea
was to acquaint the students with unit testing. The code to
be tested was a pre-implemented hash table implementation
using double hashing. The students only had to add in it a
method for finding prime numbers – other than that it was
all about testing.
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Code coverage
Mutation score
Figure 3. Scatter plot of code coverage and mutation score
of the test set A
40 %
50 %
60 %
70 %
80 %
90 %
100 %
Best Suite
Mut. Score 98,0 %
Random Suite 1
Mut. Score 85,4 %
Random Suite 2
Mut. Score 72,0 %
Worst Suite
Mut. Score 54,8 %
Figure 4. Distribution of mutation scores for the selected
test suites as box plot. Displayed are the minimum, maxi-
mum 90th and 10th percentiles. X-axis labels are the muta-
tion scores that were achieved when running the mutation
analysis on its respective implementation.
Of the 174 analyzed submissions 135 achieved perfect
code coverage. This exercise was more complex than test set
A, as it yielded on average 106 mutations per sample. Mu-
tation scores of the submissions that achieved perfect cover-
age score ranged from 42.7% to 89.39% with 73.73% being
the average. Mutation scores of the submissions that didn’t
reach perfect coverage ranged from 24.44% to 85.96% with
63.52% being the average.
The distribution and the relationship between code cover-
ages and mutation scores can be seen in Figure 5.
The best sample reached 100% code coverage and man-
aged to kill 100 out of 112 mutants.
We also examined the sample with the worst mutation
score that had reached perfect code coverage: The sample
managed to kill 38 of the 89 total mutations, reaching a
mutation score of 42.70%. The student generated portion of
the test suite had only a single assertion, and half of the unit
tests didn’t contain any assertions. The test suite is clearly
inadequate despite it having perfect branch coverage.
Figure 5. Scatter plot of code coverage and mutation score
of the test set B
157
The distribution seems to be very similar to the distribu-
tion seen in test set A (Figure 3), except that the number of
samples is higher and there is more variance in the code cov-
erages, σ0,064. Pearson product-moment correlation co-
efficient for the dataset is ρ0,669 indicating a clear pos-
itive correlation, unlike in test set A. Major difference with
test set A is the maximum mutation score, which was un-
der 90% compared to the practically perfect mutation score
achieved in A.
From some submissions we analyzed all mutants that
were not killed. Not even the best student generated test suite
managed to kill all the non-equivalent mutants.
4.3 Test set C - Disjoint sets
In this exercise the students were instructed to build a simple
union-find structure. The exercise allows extracting code
shared by different methods into helper methods and has
fairly simple recursive and iterative solutions.
Mutation analysis was successfully performed on 169
submissions, of which 144 achieved perfect coverage. Each
submission yielded on average 26 mutations. The amount of
generated mutants ranged from 12 to 93. Mutation scores of
the submissions that achieved perfect coverage score ranged
from 38.46% to 95.00% with 84.88% being the average,
which is the highest in all the data sets. Mutation scores of
the submissions that didn’t reach perfect coverage ranged
from 21.43% to 90.48% with 67,84% being the average.
The distribution and the relationship between code cover-
ages and mutation scores can be seen in Figure 6.
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Code coverage
Mutation score
Figure 6. Scatter plot of code coverage and mutation score
of the test set C
The best sample had 100% code coverage and killed 19
of its 20 mutations reaching a mutation score of 95.00%, and
the remaining mutant was equivalent so this sample should
considered mutation adequate.
The submission with perfect code coverage and worst
mutation score managed to kill 10 of its 26 total mutations
reaching a mutation score of 38.46%.
The distribution seems to be very similar to the distribu-
tion seen in the previous test sets (Figure 3 and 5). Pear-
son product-moment correlation coefficient for the dataset is
ρ0.6034 indicating a clear positive correlation.
5. Analysis of Weak Test Sets
We manually examined bottom four samples by mutation
score from the samples that had reached perfect coverage.
Our initial assumption was that these should be of poor
quality by having badly tested or untested functionality. We
were not investigating other aspects of test quality, such as
style and structural considerations. We also examined the
percentage of the teacher’s tests passing. It should be noted
that all students’ tests had to pass their own implementation
so that we were able to run the mutation analysis.
5.1 Test set A
All the examined samples had significant problems. For ex-
ample, none of them tested printInorder, at all – although
it was fully covered by the tests. Table 3 summarizes our
findings from Test set A.
Mutants teacher’s
Name generated detected score tests
weak 1 42 23 54.76% 57%
weak 2 54 30 55.56% 86%
weak 3 45 26 57.78% 71%
weak 4 49 29 59.18% 100%
Table 3. Samples with perfect code coverage and bad mu-
tation score of the test set A
5.2 Test set B
All the samples had plenty of untested functionality. Three
of the analyzed test sets had only one trivial (although mean-
ingfull) assertion generated by a student. Numerical results
from this test set are in Table 4. It should be noted that the
most important part of this assignment was to test code that
was given. This is why all of the teacher’s tests are passing.
Mutants teacher’s
Name generated detected score tests
weak 1 89 38 42.70% 100%
weak 2 104 49 47.12% 100%
weak 3 107 52 48.60% 100%
weak 4 90 53 58.89% 100%
Table 4. Samples with perfect code coverage and bad mu-
tation score of the test set B
5.3 Test set C
Unlike in the other test sets, the samples in the test set C were
not all of bad quality. Quality of the test of samples 1 and 2
was poor and comparable to test sets A and B. However,
tests of samples 3 and 4 were significantly better and it
can be argued they only had some corner cases untested
but redundant code caused a large number of mutants to
be equivalent with the original. Quantitative data from the
samples are presented in Table 5.
158
Mutants teacher’s
Name generated detected score tests
weak 1 26 10 38.46% 100%
weak 2 23 11 47.83% 100%
weak 3 25 15 60.00% 86%
weak 4 21 13 61.76% 100%
Table 5. Samples with perfect code coverage and bad mu-
tation score of the test set C
6. Discussion and Conclusions
In the following two subsections we answer to our first
research question: What are the possible strengths (Sec-
tion 6.1) and weaknesses of mutation analysis (Section 6.2)
when compared to code coverage based metrics? In Sec-
tion 6.4 we answer our second question: can mutation anal-
ysis be used to give meaningful grading on student-provided
test suites requested in programming assignments?
6.1 Strengths
Automatically assessed exercises are often criticized for not
being creative enough. Assessing the functionality of the
solution by unit tests written by the teacher implies exercises
where students are given the structure of the code. Greening,
for example, argues [5, pp. 53–54]:
Usually, however, the tasks required of the student are
highly structured and meticulously synchronized with
lectures, and are of the form that asks the student to
write a piece of code that satisfies a precise set of
specifications created by the instructor. [. . . ] Although
some practical skills are certainly gained, the exercise
is essentially one of reproduction.
Mutation analysis combines the correctness of the pro-
gram to be tested (i.e. it can only be applied when tests pass)
and the adequacy of the tests. This lessens the need of unit
tests written by the teacher and allows more open ended as-
signments.
In Section 3.1, we described other approaches and as-
sessment tools that also evaluate test set’s ability to detect
faulty programs. The benefit of mutation analysis over com-
petitions where students’ assignments are executed against
each other is the ability to give immediate feedback. Im-
mediate feedback would also be possible if faulty programs
were generated beforehand by the teacher – as discussed
in Section 3.1. However, manual generation of the mutants
would prevent automatically assessing more open ended as-
signments – which mutation analysis could perform. Qual-
itative analysis in Section 5 implies that mutation analysis
is an effective approach for semi-automatic assessment. It
could be used with systems like Web-CAT and ASSYST to
post-process the submissions and to identify students that
may be trying to fool the assessment systems. These sub-
missions could then be manually assessed to ensure this is
not the case.
6.2 Weaknesses
One weakness of mutation analysis is that coverage results
are easier to interpret and are therefore simpler to use as
feedback and assessment criteria with students. While ap-
proaches where mutants are used as counterexamples of
weak test sets exist, they should be tested in a real course
setting to see what is the best way to apply them.
Complex Solutions can be Over-weighted
Complex code creates many mutants and redundant code can
cause large numbers of equivalent mutants. This can cause
unfairly low mutation scores but can also be used to cheat
automatic assessment based on mutation analysis.
When methods contain redundant code or are very com-
plicated the number of mutants blows up. This can lead to
a situation where a significant portion of the mutants are
from a small untested functionality. This implies that the
penalty of not testing that specific funtionality gets too high
as demonstrated in Section 5.3.
If students realize that complex code creates many mu-
tants, they may try to fool the mutation analysis system by
seeding irrelevant code into their submissions. For example,
Listing 4 simply performs the function f(x) = x+ 63, but
the way it is written blows up the number of mutants. This
will distort the mutation score. The large number of analyzed
mutants can also result in the grading system performing
poorly.
p ub l ic s t a t i c i n t dummy( int x ) {
x+=3;x+=3;x+=3;x +=3;x +=3;x+=3;x+=3;
x+=3;x+=3;x+=3;x +=3;x +=3;x+=3;x+=3;
x+=3;x+=3;x+=3;x +=3;x +=3;x+=3;x+=3;
r e t u r n x ;
}
Listing 4. Sample of an easily testable dummy method, that
yields a large number of mutants, and can be thoroughly
tested with assertEquals(63,dummy(0));.
The number of mutants generated per submission should
also be monitored, as it can indicate this kind of cheating,
malicious intent in trying to cause the system to perform
poorly, or simply an over-complicated solution from where
the student should get feedback.
6.3 Testing Unspecified Behavior
Even with assignments where an exact interface to imple-
ment is provided, some details of the implementation can be
unspecified. For example, how to use return values of meth-
ods can be left for students. For students, leaving such un-
specified behavior not tested is natural. However, mutation
analysis penalizes from this as mutants are also generated
from the unspecified behavior. This forces students to spec-
ify the otherwise unspecified features through tests.
159
6.4 Mutation Analysis in Grading
We conclude that mutation analysis can reveal tests that
were created to fool the assessment system. Preliminary
results indicate that mutation analysis can provide valuable
feedback of how well students have tested their software.
While the information is most easily interpreted and used
by a teacher, the results could be valuable to the students
as well. However, to verify this result, a follow up study
where students get feedback based on the mutation analysis
is needed. An interesting question is if students fool the
mutation analysis just like they do for the coverage.
We should also keep in mind that mutation score is not
independent from the implementation. Thus, if the objective
is to give separate grades from tests and implementation,
raw mutation scores are not the best option as they are not
commensurable. We assume that it would be possible to set
an exercise-specific threshold to identify certainly poor or
suspicious work. However, this is where more research is
needed.
Although many students submit their work just before the
deadline, we assume mutation analysis to scale up and not to
be computationally too expensive. For example, analysing a
single submission in test set A took 12 seconds on average
(see Section 4.1).
7. Future Research
In the future, we would like to see mutation analysis being
used to provide formative feedback, i.e. feedback for learn-
ing. For example, if mutants are generated on source code
level, programs which did not pass student’s tests could be
provided as feedback. However how to select which of the
live mutants to show in an interesting research problem.
Testing can be made more effective by writing code that
is easy to test. There are rules how to write testable code and
metrics related to measuring the testability. One such metric
is provided by a tool called TestabilityExplorer5[7]. In the
future, we plan to take our data set and apply traditional
code coverage, mutation score and TestabilityExplorer to
understand how these three different metrics are related to
each other. Follow up studies to see how students behave
when the immediate feedback they get is based on mutation
analysis and/or testability are needed.
References
[1] R. A. DeMillo, R. J. Lipton, and F. G. Sayward. Hints on
test data selection: Help for the practicing programmer. IEEE
Computer, 11:34–41, 1978.
[2] S. H. Edwards. Rethinking computer science education from
a test-first perspective. In Companion of the 18th annual
ACM SIGPLAN conference on Object-oriented programming,
systems, languages, and applications, Anaheim, California,
USA, 26–30 October, pages 148–155. ACM, New York, NY,
USA, 2003. ISBN 1-58113-751-6.
5http://code.google.com/p/testability-explorer/
[3] S. Elbaum, S. Person, J. Dokulil, and M. Jorde. Bug hunt:
Making early software testing lessons engaging and afford-
able. In ICSE ’07: Proceedings of the 29th international con-
ference on Software Engineering, pages 688–697, Washing-
ton, DC, USA, 2007. IEEE Computer Society.
[4] M. H. Goldwasser. A gimmick to integrate software testing
throughout the curriculum. SIGCSE Bull., 34(1):271–275,
2002. ISSN 0097-8418.
[5] T. Greening. Emerging constructivist forces in computer sci-
ence education: Shaping a new future. In T. Greening, editor,
Computer science education in the 21st century, pages 47–80.
Springer Verlag, 1999.
[6] M. Hauswirth, D. Zaparanuks, A. Malekpour, and M. Keikha.
The javafest: a collaborative learning technique for java pro-
gramming courses. In PPPJ ’08: Proceedings of the 6th inter-
national symposium on Principles and practice of program-
ming in Java, pages 3–12, New York, NY, USA, 2008. ACM.
ISBN 978-1-60558-223-8.
[7] M. Hevery. Testability explorer: using byte-code analysis to
engineer lasting social changes in an organization’s software
development process. In OOPSLA Companion ’08: Com-
panion to the 23rd ACM SIGPLAN conference on Object-
oriented programming systems languages and applications,
pages 747–748, New York, NY, USA, 2008. ACM.
[8] D. Jackson and M. Usher. Grading student programs using
ASSYST. In Proceedings of 28th ACM SIGCSE Symposium
on Computer Science Education, pages 335–339, 1997.
[9] Y.-S. Ma, J. Offutt, and Y. R. Kwon. Mujava: an automated
class mutation system: Research articles. Softw. Test. Verif.
Reliab., 15(2):97–133, 2005. ISSN 0960-0833.
[10] B. Marick. How to misuse code coverage. In Proceedings
of the 16th International Conference on Testing Computer
Software, pages 16–18, 1999.
[11] W. Marrero and A. Settle. Testing first: emphasizing testing
in early programming courses. In ITiCSE ’05: Proceedings
of the 10th annual SIGCSE conference on Innovation and
technology in computer science education, pages 4–8, New
York, NY, USA, 2005. ACM. ISBN 1-59593-024-8.
[12] D. Schuler and A. Zeller. Javalanche: efficient mutation test-
ing for java. In ESEC/FSE ’09: Proceedings of the 7th joint
meeting of the European software engineering conference and
the ACM SIGSOFT symposium on The foundations of soft-
ware engineering on European software engineering confer-
ence and foundations of software engineering symposium,
pages 297–298, New York, NY, USA, 2009. ACM. ISBN 978-
1-60558-001-2.
[13] D. Schuler, V. Dallmeier, and A. Zeller. Efficient mutation
testing by checking invariant violations. In ISSTA ’09: Pro-
ceedings of the eighteenth international symposium on Soft-
ware testing and analysis, pages 69–80, New York, NY, USA,
2009. ACM. ISBN 978-1-60558-338-9.
[14] J. Spacco and W. Pugh. Helping students appreciate test-
driven development (tdd). In OOPSLA ’06: Companion to the
21st ACM SIGPLAN symposium on Object-oriented program-
ming systems, languages, and applications, pages 907–913,
New York, NY, USA, 2006. ACM. ISBN 1-59593-491-X.
160
... Several approaches [2], [17] have been developed for assessing student-created test suites. Bai et al. [8] studied the impact of a checklist on the students writing test cases. ...
... Since test coverage is not a reliable metric (i.e., executing a statement does not mean checking if it is correct), Aaltonen et al. [1] proposed mutation analysis as a replacement of test coverage. Mutation analysis consists of seeding simple programming errors into a correct program, creating mutants, and running the test suite on each mutant. ...
Article
Full-text available
Practical programming competencies are critical to the success in computer science education and go-to-market of fresh graduates. Acquiring the required level of skills is a long journey of discovery, trial and error, and optimization seeking through a broad range of programming activities that learners must perform themselves. It is not reasonable to consider that teachers could evaluate all attempts that the average learner should develop multiplied by the number of students enrolled in a course, much less in a timely, deeply, and fairly fashion. Unsurprisingly, exploring the formal structure of programs to automate the assessment of certain features has long been a hot topic among CS education practitioners. Assessing a program is considerably more complex than asserting its functional correctness, as the proliferation of tools and techniques in the literature over the past decades indicates. Program efficiency, behavior, readability, among many other features, assessed either statically or dynamically, are now also relevant for automatic evaluation. The outcome of an evaluation evolved from the primordial boolean values to information about errors and tips on how to advance, possibly taking into account similar solutions. This work surveys the state-of-the-art in the automated assessment of CS assignments, focusing on the supported types of exercises, security measures adopted, testing techniques used, type of feedback produced, and the information they offer the teacher to understand and optimize learning. A new era of automated assessment, capitalizing on static analysis techniques and containerization, has been identified. Furthermore, this review presents several other findings from the conducted review, discusses the current challenges of the field, and proposes some future research directions.
Article
Full-text available
Software Testing is the core part of computer science & engineering curriculum. It has been observed that software testing has been taught much in Computer science & engineering disciplines at undergraduate level. Software Testing Education (STE) involves time, cost, risk, quality, integration, communication, human resource, and procurement management skills. The STE at undergraduate and graduate level is a challenging task because it also requires knowledge and experience. This article aims to investigate and synthesize the state-of-the-art research in STE for the improvement of STE curriculum, pedagogical tools and techniques, cognitive, empirical and assessments methods. STE research approaches has been categorized into five categories including empirical type, research approaches, software testing education processes, key areas, and curricula. The ninety-seven articles for the area of STE have been chosen after rigorous systematic screening process published in during 2004 to 2021. Furthermore, tools and techniques, testing processes, pedagogy and student performance are the frequently addressed in STE; whereas assessment methods, gamification, curriculum and exemplary program development appeared as mostly ignored areas. Lastly, research gaps and challenges relate to STE has been presented as future directions for the faculty, software industry, and researchers.
Article
Background and Context. Students’ programming projects are often assessed on the basis of their tests as well as their implementations, most commonly using test adequacy criteria like branch coverage, or, in some cases, mutation analysis. As a result, students are implicitly encouraged to use these tools during their development process (i.e., so they have awareness of the strength of their own test suites). Objectives. Little is known about how students choose test cases for their software while being guided by these feedback mechanisms. We aim to explore the interaction between students and commonly used testing feedback mechanisms (in this case, branch coverage and mutation-based feedback). Method. We use grounded theory to explore this interaction. We conducted 12 think-aloud interviews with students as they were asked to complete a series of software testing tasks, each of which involved a different feedback mechanism. Interviews were recorded and transcripts were analyzed, and we present the overarching themes that emerged from our analysis. Findings. Our findings are organized into a process model describing how students completed software testing tasks while being guided by a test adequacy criterion. Program comprehension strategies were commonly employed to reason about feedback and devise test cases. Mutation-based feedback tended to be cognitively overwhelming for students, and they resorted to weaker heuristics in order to address this feedback. Implications. In the presence of testing feedback, students did not appear to consider problem coverage as a testing goal so much as program coverage . While test adequacy criteria can be useful for assessment of software tests, we must consider whether they represent good goals for testing, and if our current methods of practice and assessment are encouraging poor testing habits.
Conference Paper
Full-text available
Testing is an important part of the software development cycle that should be covered throughout the computer science curriculum. However, for students to truly learn the value of testing, they need to benefit from writing test cases for their own software.We report on our initial experiences teaching students to write test cases and evaluating student-written test suites, with an emphasis on our observation that, without proper incentive to write test cases early, many students will complete the programming assignment first and then add the build of their test cases afterwards. Based on these experiences, we propose new mechanisms to provide better incentives for students to write their test cases early.We also report on some of the limitations of code coverage as a tool for evaluating test suites, and finally conclude with a survey of related work on introducing testing into the undergraduate curriculum.
Conference Paper
Full-text available
Despite our best efforts and intentions as educators, student programmers continue to struggle in acquiring comprehension and analysis skills. Students believe that once a program runs on sample data, it is correct; most programming errors are reported by the compiler; when a program misbehaves, shuffling statements and tweaking expressions to see what happens is the best debugging approach. This paper presents a new vision for computer science education centered around the use of test-driven development in all programming assignments, from the beginning of CS1. A key element to the strategy is comprehensive, automated evaluation of student work, in terms of correctness, the thoroughness and validity of the student's tests, and an automatic coding style assessment performed using industrial-strength tools. By systematically applying the strategy across the curriculum as part of a student's regular programming activities, and by providing rapid, concrete, useful feedback that students find valuable, it is possible to induce a cultural shift in how students behave.
Article
The complexity of languages like Java and C++ can make introductory programming classes in these languages extremely challenging for many students. Part of the complexity comes from the large number of concepts and language features that students are expected to learn while having little time for adequate practice or examples. A second source of difficulty is the emphasis that object-oriented programming places on abstraction. We believe that by placing a larger emphasis on testing in programming assignments in these introductory courses, students have an opportunity for extra practice with the language, and this affords them a gentler transition into the abstract thinking needed for programming. In this paper we describe how we emphasized testing in introductory programming assignments by requiring that students design and implement tests before starting on the program itself. We also provide some preliminary results and student reactions.
Article
Several module and class testing techniques have been applied to object-oriented (OO) programs, but researchers have only recently begun developing test criteria that evaluate the use of key OO features such as inheritance, polymorphism, and encapsulation. Mutation testing is a powerful testing technique for generating software tests and evaluating the quality of software. However, the cost of mutation testing has traditionally been so high that it cannot be applied without full automated tool support. This paper presents a method to reduce the execution cost of mutation testing for OO programs by using two key technologies, mutant schemata generation (MSG) and bytecode translation. This method adapts the existing MSG method for mutants that change the program behaviour and uses bytecode translation for mutants that change the program structure. A key advantage is in performance: only two compilations are required and both the compilation and execution time for each is greatly reduced. A mutation tool based on the MSG/bytecode translation method has been built and used to measure the speedup over the separate compilation approach. Experimental results show that the MSG/bytecode translation method is about five times faster than separate compilation. Copyright © 2004 John Wiley & Sons, Ltd.
Article
The philosophy of constructivism has emerged as a catalyst for many important changes in pedagogy in recent times. This chapter presents some of the fundamental constructivist principles and examines how these principles might establish an increased presence within the future of computer science education. Programming and WWW support tools are given some coverage from a constructivist perspective. In addition, it is proposed that the very rapidity of technological change is proving a stimulus for a growing need to adopt constructivist approaches for education. The half-life of engineering knowledge (suggested by some as on the order of five years!) makes this rapidity tangible and emphasizes the importance of a revised approach to learning. As well as being stimulated by technological developments, constructivism offers a means of guarding against surface educational changes that do little more than bring the new technology into the classroom. It assists in deciding whether new tools for education exploit the current state of the art and whether such tools are cognitive ones that embody new approaches to educational practice or simply gee-whiz extensions to existing principles. The thread that continues throughout most of this discussion is that constructivism has already established a presence in computer science education, and that this is sufficiently embedded as to ensure its future. However, the degree of that future presence is problematic. The chapter concludes by suggesting that a partial implementation of constructivist principles risks missing the point; it is this matter of degree that is as yet unclear in the direction of CS education, and ultimately proves to be crucial to the status of constructivism in its future.
Conference Paper
To assess the quality of a test suite, one can use muta- tion testing—seeding artificial defects (mutations) into the program and checking whether the test suite finds them. JAVALANCHE is an open source framework for mutation testing Java programs with a special focus on automation, efficiency, and effectiveness. In particular,JAVALANCHE as- sesses the impact of individual mutations to effectively weed out equivalent mutants; it has been demonstrated to work on programs with up to 100,000 lines of code.
Conference Paper
The task of grading solutions to student programming exercises is laborious and error-prone. We have developed a software tool called ASSYST that is designed to relieve a tutor of much of the burden of assessing such programs. ASSYST offers a graphical interface that can be used to direct all aspects of the grading process, and it considers a wide range of criteria in its automatic assessment. Experience with the system has been encouraging.
Conference Paper
Testability Explorer is an open-source tool that identifies hard-to-test Java code. Testability Explorer provides a repeatable objective metric of "testability." This metric becomes a key component of engineering a social change within an organization of developers. The TE report provides actionable information to developers which can be used as (1) measure of progress towards a goal and (2) a guide to refactoring towards a more testable code-base.
Conference Paper
Learning to create well-designed and robust Java pro- grams requires, besides a good understanding of the lan- guage, a significant amount of practice. In this paper we present the JavaFest, a collaborative learning tech- nique for teaching Java to beginning programmers. A JavaFest is a group exercise that instructors can add to their repertoire of teaching techniques. It provides an opportunity for students to practice programming in a motivating but non-threatening environment, and to learn from the experience of their peers. Moreover, a JavaFest allows the instructor to gain insight into the current standing of the students in her class. We describe the concept of a JavaFest and present three case studies in the form of three concrete JavaFests we developed and evaluated in our own object-oriented programming course. The general idea of a JavaFest, and the three specific examples we describe and eval- uate, can easily be adopted to enhance any Java pro- gramming course.
Conference Paper
Mutation testing measures the adequacy of a test suite by seeding articial defects (mutations) into a program. If a mutation is not detected by the test suite, this usually means that the test suite is not adequate. However, it may also be that the mutant keeps the program's semantics unchanged| and thus cannot be detected by any test. Such equivalent mutants have to be eliminated manually, which is tedious. We assess the impact of mutations by checking dynamic invariants. In an evaluation of our JAVALANCHE framework on seven industrial-size programs, we found that mutations that violate invariants are signicantly more likely to be de- tectable by a test suite. As a consequence, mutations with impact on invariants should be focused upon when improv- ing test suites. With less than 3% of equivalent mutants, our approach provides an ecient, precise, and fully automatic measure of the adequacy of a test suite.