Conference PaperPDF Available

Mutation analysis vs. code coverage in automated assessment of students' testing skills

October 2010

October 2010

DOI:10.1145/1869542.1869567

Source
DBLP

Conference: Companion to the 25th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, SPLASH/OOPSLA 2010, October 17-21, 2010, Reno/Tahoe, Nevada, USA

Authors:

Petri Ihantola

University of Helsinki

Otto Seppälä

Aalto University

Learning to program should include learning about proper software testing. Some automatic assessment systems, e.g. Web-CAT, allow assessing student-generated test suites using coverage metrics. While this encourages testing, we have observed that sometimes students can get rewarded from high coverage although their tests are of poor quality. Exploring alternative methods of assessment, we have tested mutation analysis to evaluate students' solutions. Initial results from applying mutation analysis to real course submissions indicate that mutation analysis could be used to fix some problems of code coverage in the assessment. Combining both metrics is likely to give more accurate feedback.

Scatter plot of code coverage and mutation score of the test set C The best sample had 100% code coverage and killed 19 of its 20 mutations reaching a mutation score of 95.00%, and the remaining mutant was equivalent so this sample should considered mutation adequate. The submission with perfect code coverage and worst mutation score managed to kill 10 of its 26 total mutations reaching a mutation score of 38.46%. The distribution seems to be very similar to the distribution seen in the previous test sets (Figure 3 and 5).

…

how many of the submissions were successfully analyzed and how many mutants, on average, were generated from each assignment.

…

Figures - uploaded by Petri Ihantola

Content may be subject to copyright.

Content uploaded by Petri Ihantola

Content may be subject to copyright.

Mutation Analysis vs. Code Coverage in

Automated Assessment of Students’ Testing Skills

Kalle Aaltonen

kalle.aaltonen@gmail.com

Petri Ihantola Otto Sepp¨

al¨

Aalto University, Finland

{petri,oseppala}@cs.hut.ﬁ

Abstract

Learning to program should include learning about proper

software testing. Some automatic assessment systems, e.g.

Web-CAT, allow assessing student-generated test suites us-

ing coverage metrics. While this encourages testing, we have

observed that sometimes students can get rewarded from

high coverage although their tests are of poor quality. Ex-

ploring alternative methods of assessment, we have tested

mutation analysis to evaluate students’ solutions. Initial re-

sults from applying mutation analysis to real course submis-

sions indicate that mutation analysis could be used to ﬁx

some problems of code coverage in the assessment. Com-

bining both metrics is likely to give more accurate feedback.

Categories and Subject Descriptors K.3.2 [Computer and

Information Science Education]: Computer science educa-

tion

General Terms Experimentation, Measurement, Human

Factors

Keywords automated assessment, testing, programming

assignments, test coverage, mutation analysis, mutation test-

ing

1. Introduction

Students taking introductory programming classes are not

usually accustomed to perform their own testing. As a re-

sult, they tend to focus on the correctness of the output as

speciﬁed in the assignment and little else. If the program per-

forms unexpectedly, some manual testing is often done to lo-

cate a bug instead of using more systematic approaches. We

have observed this effect especially when feedback from au-

tomated assessment is available. Spacco and Pugh made sim-

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full citation

on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute

to lists, requires prior speciﬁc permission and/or a fee.

SPLASH’10,

October 17–21, 2010, Reno/Tahoe, Nevada, USA.

2010 ACM 978-1-4503-0240-1/10/10. . . $10.00

ilar observations and suggest giving detailed feedback only

after students have tested the code also by themselves [14].

Some automated assessment tools allow grading of stu-

dent tests making it worthwhile for the students to test. At

Aalto University (former Helsinki University of Technol-

ogy) we have used automated assessment of programming

assignments at least since 1994 and assessed students’ self-

written unit tests with Web-CAT1[2] since 2006. In Web-

CAT, the assessment of student-provided tests is based on

the percentage of the student’s self-deﬁned tests passing and

the structural code coverage (i.e. statement or branch cov-

erage) of these tests. While coverage provides information

for the tester about possible places for improvement it might

not tell the whole story - this is because good code coverage

does not automatically guarantee proper test adequacy. It is

well known from the industry that developers can misuse

code-coverage-based test adequacy metrics to create a false

sense of well tested software [10]. Not too surprisingly we

have observed some students to do just the same to please

the automated assessment system.

Although Web-CAT performs static analysis to ensure

that tests include assertions, getting a good code coverage

can be achieved without strong enough assertions or even by

checking the assertions before running the code that was to

be tested. For example,

•assertTrue(1 < 2); fibonacci(6);

•assertTrue(fibonacci(6) >= 0);

•assertEquals(8,fibonacci(6));

all achieve the same code coverage in automated cover-

age analysis, although their ability to tell how well the ﬁ-

bonacci method works is quite different. It is even possible

that some students do not even see a problem in the ﬁrst

and second examples, which would be even more worrying.

At least there are students following approaches similar to

all previous examples. When students are rewarded from the

code coverage of their tests, they seem to forget the true rea-

son why tests are written.

1http://web-cat.cs.vt.edu/

153

2. Research Problem and Method

To tackle the problem of poor test quality in students’ written

tests, we decided to seek alternative metrics to evaluate the

test adequacy.

Mutation analysis is a well known technique performed

on a set of unit tests by seeding simple programming er-

rors into the program code to be tested. Each combination

of errors applied to the code creates what is called a mutant.

These ”mutants” are generated systematically in large quan-

tities and the examined test suite is run on each of them. The

theory is that the test suite that detects more generated de-

fective programs is better than the one that detects less [1].

This makes mutation testing an interesting candidate for use

in automatic assessment of student tests. In this paper we

apply mutation analysis to real course data to study how this

method would perform in an educational setting.

The exact research questions we address are:

Q1: What are the possible strengths and weaknesses of mu-

tation analysis when compared to code coverage based

metrics?

Q2: Can mutation analysis be used to give meaningful grad-

ing on student-provided test suites requested in program-

ming assignments?

Suitability of mutation testing for test suite assessment

was evaluated with test data from Helsinki University of

Technology, Intermediate Courses in Programming L1 and

T1, held in Fall 2008. These identical courses both teach

object oriented programming in Java and are worth 6 ECTS-

credits2. They have a 5 ECTS-credits CS1 course as their

prerequisite. Automatic assessment is used on all courses but

students are not required to do unit testing on the prerequisite

course. The exercises count for 30 percent of the course

grade. All exercises were originally assessed with Web-CAT.

The number of resubmissions was not limited. When the

course was given no mutation testing was used.

The research method we applied was to compare the

test coverage to the mutation scores. Both scores were cal-

culated from existing student solutions to three program-

ming assignments. Mutation scores were calculated using

Javalanche [12]. During the course the students were “tra-

ditionally” awarded points from test coverage. We further

investigated submissions that got full points from the cover-

age but performed poorly in the mutation analysis. In addi-

tion, for test set A, we evaluated the effect of different so-

lutions by calculating the mutation scores of each submitted

test against all submitted solutions.

Generating and testing mutants requires processing time.

Providing instant feedback using mutation analysis in as-

sessment requires keeping the number of mutants at a rea-

sonable level. Minimum, maximum and average counts of

mutants are given for each of the test sets.

2European Credit Transfer and Accumulation System

3. Related Research

3.1 Automated Assessment of Testing Skills

Both code coverage based metrics and the ability to ﬁnd

faulty implementations are used to automatically evaluate

students’ tests. Some of the related research conducted be-

fore 2006 is summarized in [14].

Assessing the Code Coverage

ASSYST [8] and Web-CAT [2] are perhaps the most widely

used tools that combine automated assessment of correctness

and tests. Grading in Web-CAT is based on three factors:

percentage of teacher’s tests passing, percentage of student’s

own tests passing and code coverage percentage of student’s

tests. In ASSYST, statement coverage can affect a grade that

is originally based on testing the correctness with teachers’s

tests. Still another system, Marmoset, has been modiﬁed to

take into account students’ tests when grading and giving

feedback [14]. By default, Marmoset has the tests grouped

into two sets. Feedback from public tests, including test def-

initions, are given immediately after submission. After the

public tests pass, the students can ask for the release tests to

be executed. Feedback from the release tests is both limited

and delayed. An enhanced version of Marmoset investigates

both the code coverage of release tests and student’s own

tests. As an incentive to test throughly, information about a

release test is provided only if student’s own tests cover the

same as what the release test covers.

Assessing the Ability to Find Bugs

Goldwasser [4] describes an idea where each student on a

course provides both a program and a test set - all combi-

nations of which are tried together. Test sets that reveal a

lot of bugs and programs that pass a lot of test sets are both

rewarded. It is also possible for the staff to seed a faulty im-

plementation to the competition. This also makes it possible

to give immediate feedback as students do not need to wait

until the exercise deadline when the competition can be per-

formed. Moreover, this allows students to learn from their

mistakes. Other papers with similar concept of a competition

include e.g. [6, 11]. Elbaum et. al. have created BugHunt [3],

a web based tutorial where students write unit tests to reveal

problems from given programs.

3.2 Mutation Analysis

Figure 1 explains the process of mutation analysis. Process

inputs are the program to be tested and the test suite to be

evaluated. In the next phase mutants are generated from the

program, and each mutant is tested with the test set. If a mu-

tant fails in the testing it is killed. If it passes all tests, it is

called a live mutant. Live mutants are examined in the next

phase by hand and split to equivalent and non-equivalent mu-

tants. For example, Listings 2 and 3 are mutants of Listing 1.

Note that Listing 2 is functionally identical with the original

(i.e. equivalent) whereas Listing 3 is not.

154

int normFib ( int N) {

int c u r r = 1 , p r e v = 0 ;

f o r (int i =0; i<N; i + +){

int te mp = c u r r ;

c u r r = c u r r + p r e v ;

p r e v = te mp ;

}

r e t u r n p r e v ;

}

Listing 1. Original

int e q u a l F i b ( int N ) {

int c u r r = 1 , p r e v = 0 ;

f o r (int i = 1 ; i <=N; i ++) {

int te mp = c u r r ;

c u r r = c u r r + p r e v ;

p r e v = te mp ;

}

r e t u r n p r e v ;

}

Listing 2. Equivalent mutant

int mutFib ( int N) {

int c u r r = 1 , p r e v = 0 ;

f o r (int i = 0 ; i <=N; i ++) {

int te mp = c u r r ;

c u r r = c u r r + p r e v ;

p r e v = te mp ;

}

r e t u r n p r e v ;

}

Listing 3. Non-equivalent mutant

Input program P

and test suite T

Create mutant M

from P

T(M) passes?

Kill the mutant M

P and M are

equivalent?

Amend T to

detect M.

Mark M as

equivalent

Fail

Pass

Yes

Figure 1. Mutation analysis process

The effectiveness of a test set in mutation analysis is

measured by a mutation score. This is normally deﬁned as

the percentage of non-equivalent mutants killed by the test

set. Automated tools often estimate it by dividing the number

of all live mutants with all mutants. The latter is provided

by most mutation tools and is also what we have used in this

paper.

Mutating Java Programs

Mutants can be generated by modifying a program on dif-

ferent levels – from machine code to interpreted languages

with a high abstraction level. Current mutation analysis tools

for the Java language generate mutants from the Java source

code (e.g. µJava [9]) or from the intermediary bytecode ex-

ecuted by the Java Virtual Machine (e.g. Javalanche [12]),

as illustrated in Figure 2. There are pros and cons in both

source code level and bytecode level mutants:

•Each examined source code mutant has to be compiled,

which is slow.

•Bytecode mutants are difﬁcult to examine afterwards as

it’s not possible or straightforward to generate the Java

source for the mutated bytecode.

•The compiler can eliminate dead code, which in theory

can result in fewer equivalent mutants in source code

mutants, e.g. if the mutation operation targets a part of

the code that’s deemed dead by the compiler, the resulting

bytecode will be identical with the original.

•Some more advanced operators are signiﬁcantly easier to

implement in Java than in bytecode.

Java compiler

Java Virtual Machine

if_icmplt if_icmpgt

if(i > 5) { if(i < 5) {

Java mutation

Bytecode mut.

Figure 2. Mutant generation on the Java architecture

µJava3is a well known mutation tool for Java, devel-

oped since 2003. µJava’s mutation operations fall into two

distinct classes: method-level mutation operators and class

mutation operators. Class level operations are related to

encapsulation, inheritance, polymorphism, and some java-

speciﬁc features (e.g. add or remove keywords like this and

static). Method-level operations, presented in Table 1, are

very generic and can be applied to other languages.

Javalanche4is a simple and effective bytecode level mu-

tation analysis tool. It replaces numerical constants (x→

x+ 1|x−1|0|1), negates jump conditions, omits method

calls and replaces arithmetic operators. There are no ad-

vanced mutation operators related to visibility, inheritance

3http://cs.gmu.edu/~offutt/mujava/

4http://www.st.cs.uni-saarland.de/~schuler/javalanche/

155

Name Description Example

Arithmetic operators Replace, add, and remove unary and binary arith-

metic operators (+, -, /, *, ++, --) for both

integer and ﬂoating point operators.

x+1

x-1 x*1 x/1

Relational operators replace different comparison operators (>, >=, <,

<=, ==, !=) within the program.

x==1

x!=1 x>=1 ...

Conditional operators Replace, insert and remove conditional operators

(&&, ||, !). Bitwise operators &, |, and ^are

also used as replacements for these operators as

they are very common mistakes.

x||!y

x|!y x&&!y x||y ...

Shift operators Replace bit-wise shifting operators (<<, >>, >>>)x>>1

x<<1 x>>>1

Bitwise operators Replace, add and move four operators to perform

bitwise functions(&, |, ^, ~).

x&~y

x|~y x&y x^~y

Assignment operators Replace the convenience assignment operators pro-

vided by Java with another (+=, -=, *=, /=,

%=, &=, |=, ^=, <<=, >>=, >>>=).

x+=2

x-=2 x*=2 x/=2 ...

Table 1. Method-level mutation operators in µJava. The last column shows a tree diagram where the child nodes are possible

mutants generated by this operation applied on the parent.

and polymorphism as µJava has. Javalanche has been suc-

cessfully used to run mutation analysis on AspectJ, a large

open source Java project (with almost 100 thousand lines of

code) in under six hours on a single workstation [13].

4. Test Coverage vs. Mutation Score

We analyzed ﬁnal submissions from each student to three

assignments – called test set A, B, and C later on. We failed

to apply mutation analysis successfully on some submissions

because:

1. The submission did not compile successfully.

2. The test suite did not pass on the unmutated program.

3. Individual tests were not repeatable and independent of

the execution order.

This implies that only submissions where all the student’s

tests passed are analyzed. Table 2 summarizes how many of

the submissions were successfully analyzed and how many

mutants, on average, were generated from each assignment.

Mutation analysis Generated Mutants

Mutation (per submission)

Name All Applicable score avg. min max avg

Set A 158 131 (83.0%) 80.5% 22 90 44.4

Set B 187 174 (90.0%) 73.7% 79 439 106.9

Set C 193 169 (87.6%) 84.9% 12 93 26.4

Table 2. Summary of the test sets used.

4.1 Test set A - Binary search trees

In this exercise the students were instructed to implement a

binary search tree by extending an existing binary tree im-

plementation through inheritance. As with all our exercises,

unit-testing the solution was required and code coverage was

used as the measure of completeness.

Of the 131 analyzed submissions 125 achieved perfect

code coverage. This is an expected result, as the students

were awarded for reaching perfect coverage, and in the case

of this assignment it was relatively easy to achieve.

The mutation analysis yielded an average of 44 mutations

per sample, and it took on average about 12 seconds to run

per sample. The best work managed to kill 48 of its 49 mu-

tants, resulting in 97.96% mutation score. The one remaining

live mutant was identical on the java source code level and

thus unkillable, so this can be considered a perfect score. On

average the mutation score was 80.48%, and the worst was

40%. The worst mutation score that had reached perfect code

coverage was 54.76%. There were several samples where the

tested method could be completely commented out, and the

test suite would fail to detect this. This would seem to sup-

port our assumption that mutation analysis offers better ca-

pabilities in identifying weak test sets.

Figure 3 illustrates relationship between mutation score

and test coverage in the set as a scatter plot. Histograms on

each axis show the distribution of the respective variables.

Code coverage (on the X-axis) was 100% for most submis-

sions. This also explains why the correlation between the

variables is small (ρ≈0.1628).

156

Effect of Implementation to the Generation of Mutants

The binary tree assignment was more restrictive than any of

the others. Not only were the students told which methods

to implement but they were not allowed to declare any ad-

ditional instance variables. This was checked automatically.

These restrictions made it possible to combine implementa-

tions and tests of different students.

In order to show how much variation does the implemen-

tation itself (excluding the tests) cause to the mutation score,

we selected 4 example test suites from the ones that had

achieved 100% test coverage; the worst, the best, and two

random ones. We ran mutation analysis on all the previously

analyzed submissions with each of these test sets, and results

are shown in Figure 4. Each box plot presents one of the four

selected test suites. Labels on the X-axis are the original mu-

tation scores and the box plot visualizes how the mutation

score varied when the test was executed against implemen-

tation of all the other students.

If we are to use the mutation score as an indication to the

adequacy of the test set, this score should not be affected by

the implementation, but the implementation does affect the

number of equivalent mutants created, which makes the mu-

tation scores of two test suites on two different implementa-

tions incomparable. It should be noted that in Figure 4 the

distribution of the ﬁrst two test sets seems to be very sim-

ilar even though the original score is very different. This is

something that needs to taken into account if the students are

ever rewarded on basis of their mutation score.

4.2 Test set B - Hashing

This was the ﬁrst exercise on the course and its main idea

was to acquaint the students with unit testing. The code to

be tested was a pre-implemented hash table implementation

using double hashing. The students only had to add in it a

method for ﬁnding prime numbers – other than that it was

all about testing.

0 0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

Code coverage

Mutation score

Figure 3. Scatter plot of code coverage and mutation score

of the test set A

40 %

50 %

60 %

70 %

80 %

90 %

100 %

Best Suite

Mut. Score 98,0 %

Random Suite 1

Mut. Score 85,4 %

Random Suite 2

Mut. Score 72,0 %

Worst Suite

Mut. Score 54,8 %

Figure 4. Distribution of mutation scores for the selected

test suites as box plot. Displayed are the minimum, maxi-

mum 90th and 10th percentiles. X-axis labels are the muta-

tion scores that were achieved when running the mutation

analysis on its respective implementation.

Of the 174 analyzed submissions 135 achieved perfect

code coverage. This exercise was more complex than test set

A, as it yielded on average 106 mutations per sample. Mu-

tation scores of the submissions that achieved perfect cover-

age score ranged from 42.7% to 89.39% with 73.73% being

the average. Mutation scores of the submissions that didn’t

reach perfect coverage ranged from 24.44% to 85.96% with

63.52% being the average.

The distribution and the relationship between code cover-

ages and mutation scores can be seen in Figure 5.

The best sample reached 100% code coverage and man-

aged to kill 100 out of 112 mutants.

We also examined the sample with the worst mutation

score that had reached perfect code coverage: The sample

managed to kill 38 of the 89 total mutations, reaching a

mutation score of 42.70%. The student generated portion of

the test suite had only a single assertion, and half of the unit

tests didn’t contain any assertions. The test suite is clearly

inadequate despite it having perfect branch coverage.

0 0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

Code coverage

Mutation score

Figure 5. Scatter plot of code coverage and mutation score

of the test set B

157

The distribution seems to be very similar to the distribu-

tion seen in test set A (Figure 3), except that the number of

samples is higher and there is more variance in the code cov-

erages, σ≈0,064. Pearson product-moment correlation co-

efﬁcient for the dataset is ρ≈0,669 indicating a clear pos-

itive correlation, unlike in test set A. Major difference with

test set A is the maximum mutation score, which was un-

der 90% compared to the practically perfect mutation score

achieved in A.

From some submissions we analyzed all mutants that

were not killed. Not even the best student generated test suite

managed to kill all the non-equivalent mutants.

4.3 Test set C - Disjoint sets

In this exercise the students were instructed to build a simple

union-ﬁnd structure. The exercise allows extracting code

shared by different methods into helper methods and has

fairly simple recursive and iterative solutions.

Mutation analysis was successfully performed on 169

submissions, of which 144 achieved perfect coverage. Each

submission yielded on average 26 mutations. The amount of

generated mutants ranged from 12 to 93. Mutation scores of

the submissions that achieved perfect coverage score ranged

from 38.46% to 95.00% with 84.88% being the average,

which is the highest in all the data sets. Mutation scores of

the submissions that didn’t reach perfect coverage ranged

from 21.43% to 90.48% with 67,84% being the average.

The distribution and the relationship between code cover-

ages and mutation scores can be seen in Figure 6.

0 0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

Code coverage

Mutation score

Figure 6. Scatter plot of code coverage and mutation score

of the test set C

The best sample had 100% code coverage and killed 19

of its 20 mutations reaching a mutation score of 95.00%, and

the remaining mutant was equivalent so this sample should

considered mutation adequate.

The submission with perfect code coverage and worst

mutation score managed to kill 10 of its 26 total mutations

reaching a mutation score of 38.46%.

The distribution seems to be very similar to the distribu-

tion seen in the previous test sets (Figure 3 and 5). Pear-

son product-moment correlation coefﬁcient for the dataset is

ρ≈0.6034 indicating a clear positive correlation.

5. Analysis of Weak Test Sets

We manually examined bottom four samples by mutation

score from the samples that had reached perfect coverage.

Our initial assumption was that these should be of poor

quality by having badly tested or untested functionality. We

were not investigating other aspects of test quality, such as

style and structural considerations. We also examined the

percentage of the teacher’s tests passing. It should be noted

that all students’ tests had to pass their own implementation

so that we were able to run the mutation analysis.

5.1 Test set A

All the examined samples had signiﬁcant problems. For ex-

ample, none of them tested printInorder, at all – although

it was fully covered by the tests. Table 3 summarizes our

ﬁndings from Test set A.

Mutants teacher’s

Name generated detected score tests

weak 1 42 23 54.76% 57%

weak 2 54 30 55.56% 86%

weak 3 45 26 57.78% 71%

weak 4 49 29 59.18% 100%

Table 3. Samples with perfect code coverage and bad mu-

tation score of the test set A

5.2 Test set B

All the samples had plenty of untested functionality. Three

of the analyzed test sets had only one trivial (although mean-

ingfull) assertion generated by a student. Numerical results

from this test set are in Table 4. It should be noted that the

most important part of this assignment was to test code that

was given. This is why all of the teacher’s tests are passing.

Mutants teacher’s

Name generated detected score tests

weak 1 89 38 42.70% 100%

weak 2 104 49 47.12% 100%

weak 3 107 52 48.60% 100%

weak 4 90 53 58.89% 100%

Table 4. Samples with perfect code coverage and bad mu-

tation score of the test set B

5.3 Test set C

Unlike in the other test sets, the samples in the test set C were

not all of bad quality. Quality of the test of samples 1 and 2

was poor and comparable to test sets A and B. However,

tests of samples 3 and 4 were signiﬁcantly better and it

can be argued they only had some corner cases untested

but redundant code caused a large number of mutants to

be equivalent with the original. Quantitative data from the

samples are presented in Table 5.

158

Mutants teacher’s

Name generated detected score tests

weak 1 26 10 38.46% 100%

weak 2 23 11 47.83% 100%

weak 3 25 15 60.00% 86%

weak 4 21 13 61.76% 100%

Table 5. Samples with perfect code coverage and bad mu-

tation score of the test set C

6. Discussion and Conclusions

In the following two subsections we answer to our ﬁrst

research question: What are the possible strengths (Sec-

tion 6.1) and weaknesses of mutation analysis (Section 6.2)

when compared to code coverage based metrics? In Sec-

tion 6.4 we answer our second question: can mutation anal-

ysis be used to give meaningful grading on student-provided

test suites requested in programming assignments?

6.1 Strengths

Automatically assessed exercises are often criticized for not

being creative enough. Assessing the functionality of the

solution by unit tests written by the teacher implies exercises

where students are given the structure of the code. Greening,

for example, argues [5, pp. 53–54]:

Usually, however, the tasks required of the student are

highly structured and meticulously synchronized with

lectures, and are of the form that asks the student to

write a piece of code that satisﬁes a precise set of

speciﬁcations created by the instructor. [. . . ] Although

some practical skills are certainly gained, the exercise

is essentially one of reproduction.

Mutation analysis combines the correctness of the pro-

gram to be tested (i.e. it can only be applied when tests pass)

and the adequacy of the tests. This lessens the need of unit

tests written by the teacher and allows more open ended as-

signments.

In Section 3.1, we described other approaches and as-

sessment tools that also evaluate test set’s ability to detect

faulty programs. The beneﬁt of mutation analysis over com-

petitions where students’ assignments are executed against

each other is the ability to give immediate feedback. Im-

mediate feedback would also be possible if faulty programs

were generated beforehand by the teacher – as discussed

in Section 3.1. However, manual generation of the mutants

would prevent automatically assessing more open ended as-

signments – which mutation analysis could perform. Qual-

itative analysis in Section 5 implies that mutation analysis

is an effective approach for semi-automatic assessment. It

could be used with systems like Web-CAT and ASSYST to

post-process the submissions and to identify students that

may be trying to fool the assessment systems. These sub-

missions could then be manually assessed to ensure this is

not the case.

6.2 Weaknesses

One weakness of mutation analysis is that coverage results

are easier to interpret and are therefore simpler to use as

feedback and assessment criteria with students. While ap-

proaches where mutants are used as counterexamples of

weak test sets exist, they should be tested in a real course

setting to see what is the best way to apply them.

Complex Solutions can be Over-weighted

Complex code creates many mutants and redundant code can

cause large numbers of equivalent mutants. This can cause

unfairly low mutation scores but can also be used to cheat

automatic assessment based on mutation analysis.

When methods contain redundant code or are very com-

plicated the number of mutants blows up. This can lead to

a situation where a signiﬁcant portion of the mutants are

from a small untested functionality. This implies that the

penalty of not testing that speciﬁc funtionality gets too high

as demonstrated in Section 5.3.

If students realize that complex code creates many mu-

tants, they may try to fool the mutation analysis system by

seeding irrelevant code into their submissions. For example,

Listing 4 simply performs the function f(x) = x+ 63, but

the way it is written blows up the number of mutants. This

will distort the mutation score. The large number of analyzed

mutants can also result in the grading system performing

poorly.

p ub l ic s t a t i c i n t dummy( int x ) {

x+=3;x+=3;x+=3;x +=3;x +=3;x+=3;x+=3;

r e t u r n x ;

}

Listing 4. Sample of an easily testable dummy method, that

yields a large number of mutants, and can be thoroughly

tested with assertEquals(63,dummy(0));.

The number of mutants generated per submission should

also be monitored, as it can indicate this kind of cheating,

malicious intent in trying to cause the system to perform

poorly, or simply an over-complicated solution from where

the student should get feedback.

6.3 Testing Unspeciﬁed Behavior

Even with assignments where an exact interface to imple-

ment is provided, some details of the implementation can be

unspeciﬁed. For example, how to use return values of meth-

ods can be left for students. For students, leaving such un-

speciﬁed behavior not tested is natural. However, mutation

analysis penalizes from this as mutants are also generated

from the unspeciﬁed behavior. This forces students to spec-

ify the otherwise unspeciﬁed features through tests.

159

6.4 Mutation Analysis in Grading

We conclude that mutation analysis can reveal tests that

were created to fool the assessment system. Preliminary

results indicate that mutation analysis can provide valuable

feedback of how well students have tested their software.

While the information is most easily interpreted and used

by a teacher, the results could be valuable to the students

as well. However, to verify this result, a follow up study

where students get feedback based on the mutation analysis

is needed. An interesting question is if students fool the

mutation analysis just like they do for the coverage.

We should also keep in mind that mutation score is not

independent from the implementation. Thus, if the objective

is to give separate grades from tests and implementation,

raw mutation scores are not the best option as they are not

commensurable. We assume that it would be possible to set

an exercise-speciﬁc threshold to identify certainly poor or

suspicious work. However, this is where more research is

needed.

Although many students submit their work just before the

deadline, we assume mutation analysis to scale up and not to

be computationally too expensive. For example, analysing a

single submission in test set A took 12 seconds on average

(see Section 4.1).

7. Future Research

In the future, we would like to see mutation analysis being

used to provide formative feedback, i.e. feedback for learn-

ing. For example, if mutants are generated on source code

level, programs which did not pass student’s tests could be

provided as feedback. However how to select which of the

live mutants to show in an interesting research problem.

Testing can be made more effective by writing code that

is easy to test. There are rules how to write testable code and

metrics related to measuring the testability. One such metric

is provided by a tool called TestabilityExplorer5[7]. In the

future, we plan to take our data set and apply traditional

code coverage, mutation score and TestabilityExplorer to

understand how these three different metrics are related to

each other. Follow up studies to see how students behave

when the immediate feedback they get is based on mutation

analysis and/or testability are needed.

References

[1] R. A. DeMillo, R. J. Lipton, and F. G. Sayward. Hints on

test data selection: Help for the practicing programmer. IEEE

Computer, 11:34–41, 1978.

[2] S. H. Edwards. Rethinking computer science education from

a test-ﬁrst perspective. In Companion of the 18th annual

ACM SIGPLAN conference on Object-oriented programming,

systems, languages, and applications, Anaheim, California,

USA, 26–30 October, pages 148–155. ACM, New York, NY,

USA, 2003. ISBN 1-58113-751-6.

5http://code.google.com/p/testability-explorer/

[3] S. Elbaum, S. Person, J. Dokulil, and M. Jorde. Bug hunt:

Making early software testing lessons engaging and afford-

able. In ICSE ’07: Proceedings of the 29th international con-

ference on Software Engineering, pages 688–697, Washing-

ton, DC, USA, 2007. IEEE Computer Society.

[4] M. H. Goldwasser. A gimmick to integrate software testing

throughout the curriculum. SIGCSE Bull., 34(1):271–275,

2002. ISSN 0097-8418.

[5] T. Greening. Emerging constructivist forces in computer sci-

ence education: Shaping a new future. In T. Greening, editor,

Computer science education in the 21st century, pages 47–80.

Springer Verlag, 1999.

[6] M. Hauswirth, D. Zaparanuks, A. Malekpour, and M. Keikha.

The javafest: a collaborative learning technique for java pro-

gramming courses. In PPPJ ’08: Proceedings of the 6th inter-

national symposium on Principles and practice of program-

ming in Java, pages 3–12, New York, NY, USA, 2008. ACM.

ISBN 978-1-60558-223-8.

[7] M. Hevery. Testability explorer: using byte-code analysis to

engineer lasting social changes in an organization’s software

development process. In OOPSLA Companion ’08: Com-

panion to the 23rd ACM SIGPLAN conference on Object-

oriented programming systems languages and applications,

pages 747–748, New York, NY, USA, 2008. ACM.

[8] D. Jackson and M. Usher. Grading student programs using

ASSYST. In Proceedings of 28th ACM SIGCSE Symposium

on Computer Science Education, pages 335–339, 1997.

[9] Y.-S. Ma, J. Offutt, and Y. R. Kwon. Mujava: an automated

class mutation system: Research articles. Softw. Test. Verif.

Reliab., 15(2):97–133, 2005. ISSN 0960-0833.

[10] B. Marick. How to misuse code coverage. In Proceedings

of the 16th International Conference on Testing Computer

Software, pages 16–18, 1999.

[11] W. Marrero and A. Settle. Testing ﬁrst: emphasizing testing

in early programming courses. In ITiCSE ’05: Proceedings

of the 10th annual SIGCSE conference on Innovation and

technology in computer science education, pages 4–8, New

York, NY, USA, 2005. ACM. ISBN 1-59593-024-8.

[12] D. Schuler and A. Zeller. Javalanche: efﬁcient mutation test-

ing for java. In ESEC/FSE ’09: Proceedings of the 7th joint

meeting of the European software engineering conference and

the ACM SIGSOFT symposium on The foundations of soft-

ware engineering on European software engineering confer-

ence and foundations of software engineering symposium,

pages 297–298, New York, NY, USA, 2009. ACM. ISBN 978-

1-60558-001-2.

[13] D. Schuler, V. Dallmeier, and A. Zeller. Efﬁcient mutation

testing by checking invariant violations. In ISSTA ’09: Pro-

ceedings of the eighteenth international symposium on Soft-

ware testing and analysis, pages 69–80, New York, NY, USA,

2009. ACM. ISBN 978-1-60558-338-9.

[14] J. Spacco and W. Pugh. Helping students appreciate test-

driven development (tdd). In OOPSLA ’06: Companion to the

21st ACM SIGPLAN symposium on Object-oriented program-

ming systems, languages, and applications, pages 907–913,

New York, NY, USA, 2006. ACM. ISBN 1-59593-491-X.

160

Do the Test Smells Assertion Roulette and Eager Test Impact Students’ Troubleshooting and Debugging Capabilities?

Conference Paper

Full-text available

May 2023

Automated Assessment in Computer Science Education: A State-of-the-Art Review

Article

Full-text available

Feb 2022

Practical programming competencies are critical to the success in computer science education and go-to-market of fresh graduates. Acquiring the required level of skills is a long journey of discovery, trial and error, and optimization seeking through a broad range of programming activities that learners must perform themselves. It is not reasonable to consider that teachers could evaluate all attempts that the average learner should develop multiplied by the number of students enrolled in a course, much less in a timely, deeply, and fairly fashion. Unsurprisingly, exploring the formal structure of programs to automate the assessment of certain features has long been a hot topic among CS education practitioners. Assessing a program is considerably more complex than asserting its functional correctness, as the proliferation of tools and techniques in the literature over the past decades indicates. Program efficiency, behavior, readability, among many other features, assessed either statically or dynamically, are now also relevant for automatic evaluation. The outcome of an evaluation evolved from the primordial boolean values to information about errors and tips on how to advance, possibly taking into account similar solutions. This work surveys the state-of-the-art in the automated assessment of CS assignments, focusing on the supported types of exercises, security measures adopted, testing techniques used, type of feedback produced, and the information they offer the teacher to understand and optimize learning. A new era of automated assessment, capitalizing on static analysis techniques and containerization, has been identified. Furthermore, this review presents several other findings from the conducted review, discusses the current challenges of the field, and proposes some future research directions.

Mutation Coverage is not Strongly Correlated with Mutation Coverage

Conference Paper

Jun 2024

Software Testing Education: A Systematic Literature Review

Article

Full-text available

Dec 2021

Software Testing is the core part of computer science & engineering curriculum. It has been observed that software testing has been taught much in Computer science & engineering disciplines at undergraduate level. Software Testing Education (STE) involves time, cost, risk, quality, integration, communication, human resource, and procurement management skills. The STE at undergraduate and graduate level is a challenging task because it also requires knowledge and experience. This article aims to investigate and synthesize the state-of-the-art research in STE for the improvement of STE curriculum, pedagogical tools and techniques, cognitive, empirical and assessments methods. STE research approaches has been categorized into five categories including empirical type, research approaches, software testing education processes, key areas, and curricula. The ninety-seven articles for the area of STE have been chosen after rigorous systematic screening process published in during 2004 to 2021. Furthermore, tools and techniques, testing processes, pedagogy and student performance are the frequently addressed in STE; whereas assessment methods, gamification, curriculum and exemplary program development appeared as mostly ignored areas. Lastly, research gaps and challenges relate to STE has been presented as future directions for the faculty, software industry, and researchers.

A Model of How Students Engineer Test Cases With Feedback

Article

Oct 2023

Background and Context. Students’ programming projects are often assessed on the basis of their tests as well as their implementations, most commonly using test adequacy criteria like branch coverage, or, in some cases, mutation analysis. As a result, students are implicitly encouraged to use these tools during their development process (i.e., so they have awareness of the strength of their own test suites). Objectives. Little is known about how students choose test cases for their software while being guided by these feedback mechanisms. We aim to explore the interaction between students and commonly used testing feedback mechanisms (in this case, branch coverage and mutation-based feedback). Method. We use grounded theory to explore this interaction. We conducted 12 think-aloud interviews with students as they were asked to complete a series of software testing tasks, each of which involved a different feedback mechanism. Interviews were recorded and transcripts were analyzed, and we present the overarching themes that emerged from our analysis. Findings. Our findings are organized into a process model describing how students completed software testing tasks while being guided by a test adequacy criterion. Program comprehension strategies were commonly employed to reason about feedback and devise test cases. Mutation-based feedback tended to be cognitively overwhelming for students, and they resorted to weaker heuristics in order to address this feedback. Implications. In the presence of testing feedback, students did not appear to consider problem coverage as a testing goal so much as program coverage . While test adequacy criteria can be useful for assessment of software tests, we must consider whether they represent good goals for testing, and if our current methods of practice and assessment are encouraging poor testing habits.

Evaluating the Quality of Student-Written Software Tests with Curated Mutation Analysis

Conference Paper

Nov 2022

Students vs. professionals: improving the learning of software testing

Conference Paper

Oct 2022

Zhongyan Chen

On the use of mutation analysis for evaluating student test suite quality

Conference Paper

Jul 2022

Students vs. Professionals: Improving the Learning of Software Testing

Conference Paper

May 2022

Zhongyan Chen

Good Bug Hunting: Inspiring and Motivating Software Testing Novices

Conference Paper

Jun 2021

Helping students appreciate test-driven development (TDD)

Conference Paper

Full-text available

Oct 2006

Testing is an important part of the software development cycle that should be covered throughout the computer science curriculum. However, for students to truly learn the value of testing, they need to benefit from writing test cases for their own software.We report on our initial experiences teaching students to write test cases and evaluating student-written test suites, with an emphasis on our observation that, without proper incentive to write test cases early, many students will complete the programming assignment first and then add the build of their test cases afterwards. Based on these experiences, we propose new mechanisms to provide better incentives for students to write their test cases early.We also report on some of the limitations of code coverage as a tool for evaluating test suites, and finally conclude with a survey of related work on introducing testing into the undergraduate curriculum.

Rethinking computer science education from a test-first perspective

Conference Paper

Full-text available

Oct 2003

Stephen H. Edwards

Despite our best efforts and intentions as educators, student programmers continue to struggle in acquiring comprehension and analysis skills. Students believe that once a program runs on sample data, it is correct; most programming errors are reported by the compiler; when a program misbehaves, shuffling statements and tweaking expressions to see what happens is the best debugging approach. This paper presents a new vision for computer science education centered around the use of test-driven development in all programming assignments, from the beginning of CS1. A key element to the strategy is comprehensive, automated evaluation of student work, in terms of correctness, the thoroughness and validity of the student's tests, and an automatic coding style assessment performed using industrial-strength tools. By systematically applying the strategy across the curriculum as part of a student's regular programming activities, and by providing rapid, concrete, useful feedback that students find valuable, it is possible to induce a cultural shift in how students behave.

Testing first

Article

Sep 2005

The complexity of languages like Java and C++ can make introductory programming classes in these languages extremely challenging for many students. Part of the complexity comes from the large number of concepts and language features that students are expected to learn while having little time for adequate practice or examples. A second source of difficulty is the emphasis that object-oriented programming places on abstraction. We believe that by placing a larger emphasis on testing in programming assignments in these introductory courses, students have an opportunity for extra practice with the language, and this affords them a gentler transition into the abstract thinking needed for programming. In this paper we describe how we emphasized testing in introductory programming assignments by requiring that students design and implement tests before starting on the program itself. We also provide some preliminary results and student reactions.

MuJava: an automated class mutation system: Research Articles

Article

Jun 2005
SOFTW TEST VERIF REL

Several module and class testing techniques have been applied to object-oriented (OO) programs, but researchers have only recently begun developing test criteria that evaluate the use of key OO features such as inheritance, polymorphism, and encapsulation. Mutation testing is a powerful testing technique for generating software tests and evaluating the quality of software. However, the cost of mutation testing has traditionally been so high that it cannot be applied without full automated tool support. This paper presents a method to reduce the execution cost of mutation testing for OO programs by using two key technologies, mutant schemata generation (MSG) and bytecode translation. This method adapts the existing MSG method for mutants that change the program behaviour and uses bytecode translation for mutants that change the program structure. A key advantage is in performance: only two compilations are required and both the compilation and execution time for each is greatly reduced. A mutation tool based on the MSG/bytecode translation method has been built and used to measure the speedup over the separate compilation approach. Experimental results show that the MSG/bytecode translation method is about five times faster than separate compilation. Copyright © 2004 John Wiley & Sons, Ltd.

Emerging constructivist forces in computer science education: shaping a new future? in t

Article

Jan 2000

Tony Greening

The philosophy of constructivism has emerged as a catalyst for many important changes in pedagogy in recent times. This chapter presents some of the fundamental constructivist principles and examines how these principles might establish an increased presence within the future of computer science education. Programming and WWW support tools are given some coverage from a constructivist perspective. In addition, it is proposed that the very rapidity of technological change is proving a stimulus for a growing need to adopt constructivist approaches for education. The half-life of engineering knowledge (suggested by some as on the order of five years!) makes this rapidity tangible and emphasizes the importance of a revised approach to learning. As well as being stimulated by technological developments, constructivism offers a means of guarding against surface educational changes that do little more than bring the new technology into the classroom. It assists in deciding whether new tools for education exploit the current state of the art and whether such tools are cognitive ones that embody new approaches to educational practice or simply gee-whiz extensions to existing principles. The thread that continues throughout most of this discussion is that constructivism has already established a presence in computer science education, and that this is sufficiently embedded as to ensure its future. However, the degree of that future presence is problematic. The chapter concludes by suggesting that a partial implementation of constructivist principles risks missing the point; it is this matter of degree that is as yet unclear in the direction of CS education, and ultimately proves to be crucial to the status of constructivism in its future.

Javalanche: Efficient mutation testing for Java

Conference Paper

Aug 2009

To assess the quality of a test suite, one can use muta- tion testing—seeding artificial defects (mutations) into the program and checking whether the test suite finds them. JAVALANCHE is an open source framework for mutation testing Java programs with a special focus on automation, efficiency, and effectiveness. In particular,JAVALANCHE as- sesses the impact of individual mutations to effectively weed out equivalent mutants; it has been demonstrated to work on programs with up to 100,000 lines of code.

Grading student programs using ASSYST

Conference Paper

Mar 1997

The task of grading solutions to student programming exercises is laborious and error-prone. We have developed a software tool called ASSYST that is designed to relieve a tutor of much of the burden of assessing such programs. ASSYST offers a graphical interface that can be used to direct all aspects of the grading process, and it considers a wide range of criteria in its automatic assessment. Experience with the system has been encouraging.

Testability explorer: using byte-code analysis to engineer lasting social changes in an organization's software development process.

Conference Paper

Oct 2008

Misko Hevery

Testability Explorer is an open-source tool that identifies hard-to-test Java code. Testability Explorer provides a repeatable objective metric of "testability." This metric becomes a key component of engineering a social change within an organization of developers. The TE report provides actionable information to developers which can be used as (1) measure of progress towards a goal and (2) a guide to refactoring towards a more testable code-base.

The JavaFest: A Collaborative Learning Technique for Java Programming Courses

Conference Paper

Sep 2008

Learning to create well-designed and robust Java pro- grams requires, besides a good understanding of the lan- guage, a significant amount of practice. In this paper we present the JavaFest, a collaborative learning tech- nique for teaching Java to beginning programmers. A JavaFest is a group exercise that instructors can add to their repertoire of teaching techniques. It provides an opportunity for students to practice programming in a motivating but non-threatening environment, and to learn from the experience of their peers. Moreover, a JavaFest allows the instructor to gain insight into the current standing of the students in her class. We describe the concept of a JavaFest and present three case studies in the form of three concrete JavaFests we developed and evaluated in our own object-oriented programming course. The general idea of a JavaFest, and the three specific examples we describe and eval- uate, can easily be adopted to enhance any Java pro- gramming course.

Efficient mutation testing by checking invariant violations

Conference Paper

Jul 2009

Mutation testing measures the adequacy of a test suite by seeding articial defects (mutations) into a program. If a mutation is not detected by the test suite, this usually means that the test suite is not adequate. However, it may also be that the mutant keeps the program's semantics unchanged| and thus cannot be detected by any test. Such equivalent mutants have to be eliminated manually, which is tedious. We assess the impact of mutations by checking dynamic invariants. In an evaluation of our JAVALANCHE framework on seven industrial-size programs, we found that mutations that violate invariants are signicantly more likely to be de- tectable by a test suite. As a consequence, mutations with impact on invariants should be focused upon when improv- ing test suites. With less than 3% of equivalent mutants, our approach provides an ecient, precise, and fully automatic measure of the adequacy of a test suite.

Mutation analysis vs. code coverage in automated assessment of students' testing skills

Abstract and Figures

Recommended publications

Automatically detectable indicators of programming assignment difficulty

Running students' software tests against each others' code: new life for an old "gimmick"

Comparing test quality measures for assessing student-written tests

Pragmatic Software Testing Education

Topsy-Turvy: a smarter and faster parallelization of mutation analysis