DataPDF Available

Preserving Knowledge in Software Projects

October 2012
Journal of Systems and Software 85(10)

October 2012
85(10)

Authors:

Omar Alam

Trent University

Bram Adams

Queen's University

Up-to-date preservation of project knowledge like developer communication and design documents is essential for the successful evolution of software systems. Ideally, all knowledge should be preserved, but since projects only have limited resources, and software systems continuously grow in scope and complexity, one needs to prioritize the subsystems and development periods for which knowledge preservation is more urgent. For example, core subsystems on which the majority of other subsystems build are obviously prime candidates for preservation, yet if these subsystems change continuously, picking a development period to start knowledge preservation and to maintain knowledge for over time become very hard. This paper exploits the time dependence between code changes to automatically determine for which subsystems and development periods of a software project knowledge preservation would be most valuable. A case study on two large open source projects (PostgreSQL and FreeBSD) shows that the most valuable subsystems to preserve knowledge for are large core subsystems. However, the majority of these subsystems (1) are continuously foundational, i.e., ideally for each development period knowledge should be preserved, and (2) experience substantial changes, i.e., preserving knowledge requires substantial effort.

Content uploaded by Omar Alam

Content may be subject to copyright.

Preserving Knowledge in Software Projects

Omar Alama, Bram Adamsb,∗

, Ahmed E. Hassanc

aSEL, School of Computer Science, McGill University, Canada

bMCIS, D´

epartement de G´

enie Informatique et G´

enie Logiciel, ´

Ecole Polytechnique de Montr´

eal, Canada

cSAIL, School of Computing, Queen’s University, Canada

Abstract

Up-to-date preservation of project knowledge like developer communication and de-

sign documents is essential for the successful evolution of software systems. Ideally,

all knowledge should be preserved, but since projects only have limited resources, and

software systems continuously grow in scope and complexity, one needs to prioritize

the subsystems and development periods for which knowledge preservation is more

urgent. For example, core subsystems on which the majority of other subsystems build

are obviously prime candidates for preservation, yet if these subsystems change contin-

uously, picking a development period to start knowledge preservation and to maintain

knowledge for over time become very hard. This paper exploits the time dependence

between code changes to automatically determine for which subsystems and develop-

ment periods of a software project knowledge preservation would be most valuable. A

case study on two large open source projects (PostgreSQL and FreeBSD) shows that

the most valuable subsystems to preserve knowledge for are large core subsystems.

However, the majority of these subsystems (1) are continuously foundational, i.e., ide-

ally for each development period knowledge should be preserved, and (2) experience

substantial changes, i.e., preserving knowledge requires substantial effort.

Keywords: software maintenance; documentation; knowledge preservation; empirical

analysis; mining software repositories

1. Introduction

The global scale of today’s software development makes it very easy for project

teams to lose track of the context and knowledge about their systems, such as best

practices and design rationale [1, 2, 3]. Even for those projects that have preserved

system knowledge in the form of code comments [1], design documentation, manuals,

tutorials, comprehensive test suites, dedicated system experts, developer training [4]

and/or archives of relevant development artifacts, it is very challenging to keep this

∗Corresponding author

Email addresses: omar.alam@mail.mcgill.ca (Omar Alam), bram.adams@polymtl.ca

(Bram Adams), ahmed@cs.queensu.ca (Ahmed E. Hassan)

Manuscript accepted by Journal of Systems and Software April 2, 2012

knowledge up-to-date [3]. This is problematic, since lack of up-to-date system knowl-

edge has been identiﬁed as the second largest root cause of defects in software sys-

tems [4]. Preservation of knowledge is also crucial for decision processes and program

understanding [5].

The reason why up-to-date system knowledge is often lacking is not just ignorance,

but rather that preserving all knowledge simply is infeasible because of the abundance

of knowledge that could be preserved. Software systems keep on growing in size and

complexity over time [6, 7], and this growth is typically accompanied by a growth in

the number of contributors, mailing list discussions and bug reports [8, 9]. Figuring out

for which subsystems preservation of knowledge would be the most valuable and sus-

tainable is complicated. Ideally, so-called “foundational” subsystems, i.e., subsystems

that have many other subsystems building and depending on their APIs (Application

Programming Interfaces), are prime candidates for preservation. Yet, blindly preserv-

ing knowledge of every development period of a foundational subsystem is not feasible

nor efﬁcient, since for some subsystems all periods contain major changes that force

dependent subsystems to change, while for others hardly any period contains major

changes (e.g., only internal bug ﬁxes).

To support practitioners in prioritizing subsystems and development periods for

which knowledge should be preserved the most, it is important to consider the trade-off

between space, time and effort. For example, since HTML rendering is the core busi-

ness of web browsers, the HTML rendering subsystem is a foundational subsystem

on which many other subsystems build (space) and that continuously evolves (time)

because of major performance and functionality changes (effort). Other browser sub-

systems, like the SSL/TLS subsystem, have less subsystems building on them (space),

and change less frequently (time) and signiﬁcantly (effort). The space and time di-

mensions determine the subsystem and its development periods that are most valuable

to preserve knowledge for, whereas the effort dimension determines how much new

knowledge potentially needs to be preserved. The HTML rendering subsystem is more

valuable to preserve knowledge for than the SSL/TLS subsystem, but requires preser-

vation of the knowledge of almost every development period, which takes signiﬁcant

effort given the many changes. Hence, in some cases practitioners could opt to preserve

knowledge for SSL/TLS instead.

To support the analysis of the space, time and effort dimensions of knowledge

preservation in real-life software systems, we propose an automated approach to iden-

tify foundational subsystems and periods for a project. The approach is based on our

earlier work on the time dependence of changes [10, 11]. Time dependence of changes

captures for each source code entity Ein a particular revision of a software system the

speciﬁc revision of all source code entities (such as API methods) on which Ebuilds.

In this paper, we lift time dependence of changes up from revisions to development

periods and from entities to subsystems. This lifted version of time dependence allows

us to calculate for each subsystem in each development period:

•the “foundationality” 1, i.e., the degree to which other subsystems build on the

1Although the word foundationality does not exist in the English dictionary, it has been used by philosophers and even

by some computer scientists .

subsystem;

•the “sporadicity”, i.e., how irregularly the foundationality of the subsystem is

distributed across development periods;

•the number of source code changes to a subsystem, i.e., how much new knowl-

edge is added (and potentially should be preserved).

We apply our approach on two large open source systems (PostgreSQL and FreeBSD)

to address the following three research questions:

Q1 Which subsystems should have a higher priority for knowledge preservation?

A project develops few highly foundational subsystems that provide the project’s

core structure and hence should be preserved ﬁrst.

Q2 Which development periods should have a higher priority for knowledge preserva-

tion?

Most foundational subsystems are continuously foundational, which means that

ideally knowledge should be preserved for every development period.

Q3 How much effort is involved in preserving knowledge of foundational subsystems?

Foundationality of a subsystem in a development period correlates with the num-

ber of changes to the subsystem in that period. In other words, a lot of new

knowledge is added in foundational development periods, which requires more

effort to preserve.

Our work provides practitioners with an automatic approach to help them prioritize

which subsystems to preserve knowledge for. However, since the most foundational

subsystems turn out to require substantial preservation effort (experience most of the

changes), more work is needed on prioritizing the foundational subsystems that need

knowledge to be preserved.

Organization of the Paper. The paper is organized as follows. Section 2 presents

our methodology based on the time dependence of changes. Section 3 presents the

three research questions that we study using our methodology. Section 4 explains the

setup of the two case studies that we performed, and discusses the case study results

for each research question. Section 5 discusses threats to validity, whereas Section 6

discusses related work. Section 7 summarizes our ﬁndings and concludes the paper.

2. Methodology

This section introduces the concepts used to measure the space, time and effort

dimensions of knowledge preservation for subsystems of a software project. Similar to

other work [6], a subsystem can be any logical (e.g., all functions collaborating on a

major feature) or physical (e.g., ﬁle system directories) collection of source code ﬁles.

Two of our measures, i.e., foundationality and sporadicity, are based on the concept

of time dependence between source code entities. This concept was introduced in our

previous work to track the progress of projects [10] and to detect foundational periods

void sub1.f1(void){

sub2.f2();

sub3.f3();

}

(a)

void sub1.f1(void){

sub2.f2();

sub1.f4();

}

(b)

Add

sub1.f4()

Add

sub2.f2()

Change 1 Change 3 Change 4 Change 5

Modify

sub1.f4()

Modify

sub1.f1()

Change 6 Change 7

Change 2

Period 1 Period 2 Period 3

Add

sub1.f1()

remove

add

Period 4

Add

sub3.f3()

Modify

sub3.f3()

(c)

Sub1 Sub1

remove

add

Sub2 Sub3 Sub1

Sub3

Period 1 Period 2 Period 3 Period 4

(d)

Figure 1: 1a) and 1b show a source code snippet before and after source code change 7, respectively. The

corresponding change-level time dependence relations are shown in Figure 1c. Figure 1d lifts up the change-

level time dependence relations to the subsystem level.

of software systems [11]. Here, we reﬁne the concept in space by allowing speciﬁc

subsystems to be foundational at different points in time, instead of the whole system at

once. The remainder of this section ﬁrst outlines the necessary background information

on time dependence, then presents all three measures used in this paper as well as how

we implemented them.

2.1. Time Dependence between Entities

Entity-level time dependence establishes time dependence relations between dif-

ferent revisions of source code entities (functions and variables). An entity Ethat is

changed at time Tbuilds (depends) on the most recent revision of all entities to which

accesses or calls existed (possibly removed now) or were added at time T. Figure 1 il-

lustrates entity-level time dependence using a small example that runs across four time

periods. Change 7 modiﬁes function f1() in Figure 1a by removing the call to f3()

and adding a call to f4(), resulting in Figure 1b. Hence, f1() in change 7 builds on

the last revision of f1() itself (Change 4), the last revision of all entities called before

Change 7 (f2() and f3()), and the last revision of newly called entities (f4()).

In this paper, we focus on time dependence in-the-large instead of on time depen-

dence in-the-small. More in particular, we lift up time dependence from the level of

individual changes/revisions to the changes’ encompassing periods (such as quarters

or years), and, more importantly, from time dependence between individual entities to

time dependence between the subsystems to which those entities belong. Hence, time

dependence has a notion of space (subsystems) and time (periods). Figure 1d shows

how all change-level dependencies between entities in Figure 1 were lifted up to period-

level entities between subsystems. Note that the “add” edge between “sub1.f1()” and

“sub1.f4()” is lifted up into a self-edge. We can now deﬁne the concepts needed to

study knowledge preservation.

2.2. Measures for Knowledge Preservation

Space Dimension: Foundationality. Foundational subsystems and development peri-

ods are subsystems and periods that have a large impact on the development of other

subsystems. The latter subsystems heavily access or call variables and functions devel-

oped or changed by a foundational subsystem in a foundational period. In other words,

the changes to the foundational subsystem in a foundational development period trig-

gered changes in many, possibly all dependent subsystems. As such, it is important to

preserve knowledge of the speciﬁc development that happened during this period. If

the subsystem’s change in this development period would not have triggered changes

in so many other subsystems, that development period would not be foundational for

the subsystem and would seem less critical to preserve knowledge for.

The foundationality of a subsystem Sin a particular development period Dfor-

mally is deﬁned as the number of incoming time dependence relations for Sin period

Doriginating from subsystems in period Dor later. A higher foundationality means

that more subsystems changed later on because of the changes to Sin D. For example,

the foundationality of “Sub2” in “Period 1” is 1 (edge from “Sub1” in “Period 4”), be-

cause to make the changes to “Sub1” in “Period 4”, one needs knowledge about “Sub2”

in “Period 1”. The foundationality of “Sub1” in “Period 4” is also 1 (self-edge).

The total foundationality of a subsystem is the sum of the subsystem’s founda-

tionalities across all development periods [12]. A higher total foundationality means

that more subsystems changed later on because of changes to S. For example, “Sub1”

has a total foundationality of 3, whereas for “Sub2” it is 1, meaning that “Sub1” has

played a more foundational role in the lifetime of the project (its changes forced more

dependent subsystems to change).

Time Dimension: Sporadicity. A second important concept is the sporadicity of a sub-

system. For example, Figure 2a plots the foundationality of PostgreSQL’s odbc sub-

system in each period (PostgreSQL is one of the case study subject systems). odbc

is a “sporadically foundational” subsystem, since its foundationality is concentrated

in only two, clearly distinct development periods. In contrast, FreeBSD’s kern sub-

system (FreeBSD is the second case study subject system) in Figure 2b experiences

dramatic variation in foundationality, almost continuously throughout all development

periods. Hence, kern can be considered a “continuously foundational” subsystem. For

such subsystems, it can be very hard to prioritize development periods for knowledge

5000"

10000"

15000"

20000"

25000"

30000"

35000"

40000"

45000"

Founda'onality,

Time,

(a) PostgreSQL

10000"

20000"

30000"

40000"

50000"

Founda'onality,

Time,

(b) FreeBSD

Figure 2: The foundationality of the odbc subsystem in PostgreSQL and the kern subsystem in FreeBSD

across development periods. odbc is more sporadically foundational than kern.

preservation, since dependent subsystems are impacted by changes to the foundational

subsystem almost all the time.

We capture sporadicity of a subsystem in terms of the normalized entropy of foun-

dationality over time [13], i.e.:

sporadicity(s)=1−normalized entropy(s),∀subsystems s

= 1 + 1

log2(n)×

i=1

pi(s)×log2(pi(s)) ,∀subsystems s

with

pi(s) = foundationality of s in period i

total foundationality of s

i=1

pi(s)=1

Sporadicity

low high

Foundationality

1 2

3 4

continuous

unneeded

sporadic

needed

sporadic

unneeded

continuous

needed

Figure 3: Comparison of the concepts of foundationality and sporadicity.

n=total number of development periods

This entropy expresses how uniform the distribution of foundationality is across

time. A normalized entropy of 1 means that foundationality is uniformly distributed

across time (“continuous”), whereas a normalized entropy of 0 means that all founda-

tionality is concentrated in one development period (“sporadic”). Since we do not want

to measure uniformity, but sporadicity, we subtract the normalized entropy from 1. In

other words, if each development period experiences the same foundationality, the spo-

radicity is 0 (p(s) = 1

n), i.e., the foundationality is continuous. If, on the other hand, all

foundationality is focused in one development period, then sporadicity is 1 (pj(s) = 1

and pi(s) = 0,∀i6=j). The sporadicity of odbc is 0.63, whereas the sporadicity of

kern is 0.07, i.e., the foundationality of odbc is much more sporadic. This is because

odbc was migrated to a different source control repository in 2002 [11], causing the

extremely low foundationality afterwards (Figure 2a).

Effort Dimension: Number of Changes. The effort involved with preserving the knowl-

edge of a subsystem in a particular development period follows from the amount of

new knowledge about a subsystem added in that period. To approximate this amount

of new knowledge, we use the number of source code changes to the subsystem in that

period. Alternative measures could be the size of source code changes or the volume

of messages on mailing lists, but a 100% accurate measure is hard to achieve.

Discussion. The relation between foundationality and sporadicity is illustrated in Fig-

ure 3. Subsystems in zones 1 and 3 are not that foundational (we call them “hardly

foundational”), and hence have a lower need (priority) for up-to-date preservation of

project knowledge. If one would be interested in the subsystems in zone 3, typically

knowledge would need to be preserved for most of the subsystems’ development pe-

riods. Zones 2 and 4 theoretically correspond to the most foundational subsystems

(“highly foundational”). Preservation of knowledge is recommended for both zones,

but for subsystems in zone 2 it is easier to prioritize knowledge preservation, since

there are only a limited number of development periods to consider.

The number of changes and foundationality are two different measures, since it is

possible for development periods to be foundational based on only one source code

change, for example a new version of a library that is imported into a source code

repository. All dependent subsystems will need to update their dependency on this li-

brary, introducing a substantial number of time dependency edges, and hence increas-

ing foundationality.

The opposite is also possible, i.e., a non-foundational development period with

thousands of source code changes, for example to the “main” function of a system.

Although our deﬁnition of foundationality takes these changes into account via time

dependence self-edges, the absence of any incoming relation from other subsystems

generally leads to a low foundationality. Hence, such changes typically end up with

a low priority for knowledge preservation, unless other dependencies would be taken

into account in addition to time dependence. We consider this to be future work.

2.3. Implementation

The source control repository of a project (e.g., CVS or SVN) contains all the in-

formation required to calculate foundationality, sporadicity and the number of changes.

The main issue is that such repositories typically contain rather low-level information.

Instead of subsystem-level information like “subsystem 1 now depends on subsystem

2, which was last changed 1 quarter ago” or at least source code entity-level informa-

tion like “a function call to g was added to function f, and g was last changed 5 weeks

ago”, repositories typically contain line-level information like “line 5 was changed on

the 8th of December 2011”.

In order to lift up the line-level information to the subsystem- and period-level re-

quired for our purposes, one can use evolutionary extractors like C-REX [14]. Such ex-

tractors statically analyze all change transactions that happened over time. They parse

the source code changes to identify added and removed function calls and variable

accesses, then link these calls and accesses to the ﬁles containing the corresponding

function and variable deﬁnitions. Since static analysis is used, this linking is not 100%

accurate. For example, two ﬁles could both contain a function with the same name.

Ideally, the actual build conﬁguration should be considered to know which of the two

ﬁles is really used in a particular release of the product, however we did not do this for

this paper (nor for our previous work [10, 11]).

In this paper, we use the C-REX evolutionary extractor and some scripts that lift

up function- and week-level information to subsystem- and quarter-level, and calculate

the metrics discussed in this section. C-REX ignores changes done for indentation or

code changes, not a full parser. This allows processing uncompilable code changes, but

(as mentioned above) slightly reduces the accuracy of the linking. To address this, we

use heuristics based on the most speciﬁc common super-folder. For example, if ﬁles

“/a/b/c/d.c” and “/a/b/e/f.c” contain a function named “f” and a third ﬁle “/a/b/c/g.c”

calls “f”, the most speciﬁc common super-folder of “/a/b/c/d.c” and “/a/b/c/g.c” is

“a/b/c”, whereas for “/a/b/e/f.c” and “/a/b/c/g.c” it is “a/b”. Since “a/b/c” is longer

than “a/b”, we resolve the call to “f” to “/a/b/c/d.c”, since the latter ﬁle likely is related

more closely to “/a/b/c/g.c”.

3. Research Questions

Using our methodology, we study three research questions, one for each dimension

of knowledge preservation:

Q1 Which subsystems should have a higher priority for knowledge preservation?

According to the previous section, foundational subsystems are the subsystems

that we would give a high priority for knowledge preservation. Hence, we are

interested in which subsystems are foundational for a given software project. Do

foundational subsystems provide core functionality (i.e., system libraries or com-

ponents with crucial APIs that provide the essential structure for other subsys-

tems), or can non-core end user subsystems be foundational as well?

Q2 Which development periods should have a higher priority for knowledge preserva-

tion?

Does a subsystem typically exhibit short bursts of foundationality (sporadically

foundational subsystem), or is it uniformly foundational throughout the lifetime

of the project (continuously foundational subsystem)? It is easier for sporadically

foundational subsystems to determine the development periods that should have

the highest priority for knowledge preservation.

Q3 How much effort is involved in preserving knowledge of foundational subsystems?

In previous work [11], we found that foundational periods typically experienced

a high number of source code changes. If the same holds at the level of foun-

dational subsystems, preserving knowledge for foundational subsystems will re-

quire substantial effort. Hence, in this question we study if there is a correlation

between the foundationality of subsystems in development periods and the num-

ber of changes performed to these subsystems.

4. Case Study

To explore the three research questions, we performed a case study on two large,

long-lived open source projects. We ﬁrst present the two studied systems, then present

the results for our three questions.

Studied Systems

For our case study, we used the source code histories of the open source Post-

greSQL (1996–2008) and FreeBSD (1993–2009) projects, as explained by Table 1.

PostgreSQL is a relational database system of which the original design goes back to

the 1980s [18], whereas FreeBSD is an operating system distribution derived from the

Table 1: Characteristics of the studied systems.

PostgreSQL FreeBSD

type DBMS Operating System

CVS module pgsql/ [16] src/ [17]

period 10/1996–06/2008 07/1993–12/2009

#quarters 47 66

#changes 84,311 1,074,858

#entities 31,863 617,000

#ﬁles 2,053 37,724

#bug ﬁxes 22,913 144,582

#subsystems 64 957

Berkeley ﬂavour of UNIX [19]. We studied the FreeBSD system including the ker-

nel. We used quarters (3 sequential months) as “period”, since it is a common time

period for project planning (other time periods could easily be explored using our ap-

proach) [20]. We picked both systems due to their long and archived history of changes

(Table 1), and our experience with them from our prior work [10, 11]. The two systems

being from two different domains (databases and operating systems) helps us validate

the generality of our ﬁndings across different domains.

Before starting our case studies, we studied the available documentation for Post-

greSQL and FreeBSD. On the one hand, since both projects have academic roots and

later gathered a large developer and user community, many books, papers and tutori-

als have been written on PostgreSQL and FreeBSD. On the other hand, keeping this

documentation up-to-date requires substantial effort.

For example, both projects dedicate speciﬁc developers to documentation and use

collaborative media like wikis to actively involve users in the documentation process

(“Consider contributing your knowledge back.” [21]). Furthermore, FreeBSD has a

dedicated “documentation project” with explicit todo lists [22]. The FreeBSD bug

report system also lists numerous entries for outdated documentation [23]. At the time

of writing (November 2010), there were 37 critical documentation bug reports, and 293

non-critical ones. For example, there was no documentation for “The New SCSI layer

for FreeBSD (CAM)”, and large parts of the architecture handbook and USB audio

support were outdated. PostgreSQL has a smaller list (6 entries) of open problem

reports related to documentation [24], which mainly contains more technical issues

such as migration to other documentation formats.

To address our research questions, we used the approach outlined in C-REX (Fig-

ure 2.3) on the CVS repositories of PostgreSQL and FreeBSD (Table 1). As proposed

by prior work [6], this paper considers the second level directories as subsystems of

PostgreSQL (e.g., odbc in /interfaces/odbc/Attic/) and the fourth level directories as

subsystems of FreeBSD (e.g., dev in /freebsd/src/sys/dev/cxgb/). As mentioned ear-

lier, we wrote scripts to calculate the three metrics from Section 2.2.

20000"

40000"

60000"

80000"

100000"

120000"

140000"

160000"

180000"

200000"

u)ls"

access"

storage"

commands"

op)mizer"

catalog"

postmaster"

pg_dump"

thread"

libpgeasy"

lib"

regex"

rewrite"

snowball"

regress"

pginsert"

mb"

tcl"

date)me"

libpgtcl"

scripts"

locale"

monitor"

pg_conﬁg"

pg_passwd"

)oga"

pg_controldata"

CAcode"

entab"

string"

pg_encoding"

pgevent"

Founda'onality,

(a)

100000"

200000"

300000"

400000"

500000"

600000"

700000"

800000"

900000"

dev"

heimdal"

cam"

hack"

nfsserver"

wpa_supplicant"

larn"

route6d"

libkvm"

sail"

vmstat"

xinstall"

com_err"

krb"

morse"

quotacheck"

vnconﬁg"

su"

posix4"

syscons"

bugﬁler"

procstat"

libcom_err"

dumpcis"

locale"

wlconﬁg"

rpc.ypxfrd"

rup"

dig"

tput"

bootpgw"

kdb_init"

renice"

yppoll"

whaGs"

keylogout"

Founda'onality,

(b)

Figure 4: Distribution of total foundationality across the subsystems of (a) PostgreSQL and (b) FreeBSD.

The dashed line represents the border between highly and hardly foundational subsystems based on our

threshold.

Q1. Which subsystems should have a higher priority for knowledge preservation?

To facilitate our discussion, we picked a meaningful threshold to distinguish be-

tween hardly and highly foundational subsystems. By no means this is the only possible

threshold, since the right threshold to use depends on the resources (such as time and

personnel) available to an organization for preserving knowledge. If more resources

are available, a lower threshold should be used to consider more subsystems as highly

foundational. Figure 5 plots the distribution of the number of highly foundational sub-

systems for the range of foundationality values of the subsystems in PostgreSQL and

FreeBSD. We can see that this number changes slowly for high foundationality values,

yet for small values the number increases rapidly. This suggests diminishing returns

when picking a threshold that is too low.

To focus our discussion, we selected a threshold based on the differences (deltas)

in foundationality between neighbouring subsystems, as visualized by the horizontal

lines in Figure 5. These differences are roughly decreasing towards less foundational

subsystems. Hence, we selected as highly foundational all subsystems starting from

the subsystem with the highest foundationality down to the ﬁrst subsystem with a delta

050000 100000 150000 200000

12510 20 50

foundationality

#subsystems with higher foundationality

foundationality threshold=47208

(a)

0e+00 2e+05 4e+05 6e+05 8e+05

1510 50 100 500 1000

foundationality

#subsystems with higher foundationality

foundationality threshold=170385

(b)

Figure 5: Plot with for all the possible choices of foundationality threshold in (a) PostgreSQL and (b)

FreeBSD the number of subsystems with higher foundationality, i.e., the number of highly foundational

subsystems. The dashed line shows the threshold that we used. The delta in foundationality of a subsystem

corresponds to the length of the horizontal line ending at the subsystem’s foundationality value.

50"

100"

150"

200"

250"

300"

350"

400"

450"

500"

u)ls"

access"

storage"

commands"

op)mizer"

catalog"

postmaster"

pg_dump"

thread"

libpgeasy"

lib"

regex"

rewrite"

snowball"

regress"

pginsert"

mb"

tcl"

date)me"

libpgtcl"

scripts"

locale"

monitor"

pg_conﬁg"

pg_passwd"

)oga"

pg_controldata"

CAcode"

entab"

string"

pg_encoding"

pgevent"

#ﬁles&

(a)

1000"

2000"

3000"

4000"

5000"

dev"

cpio"

tools"

pkg_install"

ndp"

ne7so"

gencat"

pwd_mkdb"

watch"

uuencode"

libbluetooth"

tconv"

conf"

lfs_cleanerd"

makewha7s"

gzip"

join"

revnetgroup"

libterm"

ﬁfolog"

limits"

nfsiod"

spppcontrol"

nextboot"

dirname"

mount_fdesc"

#ﬁles&

(b)

Figure 6: Distribution of the #ﬁles in each subsystem in the last studied period of (a) PostgreSQL and (b)

FreeBSD. The subsystems are sorted by decreasing total foundationality (cf. Figure 4). Subsystems that no

longer exist in the last studied period were awarded the median #ﬁles across all subsystems.

larger than a particular value (one tenth of the maximum delta, i.e., 8,452 for Post-

greSQL and 28,398 for FreeBSD). This threshold corresponds to the dashed vertical

line on Figures 4a, 4b and 5, which separates hardly (left) and highly (right) founda-

tional subsystems.

Figures 4a and 4b plot the distribution of total foundationality across the subsys-

tems of PostgreSQL and FreeBSD. We observe that only a small percentage of sub-

systems have a high total foundationality. Table 2 lists the top 20 most foundational

subsystems of PostgreSQL and FreeBSD, including all highly foundational ones (based

on our threshold). Only 9.4% of the subsystems in PostgreSQL (6 out of 64) and 1.4%

of the subsystems in FreeBSD (13 out of 957) are highly foundational.

The highly foundational subsystems all provide core functionalities. For example,

the top 5 subsystems in PostgreSQL are utils (built-in data types and routines for mem-

ory management, database transactions and text encoding), nodes (structure for stor-

Table 2: Table showing for the top 20 most foundational subsystems for PostgreSQL (out of 64) and for

FreeBSD (out of 957): (1) total foundationality (Found.), (2) sporadicity (Spor.), and (3) the Spearman

correlation between each quarter’s foundationality and total number of changes. Bold numbers for Found.

highlight highly foundational subsystems, whereas bold numbers for Spor. highlight sporadically founda-

tional subsystems.

PostgreSQL FreeBSD

Subsystem Found. Spor. Corr. Subsystem Found. Spor. Corr.

utils 193,533 0.11 0.68 dev 918,375 0.05 0.79

nodes 109,014 0.37 0.86 kern 838,289 0.07 0.67

access 105,867 0.15 0.61 i386 554,306 0.18 0.94

odbc 74,289 0.63 0.99 sys 444,248 0.19 0.40

storage 67,762 0.18 0.75 gcc 354,610 0.61 0.98

libpq 47,208 0.14 0.63 user.bin 299,445 0.49 0.92

commands 37,798 0.05 0.79 libc 291,723 0.23 0.78

port 33,798 0.51 0.82 netinet 244,350 0.12 0.67

optimizer 29,798 0.16 0.91 boot 242,008 0.30 0.91

parser 29,269 0.18 0.91 net 206,489 0.18 0.68

catalog 25,506 0.16 0.87 gdb 173,414 0.73 0.98

executor 24,073 0.17 0.82 contrib 171,629 0.35 0.95

postmaster 21,782 0.15 0.79 binutils 170,385 0.68 1.00

tcop 19,597 0.20 0.75 pc98 141,303 0.45 0.97

pg dump 19,357 0.21 0.85 vm 139,107 0.20 0.90

ecpg 18,708 0.30 0.91 amd64 138,220 0.15 0.84

thread 16,210 0.80 1.00 perl5 133,159 0.96 1.00

psql 11,314 0.27 0.96 libstdc++ 113,946 0.76 1.00

libpgeasy 6,564 0.99 0.88 openssh 105,528 0.49 1.00

initdb 5,461 0.55 1.00 openssl 105,429 0.68 1.00

ing SQL queries), access (query algorithms based on b-trees and r-trees), odbc (API

for accessing PostgreSQL on the Windows platform [25]), and storage (manages the

PostgreSQL storage system). In FreeBSD, the top 5 subsystems are kern (kernel im-

plementation), dev (device drivers), i386 (architecture-speciﬁc kernel implementation

for the i386 platform), sys (kernel header ﬁles), and gcc (GCC compiler).

Hardly foundational subsystems turn out to be subsystems that either do not provide

essential functionality, or represent “consumer” subsystems like end user applications

and scripting engines. A consumer subsystem only builds on (consumes) other subsys-

tems, without providing functionality to other subsystems in return. Such subsystems

are less important to preserve knowledge for, since they are not of interest to developers

of other subsystems.

Examples of hardly foundational subsystems in PostgreSQL are soundex (user-

deﬁned function for matching based on similar sounding names), pg encoding (utility

to check encoding of data) and pg id (id utility for shell scripts), whereas examples of

consumer subsystems are cli (command line interface), main (main module of Post-

greSQL) and python (Python interface). Examples of hardly foundational subsystems

in FreeBSD are scrshot (screenshot utility), setpmac (run command with different

MAC process label) and ﬁb (Fibonacci heap library), whereas examples of consumer

subsystems are perl (Perl interpreter), dnsquery (DNS query utility) and keylogin

(decryption tool).

Are foundational subsystems by deﬁnition larger than non-foundational ones? Fig-

ures 6a and 6b plot the number of ﬁles for each subsystem (in the last studied period),

ordered by decreasing total foundationality (cf. Figure 4). For subsystems that did

not exist anymore in the last studied period, we used the median ﬁle size across all

subsystems.

Highly foundational subsystems tend to be larger than hardly foundational ones. In

PostgreSQL, the median number of ﬁles for highly foundational subsystems is 139.5

ﬁles compared to 19 for hardly foundational subsystems, whereas in FreeBSD it is

1,469 compared to 6. Overall, the Spearman correlation between foundationality and

ﬁle size is 0.73 for PostgreSQL and 0.69 for FreeBSD, i.e., moderately high. Hence,

hardly foundational subsystems tend to be larger.

Of the four largest subsystems of PostgreSQL, utils is highly foundational, whereas

(from left to right) port (substitute system APIs for the Windows platform), regress

(regression test and infrastructure) and include (shared interfaces across all subsys-

tems) are hardly foundational. For FreeBSD, user.bin (UNIX system utilities), libc (C

standard library) and dev are highly foundational. openssl (SSL/TLS library) on the

other hand is hardly foundational (#20 in Table 2).









Highly foundational subsystems, i.e., subsystems of which knowledge should be

preserved ﬁrst, correspond to large, core subsystems. Subsystems with lower pri-

ority to preserve knowledge for either provide less essential functionality, or rep-

resent consumer subsystems like end user applications.

Q2. Which development periods should have a higher priority for knowledge preser-

vation?

In the previous research question, we prioritized foundational subsystems from a

knowledge preservation point of view. However, do such subsystems exhibit sporadic

periods of foundationality, or are they continuously foundational throughout the life-

time of a project? A subsystem that exhibits only sporadic periods of foundationality

intuitively should be easier to preserve knowledge for, since only a relatively limited

number of development periods is foundational. On the other hand, subsystems that

are foundational throughout the lifetime of a project continuously undergo restructur-

ing and refactoring that impact hundreds of other subsystems. Such continuously foun-

dational subsystems require knowledge preservation of most (if not all) development

periods, since there is no single most foundational development period with highest

priority.

Figures 7a and 7b plot the distribution of sporadicity across subsystems, sorted in

descending order. These curves are clearly different from each other. Whereas the

distribution of sporadicity follows a concave trend for FreeBSD, PostgreSQL follows

a much more accidental trend, with some plateaus.

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0.9"

commands"

libpq"

access"

catalog"

executor"

parser"

pg_dump"

plpgsql"

nodes"

lib"

tcl"

bootstrap"

plpython"

scripts"

examples"

pg_controldata"

odbc"

pg_conﬁg"

regex"

regress"

mb"

libpq++"

pg_encoding"

entab"

libpgeasy"

snowball"

include"

locale"

pg4_dump"

array"

string"

pgevent"

Sporadicity+

(a)

!#$"

!#%"

!#&"

!#'"

!#("

!#)"

!#*"

!#+"

!#,"

-./"

01-"

2-("

314"

256789:;<"

8=<.861"

2567892<-5<

>?-?5@-"

?:A6;<"

38<5?-"

2><7@19B<"

C1@9<611?:4@

314#3D658@-"

111<8@8<"

?.<<"

E2.-"

3@7?:A"

?:A>@-2"

314#F1G;3-"

?:76G"

<H<4"

314#F161-@8.-"

H.8IJKL"

H-85@"

3<"

>83-621"

?:AHMM"

NC124"

416457835?"

>-.<835F"

Sporadicity+

(b)

Figure 7: Distribution of sporadicity across (a) PostgreSQL and (b) FreeBSD subsystems. The dashed line

represents the border between sporadically and continuously foundational subsystems.

Similar to foundationality, the right sporadicity threshold to use depends on the

number and kind of resources available. To determine a sporadicity threshold for our

discussion, we basically use the same delta-based methodology as for the foundation-

ality threshold in Q1. This time, Figure 5 plots the number of subsystems with lower

sporadicity for the range of sporadicity values of the subsystems in PostgreSQL and

FreeBSD (sorted from low to high). Since the deltas in sporadicity in Figure 8 (length

of horizontal lines) do not follow the steady downward trend of for example Figure 5,

we manually picked the threshold between both groups by looking for an out-of-place

long delta (this is the most clear for PostgreSQL). With the dashed line thresholds,

35.9% of the PostgreSQL subsystems (23 out of 64) and 5.1% of the FreeBSD subsys-

tems (49 out of 957) would be continuously foundational.

Using these thresholds, we now analyze sporadically and continuously foundational

subsystems in PostgreSQL and FreeBSD, then compare them to highly and hardly

0.2 0.4 0.6 0.8 1.0

12510 20 50

sporadicity

#subsystems with lower sporadicity

sporadicity threshold=0.416630344

(a)

0.2 0.4 0.6 0.8 1.0

1510 50 100 500 1000

sporadicity

#subsystems with lower sporadicity

sporadicity threshold=0.421261499

(b)

Figure 8: Plot with for all the possible choices of sporadicity threshold in (a) PostgreSQL and (b) FreeBSD

the number of subsystems with lower sporadicity, i.e., the number of sporadically foundational subsystems.

The dashed line shows the threshold that we used. The delta in sporadicity of a subsystem corresponds to

the length of the horizontal line ending at the subsystem’s sporadicity value.

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0.9"

0" 20000" 40000" 60000" 80000" 100000" 120000" 140000" 160000" 180000" 200000"

Sporadicity+

Founda/onality+

1 2

3 4

odbc

portinitdb

libpq

storage access

nodes

commands

optimizer

catalog/executor

thread

libpgeasy

utils

(a)

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0.9"

0" 100000" 200000" 300000" 400000" 500000" 600000" 700000" 800000" 900000" 1000000"

Sporadicity+

Founda/onality+

3 4

dev

netinet kern

gcc

gdb

libstdc++

perl5

user.bin

openssh

openssl

binutils

contrib

pc98

(b)

Figure 9: Scatterplot of sporadicity vs. foundationality for (a) PostgreSQL and (b) FreeBSD.

foundational subsystems.

Sporadically Foundational Subsystems in PostgreSQL

The most continuously foundational subsystems of PostgreSQL are commands

(random collection of portal and utility support code), utils (see question Q1) and

libpq (PostgreSQL front-end library). Other core components of PostgreSQL, such

as postmaster (dispatcher of front-end queries to back-end), optimizer (query plan

generation) and catalog (system catalog manipulation) are also highly continuously

foundational.

The most sporadically foundational PostgreSQL subsystems are those subsystems

with the lowest foundationality. Many of these subsystems were introduced in the very

ﬁrst quarter of the PostgreSQL project, but were never foundational again afterwards

(no changes triggered other subsystems to change). One exception is libpgeasy (sim-

pliﬁed version of libpq), which is the 19th most foundational subsystem in Table 2.

Four other highly foundational subsystems are sporadically foundational as well, i.e.,

thread (threading API), odbc (see question Q1), initdb (database initialization) and

port (see question Q1).

Sporadically Foundational Subsystems in FreeBSD

The most continuously foundational subsystems of FreeBSD are dev (see ques-

tion Q1), kern (kernel implementation) and netinet (IP/TCP protocol stack). Similar

to PostgreSQL, most of the continuously foundational subsystems are core subsys-

tems. The ﬁndings for sporadically foundational subsystems are also similar to those

of PostgreSQL. There are nine sporadically foundational subsystems among the 20

most foundational subsystems in Table 2, i.e., perl5 (Perl), libstdc++ (C++ runtime

support), gdb (debugger), binutils (linking and assembly tools), openssl (see question

Q1), gcc (compiler), user.bin (see question Q1), openssh (secure network protocol)

and pc98 (port of FreeBSD for the NEC PC-98x1 architecture).

All of these subsystems are core development tools or libraries that are not main-

tained by the FreeBSD development team. Instead, they are imported at very speciﬁc

moments in time for customization [11], which can trigger signiﬁcant changes in other

subsystems. This explains why these subsystems are sporadically foundational. Of

course, this does not mean that the imported subsystems do not evolve in between

the import times. The full development history is maintained in the subsystems’ own

source code repository where all regular development occurs. FreeBSD only sees the

major snapshots.

Sporadicity vs. Foundationality

Up until now, we have found that the top foundational subsystems generally tend to

be continuously foundational, which is relatively bad news from the point of knowledge

preservation (hard to prioritize a speciﬁc period to preserve knowledge for). Here, we

are interested in better understanding the speciﬁc relation between foundationality and

sporadicity. For this, we use a scatter plot visualization similar to Figure 3.

Figures 9a and 9b provide interesting insights into the differences between Post-

greSQL and FreeBSD. We demarcated the four zones of Figure 3 based on the thresh-

olds used for foundationality in question Q1 and sporadicity earlier in this section.

Although other thresholds obviously will shift subsystems around between the four

zones, the major trends discussed below remain similar.

In PostgreSQL, most subsystems are in zone 1, with some subsystems in zones 3

and 4, and even fewer in zone 2. This means that most subsystems do not urgently

require knowledge preservation (zones 1 and 3). Of those subsystems that should have

a higher priority for knowledge preservation (zones 2 and 4), most are continuously

foundational (zone 4), i.e., most of their development periods should get a high priority

for knowledge preservation. odbc,port and thread are the sporadically foundational

subsystems with highest foundationality.

In FreeBSD, the situation is different. The majority of subsystems belongs to

zones 1 and 3, followed by zones 4 and 2. The two most foundational subsystems,

i.e., dev and kern, clearly require knowledge preservation in almost every develop-

ment period. The highly foundational subsystems that are sporadically foundational

(i.e., gdb,binutils,gcc and user.bin) are mostly developed upstream (gdb,binutils

020000 40000 60000 80000

0100 200 300 400 500 600 700

Foundationality

#changes

Figure 10: Scatterplot of the #changes versus the foundationality of the development periods of the sys

subsystem of FreeBSD.

and gcc) or are already mature subsystems (user.bin). In the former case, each period

in which an upstream project is imported potentially could have a high impact on other

subsystems and hence warrant knowledge preservation to answer questions like “which

APIs were changed?” and “how to update to the new API version?”.









Most highly foundational subsystems are continuously foundational throughout

the lifetime of a project, whereas most hardly foundational subsystems are spo-

radically foundational.

Q3. How much effort is involved in preserving knowledge of foundational subsystems?

To study the effort involved with preserving knowledge for subsystems, we use the

number of changes to a subsystem in a period as a proxy. We calculate two types of

Spearman rank correlations (we use a non-parametric correlation, since the data is not

normally distributed):

Global correlation We calculate the Spearman rank correlation between the total num-

ber of changes and the total foundationality of each subsystem, to get a rough

indication of whether foundationality and number of changes correlate across all

subsystems.

Subsystem-level correlation Per subsystem S, we calculate the Spearman rank cor-

relation between the number of changes to Sand the foundationality of S, across

all quarters. This tells us for individual subsystems whether or not development

periods of subsystems for which knowledge should be preserved are typically

associated with substantial effort.

We ﬁnd that the global correlation is very high: 0.87 for PostgreSQL and 0.91 for

FreeBSD. At the subsystem-level (see Table 2), the correlation is very high for most

of the subsystems. This suggests that highly foundational subsystems are not only

more continuously foundational, but also that preserving the knowledge of founda-

tional development periods and keeping the knowledge up-to-date needs to consider a

signiﬁcantly larger amount of changes (i.e., effort).

These observations are conﬁrmed when analyzing the median total number of changes

for highly foundational subsystems and for hardly foundational subsystems. Highly

foundational subsystems have a signiﬁcantly higher median number, i.e., 4,451.5 vs.

87 for PostgreSQL and 22,575 vs. 69 for FreeBSD.

There is only one highly foundational subsystem that has a low subsystem-level

correlation. sys (see question Q1) has a correlation of only 0.40. This follows from

the fact that some less foundational quarters saw more changes, as shown in Figure 10.

Since sys consists of kernel header ﬁles, it is indeed obvious that small changes might

impact a large number of subsystems. The top right data point corresponds to the oldest

development period that we considered.









Most foundational subsystems are not only continuously foundational, they also

experience substantially more code changes, hence requiring more substantial

effort to preserve knowledge and keep this knowledge up-to-date.

5. Threats to Validity

Threats to construct validity [26] relate to whether our measurements quantify what

we intended them to. Our calculation of time dependence is primarily based on quar-

ters. Other periods like months, years, releases could be used and might lead to dif-

ferent ﬁndings. Also, our time dependence relations are derived from static call graph

dependencies. Implicit dynamic dependencies are not captured. Therefore, we miss

some of the time dependence relations between changes.

Since our technique is based on the historical source code changes archived in

source control repositories, the most recent development periods and subsystems typi-

cally will have lower foundationality than older periods. This does not mean that it is

not important to preserve knowledge for the former periods and subsystems. We are

currently experimenting with a weighting system to eliminate the skew towards older

development periods and subsystems.

We deﬁne subsystems as a ﬁxed level in the ﬁle system hierarchy of PostgreSQL

(second level) and FreeBSD (fourth level), as explained in Section 4. Although this

approach is well-known (e.g., [6]), and provides us with subsystems for which concrete

project documentation is available, other deﬁnitions of subsystem should be explored.

For example, subsystems might be deﬁned at different levels in the future: a subsystem

Smight consist of directories Cand Yin /A/B/C/ and /X/Y/Z/.

We used the number of changes in a development period as a proxy for the amount

of knowledge that should be preserved, i.e., the effort involved with preservation. This

is a simpliﬁcation, since a small change like switching the implementation of for ex-

ample “malloc” requires careful documentation of the rationale of the switch and the

implementation details of the new version. Similarly, renaming a method that is called

throughout the system is a trivial change that does not require major preservation.

Our analysis is based on two thresholds, i.e., one for foundationality and one for

sporadicity. Although we documented our methodology for determining the thresh-

olds, this methodology by no means is the only acceptable one. Different thresholds

can be used according to the needs and resources of the practitioner. If signiﬁcant peo-

ple and funding are available, lower thresholds can be used (more highly foundational

and continuously foundational systems). Otherwise, higher thresholds should be used

(less highly foundational and continuously foundational systems). In our case stud-

ies, we showed how the number of highly foundational and sporadically foundational

subsystems evolves for all possible threshold values.

External validity relates to the generalization of our study results. Our case studies

are based on two open source C systems. Although they come from different domains

(database and operating system), our results may not generalize to systems of other

domains or commercial systems. Also, systems in different programming languages

or even paradigms (OO) might show different results. Additional studies are needed to

investigate this.

Finally, our approach focuses on supporting practitioners in prioritizing subsys-

tems for which knowledge should be preserved. This approach was used in this paper

to retro-actively study which development periods of which subsystems are most im-

portant to preserve knowledge for. However, to prove the effectiveness of our approach

in pro-actively guiding practitioners, we need to perform a user study.

6. Related Work

In this section, we discuss the closest related work on the different dimensions of

knowledge preservation. We also relate this paper to our earlier work on time depen-

dence. Brudaru and Zeller propose to measure the genealogy of changes, which uses a

directed acyclic graph to model the impact of changes on defects [27]. They build this

genealogy by iteratively establishing change dependencies at the level of lines of code.

During this process, the changes that break the system are observed and the impact

of changes on future defects is analyzed. Although their approach was not validated

in practice, Brudaru et al.’s concept of change dependency has some similarity to our

approach. Our approach considers information from the source control repositories at

subsystem-level and across time instead of at line-level.

German et al. [28] introduce the concept of Change Impact Graph (CIG) to detect

the impact of a change on its dependent changes when changing a source code en-

tity. They iteratively visualize the call graph of a function and call graphs of its called

entities within a time window. The approach aims primarily at locating bugs of a func-

tion. Our approach aims to assist practitioners to identify subsystems and development

periods for which knowledge preservation is needed.

Existing software evolution research does not explore the temporal dependence

between changes. Usually, software metrics such as LOC [29, 30] are measured to

monitor and detect the development periods with rapid or slow growth. Tilley et al. [5]

reverse-engineer the subsystem dependencies in one snapshot of a software system for

re-documentation purposes. They do not consider the time and effort dimension.

The work on change coupling between software entities, like classes and methods,

analyzes what other parts of the code need to be changed if a given piece of code is

changed [31]. However, the dependencies studied in that line of work relate entities to

other entities in the same version of the code base, whereas time dependence relates a

change of an entity to past changes of itself and other called entities.

Kothari et al. [32] introduce the concept of canonical changes to categorize the

change clusters of a project into different areas or activities, like maintenance and new

development. Our approach is somewhat similar in intent, but focuses on the effort to

preserve and maintain knowledge.

Other time-related research analyzes historical data to better understand large, long-

lived software systems. Mockus et al. [33] use historical data from version systems to

identify code experts. Chen et al. [34] developed a tool called CVSSearch, which uses

the CVS comments to track source code fragments. Hassan and Holt [35] introduce

the idea of attaching Source Sticky Notes to static dependency graphs, which assist in

better understanding the software architecture. Our approach leverages time depen-

dence between changes to identify the foundational subsystems. Kim et al. [36] trace

the evolution of software clones in a clone group across time, i.e., they identify time

dependence of clones.

There is quite some effort-related work on bug prediction techniques. Most recent

techniques are based on process metrics derived from the change history of a sys-

tem [37, 38, 4, 39]. These techniques typically look at code churn [39] or the number

of changes to a ﬁle. Bernstein et al. [37], for example, use the number of revisions

and reported issues in the last quarter to predict the location and number of bugs in

the next month. More recently, different techniques have been proposed to factor bug

ﬁxing effort into bug prediction models [40]. In future work, we plan to compare the

performance of our approach against approaches that use other types of historical data.

Finally, in our previous work [10], we used the concept of time dependence to assist

project managers to track the progress of their project. We also studied the impact of

the most recent time dependence on the appearance of bugs. To study the foundational

periods of a software project [11], we considered time dependence relations at the entity

level instead of at the subsystem level. This paper lifts up time dependence between

source code entities to subsystems and studies the characteristics of these subsystems

over their evolution, i.e., we consider both space and time.

7. Conclusion

To keep track of the software development process and to support future evolution

of a software system, preserving and maintaining related software artifacts and their

metadata is indispensable. However, at the same time such knowledge preservation

is hard to achieve because of the continuous evolution of software projects in size and

complexity, whereas only a limited amount of project resources are available for preser-

vation. This paper proposes an automated technique to analyze which subsystems to

prioritize for knowledge preservation without having to spend too much effort. Our ap-

proach is based on the foundationality and sporadicity metrics that can be derived from

the time dependence relations between the subsystems of a project, and on the number

of source code changes stored in the source control repository. Basically, highly foun-

dational subsystems are subsystems whose changes in a particular development period

typically trigger a massive amount of changes in dependent subsystems.

Through a case study on two large open source systems, we ﬁnd that, as could be

expected, highly foundational subsystems mostly correspond to relatively large, core

subsystems. These subsystems deﬁnitely should get a high priority when preserving

and updating knowledge. However, most of those subsystems are continuously foun-

dational, i.e., almost all development periods are important to preserve knowledge for.

In addition, the high number of changes in those periods means that knowledge preser-

vation requires substantial effort.

Although our technique can support practitioners to automatically recommend valu-

able subsystems for which knowledge preservation is feasible with reasonable effort,

there is deﬁnitely room for future work. In particular, what foundationality and spo-

radicity thresholds should be used in what context? How should one prioritize the

subsystems in a speciﬁc zone in the foundationality-sporadicity scatter plot? How

should one deal with the large group of continuously foundational subsystems for

which knowledge preservation has a high priority (zone 4)? Who should preserve the

knowledge of a particular subsystem, and who has the required knowledge? Finally,

does missing to preserve important knowledge really lead to signiﬁcantly more bugs

and wasted development effort?

Acknowledgments.The authors want to thank Weiyi Shang and the anonymous re-

viewers for their insightful suggestions and comments.

References

[1] U. Dekel, J. D. Herbsleb, Improving API documentation usability with knowl-

edge pushing, in: Proc. of the 31st Intl. Conf. on Software Engineering (ICSE),

Vancouver, BC, Canada, 2009, pp. 320–330.

[2] T. Fritz, G. C. Murphy, Using information fragments to answer the questions

developers ask, in: Proc. of the 32nd ACM/IEEE Intl. Conf. on Software Engi-

neering - Volume 1 (ICSE), Cape Town, South Africa, 2010, pp. 175–184.

[3] T. D. LaToza, G. Venolia, R. DeLine, Maintaining mental models: a study of

developer work habits, in: Proc. of the 28th Intl. Conf. on Software Engineering

(ICSE), Shanghai, China, 2006, pp. 492–501.

[4] M. Leszak, D. E. Perry, D. Stoll, Classiﬁcation and evaluation of defects in a

project retrospective, J. Syst. Softw. 61 (2002) 173–187.

[5] S. R. Tilley, H. A. M¨

uller, M. A. Orgun, Documenting software systems with

views, in: Proc. of the 10th annual international conf. on Systems documentation

(SIGDOC), Ottawa, ON, Canada, 1992, pp. 211–219.

[6] M. W. Godfrey, Q. Tu, Evolution in open source software: A case study, in: Proc.

of the Intl. Conf. on Software Maintenance (ICSM), San Jose, CA, US, 2000, pp.

131–142.

[7] S. Koch, Software evolution in open source projects—a large-scale investigation,

Journal of Software Maintenance and Evolution 19 (6) (2007) 361–382.

[8] G. von Krogh, S. Spaeth, K. R. Lakhani, Community, joining, and specialization

in open source software innovation: a case study, Research Policy 32 (7) (2003)

1217 – 1241.

[9] Y. Wang, D. Guo, H. Shi, Measuring the evolution of open source software

systems with their communities, SIGSOFT Software Engineering Notes 32 (6)

(2007) 7–13.

[10] O. Alam, B. Adams, A. E. Hassan, Measuring the progress of projects using

the time dependence of code changes, in: Proc. of the 25th IEEE Intl. Conf. on

Software Maintenance (ICSM), Edmonton, AB, Canada, 2009, pp. 329–338.

[11] O. Alam, B. Adams, A. E. Hassan, A study of the time dependence of code

changes, in: Proc. of the 16th Working Conf. on Reverse Engineering (WCRE),

Lille, France, 2009, pp. 21–30.

[12] R. C. Holt, Structural manipulations of software architecture using tarski rela-

tional algebra, in: Proc. of the Working Conf. on Reverse Engineering (WCRE),

Honolulu, HI, US, 1998, pp. 210–219.

[13] C. E. Shannon, Prediction and entropy of printed english, Bell System Technical

Journal 3 (1951) 53–64.

[14] A. E. Hassan, Mining software repositories to assist developers and support man-

agers, Ph.D. thesis, University of Waterloo, Waterloo, ON, Canada (2004).

[15] A. E. Hassan, Automated classiﬁcation of change messages in open source

projects, in: Proc. of the 2008 ACM Symposium on Applied Computing (SAC),

Fortaleza, Ceara, Brazil, 2008, pp. 837–841.

[16] PostgreSQL, Cvs repository (pgsql/ module), :pserver:anoncvs@

postgresql.org:/usr/local/cvsroot.

[17] FreeBSD, Cvs repository (src/ module), anoncvs@anoncvs1.FreeBSD.

org:/home/ncvs.

[18] http://www.postgresql.org/.

[19] http://www.freebsd.org/.

[20] A. E. Hassan, R. C. Holt, The chaos of software development, in: Proc. of the

6th Intl. Wrksh. on Principles of Software Evolution (IWPSE), Helsinki, Finland,

2003, pp. 84–94.

[21] http://developer.postgresql.org/pgdocs/postgres/

resources.html.

[22] http://www.freebsd.org/docproj/todo.html.

[23] http://www.FreeBSD.org/cgi/query-pr-summary.cgi?

category=docs&responsible=.

[24] http://wiki.postgresql.org/wiki/Todo#Source_Code.

[25] http://www.postgresql.org/developer/ext.backend_dirs.

html.

[26] R. K. Yin, Case Study Research: Design and Methods - Third Edition, SAGE

Publications, London, 2002.

[27] I. I. Brudaru, A. Zeller, What is the long-term impact of changes?, in: Proc. of

the intl. wrksh. on Recommendation Systems for Software Engineering (RSSE),

Atlanta, Georgia, 2008, pp. 30–32.

[28] D. M. German, A. E. Hassan, G. Robles, Change impact graphs: Determining the

impact of prior codechanges, Inf. Softw. Technol. 51 (2009) 1394–1408.

[29] H. Gall, M. Jazayeri, J. Krajewski, CVS release history data for detecting logical

couplings, in: Proc. of the 6th Intl. Wrksh. on Principles of Software Evolution

(IWPSE), Helsinki, Finland, 2003, pp. 13–23.

[30] M. M. Lehman, J. F. Ramil, P. D. Wernick, D. E. Perry, W. M. Turski, Metrics and

laws of software evolution - the nineties view, in: Proc. of the 4th Intl. Symposium

on Software Metrics (METRICS), Albuquerque, NM, US, 1997, pp. 20–32.

[31] S. Mirarab, A. Hassouna, L. Tahvildari, Using bayesian belief networks to predict

change propagation in software systems, in: Proc. of the 15th IEEE Intl. Conf. on

Program Comprehension (ICPC), Banff, AB, Canada, 2007, pp. 177–188.

[32] J. Kothari, A. Shokoufandeh, S. Mancoridis, A. E. Hassan, Studying the evolution

of software systems using change clusters, in: Proc. of the 14th IEEE Intl. Conf.

on Program Comprehension (ICPC), Athens, Greece, 2006, pp. 46–55.

[33] A. Mockus, L. G. Votta, Identifying reasons for software changes using historic

databases, in: Proc. of the Intl. Conf. on Software Maintenance (ICSM), San Jose,

CA, US, 2000, pp. 120–130.

[34] A. Chen, E. Chou, J. Wong, A. Y. Yao, Q. Zhang, S. Zhang, A. Michail,

CVSSearch: Searching through source code using CVS comments, Proc. of the

IEEE Intl. Conf. on Software Maintenance (ICSM) 0 (2001) 364–373.

[35] A. E. Hassan, R. C. Holt, Using development history sticky notes to understand

software architecture, in: Proc. of the 12th IEEE Intl. Wrksh. on Program Com-

prehension (IWPC), Bari, Italy, 2004, pp. 183–192.

[36] M. Kim, V. Sazawal, D. Notkin, G. Murphy, An empirical study of code clone ge-

nealogies, in: Proc. of the 10th European software engineering conf. held jointly

with the 13th ACM SIGSOFT intl. symp. on Foundations of software engineering

(ESEC/FSE-13), Lisbon, Portugal, 2005, pp. 187–196.

[37] A. Bernstein, J. Ekanayake, M. Pinzger, Improving defect prediction using tem-

poral features and non linear models, in: 9th intl. wrksh. on Principles of software

evolution (IWPSE), Dubrovnik, Croatia, 2007, pp. 11–18.

[38] T. L. Graves, A. F. Karr, J. S. Marron, H. Siy, Predicting fault incidence using soft-

ware change history, IEEE Transactions on Software Engineering 26 (7) (2000)

653–661.

[39] N. Nagappan, T. Ball, Use of relative code churn measures to predict system

defect density, in: Proc. of the 27th Intl. Conf. on Software engineering (ICSE),

St. Louis, MO, US, 2005, pp. 284–292.

[40] T. Mende, R. Koschke, M. Leszak, Evaluating defect prediction models for a

large evolving software system, in: Proc. of the European Conf. on Software

Maintenance and Reengineering (CSMR), Madrid, Spain, 2009, pp. 247–250.

ResearchGate has not been able to resolve any citations for this publication.

Mining Software Repositories to Assist Developers and Support Managers

Conference Paper

Full-text available

Oct 2006

Ahmed E. Hassan

Software repositories (such as source control repositories) contain a wealth of valuable information regarding the evolutionary history of a software project. This paper presents approaches and tools which mine and transform static record keeping software repositories to active repositories used by researchers to gain empirically based understanding of software development, and by practitioners to predict, plan and understand various aspects of their project. Our work is validated empirically using data based on over 60 years of development history for several open source projects

Change impact graphs: Determining the impact of prior codechanges

Article

Full-text available

Oct 2009
INFORM SOFTWARE TECH

The source code of a software system is in constant change. The impact of these changes spreads out across the software system and may lead to the sudden manifestation of failures in unchanged parts. To help developers fix such failures, we propose a method that, in a pre-processing stage, analyzes prior code changes to determine what functions have been modified. Next, given a particular period of time in the past, the functions changed during that period are propagated throughout the rest of the system using the dependence graph of the system. This information is visualized using Change Impact Graphs (CIGs). Through a case study based on the Apache Web Server, we demonstrate the benefit of using CIGs to investigate several real defects.

Evaluating Defect Prediction Models for a Large Evolving Software System

Conference Paper

Full-text available

Jan 2009

A plethora of defect prediction models has been proposed and empirically evaluated, often using standard classification performance measures. In this paper, we explore defect prediction models for a large, multi-release software system from the telecommunications domain. A history of roughly 3 years is analyzed to extract process and static code metrics that are used to build several defect prediction models with random forests. The performance of the resulting models is comparable to previously published work. Furthermore, we develop a new evaluation measure based on the comparison to an optimal model.

Using information fragments to answer the questions developers ask

Conference Paper

Full-text available

May 2010

Each day, a software developer needs to answer a variety of questions that require the integration of different kinds of project information. Currently, answering these questions, such as "What have my co-workers been doing?", is tedious, and sometimes impossible, because the only support available requires the developer to manually link and traverse the information step-by-step. Through interviews with eleven professional developers, we identified 78 questions developers want to ask, but for which support is lacking. We introduce an information fragment model (and prototype tool) that automates the composition of different kinds of information and that allows developers to easily choose how to display the composed information. In a study, 18 professional developers used the prototype tool to answer eight of the 78 questions. All developers were able to easily use the prototype to successfully answer 94% of questions in a mean time of 2.3 minutes per question.

Identifying Reasons for Software Changes using Historic Databases.

Conference Paper

Full-text available

Jan 2000

Large scale software products must constantly change in order to adapt to a changing environment. Studies of historic data from legacy software systems have identified three specific causes of this change: adding new features; correcting faults; and restructuring code to accommodate future changes. Our hypothesis is that a textual description field of a change is essential to understanding why that change was performed. Also, we expect that difficulty, size, and interval would vary strongly across different types of changes. To test these hypotheses we have designed a program which automatically classifies maintenance activity based on a textual description of changes. Developer surveys showed that the automatic classification was in agreement with developer opinions. Tests of the classifier on a different product found that size and interval for different types of changes did not vary across two products. We have found strong relationships between the type and size of a change and the time required to carry it out. We also discovered a relatively large amount of perfective changes in the system we examined. From this study we have arrived at several suggestions on how to make version control data useful in diagnosing the state of a software project, without significantly increasing the overhead for the developer using the change management system

Measuring the Progress of Projects Using the Time Dependence of Code Changes

Conference Paper

Full-text available

Sep 2009

Tracking the progress of a project is often done through imprecise manually gathered information, like progress re- ports, or through automatic metrics such as Lines Of Code (LOC). Such metrics are too coarse-grained and too impre- cise to capture all facets of a project. In this paper, we mine the code changes in the source code repository and study the concept of time dependence of code changes. Using this concept, we can track the progress of a software project as the progress of a building. We can examine how changes build on each other over time and determine the impact of these changes on the quality of a project. In particular, we study whether new changes are built just-in-time or if they build on older, stable code. Through a case study on two large open source projects (PostgreSQL and FreeBSD), we show that time dependence varies across projects and throughout the lifetime of each project. We also show that there is a high linear correlation between building on new code and the occurrence of bugs.

Maintaining mental models: A study of developer work habits

Conference Paper

Jan 2006

To understand developers' typical tools, activities, and practices and their satisfaction with each, we conducted two surveys and eleven interviews. We found that many problems arose because developers were forced to invest great effort recovering implicit knowledge by exploring code and interrupting teammates and this knowledge was only saved in their memory. Contrary to expectations that email and IM prevent expensive task switches caused by face-to-face interruptions, we found that face-to-face communication enjoys many advantages. Contrary to expectations that documentation makes understanding design rationale easy, we found that current design documents are inadequate. Contrary to expectations that code duplication involves the copy and paste of code snippets, developers reported several types of duplication. We use data to characterize these and other problems and draw implications for the design of tools for their solution.

What is the long-term impact of changes?

Article

Nov 2008

During their life cycle, programs undergo many changes. Each of these changes may introduce new features---or new problems. While most of the impact of a change is immediate, some of the impact may become evident only in the long term. For instance, suppose we make the internals of a component accessible to its clients. In itself, this does not introduce a problem. In the long term, though, this will most likely lead to maintainability issues. We are currently exploring ways to identify this long-term impact of change. We want to show how a change eventually impacts program quality (in terms of defects), program maintainability, and development effort. Identifying those changes with the greatest impact will foster our understanding of a program's history, and help us in learning lessons for future projects. Eventually, such lessons may come as automated recommendations regarding long-term impact: "In the long run, this change will cause maintainability issues. Do you want to reconsider?"

Prediction and Entropy of Printed English

Article

Jan 1951

C. E. Shannon

A new method of estimating the entropy and redundancy of a language is described. This method exploits the knowledge of the language statistics possessed by those who speak the language, and depends on experimental results in prediction of the next letter when the preceding text is known. Results of experiments in prediction are given, and some properties of an ideal predictor are developed.

Classification and evaluation of defect in a project retrospective

Article

Apr 2002
J SYST SOFTWARE

There are three interdependent factors that drive our development processes: interval, quality and cost. As market pressures continue to demand new features ever more rapidly, the challenge is to meet those demands while increasing, or at least not sacrificing, quality. One advantage of defect prevention as an upstream quality improvement practice is the beneficial effect it can have on interval: higher quality early in the process results in fewer defects to be found and repaired in the later parts of the process, thus causing an indirect interval reduction.

Preserving Knowledge in Software Projects

Abstract

Recommended publications

Software Dependency Estimation in the code Repositories for the Requirement Evolution

A Preliminary Evaluation of Text-based and Dependency-based Techniques for Determining the Origins o...

An Empirical Validation of Coupling Metrics Using Automated Refactoring

Analyzing Refactorings on Software Repositories