Conference PaperPDF Available

Tell them apart: distilling technology differences from crowd-scale comparison discussions

Authors:
  • Technical University of Munich

Abstract and Figures

Developers can use different technologies for many software development tasks in their work. However, when faced with several technologies with comparable functionalities, it is not easy for developers to select the most appropriate one, as comparisons among technologies are time-consuming by trial and error. Instead, developers can resort to expert articles, read official documents or ask questions in QA sites for technology comparison, but it is opportunistic to get a comprehensive comparison as online information is often fragmented or contradictory. To overcome these limitations, we propose the diffTech system that exploits the crowdsourced discussions from Stack Overflow, and assists technology comparison with an informative summary of different comparison aspects. We first build a large database of comparable technologies in software engineering by mining tags in Stack Overflow, and then locate comparative sentences about comparable technologies with natural language processing methods. We further mine prominent comparison aspects by clustering similar comparative sentences and representing each cluster with its keywords. The evaluation demonstrates both the accuracy and usefulness of our model and we implement our approach into a practical website for public use.
Content may be subject to copyright.
Tell Them Apart: Distilling Technology Dierences from
Crowd-Scale Comparison Discussions
Yi Huang
Australian National
University Australia
u6039034@anu.edu.au
Chunyang Chen
Faculty of Information
Technology Monash
University Australia
chunyang.chen@monash.
edu
Zhenchang Xing
Australian National
University Australia
zhenchang.xing@anu.edu.
au
Tian Lin
Yang Liu
Nanyang Technological
University Singapore
yangliu@ntu.edu.sg
ABSTRACT
Developers can use dierent technologies for many software de-
velopment tasks in their work. However, when faced with several
technologies with comparable functionalities, it is not easy for de-
velopers to select the most appropriate one, as comparisons among
technologies are time-consuming by trial and error. Instead, devel-
opers can resort to expert articles, read ocial documents or ask
questions in Q&A sites for technology comparison, but it is oppor-
tunistic to get a comprehensive comparison as online information
is often fragmented or contradictory. To overcome these limitations,
we propose the diTech system that exploits the crowdsourced dis-
cussions from Stack Overow, and assists technology comparison
with an informative summary of dierent comparison aspects. We
rst build a large database of comparable software technologies by
mining tags in Stack Overow, and locate comparative sentences
about comparable technologies with NLP methods. We further mine
prominent comparison aspects by clustering similar comparative
sentences and represent each cluster with its keywords. The evalu-
ation demonstrates both the accuracy and usefulness of our model
and we implement a practical website for public use.
CCS CONCEPTS
Information systems Data mining
;
Software and its
engineering Software libraries and repositories;
KEYWORDS
dierencing similar technology, Stack Overow, NLP
ACM Reference Format:
Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu. 2018.
Tell Them Apart: Distilling Technology Dierences from Crowd-Scale Com-
parison Discussions. In Proceedings of the 2018 33rd ACM/IEEE Interna-
tional Conference on Automated Software Engineering (ASE ’18), Septem-
ber 3–7, 2018, Montpellier, France. ACM, New York, NY, USA, 11 pages.
https://doi.org/10.1145/3238147.3238208
Co-rst and corresponding author.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
ASE ’18, September 3–7, 2018, Montpellier, France
©2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-5937-5/18/09.. .$15.00
https://doi.org/10.1145/3238147.3238208
Figure 1: A comparative sentence in a post (#1008671) that is
not explicitly for technology comparison
1 INTRODUCTION
A diverse set of technologies (e.g, algorithms, programming lan-
guages, platforms, libraries/frameworks, concepts for software en-
gineering) [
12
,
15
] is available for use by developers and that set
continues growing. By adopting suitable technologies, it will sig-
nicantly accelerate the software development process and also
enhance the software quality. But when developers are looking for
proper technologies for their tasks, they are likely to nd several
comparable candidates. For example, they will nd bubble sort and
quick sort algorithms for sorting, nltk and opennlp libraries for NLP,
Eclipse and Intellij for developing Java applications.
Faced with so many candidates, developers are expected to have
a good understanding of dierent technologies in order to make
a proper choice for their work. However, even for experienced
developers, it can be dicult to keep pace with the rapid evolu-
tion of technologies. Developers can try each of the candidates in
their work for the comparison. But such trial-and-error assessment
is time-consuming and labor extensive. Instead, we nd that the
perceptions of developers about comparable technologies and the
choices they make about which technology to use are very likely
to be inuenced by how other developers see and evaluate the tech-
nologies. So developers often turn to the two information sources
on the Web [8] to learn more about comparable technologies.
First, they read experts’ articles about technology comparison
like “Intellij vs. Eclipse: Why IDEA is Better”. Second, developers
can seek answers on Q&A websites such as Stack Overow or
Quora (e.g., “Apache OpenNLP vs NLTK”). These expert articles
and community answers are indexable by search engines, thus
enabling developers to nd answers to their technology comparison
inquiries.
However, there are two limitations with expert articles and com-
munity answers.
214
ASE ’18, September 3–7, 2018, Montpellier, France Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu
Fragmented view: An expert article or community answer usually
focuses on a specic aspect of some comparable technologies, and
developers have to aggregate the fragmented information into a
complete comparison in dierent aspects. For example, to com-
pare mysql and postgresql, one article [
2
] contrasts their speed,
while another [
3
] compares their reliability. Only after reading
both articles, developers can have a relatively comprehensive
overview of these two comparable technologies.
Diverse opinions: One expert article or community answer is
based on the author’s knowledge and experience. However, the
knowledge and experience of developers vary greatly. For exam-
ple, one developer may prefer Eclipse over Intellij because Eclipse
ts his project setting better. But that setting may not be extensi-
ble to other developers. At the same time, some developers may
prefer Intellij over Eclipse for other reasons. Such contradictory
preferences among dierent opinions may confuse developers.
The above two limitations create a high barrier for developers
to eectively gather useful information about technology dier-
ences on the Web in order to tell apart comparable technologies.
Although developers may manually aggregate relevant information
by searching and reading many web pages, that would be very
opportunistic and time consuming. To overcome the above limi-
tations, we present the diTech system that automatically distills
and aggregates fragmented and trustworthy technology compari-
son information from the crowd-scale Q&A discussions in Stack
Overow, and assists technology comparison with an informative
summary of dierent aspects of comparison information.
Our system is motivated by the fact that a wide range of technolo-
gies have been discussed by millions of users in Stack Overow [
14
],
and users often express their preferences toward a technology and
compare one technology with the others in the discussions. Apart
from posts explicitly about the comparison of some technologies,
many comparative sentences
hide
in posts that are implicitly about
technology comparison. Fig.1shows such an example: the answer
“accidentally” compares the security of POST and GET, while the
question “How secure is a HTTP post?” does not explicit ask for
this comparison. Inspired by such phenomenon, we then propose
our system to mine and aggregate the comparative sentences in
Stack Overow discussions.
As shown in Fig. 2, we consider Stack Overow tags as a collec-
tion of technology terms and rst nd comparable technologies by
analyzing tag embeddings and categories. And then, our system
distills and clusters comparative sentences from Q&A discussions,
which highly likely contains detailed comparisons between some
comparable technologies, and sometimes even explains why users
like or dislike a particular technology. Finally, we use word mover
distance [
28
] and community detection [
21
] to cluster compara-
tive sentences into prominent aspects by which users compare the
two technologies and present the mined clusters of comparative
sentences for user inspection.
As there is no ground truth for technology comparison, we man-
ually validate the performance of each step of our approach. The
experiment results conrm the the accuracy of comparable tech-
nology identication (90.7%), and distilling comparative sentences
(83.7%) from Q&A discussions. By manually building the ground
truth, we show that our clustering method (word mover distance
Figure 2: The overview of our approach
and community detection) for comparative sentences signicantly
outperforms the two baselines (TF-IDF with K-means and Doc2vec
with K-means). Finally, we further demonstrate the usefulness of
our system for answering questions of technology comparison in
Stack Overow. The result show that our system can cover the
semantics of 72% comparative sentences in ve randomly selected
technology comparison questions, and also include some unique
comparisons from other aspects which are not discussed in original
answers.
Our contributions in this work are four-fold:
This is the rst work to systematically identify comparable
software-engineering technologies and distill crowd-scale
comparative sentences for these technologies.
Our method automatically distills and aggregates crowd
opinions into dierent comparison aspects so that devel-
opers can understand technology comparison more easily.
Our experiments demonstrate the eectiveness of our method
by checking the accuracy and usefulness of each step of our
approach.
We implement our results into a practical tool and make it
public to the community. Developers can benet from the
technology comparison knowledge in our website1.
1https://ditech.herokuapp.com/
215
Tell Them Apart: Distilling Technology Dierences from Crowd-Scale Comparison... ASE ’18, September 3–7, 2018, Montpellier, France
word embedding of t3
𝑡1 t5 t4 𝑡2
t3
(a) Continuous skip-gram model
wordembeddingoft
t
(b) Continuous bag-of-words model
Figure 3: The architecture of the two word embeddings mod-
els. The continuous skip-gram model predicts surrounding
words given the central word, and the CBOW model predicts
the central word based on the context words. Note the dier-
ences in arrow direction between the two models.
2 MINING SIMILAR TECHNOLOGY
Studies [
9
,
11
,
44
] show that Stack Overow tags identify com-
puter programming technologies that questions and answers re-
volve around. They cover a wide range of technologies, from algo-
rithms (e.g., binary search, merge sort), programming languages (e.g.,
python, java), libraries and frameworks (e.g., tensorow, django),
and development tools (e.g., vim, git). In this work, we regard Stack
Overow tags as a collection of technologies that developers would
like to compare. We leverage word embedding techniques to infer
semantically related tags, and develop natural language methods to
analyze each tag’s TagWiki to determine the corresponding tech-
nology’s category (e.g., algorithm, library, IDE). Finally, we build a
knowledge base of comparable technologies by ltering the same-
category, semantically-related tags.
2.1 Learning Tag Embeddings
Word embeddings are dense low-dimensional vector representa-
tions of words that are built on the assumption that words with sim-
ilar meanings tend to be present in similar context. Studies [
11
,
35
]
show that word embeddings are able to capture rich semantic and
syntactic properties of words for measuring word similarity. In our
approach, given a corpus of tag sentences, we use word embedding
methods to learn the word representation of each tag using the
surrounding context of the tag in the corpus of tag sentences.
There are two kinds of widely-used word embedding meth-
ods [
35
], the continuous skip-gram model [
36
] and the continuous
bag-of-words (CBOW) model. As illustrated in Fig. 3, the objective
of the continuous skip-gram model is to learn the word representa-
tion of each word that is good at predicting the co-occurring words
in the same sentence (Fig. 3(a)), while the CBOW is the opposite,
that is, predicting the center word by the context words (Fig. 3(b)).
Note that word order within the context window is not important
for learning word embeddings.
Specically, given a sequence of training text stream
t1,t2, ..., tk
,
the objective of the continuous skip-gram model is to maximize the
following average log probability:
L=1
K
K
Õ
k=1
Õ
NjN,j,0
logp(tk+j|tk)(1)
Tag Wiki: Matplotlib is a plotting library for Python
Part of Speech: NNP VBZ DT JJ NN IN NNP
Figure 4: POS tagging of the denition sentence of the tag
Matplotlib
while the objective of the CBOW model is:
L=1
K
K
Õ
k=1
logp(tk|(tkN,tkN+1, . . . , tk+N)) (2)
where
tk
is the central word,
tk+j
is its surrounding word with the
distance
j
, and
N
indicates the window size. In our application of
the word embedding, a tag sentence is a training text stream, and
each tag is a word. As tag sentence is short (has at most 5 tags), we
set
N
as 5 in our approach so that the context of one tag is all other
tags in the current sentences. That is, the context window contains
all other tags as the surrounding words for a given tag. Therefore,
tag order does not matter in this work for learning tag embeddings.
To determine which word-embedding model performs better in
our comparable technology reasoning task , we carry out a com-
parison experiment, and the details are discussed in Section 5.1.3.
2.2 Mining Categorical Knowledge
In Stack Overow, tags can be of dierent categories, such as pro-
gramming language, library, framework, tool, API, algorithm, etc.
To determine the category of a tag, we resort to the tag denition
in the TagWiki of the tag. The TagWiki of a tag is collaboratively
edited by the Stack Overow community. Although there are no
strict formatting rules in Stack Overow, the TagWiki description
usually starts with a short sentence to dene the tag. For example,
the tagWiki of the tag Matplotlib starts with the sentence “Mat-
plotlib is a plotting library for Python”. Typically, the rst noun
just after the be verb denes the category of the tag. For example,
from the tag denition of Matplotlib, we can learn that the category
of Matplotlib is library.
Based on the above observation of tag denitions, we use the
NLP methods [
11
,
27
] to extract such noun from the tag denition
sentence as the category of a tag. Given the tagWiki of a tag in Stack
Overow, we extract the rst sentence of the TagWiki description,
and clean up the sentence by removing hyperlinks and brackets
such as “{}”, “()”. Then, we apply Part of Speech (POS) tagging to
the extracted sentence. POS tagging is the process of marking up a
word in a text as corresponding to a particular part of speech, such
as noun, verb, adjective. NLP tools usually agree on the POS tags
of nouns, and we nd that POS tagger in NLTK [
10
] is especially
suitable for our task. In NLTK, the noun is annotated by dierent
POS tags [
1
] including NN (Noun, singular or mass), NNS (Noun,
plural), NNP (Proper noun, singular), NNPS (Proper noun, plural).
Fig. 4shows the results for the tag denition sentence of Matplotlib.
Based on the POS tagging results, we extract the rst noun (library
in this example) after the be verb (is in this example) as the category
of the tag. That is, the category of Matplotlib is library. Note that if
the noun is some specic words such as system,development, we
will further check its neighborhood words to see if it is operating
system or independent development environment.
216
ASE ’18, September 3–7, 2018, Montpellier, France Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu
With this method, we obtain 318 categories for the 23,658 tags
(about 67% of all the tags that have TagWiki). We manually normal-
ize these 318 categories labels, such as merging app and applications
as application,libraries and lib as library, and normalizing uppercase
and lowercase (e.g., API and api). As a result, we obtain 167 cate-
gories. Furthermore, we manually categorize these 167 categories
into ve general categories: programming language, platform, li-
brary, API, and concept/standard [
48
]. This is because the meaning
of the ne-grained categories is often overlapping, and there is
no consistent rule for the usage of these terms in the TagWiki.
This generalization step is necessary, especially for the library tags
that broadly refer to the tags whose ne-grained categories can
be library, framework, api, toolkit, wrapper, and so on. For exam-
ple, in Stack Overow’s TagWiki, junit is dened as a framework,
google-visualization is dened as an API, and wxpython is dened
as a wrapper. All these tags are referred to as library tags in our
approach.
Although the above method obtains the tag category for the
majority of the tags, the rst sentence of the TagWiki of some
tags is not formatted in the standard “tag be noun phrase” form.
For example, the rst sentence of the TagWiki of the tag itext is
“Library to create and manipulate PDF documents in Java”, or for
markermanager, the tag denition sentence is “A Google Maps tool”,
or for ghc-pkg, the tag denition sentence is “The command ghc-
pkg can be used to handle GHC packages”. As there is no be verb in
this sentence, the above NLP method cannot return a noun phrase
as the tag category. According to our observation, for most of such
cases, the category of the tag is still present in the sentence, but
often in many dierent ways. It is very likely that the category word
appears as the rst noun phrase that match the existing category
words in the denition sentence. Therefore, we use a dictionary
look-up method to determine the category of such tags. Specially,
we use the 167 categories obtained using the above NLP method as
a dictionary to recognize the category of the tags that have not been
categorized using the NLP method. Given an uncategorized tag, we
scan the rst sentence of the tag’s TagWiki from the beginning, and
search for the rst match of a category label in the sentence. If a
match is found, the tag is categorized as the matched category. For
example, the tag itext is categorized as library using this dictionary
look-up method. Using the dictionary look-up method, we obtain
the category for 9,648 more tags.
Note that we cannot categorize some (less than 15%) of the tags
using the above NLP method and the dictionary look-up method.
This is because these tags do not have a clear tag denition sen-
tence, for example, the TagWiki of the tag richtextbox states that
“The RichTextBox control enables you to display or edit RTF con-
tent”. This sentence is not a clear denition of what richtextbox
is. Or no category match can be found in the tag denition sen-
tence of some tags. For example, the TagWiki of the tag carousel
states that “A rotating display of content that can house a variety
of content”. Unfortunately, we do not have the category “display”
in the 167 categories we collect using the NLP method. When build-
ing comparable-technologies knowledge base, we exclude these
uncategorized tags as potential candidates.
Table 1: Examples of ltering results by categorical knowl-
edge (in red)
Source Top-5 recommendations from word embedding
nltk nlp, opennlp, gate, language-model, stanford-nlp
tcp tcp-ip, network-programming, udp, packets,tcpserver
vim sublimetext, vim-plugin, emacs, nano, gedit
swift objective-c, cocoa-touch,storyboard,launch-screen
bubble-sort insertion-sort, selection-sort, mergesort, timsort, heapsort
2.3 Building Similar-technology Knowledge
Base
Given a technology tag
t1
with its vector
vec(t1)
, we rst nd most
similar library t2whose vector vec(t2)is most closed to it, i.e.,
argmax
t2T
cos(vec(t1),vec(t2)) (3)
where
T
is the set of technology tags excluding
t1
, and
cos(u,v)
is
the cosine similarity of the two vectors.
Note that tags whose tag embedding is similar to the vector
vec(t1)
may not always be in the same category. For example, tag
embeddings of the tags nlp,language-model are similar to the vector
vec(nltk )
. These tags are relevant to the nltk library as they refer to
some NLP concepts and tasks, but they are not comparable libraries
to the nltk. In our approach, we rely on the category of tags (i.e.,
categorical knowledge) to return only tags within the same category
as candidates. Some examples can be seen in Table 1.
In practice, there could be several comparable technologies
t2
to the technology
t1
. Thus, we select tags
t2
with the cosine sim-
ilarity in Eq. 3above a threshold
Thresh
. Take the library nltk (a
NLP library in python) as an example. We will preserve several
candidates which are libraries such as textblob,stanford-nlp.
3 MINING COMPARATIVE OPINIONS
For each pair of comparable technologies in the knowledge base, we
analyze the Q&A discussions in Stack Overow to extract plausible
comparative sentences by which Stack Overow users express their
opinions on the comparable technologies. We may obtain many
comparative sentences for each pair of comparable technologies.
Displaying all these sentences as a whole may make it dicult for
developers to read and digest the comparison information. There-
fore, we measure the similarity among the comparative sentences,
and then cluster them into several groups, each of which may iden-
tify a prominent aspect of technology comparison that users are
concerned with.
3.1 Extracting Comparative Sentences
There are three steps to extract comparative sentences of the two
technologies. We rst carry out some preprocessing of the Stack
Overow post content. Then we locate the sentences that contain
the name of the two technologies, and further select the comparative
sentences that satisfy a set of comparative sentence patterns.
3.1.1 Preprocessing. To extract trustworthy opinions about the
comparison of technologies, we consider only answer posts with
positive score points. Then we split the textual content of such
answer posts into individual sentences by punctuations like “., “!”,
“?”. We remove all sentences ended with question mark, as we want
217
Tell Them Apart: Distilling Technology Dierences from Crowd-Scale Comparison... ASE ’18, September 3–7, 2018, Montpellier, France
Table 2: The 6 comparative sentence patterns
No. Pattern Sequence example Original sentence
1TECH * VBZ * JJR innodb has 30 higher InnoDB has 30% higher performance than MyISAM on average.
2TECH * VBZ * RBR postgresql is a more Postgresql is a more correct database implementation while mysql is less compliant.
3JJR * CIN * TECH faster than coalesce Isnull is faster than coalesce.
4RBR JJ * CIN TECH more powerful than velocity Freemarker is more powerful than velocity.
5CV * CIN TECH prefer ant over maven I prefer ant over maven personally.
6CV VBG TECH recommend using html5lib I strongly recommend using html5lib instead of beautifulsoup.
Table 3: Examples of alias
Tech term Synonyms Abbreviation
visual studio visualstudio, visual studios, visual-studio msvs
beautifulsoup beautiful soup bs4
objective-c objectivec, objective c objc, obj-c
depth-rst search deep rst search, depth rst search, depth-rst-search dfs
postgresql postgre sql, posgresq, postgesql pgsql
to extract facts instead of doubts. We lowercase all sentences to
make the sentence tokens consistent with the technology names
because all tags are in lowercase.
3.1.2 Locating Candidate Sentences. To locate sentences mention-
ing a pair of comparable technologies, using only the tag names is
not enough. As posts in Stack Overow are informal discussions
about programming-related issues, users often use alias to refer
to the same technology [
16
]. Aliases of technologies can be abbre-
viations, synonyms and some frequent misspellings. For example,
“javascript” are often written in many forms such as “js” (abbre-
viation), “java-script” (synonym), “javascrip” (misspelling) in the
discussions.
The presence of such aliases will lead to signicant missing of
comparative sentences, if we match technology mentions in a sen-
tence with only tag names. Chen et al.’s work [
17
] builds a large
thesaurus of morphological forms of software-specic terms, in-
cluding abbreviations, synonyms and misspellings. Table 3shows
some examples of technologies aliases in this thesaurus. Based on
this thesaurus, we nd 7310 dierent alias for 3731 software tech-
nologies. These aliases help to locate more candidate comparative
sentences that mention certain technologies.
3.1.3 Selecting Comparative Sentences. To identify comparative
sentences from candidate sentences, we develop a set of compar-
ative sentence patterns. Each comparative sentence pattern is a
sequence of POS tags. For example, the sequence of POS tags “RBR
JJ IN ” is a pattern that consists of a comparative adverb (RBR), an
adjective (JJ) and subsequently a preposition (IN ), such as "more ef-
cient than", “less friendly than”, etc. We extend the list of common
POS tags to enhance the identication of comparative sentences.
More specically, we create three comparative POS tags: CV (com-
parative verbs, e.g. prefer, compare, beat), CIN (comparative prepo-
sitions, e.g. than, over), and TECH (technology reference, including
the name and aliases of a technology, e.g. python, eclipse).
Based on data observations of comparative sentences, we sum-
marise six comparative patterns. Table 2shows these patterns and
the corresponding examples of comparative sentences. To make the
patterns more exible, we use a wildcard character to represent a
list of arbitrary words to match the pattern. For each sentence men-
tioning the two comparable technologies, we obtain its POS tags
and check if it matches any one of six patterns. If so, the sentence
will be selected as a comparative sentence.
Figure 5: An illustration of measuring similarity of two com-
parative sentences
3.2 Measure Sentence Similarity
To measure the similarity of two comparative sentences, we adopt
the Word Mover’s Distance [
28
] which is especially useful for short-
text comparison. Given two sentences
S1
and
S2
, we take one word
i
from
S1
and one word
j
from
S2
. Let their word vectors be
vi
and
vj
. The distance between the word
i
and the word
j
is the
Euclidean distance between their vectors,
c(i,j)=||vivj||2
. To
avoid confusion between word and sentence distance, we will refer
to
c(i,j)
as the cost associated with “traveling” from one word to
another. One word
i
in
S1
may move to several dierent words in the
S2
, but its total weight is 1. So we use
Ti j
0to denote how much
of word
i
in
S1
travels to word
j
in
S2
. It costs
ÍjTi jc(i,j)
to move
one word
i
entirely into
S2
. We dene the distance between the
two sentences as the minimum (weighted) cumulative cost required
to move all words from
S1
to
S2
, i.e.,
D(S1,S2)=Íi,jTij c(i,j)
.
This problem is very similar to transportation problem i.e., how
to spend less to transform all goods from source cities
A1,A2, ...
to target cities
B1,B2, ...
. Getting such minimum cost actually is a
well-studied optimization problem of earth mover distance [
32
,
38
].
To use word mover’s distance in our approach, we rst train a
word embedding model based on the post content of Stack Overow
so that we get a dense vector representation for each word in Stack
Overow. Word embedding has been shown to be able to capture
rich semantic and syntactic information of words. Our approach
does not consider word mover’s distance for all words in a sentence.
Instead, for each comparative sentence, we extract only keywords
with POS tags that are most relevant to the comparison, including
adjectives (JJ), comparative adjectives ( JJR) and nouns (NN, NNS,
NNP and NNPS), not including the technologies under comparison.
Then, we compute the minimal word movers’ distance between the
keywords in one sentence and those in the other sentences. Base
on the distance, we further compute the similarity score of the two
sentences by
similarityscore (S1,S2)=1
1+D(S1,S2)
218
ASE ’18, September 3–7, 2018, Montpellier, France Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu
Figure 6: Communities in the graph of comparative sen-
tences
The similarity score is in range
(
0
,
1
)
, and the higher the score, the
more similar the two sentences. If the similarity score between
the two sentences is larger than the threshold, we regard them as
similar. The threshold is 0.55 in this work, determined heuristically
by a small-scale pilot study. We show some similar comparative
sentences by word mover’s distance in Table 4.
To help reader understand word movers’ distance, we show an
example in Figure 5with two comparative sentences for comparing
postgresql and mysql: “Postgresql oers more security functionality
than mysql” and “’Mysql provides less safety features than postgresql’.
The keywords in the two sentences that are most relevant to the
comparison are highlighted in bold. We see that the minimum
distance between the two sentences is mainly the accumulation
of word distance between pairs of similar words (oers, provides),
(more, less),(security, safety), and (functionality, features). As the
distance between the two sentences is small, the similarity score is
high even though the two sentences use rather dierent words and
express the comparison in reverse directions.
3.3 Clustering Representative Comparison
Aspects
For each pair of comparable technologies, we collect a set of com-
parative sentences about their comparison in Section 3.1. Within
these comparative sentences, we nd pairs of similar sentences in
Section 3.2. We take each comparative sentence as one node in the
graph. If the two sentences are determined as similar, we add an
edge between them in the graph. In this way, we obtain a graph of
comparative sentences for a given pair of comparative technologies.
Although some comparative sentences are very dierent in
words or comparison directions (examples shown in Fig. 5and
Table 4), they may still share the same comparison opinions. In
graph theory, a set of highly correlated nodes is referred to as a
community (cluster) in the network. Based on the sentence sim-
ilarity, we cluster similar opinions by applying the community
detection algorithm to the graph of comparative sentences. In this
work, we use the Girvan-Newman algorithm [
21
] which is a hierar-
chical community detection method. It uses an iterative modularity
maximization method to partition the network into a nite number
of disjoint clusters that will be considered as communities. Each
node must be assigned to exactly one community. Fig. 6shows the
graph of comparative sentences for the comparison of TCP and
UDP (two network protocols), in which each node is a comparative
sentence, and the detected communities are visualized in the same
color.
As seen in Fig. 6, each community may represent a prominent
comparison aspect of the two comparable technologies. But some
communities may contain too many comparative sentences to un-
derstand easily. Therefore, we use TF-IDF (Term Frequency Inverse
Document Frequency) to extract keywords from comparative sen-
tence in one community to represent the comparison aspect of this
community. TF-IDF is a statistical measure to evaluate the impor-
tance of a word to a document in a collection. It consists of two
parts: term frequency (TF, the number occurrences of a term in a
document) and inverse document frequency (IDF, the logarithm of
the total number of documents in the collection divided by the num-
ber of documents in the collection that contain the specic term).
For each community, we remove stop words in the sentences, and
regard each community as a document. We take the top-3 words
with largest TF-IDF scores as the representative aspect for the com-
munity. Table 5shows the comparison aspects of four communities
for comparing postgresql with mysql. The representative keywords
directly show that the comparison between postgresql with mysql
mainly focuses on four aspects: speed, security, popularity, and
usability.
4 IMPLEMENTATION
4.1 Dataset
We take the latest Stack Overow data dump (released on 13 March
2018) as the data source. It contains 14,995,834 questions, 23,399,083
answers, 50,812 unique tags. With the approach in Section 2, we col-
lect in total 14,876 pairs of comparable technologies. Among these
technologies, we extract 14,552 comparative sentences for 2,074
pairs of comparable technologies. We use these technologies and
comparative sentences to build a knowledge base for technology
comparison.
4.2 Tool Support
Apart from our abstract approach, we also implement a practi-
cal tool
2
for developers. With the knowledge base of comparable
technologies and their comparative sentences mined from Stack
Overow, our site can return an informative and aggregated view
of comparative sentences in dierent comparison aspects for com-
parable technology queries. In addition, the tool provides the link
of each comparative sentence to its corresponding Stack Overow
post so that users can easily nd more detailed content.
5 EXPERIMENT
In this section, we evaluate each step of our approach. As there is
no ground truth for technology comparison, we have to manually
check the results of each step or build the ground truth. And as it is
clear to judge whether a tag is of a certain category from its tag de-
scription, whether two technologies are comparable, and whether a
sentence is a comparative sentence, we recruit two Master students
2https://ditech.herokuapp.com/
219
Tell Them Apart: Distilling Technology Dierences from Crowd-Scale Comparison... ASE ’18, September 3–7, 2018, Montpellier, France
Table 4: Examples of similar comparative sentences by Word Mover’s Distance
Comparable technology pair Comparative sentences
vmware &virtualbox Virtualbox is slower than vmware.
In my experience I’ve found that vmware seems to be faster than virtualbox.
strncpy &strcpy In general strncpy is a safer alternative to strcpy.
So that the strncpy is more secure than strcpy.
google-chrome &safari Safari still uses the older Webkit while Chrome uses a more current one.
Google Chrome also uses an earlier version of Webkit than Safari.
quicksort &mergesort Mergesort would use more space than quicksort.
Quicksort is done in place and doesn’t require allocating memory, unlike mergesort.
nginx &apache Serving static les with nginx is much more ecient than with apache.
There seems to be a consensus that nginx serves static content faster than apache.
Table 5: The representative keywords for clusters of postgresql and mysql.
Representative keywords Comparative sentences
speed, slower, faster
In most regards, postgresql is slower than mysql especially when it comes to ne tuning in the end.
I did a simple performance test and I noticed postgresql is slower than mysql.
According to my own experience, postgresql run much faster than mysql.
Postgresql seem to better than mysql in terms of speed.
security, safety, functionalityTraditionally postgresql has fewer security issues than mysql.
Postgresql oers more security functionality than mysql.
Mysql provides less safety features than postgresql.
popular
While postgresql is less popular than mysql, most of the serious web hosting supports it.
Though mysql is more popular than postgresql but instagram is using postgresql maybe due to these reasons.
It’s a shame postgresql isn’t more popular than mysql, since it supports exactly this feature out-of-the-box.
easier, simplicity
Mysql is more widely supported and a little easier to use than postgresql.
Postgresql specically has gotten easier to manage while mysql has lost some of the simplicity.
However, people often argue that postgresql is easier to use than mysql.
to manually check the results of these three steps. Only results that
they both agree will be regarded as ground truth for computing
relevant accuracy metrics, and those results without consensus
will be given to the third judge who is a PhD student with more
experience. All three students are majoring in computer science
and computer engineering in our school, and they have diverse
research and engineering background with dierent software tools
and programming languages in their work. In addition, we release
all experiment data and results in our website3.
5.1 Accuracy of Extracting Comparable
Technologies
This section reports our evaluation of the accuracy of tag category
identication, the important of tag category for ltering out irrele-
vant technologies, and the impact of word embedding models and
hyperparameters.
5.1.1 The Accuracy of Tag CCategory. From 33,306 tags with tag
category extracted by our method, we randomly sample 1000 tags
whose categories are determined using the NLP method, and the
other 1000 tags whose categories are determined by the dictionary
look-up method (see Section 2.2). Among the 1000 sampled tag
categories by the NLP method, categories of 838 (83.8%) tags are
correctly extracted by the proposed method. For the 1000 sampled
3https://sites.google.com/view/ditech/
tags by the dictionary look-up method, categories of 788 (78.8%)
tags are correct.
According to our observation, two reasons lead to the erroneous
tag categories. First, some tag denition sentences are complex
which can lead to erroneous POS tagging results. For example, the
tagWiki of the tag rpy2 states that “RPy is a very simple, yet robust,
Python interface to the R Programming Language”. The default POS
tagging recognizes simple as the noun which is then regarded as the
category by our method. Second, the dictionary look-up method
sometimes makes mistakes, as the matched category may not be the
real category. For example, the TagWiki of the tag honeypot states
“A trap set to detect or deect attempts to hack a site or system”.
Our approach matches the system as the category of the honeypot.
5.1.2 The Importance of Tag Category. To check the importance of
tag category for the accurate comparable technology extraction, we
set up two methods, i.e., one is word embedding and tag category
ltering, and the other is only with word embedding. The word
embedding model in two methods are both skip-gram model with
the word embedding dimension as 800. We randomly sample 150
technologies pairs extracted from each method, and manually check
the if the extracted technology pair is comparable or not. It shows
that the performance of model with tag category (90.7%) is much
better than that without the tag category ltering (29.3%).
5.1.3 The Impact of Parameters of Word Embedding. There are
two important parameters for the word embedding model, and we
220
ASE ’18, September 3–7, 2018, Montpellier, France Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu
Table 6: The accuracy of comparative sentences extraction
No. Pattern #right #wrong Accuracy
1TECH * VBZ * JJR 44 6 88%
2TECH * VBZ * RBR 45 5 90%
3JJR * CIN * TECH 43 7 86%
4RBR JJ * CIN TECH 47 3 94%
5CV * CIN TECH 37 13 74%
6CV VBG TECH 35 15 70%
Total 251 49 83.7%
test its impact on the the performance of our method. First, we
compare the performance of CBOW and Skip-gram mentioned in
Section 2.1 by randomly sampling 150 technology pairs extracted
by each method under the same parameter setting (the word em-
bedding dimension is 400). The results show that Skip-gram (90.7%)
outperforms the CBOW (88.7%), but the dierence is marginal.
Second, we randomly sample 150 technologies pairs by the skip-
gram model with dierent word embedding dimensions, and man-
ually check the accuracy. From the dimension 200 to 1000 with
the step as 200, the accuracy is 70.7%, 72.7%, 81.3%, 90.7%, 87.3%.
We can see that the model with the word embedding dimension as
800 achieves the best performance. Finally, we take the Skip-gram
model with 800 word-embedding dimension as the word embedding
model to obtain the comparable technologies in this work.
5.2 Accuracy and Coverage of Comparative
Sentences
We evaluate the accuracy and coverage of our approach in nding
comparative sentences from the corpus. We rst randomly sample
300 sentences (50 sentences for each comparative sentence pattern
in Table 2) which are extracted by our model. We manually check
the accuracy of the sampled sentences and Table 6shows the results.
The overall accuracy of comparative sentence extraction is 83.7%,
and our approach is especially accurate for the rst 4 patterns.
The last two patterns do not work well due to the relatively loose
conditions.
We further check the wrong extraction of comparative sentences
and nd that most errors are caused by wrong comparable tech-
nologies extracted in Section 2. For example, implode and explode
are not comparable technologies, but they are mentioned in sen-
tence “I’m not sure why you’d serialize it in php either because
implode and explode would be more appropriate”. In addition, al-
though some sentences do not contain the question mark, they are
actually interrogative sentence such as “I also wonder if postgresql
will be a win over mysql”.
5.3 Accuracy of Clustering Comparative
Sentences
We evaluate the performance of our opinion clustering method by
comparing it with the baseline methods.
5.3.1 Baseline. We set up two baselines to compare with our com-
parative sentence clustering method. The rst baseline is the tradi-
tional TF-IDF [
40
] with K-means [
23
], and the second baseline is
based on the document-to-vector deep learning model (i.e., Doc2vec [
29
])
Table 7: Ground Truth for evaluating clustering results
No. Technology pair #comparative sentences#clusters
1compiled & interpreted language 27 5
2sortedlist & sorteddictionary 11 4
3ant & maven 47 7
4pypy & cpython 51 3
5google-chrome & safari 35 6
6quicksort & mergesort 54 4
7lxml & beautifulsoup 32 4
8awt & swing 30 3
9jackson & gson 31 3
10 swift & objective-c 72 10
11 jruby & mri 19 3
12 memmove & memcpy 21 3
with K-means. Both methods rst convert the comparative sen-
tences for a pair of comparable technologies into vectors by TF-IDF
and Doc2vec. Then for both methods, we carry out K-means algo-
rithms to cluster the sentence vectors into
N
clusters. To make the
baseline as competitive as possible, we set
N
at the cluster number
of the ground truth. In contrast, our method species its cluster
number by community detection which may dier from the cluster
number of the ground truth.
5.3.2 Ground Truth. As there is no ground truth for clustering com-
parative sentences, we ask two Master students mentioned before
to manually build a small-scale ground truth. We randomly sam-
ple 15 pairs of comparable technologies with dierent number of
comparative sentences. For each technology pair, the two students
read each comparative sentence and each of them will individually
create several clusters for these comparative sentences. Note some
comparative sentences are unique without any similar comparative
sentence, and we put all those sentences into one cluster. Then they
will discuss with the Ph.D student about the clustering results, and
change the clusters accordingly. Finally, they reach an agreement
for 12 pairs of comparable technologies. We take these 12 pairs as
the ground truth whose details can be seen in Table 7.
5.3.3 Evaluation Metrics. Given the ground truth clusters, many
metrics have been proposed to evaluate the clustering performance
in the literature. In this work, we take the Adjusted Rand Index
(ARI) [25], Normalized Mutual Information(NMI) [47], homogene-
ity, completeness, V-measure [
39
], and Fowlkes-Mallows Index
(FMI) [
20
]. For all six metrics, higher value represents better cluster-
ing performance. For each pair of comparable technologies, we take
all comparative sentences as a xed list, and
G
as a ground truth
cluster assignment and Cas the algorithm clustering assignment.
Adjusted Rand Index (ARI) measures the similarity between
two partitions in a statistical way. It rst calculates the raw Rand
Index (RI) by
RI =a+b
CN
2
where
a
is the number of pairs of elements
that are in the same cluster in
G
and also in the same cluster in
C
,
and
b
is the number of pairs of elements that are in dierent clusters
in
G
and also in dierent clusters in
C
.
CN
2
is the total number of
possible pairs in the dataset (without ordering) where
N
is the
number of comparative sentences. To guarantee that random label
assignments will get a value close to zero, ARI is dened as
ARI =RI E[RI ]
max(RI ) − E[RI ]
221
Tell Them Apart: Distilling Technology Dierences from Crowd-Scale Comparison... ASE ’18, September 3–7, 2018, Montpellier, France
where E[RI]is the expected value of RI .
Normalized Mutual Information (NMI)
measures the mutual
information between the ground truth labels
G
and the algorithm
clustering labels C, followed by a normalization operation:
N MI (G,C)=MI (G,C)
pH(G)H(C)
where
H(G)
is the entropy of set
G
i.e.,
H(G)=Í|G|
i=1P(i)log(P(i))
and
P(i)=Gi
N
is the probability than an objet picked at random
falls into class
Gi
. The
MI (G,C)
is the mutual information between
Gand Cwhere MI (G,C)=Í|G|
i=1Í|C|
j=1P(i,j)log(P(i,j)
P(i)P(j))
Homogeneity (HOM)
is the proportion of clusters containing
only members of a single class by
h=1H(G|C)
H(G)
Completeness (COM)
is the proportion of all members of a given
class are assigned to the same cluster by
c=1H(C|G)
H(C)
where
H(G|C)
is the conditional entropy of the ground-truth classes
given the algorithm clustering assignments.
V-measure (V-M)
is the harmonic mean of homogeneity and
completeness
v=2×h×c
h+c
Fowlkes-Mallows Index (FMI)
is dened as the geometric
mean of the pairwise precision and recall:
FMI =T P
p(TP +F P )(T P +F N )
where
T P
is the number of True Positive (i.e. the number of pairs
of comparative sentences that belong to the same clusters in both
the ground truth and the algorithm prediction),
FP
is the number
of False Positive (i.e. the number of pairs of comparative sentences
that belong to the same clusters in the ground-truth labels but not
in the algorithm prediction) and
F N
is the number of False Negative
(i.e the number of pairs of comparative sentences that belongs in
the same clusters in the algorithm prediction but not in the ground
truth labels).
5.3.4 Overall Performance. Table 8shows the evaluation results.
TF-IDF with K-means has similar performance as the Doc2vec with
K-means, but our model signicantly outperforms both models in
all six metrics.
According to our inspection of detailed results, we nd two rea-
sons why our model outperforms two baselines. First, our model can
capture the semantic meaning of comparative sentences. TF-IDF
can only nd similar sentences using the same words but count sim-
ilar words like “secure” and “safe” as unrelated. While the sentence
vector from Doc2vec is easily inuenced by the noise as it takes
all words in the sentence into consideration. Second, constructing
the similar sentences as a graph in our model explicitly encodes
the sentence relationships. The community detection based on the
graph can then easily put similar sentences into clusters. In con-
trast, for the two baselines, the error brought from the TF-IDF and
Table 8: Clustering performance
Method ARI NMI HOM COM V-M FMI
TF-IDF+Kmeans 0.12 0.28 0.29 0.27 0.28 0.41
Doc2vec+Kmeans -0.01 0.11 0.10 0.14 0.11 0.43
Our model 0.66 0.73 0.75 0.72 0.73 0.79
Doc2vec is accumulated and amplied to K-means in the clustering
phase.
6 USEFULNESS EVALUATION
Experiments in Section 5have shown the accuracy of our approach.
In this section, we further demonstrate the usefulness of our ap-
proach. According to our observation of Stack Overow, there are
some questions discussing comparable technologies such as “What
is the dierence between Swing and AWT ”. We demonstrate the use-
fulness of the technology-comparison knowledge our approach
distills from Stack Overow discussions by checking how well the
distilled knowledge by our approach can answer those questions.
6.1 Evaluation Procedures
We use the name of comparable technologies with several keywords
such as compare,vs,dierence to search questions in Stack Overow.
We then manually check which of them are truly about comparable
technology comparison, and randomly sample ve questions that
discuss comparable technologies in dierent categories and have
at least ve answers. The testing dataset can be seen in Table 9.
We then ask the two Master students to read each sentence in
all answers and cluster all sentences into several clusters which
represent developers’ opinions in dierent aspects. To make the
data as valid as possible, they still rst carry out the clustering
individually and then reach an agreement after discussions. For
each comparative opinion in the answer, we manually check if
that opinion also appears in the knowledge base of comparative
sentences extracted by our method. To make this study fair, our
method does not extract comparative sentences from answers of
questions used in this experiment.
6.2 Results
Table 10 shows the evaluation results. We can see that most com-
parison (72%) aspects can be covered by our knowledge base. For
two questions (#5970383 and #46585), the technology-comparison
knowledge distilled by our method can cover all of comparison
aspects in the original answers such as speed, reliability, data size
for comparing post and get. While for the other three questions,
our model can still cover more than half of the comparison aspects.
We miss some comparison aspects for the other three questions,
such as “One psychological reason that has not been given is simply
that Quicksort is more cleverly named, i.e. good marketing.”, “The
VMWare Workstation client provides a nicer end-user experience (sub-
jective, I know...)” and “Another statement which I saw is that swing is
MVC based and awt is not.”. Such opinions are either too subjective
or too detailed which rarely appear again in other Stack Overow
discussions, leading to not having them in our knowledge base.
Apart from comparison aspects appeared in the original answers,
our tool can provide some unique opinions from other aspects, such
222
ASE ’18, September 3–7, 2018, Montpellier, France Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu
Table 9: Comparative questions
Question ID Question title Tech pair Tech category #answers
70402 Why is quicksort better than mergesort? quicksort & mergesort Algorithm 29
5970383 Dierence between TCP and UDP tcp & udp Protocol 9
630179 Benchmark: VMware vs Virtualbox vmware & virtualbox IDE 13
408820 What is the dierence between Swing and AWT? swing & awt Library 8
46585 When do you use POST and when do you use GET? post & get Method 28
Table 10: Distilled knowledge by our approach versus origi-
nal answers
Question ID #Aspects #Covered #Unique in our model
70402 6 4 (66.67%) 2
5970383 3 3 (100%) 5
630179 7 4 (57.1%) 1
408820 5 3 (60%) 4
46585 4 4 (100%) 2
Total 25 18 (72%) 14
as “In my experience, udp based code is generally less complex than tcp
based code” for comparing tcp and udp, “however I found that vmware
is much more stable in full screen resolution to handle the iphone
connection via usb” for comparing vmware and virtualbox, and “GET
would obviously allow for a user to change the value a lot easier than
POST” for comparing post and get. As seen in Table 10, our model
can provide more than one unique comparative aspects which are
not in the existing answers for each technology pair. Therefore,
our knowledge base can be a good complement to these existing
technology-comparison questions with answers. Furthermore, our
knowledge base contains the comparison knowledge of 2074 pairs
of comparable technologies, many of which have not been explicitly
asked and discussed in Stack Overow, such as swift and objective-c,
nginx and apache.
7 RELATED WORKS
Finding similar software artefacts can help developers migrate from
one tool to the other which is more suitable to their requirement. But
it is a challenging task to identify similar software artefacts from the
existing large pool of candidates. Therefore, much research eort
has been put into this domain. Dierent methods has been adopted
to mine similar artefacts ranging from high-level software [
33
,
43
], mobile applications [
19
,
31
], github projects[
49
] to low-level
third-party libraries [
11
,
13
,
42
], APIs [
22
,
37
], code snippets [
41
],
or Q&A questions [
18
]. Compared with these research studies,
the mined software technologies in this work has much broader
scope including not only software-specic artefacts, but also general
software concepts (e.g., algorithm, protocol), tools (e.g., IDE).
Given a list of similar technologies, developers may further com-
pare and contrast them for the nal selection. Some researcher
investigate such comparison, the comparison is highly domain-
specic such as software for trac simulation [
26
], regression
models [
24
], x86 virtualization [
7
], etc. Michail and Notkin [
34
]
assess dierent third-party libraries by matching similar compo-
nents (such as classes and functions) across similar libraries. But
it can only work for library comparison without the possibility to
be extended to other higher/lower-level technologies in Software
Engineering. Instead, we nd developers’s preference of certain
software technologies highly depends on other developers’ usage
experience and report of similar technology comparisons. There-
fore, Uddin and Khomh [
45
,
46
] extract API opinion sentences in
dierent aspects to show developers’ sentiment to that API. Li et
al. [
30
] adopt NLP methods to distill comparative user review about
similar mobile Apps. Dierent from their works, we rst explicitly
extract a large pool of comparable technologies. In addition, apart
from extracting comparative sentences, we further organize them
into dierent clusters and represent each cluster with some key-
words to help developers understand comparative opinions more
easily.
Finally, it is worth mentioning some related practical projects.
SimilarWeb [
6
] is a website that provides both users engagement sta-
tistics and similar competitors for websites and mobile applications.
AlternativeTo [
4
] is a social software recommendation website in
which users can nd alternatives to a given software based on user
recommendations. SimilarTech [
5
] is a site to recommend analogi-
cal third-party libraries across dierent programming languages.
These websites can help users nd similar or alternative websites
or software applications without detailed comparison.
8 CONCLUSION AND FUTURE WORK
In this paper, we present an automatic approach to distill and aggre-
gate comparative opinions of comparable technologies from Q&A
websites. We rst obtain a large pool of comparable technologies by
incorporating categorical knowledge into word embedding of tags
in Stack Overow, and then locate comparative sentences about
these technologies by POS-tag based pattern matching, and nally
organize comparative sentences into clusters for easier understand-
ing. The evaluation shows that our system covers a large set of
comparable technologies and their corresponding comparative sen-
tences with high accuracy. We also demonstrate the potential of
our system to answer questions about comparing comparable tech-
nologies, because the technology comparison knowledge mined
using our system largely overlap with the original answers in Stack
Overow.
Apart from comparative sentences explicitly mentioning both
comparable technologies, some comparative opinions may hide
deeper. For example, one developer expresses his opinions about one
technology in one paragraph while discussing the other technology
in the next paragraph. Therefore, we will improve our system to
distill technology comparison knowledge from the current sentence
level to post level. In addition, we also plan to summarize higher-
level opinions or preferences from separated individual comparative
sentences for easier understanding.
223
Tell Them Apart: Distilling Technology Dierences from Crowd-Scale Comparison... ASE ’18, September 3–7, 2018, Montpellier, France
REFERENCES
[1]
2003. Alphabetical list of part-of-speech tags used in the Penn Treebank
Project. https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_
pos.html. Accessed: 2018-02-02.
[2]
2017. Millions of Queries per Second: PostgreSQL and MySQLâĂŹs Peaceful
Battle at TodayâĂŹs Demanding Workloads. https://goo.gl/RXVjkB/. Accessed:
2018-04-05.
[3]
2017. MySQL vs Postgres. https://www.upguard.com/articles/postgres-vs-mysql/.
Accessed: 2018-04-05.
[4]
2018. AlternativeTo - Crowdsourced software recommendations. https://
alternativeto.net/. Accessed: 2018-04-05.
[5]
2018. SimilarTech: Find alternative libraries across languages. https://
graphofknowledge.appspot.com/similartech/. Accessed: 2018-04-05.
[6] 2018. SimilarWeb. https://www.similarweb.com/. Accessed: 2018-04-05.
[7]
Keith Adams and Ole Agesen. 2006. A comparison of software and hardware
techniques for x86 virtualization. ACM SIGARCH Computer Architecture News 34,
5 (2006), 2–13.
[8]
Lingfeng Bao, Jing Li, Zhenchang Xing, Xinyu Wang, Xin Xia, and Bo Zhou. 2017.
Extracting and analyzing time-series HCI data from screen-captured task videos.
Empirical Software Engineering 22, 1 (2017), 134–174.
[9] Anton Barua, Stephen W Thomas, and Ahmed E Hassan. 2014. What are devel-
opers talking about? an analysis of topics and trends in stack overow. Empirical
Software Engineering 19, 3 (2014), 619–654.
[10]
Steven Bird and Edward Loper. 2004. NLTK: the natural language toolkit. In
Proceedings of the ACL 2004 on Interactive poster and demonstration sessions.
Association for Computational Linguistics, 31.
[11]
Chunyang Chen, Sa Gao, and Zhenchang Xing. 2016. Mining analogical libraries
in q&a discussions–incorporating relational and categorical knowledge into word
embedding. In Software Analysis, Evolution, and Reengineering (SANER), 2016
IEEE 23rd International Conference on, Vol. 1. IEEE, 338–348.
[12]
Chunyang Chen and Zhenchang Xing. 2016. Mining technology landscape from
stack overow. In Proceedings of the 10th ACM/IEEE International Symposium on
Empirical Software Engineering and Measurement. ACM, 14.
[13]
Chunyang Chen and Zhenchang Xing. 2016. Similartech: automatically recom-
mend analogical libraries across dierent programming languages. In Automated
Software Engineering (ASE), 2016 31st IEEE/ACM International Conference on. IEEE,
834–839.
[14]
Chunyang Chen and Zhenchang Xing. 2016. Towards correlating search on
google and asking on stack overow. In Computer Software and Applications
Conference (COMPSAC), 2016 IEEE 40th Annual, Vol. 1. IEEE, 83–92.
[15]
Chunyang Chen, Zhenchang Xing, and Lei Han. 2016. Techland: Assisting
technology landscape inquiries with insights from stack overow. In Software
Maintenance and Evolution (ICSME), 2016 IEEE International Conference on. IEEE,
356–366.
[16]
Chunyang Chen, Zhenchang Xing, and Yang Liu. 2017. By the Community & For
the Community: A Deep Learning Approach to Assist Collaborative Editing in
Q&A Sites. Proceedings of the ACM on Human-Computer Interaction 1, 32 (2017),
1–32.
[17]
Chunyang Chen, Zhenchang Xing, and Ximing Wang. 2017. Unsupervised
software-specic morphological forms inference from informal discussions. In
Proceedings of the 39th International Conference on Software Engineering. IEEE
Press, 450–461.
[18]
Guibin Chen, Chunyang Chen, Zhenchang Xing, and Bowen Xu. 2016. Learning
a dual-language vector space for domain-specic cross-lingual question retrieval.
In Automated Software Engineering (ASE), 2016 31st IEEE/ACM International Con-
ference on. IEEE, 744–755.
[19]
Ning Chen, Steven CH Hoi, Shaohua Li, and Xiaokui Xiao. 2015. SimApp: A
framework for detecting similar mobile applications by online kernel learning. In
Proceedings of the Eighth ACM International Conference on Web Search and Data
Mining. ACM, 305–314.
[20]
Edward B Fowlkes and Colin L Mallows. 1983. A method for comparing two
hierarchical clusterings. Journal of the American statistical association 78, 383
(1983), 553–569.
[21]
Michelle Girvan and Mark EJ Newman. 2002. Community structure in social and
biological networks. Proceedings of the national academy of sciences 99, 12 (2002),
7821–7826.
[22]
Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2017. DeepAM:
Migrate APIs with multi-modal sequence to sequence learning. arXiv preprint
arXiv:1704.07734 (2017).
[23]
John A Hartigan and Manchek A Wong. 1979. Algorithm AS 136: A k-means
clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied
Statistics) 28, 1 (1979), 100–108.
[24]
Nicholas J Horton and Stuart R Lipsitz. 2001. Multiple imputation in practice:
comparison of software packages for regression models with missing variables.
The American Statistician 55, 3 (2001), 244–254.
[25]
Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of
classication 2, 1 (1985), 193–218.
[26] Steven L Jones, Andrew J Sullivan, Naveen Cheekoti, Michael D Anderson, and
D Malave. 2004. Trac simulation software comparison study. UTCA report 2217
(2004).
[27]
JunâĂŹichi Kazama and Kentaro Torisawa. 2007. Exploiting Wikipedia as ex-
ternal knowledge for named entity recognition. In Proceedings of the 2007 Joint
Conference on Empirical Methods in Natural Language Processing and Computa-
tional Natural Language Learning (EMNLP-CoNLL). 698–707.
[28]
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. 2015. From word
embeddings to document distances. In International Conference on Machine Learn-
ing. 957–966.
[29]
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and
documents. In International Conference on Machine Learning. 1188–1196.
[30]
Yuanchun Li, Baoxiong Jia, Yao Guo, and Xiangqun Chen. 2017. Mining User
Reviews for Mobile App Comparisons. Proceedings of the ACM on Interactive,
Mobile, Wearable and Ubiquitous Technologies 1, 3 (2017), 75.
[31]
Mario Linares-Vásquez, Andrew Holtzhauer, and Denys Poshyvanyk. 2016. On
automatically detecting similar android apps. In Program Comprehension (ICPC),
2016 IEEE 24th International Conference on. IEEE, 1–10.
[32]
Haibin Ling and Kazunori Okada. 2007. An ecient earth mover’s distance
algorithm for robust histogram comparison. IEEE transactions on pattern analysis
and machine intelligence 29, 5 (2007), 840–853.
[33]
Collin McMillan, Mark Grechanik, and Denys Poshyvanyk. 2012. Detecting
similar software applications. In Proceedings of the 34th International Conference
on Software Engineering. IEEE Press, 364–374.
[34]
Amir Michail and David Notkin. 1999. Assessing software libraries by browsing
similar classes, functions and relationships. In Proceedings of the 21st international
conference on Software engineering. ACM, 463–472.
[35]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jerey Dean. 2013. Ecient
estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
(2013).
[36]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Je Dean. 2013.
Distributed representations of words and phrases and their compositionality. In
Advances in neural information processing systems. 3111–3119.
[37]
Trong Duc Nguyen, Anh Tuan Nguyen, Hung Dang Phan, and Tien N Nguyen.
2017. Exploring API embedding for API usages and applications. In Software
Engineering (ICSE), 2017 IEEE/ACM 39th International Conference on. IEEE, 438–
449.
[38]
Or Pele and Michael Werman. 2009. Fast and robust earth mover’s distances. In
Computer vision, 2009 IEEE 12th international conference on. IEEE, 460–467.
[39]
Andrew Rosenberg and Julia Hirschberg. 2007. V-measure: A conditional entropy-
based external cluster evaluation measure. In Proceedings of the 2007 joint con-
ference on empirical methods in natural language processing and computational
natural language learning (EMNLP-CoNLL).
[40]
Karen Sparck Jones. 1972. A statistical interpretation of term specicity and its
application in retrieval. Journal of documentation 28, 1 (1972), 11–21.
[41]
Fang-Hsiang Su, Jonathan Bell, Gail Kaiser, and Simha Sethumadhavan. 2016.
Identifying functionally similar code in complex codebases. In Program Compre-
hension (ICPC), 2016 IEEE 24th International Conference on. IEEE, 1–10.
[42]
Cédric Teyton, Jean-Rémy Falleri, and Xavier Blanc. 2013. Automatic discovery
of function mappings between similar libraries. In Reverse Engineering (WCRE),
2013 20th Working Conference on. IEEE, 192–201.
[43]
Ferdian Thung, David Lo, and Lingxiao Jiang. 2012. Detecting similar applica-
tions with collaborative tagging. In Software Maintenance (ICSM), 2012 28th IEEE
International Conference on. IEEE, 600–603.
[44]
Christoph Treude, Ohad Barzilay, and Margaret-Anne Storey. 2011. How do
programmers ask and answer questions on the web?: Nier track. In Software
Engineering (ICSE), 2011 33rd International Conference on. IEEE, 804–807.
[45]
Gias Uddin and Foutse Khomh. 2017. Automatic summarization of API reviews.
In Automated Software Engineering (ASE), 2017 32nd IEEE/ACM International
Conference on. IEEE, 159–170.
[46]
Gias Uddin and Foutse Khomh. 2017. Opiner: an opinion search and summariza-
tion engine for APIs. In Proceedings of the 32nd IEEE/ACM International Conference
on Automated Software Engineering. IEEE Press, 978–983.
[47]
Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2010. Information theoretic
measures for clusterings comparison: Variants, properties, normalization and
correction for chance. Journal of Machine Learning Research 11, Oct (2010),
2837–2854.
[48]
Deheng Ye, Zhenchang Xing, Chee Yong Foo, Zi Qun Ang, Jing Li, and Nachiket
Kapre. 2016. Software-specic named entity recognition in software engineering
social content. In Software Analysis, Evolution, and Reengineering (SANER), 2016
IEEE 23rd International Conference on, Vol. 1. IEEE, 90–101.
[49]
Yun Zhang, David Lo, Pavneet Singh Kochhar, Xin Xia, Quanlai Li, and Jianling
Sun. 2017. Detecting similar repositories on GitHub. In Software Analysis, Evolu-
tion and Reengineering (SANER), 2017 IEEE 24th International Conference on. IEEE,
13–23.
224
... Huang et al. [20] applied clustering with NLP in StackOverflow discussions to mine comparable technologies and opinions. They utilize tags in each discussion, considering the collection of technologies that a person would like to compare. ...
Article
Full-text available
Group decision support systems (GDSSs) have been widely studied over the recent decades. The Web-based group decision support systems appeared to support the group decision-making process by creating the conditions for it to be effective, allowing the management and participation in the process to be carried out from any place and at any time. In GDSS, argumentation is ideal, since it makes it easier to use justifications and explanations in interactions between decision-makers so they can sustain their opinions. Aspect-based sentiment analysis (ABSA) intends to classify opinions at the aspect level and identify the elements of an opinion. Intelligent reports for GDSS provide decision makers with accurate information about each decision-making round. Applying ABSA techniques to group decision making context results in the automatic identification of alternatives and criteria, for instance. This automatic identification is essential to reduce the time decision makers take to step themselves up on group decision support systems and to offer them various insights and knowledge on the discussion they are participating in. In this work, we propose and implement a methodology that uses an unsupervised technique and clustering to group arguments on topics around a specific alternative, for example, or a discussion comparing two alternatives. We experimented with several combinations of word embedding, dimensionality reduction techniques, and different clustering algorithms to achieve the best approach. The best method consisted of applying the KMeans++ clustering technique, using SBERT as a word embedder with UMAP dimensionality reduction. These experiments achieved a silhouette score of 0.63 with eight clusters on the baseball dataset, which wielded good cluster results based on their manual review and word clouds. We obtained a silhouette score of 0.59 with 16 clusters on the car brand dataset, which we used as an approach validation dataset. With the results of this work, intelligent reports for GDSS become even more helpful, since they can dynamically organize the conversations taking place by grouping them on the arguments used.
... Stack Overflow is an open community for developers to ask technical questions and share their knowledge [60]. Previous researchers leveraged Stack Overflow data to understand real-world development problems [2,14,25,66,72,74] and built tools to improve the usages of Stack Overflow [6,24,47,70,75]. However, there is no prior work on studying Rust-related Stack Overflow questions and our study in Section 3 is the first one to examine those questions. ...
Conference Paper
Full-text available
Rust is a young systems programming language designed to provide both the safety guarantees of high-level languages and the execution performance of low-level languages. To achieve this design goal, Rust provides a suite of safety rules and checks against those rules at the compile time to eliminate many memory-safety and thread-safety issues. Due to its safety and performance, Rust's popularity has increased significantly in recent years, and it has already been adopted to build many safety-critical software systems. It is critical to understand the learning and programming challenges imposed by Rust's safety rules. For this purpose, we first conducted an empirical study through close, manual inspection of 100 Rust-related Stack Overflow questions. We sought to understand (1) what safety rules are challenging to learn and program with, (2) under which contexts a safety rule becomes more difficult to apply, and (3) whether the Rust compiler is sufficiently helpful in debugging safety-rule violations. We then performed an online survey with 101 Rust programmers to validate the findings of the empirical study. We invited participants to evaluate program variants that differ from each other, either in terms of violated safety rules or the code constructs involved in the violation, and compared the participants' performance on the variants. Our mixed-methods investigation revealed a range of consistent findings that can benefit Rust learners, practitioners, and language designers.
Article
Opinion mining, sometimes referred to as sentiment analysis, has gained increasing attention in software engineering (SE) studies. SE researchers have applied opinion mining techniques in various contexts, such as identifying developers’ emotions expressed in code comments and extracting users’ critics toward mobile apps. Given the large amount of relevant studies available, it can take considerable time for researchers and developers to figure out which approaches they can adopt in their own studies and what perils these approaches entail. We conducted a systematic literature review involving 185 papers. More specifically, we present (1) well-defined categories of opinion mining-related software development activities, (2) available opinion mining approaches, whether they are evaluated when adopted in other studies, and how their performance is compared, (3) available datasets for performance evaluation and tool customization, and (4) concerns or limitations SE researchers might need to take into account when applying/customizing these opinion mining techniques. The results of our study serve as references to choose suitable opinion mining tools for software development activities and provide critical insights for the further development of opinion mining techniques in the SE domain.
Conference Paper
Full-text available
Computer programs written in one language are often required to be ported to other languages to support multiple devices and environments. When programs use language specific APIs (Application Programming Interfaces), it is very challenging to migrate these APIs to the corresponding APIs written in other languages. Existing approaches mine API mappings from projects that have corresponding versions in two languages. They rely on the sparse availability of bilingual projects, thus producing a limited number of API mappings. In this paper, we propose an intelligent system called DeepAM for automatically mining API mappings from a large-scale code corpus without bilingual projects. The key component of DeepAM is based on the multi-modal sequence to sequence learning architecture that aims to learn joint semantic representations of bilingual API sequences from big source code data. Experimental results indicate that DeepAM significantly increases the accuracy of API mappings as well as the number of API mappings when compared with the state-of-the-art approaches.
Article
Full-text available
Computer programs written in one language are often required to be ported to other languages to support multiple devices and environments. When programs use language specific APIs (Application Programming Interfaces), it is very challenging to migrate these APIs to the corresponding APIs written in other languages. Existing approaches mine API mappings from projects that have corresponding versions in two languages. They rely on the sparse availability of bilingual projects, thus producing a limited number of API mappings. In this paper, we propose an intelligent system called DeepAM for automatically mining API mappings from a large-scale code corpus without bilingual projects. The key component of DeepAM is based on the multimodal sequence to sequence learning architecture that aims to learn joint semantic representations of bilingual API sequences from big source code data. Experimental results indicate that DeepAM significantly increases the accuracy of API mappings as well as the number of API mappings, when compared with the state-of-the-art approaches.
Article
Community edits to questions and answers (called post edits) plays an important role in improving content quality in Stack Overflow. Our study of post edits in Stack Overflow shows that a large number of edits are about formatting, grammar and spelling. These post edits usually involve small-scale sentence edits and our survey of trusted contributors suggests that most of them care much or very much about such small sentence edits. To assist users in making small sentence edits, we develop an edit-assistance tool for identifying minor textual issues in posts and recommending sentence edits for correction. We formulate the sentence editing task as a machine translation problem, in which an original sentence is "translated" into an edited sentence. Our tool implements a character-level Recurrent Neural Network (RNN) encoder-decoder model, trained with about 6.8 millions original-edited sentence pairs from Stack Overflow post edits. We evaluate our edit assistance tool using a large-scale archival post edits, a field study of assisting a novice post editor, and a survey of trusted contributors. Our evaluation demonstrates the feasibility of training a deep learning model with post edits by the community and then using the trained model to assist post editing for the community.
Conference Paper
Informal discussions on social platforms (e.g., Stack Overflow) accumulates a large body of programming knowl- edge in natural language text. Natural language process (NLP) techniques can be exploited to harvest this knowledge base for software engineering tasks. To make an effective use of NLP techniques, consistent vocabulary is essential. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms in informal discussions, such as abbreviations, synonyms and misspellings. Existing techniques to deal with such morphological forms are either designed for general English or predominantly rely on domain-specific lexical rules. A thesaurus of software-specific terms and commonly-used morphological forms is desirable for normalizing software engineering text, but very difficult to build manually. In this work, we propose an automatic approach to build such a thesaurus. Our approach identifies software-specific terms by contrasting software-specific and general corpuses, and infers morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations, and graph analysis of morphological relations. We evaluate the coverage and accuracy of the resulting thesaurus against community-curated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our thesaurus in a case study of normalizing questions from Stack Overflow and CodeProject.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Article
As the number of mobile apps keeps increasing, users often need to compare many apps, in order to choose one that best fits their needs. Fortunately, as there are so many users sharing an app market, it is likely that some other users with the same preferences have already made the comparisons and shared their opinions. For example, a user may state that an app is better in power consumption than another app in a review, then the review would help other users who care about battery life while choosing apps. This paper presents a method to identify comparative reviews for mobile apps from an app market, which can be used to provide fine-grained app comparisons based on different topics. According to experiments on 5 million reviews from Google Play and manual assessments on 900 reviews, our method is able to identify opinions accurately and provide meaningful comparisons between apps, which could in turn help users find desired apps based on their preferences.