Conference PaperPDF Available

Tell them apart: distilling technology differences from crowd-scale comparison discussions

September 2018

September 2018

DOI:10.1145/3238147.3238208

Conference: the 33rd ACM/IEEE International Conference

Authors:

Yi Huang

Australian National University

Chunyang Chen

Technical University of Munich

Zhenchang Xing

Nanyang Technological University

Show all 5 authorsHide

Developers can use different technologies for many software development tasks in their work. However, when faced with several technologies with comparable functionalities, it is not easy for developers to select the most appropriate one, as comparisons among technologies are time-consuming by trial and error. Instead, developers can resort to expert articles, read official documents or ask questions in QA sites for technology comparison, but it is opportunistic to get a comprehensive comparison as online information is often fragmented or contradictory. To overcome these limitations, we propose the diffTech system that exploits the crowdsourced discussions from Stack Overflow, and assists technology comparison with an informative summary of different comparison aspects. We first build a large database of comparable technologies in software engineering by mining tags in Stack Overflow, and then locate comparative sentences about comparable technologies with natural language processing methods. We further mine prominent comparison aspects by clustering similar comparative sentences and representing each cluster with its keywords. The evaluation demonstrates both the accuracy and usefulness of our model and we implement our approach into a practical website for public use.

: Examples of filtering results by categorical knowl- edge (in red)

…

The overview of our approach

…

: Examples of alias

…

An illustration of measuring similarity of two comparative sentences

…

Communities in the graph of comparative sentences

…

Figures - uploaded by Chunyang Chen

Content may be subject to copyright.

Content uploaded by Chunyang Chen

Content may be subject to copyright.

Tell Them Apart: Distilling Technology Dierences from

Crowd-Scale Comparison Discussions

Yi Huang

Australian National

University Australia

u6039034@anu.edu.au

Chunyang Chen∗

Faculty of Information

Technology Monash

University Australia

chunyang.chen@monash.

edu

Zhenchang Xing

Australian National

University Australia

zhenchang.xing@anu.edu.

Tian Lin

Yang Liu

Nanyang Technological

University Singapore

yangliu@ntu.edu.sg

ABSTRACT

Developers can use dierent technologies for many software de-

velopment tasks in their work. However, when faced with several

technologies with comparable functionalities, it is not easy for de-

velopers to select the most appropriate one, as comparisons among

technologies are time-consuming by trial and error. Instead, devel-

opers can resort to expert articles, read ocial documents or ask

questions in Q&A sites for technology comparison, but it is oppor-

tunistic to get a comprehensive comparison as online information

is often fragmented or contradictory. To overcome these limitations,

we propose the diTech system that exploits the crowdsourced dis-

cussions from Stack Overow, and assists technology comparison

with an informative summary of dierent comparison aspects. We

rst build a large database of comparable software technologies by

mining tags in Stack Overow, and locate comparative sentences

about comparable technologies with NLP methods. We further mine

prominent comparison aspects by clustering similar comparative

sentences and represent each cluster with its keywords. The evalu-

ation demonstrates both the accuracy and usefulness of our model

and we implement a practical website for public use.

CCS CONCEPTS

•Information systems →Data mining

;

•Software and its

engineering →Software libraries and repositories;

KEYWORDS

dierencing similar technology, Stack Overow, NLP

ACM Reference Format:

Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu. 2018.

Tell Them Apart: Distilling Technology Dierences from Crowd-Scale Com-

parison Discussions. In Proceedings of the 2018 33rd ACM/IEEE Interna-

tional Conference on Automated Software Engineering (ASE ’18), Septem-

ber 3–7, 2018, Montpellier, France. ACM, New York, NY, USA, 11 pages.

https://doi.org/10.1145/3238147.3238208

∗Co-rst and corresponding author.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

ASE ’18, September 3–7, 2018, Montpellier, France

ACM ISBN 978-1-4503-5937-5/18/09.. .$15.00

https://doi.org/10.1145/3238147.3238208

Figure 1: A comparative sentence in a post (#1008671) that is

not explicitly for technology comparison

1 INTRODUCTION

A diverse set of technologies (e.g, algorithms, programming lan-

guages, platforms, libraries/frameworks, concepts for software en-

gineering) [

] is available for use by developers and that set

continues growing. By adopting suitable technologies, it will sig-

nicantly accelerate the software development process and also

enhance the software quality. But when developers are looking for

proper technologies for their tasks, they are likely to nd several

comparable candidates. For example, they will nd bubble sort and

quick sort algorithms for sorting, nltk and opennlp libraries for NLP,

Eclipse and Intellij for developing Java applications.

Faced with so many candidates, developers are expected to have

a good understanding of dierent technologies in order to make

a proper choice for their work. However, even for experienced

developers, it can be dicult to keep pace with the rapid evolu-

tion of technologies. Developers can try each of the candidates in

their work for the comparison. But such trial-and-error assessment

is time-consuming and labor extensive. Instead, we nd that the

perceptions of developers about comparable technologies and the

choices they make about which technology to use are very likely

to be inuenced by how other developers see and evaluate the tech-

nologies. So developers often turn to the two information sources

on the Web [8] to learn more about comparable technologies.

First, they read experts’ articles about technology comparison

like “Intellij vs. Eclipse: Why IDEA is Better”. Second, developers

can seek answers on Q&A websites such as Stack Overow or

Quora (e.g., “Apache OpenNLP vs NLTK”). These expert articles

and community answers are indexable by search engines, thus

enabling developers to nd answers to their technology comparison

inquiries.

However, there are two limitations with expert articles and com-

munity answers.

214

ASE ’18, September 3–7, 2018, Montpellier, France Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu

•

Fragmented view: An expert article or community answer usually

focuses on a specic aspect of some comparable technologies, and

developers have to aggregate the fragmented information into a

complete comparison in dierent aspects. For example, to com-

pare mysql and postgresql, one article [

] contrasts their speed,

while another [

] compares their reliability. Only after reading

both articles, developers can have a relatively comprehensive

overview of these two comparable technologies.

•

Diverse opinions: One expert article or community answer is

based on the author’s knowledge and experience. However, the

knowledge and experience of developers vary greatly. For exam-

ple, one developer may prefer Eclipse over Intellij because Eclipse

ts his project setting better. But that setting may not be extensi-

ble to other developers. At the same time, some developers may

prefer Intellij over Eclipse for other reasons. Such contradictory

preferences among dierent opinions may confuse developers.

The above two limitations create a high barrier for developers

to eectively gather useful information about technology dier-

ences on the Web in order to tell apart comparable technologies.

Although developers may manually aggregate relevant information

by searching and reading many web pages, that would be very

opportunistic and time consuming. To overcome the above limi-

tations, we present the diTech system that automatically distills

and aggregates fragmented and trustworthy technology compari-

son information from the crowd-scale Q&A discussions in Stack

Overow, and assists technology comparison with an informative

summary of dierent aspects of comparison information.

Our system is motivated by the fact that a wide range of technolo-

gies have been discussed by millions of users in Stack Overow [

and users often express their preferences toward a technology and

compare one technology with the others in the discussions. Apart

from posts explicitly about the comparison of some technologies,

many comparative sentences

hide

in posts that are implicitly about

technology comparison. Fig.1shows such an example: the answer

“accidentally” compares the security of POST and GET, while the

question “How secure is a HTTP post?” does not explicit ask for

this comparison. Inspired by such phenomenon, we then propose

our system to mine and aggregate the comparative sentences in

Stack Overow discussions.

As shown in Fig. 2, we consider Stack Overow tags as a collec-

tion of technology terms and rst nd comparable technologies by

analyzing tag embeddings and categories. And then, our system

distills and clusters comparative sentences from Q&A discussions,

which highly likely contains detailed comparisons between some

comparable technologies, and sometimes even explains why users

like or dislike a particular technology. Finally, we use word mover

distance [

] and community detection [

] to cluster compara-

tive sentences into prominent aspects by which users compare the

two technologies and present the mined clusters of comparative

sentences for user inspection.

As there is no ground truth for technology comparison, we man-

ually validate the performance of each step of our approach. The

experiment results conrm the the accuracy of comparable tech-

nology identication (90.7%), and distilling comparative sentences

(83.7%) from Q&A discussions. By manually building the ground

truth, we show that our clustering method (word mover distance

Figure 2: The overview of our approach

and community detection) for comparative sentences signicantly

outperforms the two baselines (TF-IDF with K-means and Doc2vec

with K-means). Finally, we further demonstrate the usefulness of

our system for answering questions of technology comparison in

Stack Overow. The result show that our system can cover the

semantics of 72% comparative sentences in ve randomly selected

technology comparison questions, and also include some unique

comparisons from other aspects which are not discussed in original

answers.

Our contributions in this work are four-fold:

•This is the rst work to systematically identify comparable

software-engineering technologies and distill crowd-scale

comparative sentences for these technologies.

•

Our method automatically distills and aggregates crowd

opinions into dierent comparison aspects so that devel-

opers can understand technology comparison more easily.

•

Our experiments demonstrate the eectiveness of our method

by checking the accuracy and usefulness of each step of our

approach.

•

We implement our results into a practical tool and make it

public to the community. Developers can benet from the

technology comparison knowledge in our website1.

1https://ditech.herokuapp.com/

215

Tell Them Apart: Distilling Technology Dierences from Crowd-Scale Comparison... ASE ’18, September 3–7, 2018, Montpellier, France

word embedding of t3

𝑡1 t5 t4 𝑡2

(a) Continuous skip-gram model

wordembeddingoft







t

(b) Continuous bag-of-words model

Figure 3: The architecture of the two word embeddings mod-

els. The continuous skip-gram model predicts surrounding

words given the central word, and the CBOW model predicts

the central word based on the context words. Note the dier-

ences in arrow direction between the two models.

2 MINING SIMILAR TECHNOLOGY

Studies [

] show that Stack Overow tags identify com-

puter programming technologies that questions and answers re-

volve around. They cover a wide range of technologies, from algo-

rithms (e.g., binary search, merge sort), programming languages (e.g.,

python, java), libraries and frameworks (e.g., tensorow, django),

and development tools (e.g., vim, git). In this work, we regard Stack

Overow tags as a collection of technologies that developers would

like to compare. We leverage word embedding techniques to infer

semantically related tags, and develop natural language methods to

analyze each tag’s TagWiki to determine the corresponding tech-

nology’s category (e.g., algorithm, library, IDE). Finally, we build a

knowledge base of comparable technologies by ltering the same-

category, semantically-related tags.

2.1 Learning Tag Embeddings

Word embeddings are dense low-dimensional vector representa-

tions of words that are built on the assumption that words with sim-

ilar meanings tend to be present in similar context. Studies [

]

show that word embeddings are able to capture rich semantic and

syntactic properties of words for measuring word similarity. In our

approach, given a corpus of tag sentences, we use word embedding

methods to learn the word representation of each tag using the

surrounding context of the tag in the corpus of tag sentences.

There are two kinds of widely-used word embedding meth-

ods [

], the continuous skip-gram model [

] and the continuous

bag-of-words (CBOW) model. As illustrated in Fig. 3, the objective

of the continuous skip-gram model is to learn the word representa-

tion of each word that is good at predicting the co-occurring words

in the same sentence (Fig. 3(a)), while the CBOW is the opposite,

that is, predicting the center word by the context words (Fig. 3(b)).

Note that word order within the context window is not important

for learning word embeddings.

Specically, given a sequence of training text stream

t1,t2, ..., tk

the objective of the continuous skip-gram model is to maximize the

following average log probability:

L=1

k=1

−N⪯j⪯N,j,0

logp(tk+j|tk)(1)

Tag Wiki: Matplotlib is a plotting library for Python

Part of Speech: NNP VBZ DT JJ NN IN NNP

Figure 4: POS tagging of the denition sentence of the tag

Matplotlib

while the objective of the CBOW model is:

L=1

k=1

logp(tk|(tk−N,tk−N+1, . . . , tk+N)) (2)

where

is the central word,

tk+j

is its surrounding word with the

distance

, and

indicates the window size. In our application of

the word embedding, a tag sentence is a training text stream, and

each tag is a word. As tag sentence is short (has at most 5 tags), we

set

as 5 in our approach so that the context of one tag is all other

tags in the current sentences. That is, the context window contains

all other tags as the surrounding words for a given tag. Therefore,

tag order does not matter in this work for learning tag embeddings.

To determine which word-embedding model performs better in

our comparable technology reasoning task , we carry out a com-

parison experiment, and the details are discussed in Section 5.1.3.

2.2 Mining Categorical Knowledge

In Stack Overow, tags can be of dierent categories, such as pro-

gramming language, library, framework, tool, API, algorithm, etc.

To determine the category of a tag, we resort to the tag denition

in the TagWiki of the tag. The TagWiki of a tag is collaboratively

edited by the Stack Overow community. Although there are no

strict formatting rules in Stack Overow, the TagWiki description

usually starts with a short sentence to dene the tag. For example,

the tagWiki of the tag Matplotlib starts with the sentence “Mat-

plotlib is a plotting library for Python”. Typically, the rst noun

just after the be verb denes the category of the tag. For example,

from the tag denition of Matplotlib, we can learn that the category

of Matplotlib is library.

Based on the above observation of tag denitions, we use the

NLP methods [

] to extract such noun from the tag denition

sentence as the category of a tag. Given the tagWiki of a tag in Stack

Overow, we extract the rst sentence of the TagWiki description,

and clean up the sentence by removing hyperlinks and brackets

such as “{}”, “()”. Then, we apply Part of Speech (POS) tagging to

the extracted sentence. POS tagging is the process of marking up a

word in a text as corresponding to a particular part of speech, such

as noun, verb, adjective. NLP tools usually agree on the POS tags

of nouns, and we nd that POS tagger in NLTK [

] is especially

suitable for our task. In NLTK, the noun is annotated by dierent

POS tags [

] including NN (Noun, singular or mass), NNS (Noun,

plural), NNP (Proper noun, singular), NNPS (Proper noun, plural).

Fig. 4shows the results for the tag denition sentence of Matplotlib.

Based on the POS tagging results, we extract the rst noun (library

in this example) after the be verb (is in this example) as the category

of the tag. That is, the category of Matplotlib is library. Note that if

the noun is some specic words such as system,development, we

will further check its neighborhood words to see if it is operating

system or independent development environment.

216

ASE ’18, September 3–7, 2018, Montpellier, France Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu

With this method, we obtain 318 categories for the 23,658 tags

(about 67% of all the tags that have TagWiki). We manually normal-

ize these 318 categories labels, such as merging app and applications

as application,libraries and lib as library, and normalizing uppercase

and lowercase (e.g., API and api). As a result, we obtain 167 cate-

gories. Furthermore, we manually categorize these 167 categories

into ve general categories: programming language, platform, li-

brary, API, and concept/standard [

]. This is because the meaning

of the ne-grained categories is often overlapping, and there is

no consistent rule for the usage of these terms in the TagWiki.

This generalization step is necessary, especially for the library tags

that broadly refer to the tags whose ne-grained categories can

be library, framework, api, toolkit, wrapper, and so on. For exam-

ple, in Stack Overow’s TagWiki, junit is dened as a framework,

google-visualization is dened as an API, and wxpython is dened

as a wrapper. All these tags are referred to as library tags in our

approach.

Although the above method obtains the tag category for the

majority of the tags, the rst sentence of the TagWiki of some

tags is not formatted in the standard “tag be noun phrase” form.

For example, the rst sentence of the TagWiki of the tag itext is

“Library to create and manipulate PDF documents in Java”, or for

markermanager, the tag denition sentence is “A Google Maps tool”,

or for ghc-pkg, the tag denition sentence is “The command ghc-

pkg can be used to handle GHC packages”. As there is no be verb in

this sentence, the above NLP method cannot return a noun phrase

as the tag category. According to our observation, for most of such

cases, the category of the tag is still present in the sentence, but

often in many dierent ways. It is very likely that the category word

appears as the rst noun phrase that match the existing category

words in the denition sentence. Therefore, we use a dictionary

look-up method to determine the category of such tags. Specially,

we use the 167 categories obtained using the above NLP method as

a dictionary to recognize the category of the tags that have not been

categorized using the NLP method. Given an uncategorized tag, we

scan the rst sentence of the tag’s TagWiki from the beginning, and

search for the rst match of a category label in the sentence. If a

match is found, the tag is categorized as the matched category. For

example, the tag itext is categorized as library using this dictionary

look-up method. Using the dictionary look-up method, we obtain

the category for 9,648 more tags.

Note that we cannot categorize some (less than 15%) of the tags

using the above NLP method and the dictionary look-up method.

This is because these tags do not have a clear tag denition sen-

tence, for example, the TagWiki of the tag richtextbox states that

“The RichTextBox control enables you to display or edit RTF con-

tent”. This sentence is not a clear denition of what richtextbox

is. Or no category match can be found in the tag denition sen-

tence of some tags. For example, the TagWiki of the tag carousel

states that “A rotating display of content that can house a variety

of content”. Unfortunately, we do not have the category “display”

in the 167 categories we collect using the NLP method. When build-

ing comparable-technologies knowledge base, we exclude these

uncategorized tags as potential candidates.

Table 1: Examples of ltering results by categorical knowl-

edge (in red)

Source Top-5 recommendations from word embedding

nltk nlp, opennlp, gate, language-model, stanford-nlp

tcp tcp-ip, network-programming, udp, packets,tcpserver

vim sublimetext, vim-plugin, emacs, nano, gedit

swift objective-c, cocoa-touch,storyboard,launch-screen

bubble-sort insertion-sort, selection-sort, mergesort, timsort, heapsort

2.3 Building Similar-technology Knowledge

Base

Given a technology tag

with its vector

vec(t1)

, we rst nd most

similar library t2whose vector vec(t2)is most closed to it, i.e.,

argmax

t2∈T

cos(vec(t1),vec(t2)) (3)

where

is the set of technology tags excluding

, and

cos(u,v)

the cosine similarity of the two vectors.

Note that tags whose tag embedding is similar to the vector

vec(t1)

may not always be in the same category. For example, tag

embeddings of the tags nlp,language-model are similar to the vector

vec(nltk )

. These tags are relevant to the nltk library as they refer to

some NLP concepts and tasks, but they are not comparable libraries

to the nltk. In our approach, we rely on the category of tags (i.e.,

categorical knowledge) to return only tags within the same category

as candidates. Some examples can be seen in Table 1.

In practice, there could be several comparable technologies

to the technology

. Thus, we select tags

with the cosine sim-

ilarity in Eq. 3above a threshold

Thresh

. Take the library nltk (a

NLP library in python) as an example. We will preserve several

candidates which are libraries such as textblob,stanford-nlp.

3 MINING COMPARATIVE OPINIONS

For each pair of comparable technologies in the knowledge base, we

analyze the Q&A discussions in Stack Overow to extract plausible

comparative sentences by which Stack Overow users express their

opinions on the comparable technologies. We may obtain many

comparative sentences for each pair of comparable technologies.

Displaying all these sentences as a whole may make it dicult for

developers to read and digest the comparison information. There-

fore, we measure the similarity among the comparative sentences,

and then cluster them into several groups, each of which may iden-

tify a prominent aspect of technology comparison that users are

concerned with.

3.1 Extracting Comparative Sentences

There are three steps to extract comparative sentences of the two

technologies. We rst carry out some preprocessing of the Stack

Overow post content. Then we locate the sentences that contain

the name of the two technologies, and further select the comparative

sentences that satisfy a set of comparative sentence patterns.

3.1.1 Preprocessing. To extract trustworthy opinions about the

comparison of technologies, we consider only answer posts with

positive score points. Then we split the textual content of such

answer posts into individual sentences by punctuations like “.”, “!”,

“?”. We remove all sentences ended with question mark, as we want

217

Tell Them Apart: Distilling Technology Dierences from Crowd-Scale Comparison... ASE ’18, September 3–7, 2018, Montpellier, France

Table 2: The 6 comparative sentence patterns

No. Pattern Sequence example Original sentence

1TECH * VBZ * JJR innodb has 30 higher InnoDB has 30% higher performance than MyISAM on average.

2TECH * VBZ * RBR postgresql is a more Postgresql is a more correct database implementation while mysql is less compliant.

3JJR * CIN * TECH faster than coalesce Isnull is faster than coalesce.

4RBR JJ * CIN TECH more powerful than velocity Freemarker is more powerful than velocity.

5CV * CIN TECH prefer ant over maven I prefer ant over maven personally.

6CV VBG TECH recommend using html5lib I strongly recommend using html5lib instead of beautifulsoup.

Table 3: Examples of alias

Tech term Synonyms Abbreviation

visual studio visualstudio, visual studios, visual-studio msvs

beautifulsoup beautiful soup bs4

objective-c objectivec, objective c objc, obj-c

depth-rst search deep rst search, depth rst search, depth-rst-search dfs

postgresql postgre sql, posgresq, postgesql pgsql

to extract facts instead of doubts. We lowercase all sentences to

make the sentence tokens consistent with the technology names

because all tags are in lowercase.

3.1.2 Locating Candidate Sentences. To locate sentences mention-

ing a pair of comparable technologies, using only the tag names is

not enough. As posts in Stack Overow are informal discussions

about programming-related issues, users often use alias to refer

to the same technology [

]. Aliases of technologies can be abbre-

viations, synonyms and some frequent misspellings. For example,

“javascript” are often written in many forms such as “js” (abbre-

viation), “java-script” (synonym), “javascrip” (misspelling) in the

discussions.

The presence of such aliases will lead to signicant missing of

comparative sentences, if we match technology mentions in a sen-

tence with only tag names. Chen et al.’s work [

] builds a large

thesaurus of morphological forms of software-specic terms, in-

cluding abbreviations, synonyms and misspellings. Table 3shows

some examples of technologies aliases in this thesaurus. Based on

this thesaurus, we nd 7310 dierent alias for 3731 software tech-

nologies. These aliases help to locate more candidate comparative

sentences that mention certain technologies.

3.1.3 Selecting Comparative Sentences. To identify comparative

sentences from candidate sentences, we develop a set of compar-

ative sentence patterns. Each comparative sentence pattern is a

sequence of POS tags. For example, the sequence of POS tags “RBR

JJ IN ” is a pattern that consists of a comparative adverb (RBR), an

adjective (JJ) and subsequently a preposition (IN ), such as "more ef-

cient than", “less friendly than”, etc. We extend the list of common

POS tags to enhance the identication of comparative sentences.

More specically, we create three comparative POS tags: CV (com-

parative verbs, e.g. prefer, compare, beat), CIN (comparative prepo-

sitions, e.g. than, over), and TECH (technology reference, including

the name and aliases of a technology, e.g. python, eclipse).

Based on data observations of comparative sentences, we sum-

marise six comparative patterns. Table 2shows these patterns and

the corresponding examples of comparative sentences. To make the

patterns more exible, we use a wildcard character to represent a

list of arbitrary words to match the pattern. For each sentence men-

tioning the two comparable technologies, we obtain its POS tags

and check if it matches any one of six patterns. If so, the sentence

will be selected as a comparative sentence.

Figure 5: An illustration of measuring similarity of two com-

parative sentences

3.2 Measure Sentence Similarity

To measure the similarity of two comparative sentences, we adopt

the Word Mover’s Distance [

] which is especially useful for short-

text comparison. Given two sentences

and

, we take one word

from

and one word

from

. Let their word vectors be

and

. The distance between the word

and the word

is the

Euclidean distance between their vectors,

c(i,j)=||vi−vj||2

. To

avoid confusion between word and sentence distance, we will refer

c(i,j)

as the cost associated with “traveling” from one word to

another. One word

may move to several dierent words in the

, but its total weight is 1. So we use

Ti j ≥

0to denote how much

of word

travels to word

. It costs

ÍjTi jc(i,j)

to move

one word

entirely into

. We dene the distance between the

two sentences as the minimum (weighted) cumulative cost required

to move all words from

, i.e.,

D(S1,S2)=Íi,jTij c(i,j)

This problem is very similar to transportation problem i.e., how

to spend less to transform all goods from source cities

A1,A2, ...

to target cities

B1,B2, ...

. Getting such minimum cost actually is a

well-studied optimization problem of earth mover distance [

To use word mover’s distance in our approach, we rst train a

word embedding model based on the post content of Stack Overow

so that we get a dense vector representation for each word in Stack

Overow. Word embedding has been shown to be able to capture

rich semantic and syntactic information of words. Our approach

does not consider word mover’s distance for all words in a sentence.

Instead, for each comparative sentence, we extract only keywords

with POS tags that are most relevant to the comparison, including

adjectives (JJ), comparative adjectives ( JJR) and nouns (NN, NNS,

NNP and NNPS), not including the technologies under comparison.

Then, we compute the minimal word movers’ distance between the

keywords in one sentence and those in the other sentences. Base

on the distance, we further compute the similarity score of the two

sentences by

similarityscore (S1,S2)=1

1+D(S1,S2)

218

ASE ’18, September 3–7, 2018, Montpellier, France Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu

Figure 6: Communities in the graph of comparative sen-

tences

The similarity score is in range

(

)

, and the higher the score, the

more similar the two sentences. If the similarity score between

the two sentences is larger than the threshold, we regard them as

similar. The threshold is 0.55 in this work, determined heuristically

by a small-scale pilot study. We show some similar comparative

sentences by word mover’s distance in Table 4.

To help reader understand word movers’ distance, we show an

example in Figure 5with two comparative sentences for comparing

postgresql and mysql: “Postgresql oers more security functionality

than mysql” and “’Mysql provides less safety features than postgresql’.

The keywords in the two sentences that are most relevant to the

comparison are highlighted in bold. We see that the minimum

distance between the two sentences is mainly the accumulation

of word distance between pairs of similar words (oers, provides),

(more, less),(security, safety), and (functionality, features). As the

distance between the two sentences is small, the similarity score is

high even though the two sentences use rather dierent words and

express the comparison in reverse directions.

3.3 Clustering Representative Comparison

Aspects

For each pair of comparable technologies, we collect a set of com-

parative sentences about their comparison in Section 3.1. Within

these comparative sentences, we nd pairs of similar sentences in

Section 3.2. We take each comparative sentence as one node in the

graph. If the two sentences are determined as similar, we add an

edge between them in the graph. In this way, we obtain a graph of

comparative sentences for a given pair of comparative technologies.

Although some comparative sentences are very dierent in

words or comparison directions (examples shown in Fig. 5and

Table 4), they may still share the same comparison opinions. In

graph theory, a set of highly correlated nodes is referred to as a

community (cluster) in the network. Based on the sentence sim-

ilarity, we cluster similar opinions by applying the community

detection algorithm to the graph of comparative sentences. In this

work, we use the Girvan-Newman algorithm [

] which is a hierar-

chical community detection method. It uses an iterative modularity

maximization method to partition the network into a nite number

of disjoint clusters that will be considered as communities. Each

node must be assigned to exactly one community. Fig. 6shows the

graph of comparative sentences for the comparison of TCP and

UDP (two network protocols), in which each node is a comparative

sentence, and the detected communities are visualized in the same

color.

As seen in Fig. 6, each community may represent a prominent

comparison aspect of the two comparable technologies. But some

communities may contain too many comparative sentences to un-

derstand easily. Therefore, we use TF-IDF (Term Frequency Inverse

Document Frequency) to extract keywords from comparative sen-

tence in one community to represent the comparison aspect of this

community. TF-IDF is a statistical measure to evaluate the impor-

tance of a word to a document in a collection. It consists of two

parts: term frequency (TF, the number occurrences of a term in a

document) and inverse document frequency (IDF, the logarithm of

the total number of documents in the collection divided by the num-

ber of documents in the collection that contain the specic term).

For each community, we remove stop words in the sentences, and

regard each community as a document. We take the top-3 words

with largest TF-IDF scores as the representative aspect for the com-

munity. Table 5shows the comparison aspects of four communities

for comparing postgresql with mysql. The representative keywords

directly show that the comparison between postgresql with mysql

mainly focuses on four aspects: speed, security, popularity, and

usability.

4 IMPLEMENTATION

4.1 Dataset

We take the latest Stack Overow data dump (released on 13 March

2018) as the data source. It contains 14,995,834 questions, 23,399,083

answers, 50,812 unique tags. With the approach in Section 2, we col-

lect in total 14,876 pairs of comparable technologies. Among these

technologies, we extract 14,552 comparative sentences for 2,074

pairs of comparable technologies. We use these technologies and

comparative sentences to build a knowledge base for technology

comparison.

4.2 Tool Support

Apart from our abstract approach, we also implement a practi-

cal tool

for developers. With the knowledge base of comparable

technologies and their comparative sentences mined from Stack

Overow, our site can return an informative and aggregated view

of comparative sentences in dierent comparison aspects for com-

parable technology queries. In addition, the tool provides the link

of each comparative sentence to its corresponding Stack Overow

post so that users can easily nd more detailed content.

5 EXPERIMENT

In this section, we evaluate each step of our approach. As there is

no ground truth for technology comparison, we have to manually

check the results of each step or build the ground truth. And as it is

clear to judge whether a tag is of a certain category from its tag de-

scription, whether two technologies are comparable, and whether a

sentence is a comparative sentence, we recruit two Master students

2https://ditech.herokuapp.com/

219

Tell Them Apart: Distilling Technology Dierences from Crowd-Scale Comparison... ASE ’18, September 3–7, 2018, Montpellier, France

Table 4: Examples of similar comparative sentences by Word Mover’s Distance

Comparable technology pair Comparative sentences

vmware &virtualbox Virtualbox is slower than vmware.

In my experience I’ve found that vmware seems to be faster than virtualbox.

strncpy &strcpy In general strncpy is a safer alternative to strcpy.

So that the strncpy is more secure than strcpy.

google-chrome &safari Safari still uses the older Webkit while Chrome uses a more current one.

Google Chrome also uses an earlier version of Webkit than Safari.

quicksort &mergesort Mergesort would use more space than quicksort.

Quicksort is done in place and doesn’t require allocating memory, unlike mergesort.

nginx &apache Serving static les with nginx is much more ecient than with apache.

There seems to be a consensus that nginx serves static content faster than apache.

Table 5: The representative keywords for clusters of postgresql and mysql.

Representative keywords Comparative sentences

speed, slower, faster

In most regards, postgresql is slower than mysql especially when it comes to ne tuning in the end.

I did a simple performance test and I noticed postgresql is slower than mysql.

According to my own experience, postgresql run much faster than mysql.

Postgresql seem to better than mysql in terms of speed.

security, safety, functionalityTraditionally postgresql has fewer security issues than mysql.

Postgresql oers more security functionality than mysql.

Mysql provides less safety features than postgresql.

popular

While postgresql is less popular than mysql, most of the serious web hosting supports it.

Though mysql is more popular than postgresql but instagram is using postgresql maybe due to these reasons.

It’s a shame postgresql isn’t more popular than mysql, since it supports exactly this feature out-of-the-box.

easier, simplicity

Mysql is more widely supported and a little easier to use than postgresql.

Postgresql specically has gotten easier to manage while mysql has lost some of the simplicity.

However, people often argue that postgresql is easier to use than mysql.

to manually check the results of these three steps. Only results that

they both agree will be regarded as ground truth for computing

relevant accuracy metrics, and those results without consensus

will be given to the third judge who is a PhD student with more

experience. All three students are majoring in computer science

and computer engineering in our school, and they have diverse

research and engineering background with dierent software tools

and programming languages in their work. In addition, we release

all experiment data and results in our website3.

5.1 Accuracy of Extracting Comparable

Technologies

This section reports our evaluation of the accuracy of tag category

identication, the important of tag category for ltering out irrele-

vant technologies, and the impact of word embedding models and

hyperparameters.

5.1.1 The Accuracy of Tag CCategory. From 33,306 tags with tag

category extracted by our method, we randomly sample 1000 tags

whose categories are determined using the NLP method, and the

other 1000 tags whose categories are determined by the dictionary

look-up method (see Section 2.2). Among the 1000 sampled tag

categories by the NLP method, categories of 838 (83.8%) tags are

correctly extracted by the proposed method. For the 1000 sampled

3https://sites.google.com/view/ditech/

tags by the dictionary look-up method, categories of 788 (78.8%)

tags are correct.

According to our observation, two reasons lead to the erroneous

tag categories. First, some tag denition sentences are complex

which can lead to erroneous POS tagging results. For example, the

tagWiki of the tag rpy2 states that “RPy is a very simple, yet robust,

Python interface to the R Programming Language”. The default POS

tagging recognizes simple as the noun which is then regarded as the

category by our method. Second, the dictionary look-up method

sometimes makes mistakes, as the matched category may not be the

real category. For example, the TagWiki of the tag honeypot states

“A trap set to detect or deect attempts to hack a site or system”.

Our approach matches the system as the category of the honeypot.

5.1.2 The Importance of Tag Category. To check the importance of

tag category for the accurate comparable technology extraction, we

set up two methods, i.e., one is word embedding and tag category

ltering, and the other is only with word embedding. The word

embedding model in two methods are both skip-gram model with

the word embedding dimension as 800. We randomly sample 150

technologies pairs extracted from each method, and manually check

the if the extracted technology pair is comparable or not. It shows

that the performance of model with tag category (90.7%) is much

better than that without the tag category ltering (29.3%).

5.1.3 The Impact of Parameters of Word Embedding. There are

two important parameters for the word embedding model, and we

220

ASE ’18, September 3–7, 2018, Montpellier, France Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu

Table 6: The accuracy of comparative sentences extraction

No. Pattern #right #wrong Accuracy

1TECH * VBZ * JJR 44 6 88%

2TECH * VBZ * RBR 45 5 90%

3JJR * CIN * TECH 43 7 86%

4RBR JJ * CIN TECH 47 3 94%

5CV * CIN TECH 37 13 74%

6CV VBG TECH 35 15 70%

Total 251 49 83.7%

test its impact on the the performance of our method. First, we

compare the performance of CBOW and Skip-gram mentioned in

Section 2.1 by randomly sampling 150 technology pairs extracted

by each method under the same parameter setting (the word em-

bedding dimension is 400). The results show that Skip-gram (90.7%)

outperforms the CBOW (88.7%), but the dierence is marginal.

Second, we randomly sample 150 technologies pairs by the skip-

gram model with dierent word embedding dimensions, and man-

ually check the accuracy. From the dimension 200 to 1000 with

the step as 200, the accuracy is 70.7%, 72.7%, 81.3%, 90.7%, 87.3%.

We can see that the model with the word embedding dimension as

800 achieves the best performance. Finally, we take the Skip-gram

model with 800 word-embedding dimension as the word embedding

model to obtain the comparable technologies in this work.

5.2 Accuracy and Coverage of Comparative

Sentences

We evaluate the accuracy and coverage of our approach in nding

comparative sentences from the corpus. We rst randomly sample

300 sentences (50 sentences for each comparative sentence pattern

in Table 2) which are extracted by our model. We manually check

the accuracy of the sampled sentences and Table 6shows the results.

The overall accuracy of comparative sentence extraction is 83.7%,

and our approach is especially accurate for the rst 4 patterns.

The last two patterns do not work well due to the relatively loose

conditions.

We further check the wrong extraction of comparative sentences

and nd that most errors are caused by wrong comparable tech-

nologies extracted in Section 2. For example, implode and explode

are not comparable technologies, but they are mentioned in sen-

tence “I’m not sure why you’d serialize it in php either because

implode and explode would be more appropriate”. In addition, al-

though some sentences do not contain the question mark, they are

actually interrogative sentence such as “I also wonder if postgresql

will be a win over mysql”.

5.3 Accuracy of Clustering Comparative

Sentences

We evaluate the performance of our opinion clustering method by

comparing it with the baseline methods.

5.3.1 Baseline. We set up two baselines to compare with our com-

parative sentence clustering method. The rst baseline is the tradi-

tional TF-IDF [

] with K-means [

], and the second baseline is

based on the document-to-vector deep learning model (i.e., Doc2vec [

])

Table 7: Ground Truth for evaluating clustering results

No. Technology pair #comparative sentences#clusters

1compiled & interpreted language 27 5

2sortedlist & sorteddictionary 11 4

3ant & maven 47 7

4pypy & cpython 51 3

5google-chrome & safari 35 6

6quicksort & mergesort 54 4

7lxml & beautifulsoup 32 4

8awt & swing 30 3

9jackson & gson 31 3

10 swift & objective-c 72 10

11 jruby & mri 19 3

12 memmove & memcpy 21 3

with K-means. Both methods rst convert the comparative sen-

tences for a pair of comparable technologies into vectors by TF-IDF

and Doc2vec. Then for both methods, we carry out K-means algo-

rithms to cluster the sentence vectors into

clusters. To make the

baseline as competitive as possible, we set

at the cluster number

of the ground truth. In contrast, our method species its cluster

number by community detection which may dier from the cluster

number of the ground truth.

5.3.2 Ground Truth. As there is no ground truth for clustering com-

parative sentences, we ask two Master students mentioned before

to manually build a small-scale ground truth. We randomly sam-

ple 15 pairs of comparable technologies with dierent number of

comparative sentences. For each technology pair, the two students

read each comparative sentence and each of them will individually

create several clusters for these comparative sentences. Note some

comparative sentences are unique without any similar comparative

sentence, and we put all those sentences into one cluster. Then they

will discuss with the Ph.D student about the clustering results, and

change the clusters accordingly. Finally, they reach an agreement

for 12 pairs of comparable technologies. We take these 12 pairs as

the ground truth whose details can be seen in Table 7.

5.3.3 Evaluation Metrics. Given the ground truth clusters, many

metrics have been proposed to evaluate the clustering performance

in the literature. In this work, we take the Adjusted Rand Index

(ARI) [25], Normalized Mutual Information(NMI) [47], homogene-

ity, completeness, V-measure [

], and Fowlkes-Mallows Index

(FMI) [

]. For all six metrics, higher value represents better cluster-

ing performance. For each pair of comparable technologies, we take

all comparative sentences as a xed list, and

as a ground truth

cluster assignment and Cas the algorithm clustering assignment.

Adjusted Rand Index (ARI) measures the similarity between

two partitions in a statistical way. It rst calculates the raw Rand

Index (RI) by

RI =a+b

where

is the number of pairs of elements

that are in the same cluster in

and also in the same cluster in

and

is the number of pairs of elements that are in dierent clusters

and also in dierent clusters in

is the total number of

possible pairs in the dataset (without ordering) where

is the

number of comparative sentences. To guarantee that random label

assignments will get a value close to zero, ARI is dened as

ARI =RI −E[RI ]

max(RI ) − E[RI ]

221

Tell Them Apart: Distilling Technology Dierences from Crowd-Scale Comparison... ASE ’18, September 3–7, 2018, Montpellier, France

where E[RI]is the expected value of RI .

Normalized Mutual Information (NMI)

measures the mutual

information between the ground truth labels

and the algorithm

clustering labels C, followed by a normalization operation:

N MI (G,C)=MI (G,C)

pH(G)H(C)

where

H(G)

is the entropy of set

i.e.,

H(G)=−Í|G|

i=1P(i)log(P(i))

and

P(i)=Gi

is the probability than an objet picked at random

falls into class

. The

MI (G,C)

is the mutual information between

Gand Cwhere MI (G,C)=Í|G|

i=1Í|C|

j=1P(i,j)log(P(i,j)

P(i)P(j))

Homogeneity (HOM)

is the proportion of clusters containing

only members of a single class by

h=1−H(G|C)

H(G)

Completeness (COM)

is the proportion of all members of a given

class are assigned to the same cluster by

c=1−H(C|G)

H(C)

where

H(G|C)

is the conditional entropy of the ground-truth classes

given the algorithm clustering assignments.

V-measure (V-M)

is the harmonic mean of homogeneity and

completeness

v=2×h×c

h+c

Fowlkes-Mallows Index (FMI)

is dened as the geometric

mean of the pairwise precision and recall:

FMI =T P

p(TP +F P )(T P +F N )

where

T P

is the number of True Positive (i.e. the number of pairs

of comparative sentences that belong to the same clusters in both

the ground truth and the algorithm prediction),

is the number

of False Positive (i.e. the number of pairs of comparative sentences

that belong to the same clusters in the ground-truth labels but not

in the algorithm prediction) and

F N

is the number of False Negative

(i.e the number of pairs of comparative sentences that belongs in

the same clusters in the algorithm prediction but not in the ground

truth labels).

5.3.4 Overall Performance. Table 8shows the evaluation results.

TF-IDF with K-means has similar performance as the Doc2vec with

K-means, but our model signicantly outperforms both models in

all six metrics.

According to our inspection of detailed results, we nd two rea-

sons why our model outperforms two baselines. First, our model can

capture the semantic meaning of comparative sentences. TF-IDF

can only nd similar sentences using the same words but count sim-

ilar words like “secure” and “safe” as unrelated. While the sentence

vector from Doc2vec is easily inuenced by the noise as it takes

all words in the sentence into consideration. Second, constructing

the similar sentences as a graph in our model explicitly encodes

the sentence relationships. The community detection based on the

graph can then easily put similar sentences into clusters. In con-

trast, for the two baselines, the error brought from the TF-IDF and

Table 8: Clustering performance

Method ARI NMI HOM COM V-M FMI

TF-IDF+Kmeans 0.12 0.28 0.29 0.27 0.28 0.41

Doc2vec+Kmeans -0.01 0.11 0.10 0.14 0.11 0.43

Our model 0.66 0.73 0.75 0.72 0.73 0.79

Doc2vec is accumulated and amplied to K-means in the clustering

phase.

6 USEFULNESS EVALUATION

Experiments in Section 5have shown the accuracy of our approach.

In this section, we further demonstrate the usefulness of our ap-

proach. According to our observation of Stack Overow, there are

some questions discussing comparable technologies such as “What

is the dierence between Swing and AWT ”. We demonstrate the use-

fulness of the technology-comparison knowledge our approach

distills from Stack Overow discussions by checking how well the

distilled knowledge by our approach can answer those questions.

6.1 Evaluation Procedures

We use the name of comparable technologies with several keywords

such as compare,vs,dierence to search questions in Stack Overow.

We then manually check which of them are truly about comparable

technology comparison, and randomly sample ve questions that

discuss comparable technologies in dierent categories and have

at least ve answers. The testing dataset can be seen in Table 9.

We then ask the two Master students to read each sentence in

all answers and cluster all sentences into several clusters which

represent developers’ opinions in dierent aspects. To make the

data as valid as possible, they still rst carry out the clustering

individually and then reach an agreement after discussions. For

each comparative opinion in the answer, we manually check if

that opinion also appears in the knowledge base of comparative

sentences extracted by our method. To make this study fair, our

method does not extract comparative sentences from answers of

questions used in this experiment.

6.2 Results

Table 10 shows the evaluation results. We can see that most com-

parison (72%) aspects can be covered by our knowledge base. For

two questions (#5970383 and #46585), the technology-comparison

knowledge distilled by our method can cover all of comparison

aspects in the original answers such as speed, reliability, data size

for comparing post and get. While for the other three questions,

our model can still cover more than half of the comparison aspects.

We miss some comparison aspects for the other three questions,

such as “One psychological reason that has not been given is simply

that Quicksort is more cleverly named, i.e. good marketing.”, “The

VMWare Workstation client provides a nicer end-user experience (sub-

jective, I know...)” and “Another statement which I saw is that swing is

MVC based and awt is not.”. Such opinions are either too subjective

or too detailed which rarely appear again in other Stack Overow

discussions, leading to not having them in our knowledge base.

Apart from comparison aspects appeared in the original answers,

our tool can provide some unique opinions from other aspects, such

222

ASE ’18, September 3–7, 2018, Montpellier, France Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu

Table 9: Comparative questions

Question ID Question title Tech pair Tech category #answers

70402 Why is quicksort better than mergesort? quicksort & mergesort Algorithm 29

5970383 Dierence between TCP and UDP tcp & udp Protocol 9

630179 Benchmark: VMware vs Virtualbox vmware & virtualbox IDE 13

408820 What is the dierence between Swing and AWT? swing & awt Library 8

46585 When do you use POST and when do you use GET? post & get Method 28

Table 10: Distilled knowledge by our approach versus origi-

nal answers

Question ID #Aspects #Covered #Unique in our model

70402 6 4 (66.67%) 2

5970383 3 3 (100%) 5

630179 7 4 (57.1%) 1

408820 5 3 (60%) 4

46585 4 4 (100%) 2

Total 25 18 (72%) 14

as “In my experience, udp based code is generally less complex than tcp

based code” for comparing tcp and udp, “however I found that vmware

is much more stable in full screen resolution to handle the iphone

connection via usb” for comparing vmware and virtualbox, and “GET

would obviously allow for a user to change the value a lot easier than

POST” for comparing post and get. As seen in Table 10, our model

can provide more than one unique comparative aspects which are

not in the existing answers for each technology pair. Therefore,

our knowledge base can be a good complement to these existing

technology-comparison questions with answers. Furthermore, our

knowledge base contains the comparison knowledge of 2074 pairs

of comparable technologies, many of which have not been explicitly

asked and discussed in Stack Overow, such as swift and objective-c,

nginx and apache.

7 RELATED WORKS

Finding similar software artefacts can help developers migrate from

one tool to the other which is more suitable to their requirement. But

it is a challenging task to identify similar software artefacts from the

existing large pool of candidates. Therefore, much research eort

has been put into this domain. Dierent methods has been adopted

to mine similar artefacts ranging from high-level software [

], mobile applications [

], github projects[

] to low-level

third-party libraries [

], APIs [

], code snippets [

or Q&A questions [

]. Compared with these research studies,

the mined software technologies in this work has much broader

scope including not only software-specic artefacts, but also general

software concepts (e.g., algorithm, protocol), tools (e.g., IDE).

Given a list of similar technologies, developers may further com-

pare and contrast them for the nal selection. Some researcher

investigate such comparison, the comparison is highly domain-

specic such as software for trac simulation [

], regression

models [

], x86 virtualization [

], etc. Michail and Notkin [

]

assess dierent third-party libraries by matching similar compo-

nents (such as classes and functions) across similar libraries. But

it can only work for library comparison without the possibility to

be extended to other higher/lower-level technologies in Software

Engineering. Instead, we nd developers’s preference of certain

software technologies highly depends on other developers’ usage

experience and report of similar technology comparisons. There-

fore, Uddin and Khomh [

] extract API opinion sentences in

dierent aspects to show developers’ sentiment to that API. Li et

al. [

] adopt NLP methods to distill comparative user review about

similar mobile Apps. Dierent from their works, we rst explicitly

extract a large pool of comparable technologies. In addition, apart

from extracting comparative sentences, we further organize them

into dierent clusters and represent each cluster with some key-

words to help developers understand comparative opinions more

easily.

Finally, it is worth mentioning some related practical projects.

SimilarWeb [

] is a website that provides both users engagement sta-

tistics and similar competitors for websites and mobile applications.

AlternativeTo [

] is a social software recommendation website in

which users can nd alternatives to a given software based on user

recommendations. SimilarTech [

] is a site to recommend analogi-

cal third-party libraries across dierent programming languages.

These websites can help users nd similar or alternative websites

or software applications without detailed comparison.

8 CONCLUSION AND FUTURE WORK

In this paper, we present an automatic approach to distill and aggre-

gate comparative opinions of comparable technologies from Q&A

websites. We rst obtain a large pool of comparable technologies by

incorporating categorical knowledge into word embedding of tags

in Stack Overow, and then locate comparative sentences about

these technologies by POS-tag based pattern matching, and nally

organize comparative sentences into clusters for easier understand-

ing. The evaluation shows that our system covers a large set of

comparable technologies and their corresponding comparative sen-

tences with high accuracy. We also demonstrate the potential of

our system to answer questions about comparing comparable tech-

nologies, because the technology comparison knowledge mined

using our system largely overlap with the original answers in Stack

Overow.

Apart from comparative sentences explicitly mentioning both

comparable technologies, some comparative opinions may hide

deeper. For example, one developer expresses his opinions about one

technology in one paragraph while discussing the other technology

in the next paragraph. Therefore, we will improve our system to

distill technology comparison knowledge from the current sentence

level to post level. In addition, we also plan to summarize higher-

level opinions or preferences from separated individual comparative

sentences for easier understanding.

223

Tell Them Apart: Distilling Technology Dierences from Crowd-Scale Comparison... ASE ’18, September 3–7, 2018, Montpellier, France

REFERENCES

[1]

2003. Alphabetical list of part-of-speech tags used in the Penn Treebank

Project. https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_

pos.html. Accessed: 2018-02-02.

[2]

2017. Millions of Queries per Second: PostgreSQL and MySQLâĂŹs Peaceful

Battle at TodayâĂŹs Demanding Workloads. https://goo.gl/RXVjkB/. Accessed:

2018-04-05.

[3]

2017. MySQL vs Postgres. https://www.upguard.com/articles/postgres-vs-mysql/.

Accessed: 2018-04-05.

[4]

2018. AlternativeTo - Crowdsourced software recommendations. https://

alternativeto.net/. Accessed: 2018-04-05.

[5]

2018. SimilarTech: Find alternative libraries across languages. https://

graphofknowledge.appspot.com/similartech/. Accessed: 2018-04-05.

[6] 2018. SimilarWeb. https://www.similarweb.com/. Accessed: 2018-04-05.

[7]

Keith Adams and Ole Agesen. 2006. A comparison of software and hardware

techniques for x86 virtualization. ACM SIGARCH Computer Architecture News 34,

5 (2006), 2–13.

[8]

Lingfeng Bao, Jing Li, Zhenchang Xing, Xinyu Wang, Xin Xia, and Bo Zhou. 2017.

Extracting and analyzing time-series HCI data from screen-captured task videos.

Empirical Software Engineering 22, 1 (2017), 134–174.

[9] Anton Barua, Stephen W Thomas, and Ahmed E Hassan. 2014. What are devel-

opers talking about? an analysis of topics and trends in stack overow. Empirical

Software Engineering 19, 3 (2014), 619–654.

[10]

Steven Bird and Edward Loper. 2004. NLTK: the natural language toolkit. In

Proceedings of the ACL 2004 on Interactive poster and demonstration sessions.

Association for Computational Linguistics, 31.

[11]

Chunyang Chen, Sa Gao, and Zhenchang Xing. 2016. Mining analogical libraries

in q&a discussions–incorporating relational and categorical knowledge into word

embedding. In Software Analysis, Evolution, and Reengineering (SANER), 2016

IEEE 23rd International Conference on, Vol. 1. IEEE, 338–348.

[12]

Chunyang Chen and Zhenchang Xing. 2016. Mining technology landscape from

stack overow. In Proceedings of the 10th ACM/IEEE International Symposium on

Empirical Software Engineering and Measurement. ACM, 14.

[13]

Chunyang Chen and Zhenchang Xing. 2016. Similartech: automatically recom-

mend analogical libraries across dierent programming languages. In Automated

Software Engineering (ASE), 2016 31st IEEE/ACM International Conference on. IEEE,

834–839.

[14]

Chunyang Chen and Zhenchang Xing. 2016. Towards correlating search on

google and asking on stack overow. In Computer Software and Applications

Conference (COMPSAC), 2016 IEEE 40th Annual, Vol. 1. IEEE, 83–92.

[15]

Chunyang Chen, Zhenchang Xing, and Lei Han. 2016. Techland: Assisting

technology landscape inquiries with insights from stack overow. In Software

Maintenance and Evolution (ICSME), 2016 IEEE International Conference on. IEEE,

356–366.

[16]

Chunyang Chen, Zhenchang Xing, and Yang Liu. 2017. By the Community & For

the Community: A Deep Learning Approach to Assist Collaborative Editing in

Q&A Sites. Proceedings of the ACM on Human-Computer Interaction 1, 32 (2017),

1–32.

[17]

Chunyang Chen, Zhenchang Xing, and Ximing Wang. 2017. Unsupervised

software-specic morphological forms inference from informal discussions. In

Proceedings of the 39th International Conference on Software Engineering. IEEE

Press, 450–461.

[18]

Guibin Chen, Chunyang Chen, Zhenchang Xing, and Bowen Xu. 2016. Learning

a dual-language vector space for domain-specic cross-lingual question retrieval.

In Automated Software Engineering (ASE), 2016 31st IEEE/ACM International Con-

ference on. IEEE, 744–755.

[19]

Ning Chen, Steven CH Hoi, Shaohua Li, and Xiaokui Xiao. 2015. SimApp: A

framework for detecting similar mobile applications by online kernel learning. In

Proceedings of the Eighth ACM International Conference on Web Search and Data

Mining. ACM, 305–314.

[20]

Edward B Fowlkes and Colin L Mallows. 1983. A method for comparing two

hierarchical clusterings. Journal of the American statistical association 78, 383

(1983), 553–569.

[21]

Michelle Girvan and Mark EJ Newman. 2002. Community structure in social and

biological networks. Proceedings of the national academy of sciences 99, 12 (2002),

7821–7826.

[22]

Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2017. DeepAM:

Migrate APIs with multi-modal sequence to sequence learning. arXiv preprint

arXiv:1704.07734 (2017).

[23]

John A Hartigan and Manchek A Wong. 1979. Algorithm AS 136: A k-means

clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied

Statistics) 28, 1 (1979), 100–108.

[24]

Nicholas J Horton and Stuart R Lipsitz. 2001. Multiple imputation in practice:

comparison of software packages for regression models with missing variables.

The American Statistician 55, 3 (2001), 244–254.

[25]

Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of

classication 2, 1 (1985), 193–218.

[26] Steven L Jones, Andrew J Sullivan, Naveen Cheekoti, Michael D Anderson, and

D Malave. 2004. Trac simulation software comparison study. UTCA report 2217

(2004).

[27]

JunâĂŹichi Kazama and Kentaro Torisawa. 2007. Exploiting Wikipedia as ex-

ternal knowledge for named entity recognition. In Proceedings of the 2007 Joint

Conference on Empirical Methods in Natural Language Processing and Computa-

tional Natural Language Learning (EMNLP-CoNLL). 698–707.

[28]

Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. 2015. From word

embeddings to document distances. In International Conference on Machine Learn-

ing. 957–966.

[29]

Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and

documents. In International Conference on Machine Learning. 1188–1196.

[30]

Yuanchun Li, Baoxiong Jia, Yao Guo, and Xiangqun Chen. 2017. Mining User

Reviews for Mobile App Comparisons. Proceedings of the ACM on Interactive,

Mobile, Wearable and Ubiquitous Technologies 1, 3 (2017), 75.

[31]

Mario Linares-Vásquez, Andrew Holtzhauer, and Denys Poshyvanyk. 2016. On

automatically detecting similar android apps. In Program Comprehension (ICPC),

2016 IEEE 24th International Conference on. IEEE, 1–10.

[32]

Haibin Ling and Kazunori Okada. 2007. An ecient earth mover’s distance

algorithm for robust histogram comparison. IEEE transactions on pattern analysis

and machine intelligence 29, 5 (2007), 840–853.

[33]

Collin McMillan, Mark Grechanik, and Denys Poshyvanyk. 2012. Detecting

similar software applications. In Proceedings of the 34th International Conference

on Software Engineering. IEEE Press, 364–374.

[34]

Amir Michail and David Notkin. 1999. Assessing software libraries by browsing

similar classes, functions and relationships. In Proceedings of the 21st international

conference on Software engineering. ACM, 463–472.

[35]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jerey Dean. 2013. Ecient

estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

(2013).

[36]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Je Dean. 2013.

Distributed representations of words and phrases and their compositionality. In

Advances in neural information processing systems. 3111–3119.

[37]

Trong Duc Nguyen, Anh Tuan Nguyen, Hung Dang Phan, and Tien N Nguyen.

2017. Exploring API embedding for API usages and applications. In Software

Engineering (ICSE), 2017 IEEE/ACM 39th International Conference on. IEEE, 438–

449.

[38]

Or Pele and Michael Werman. 2009. Fast and robust earth mover’s distances. In

Computer vision, 2009 IEEE 12th international conference on. IEEE, 460–467.

[39]

Andrew Rosenberg and Julia Hirschberg. 2007. V-measure: A conditional entropy-

based external cluster evaluation measure. In Proceedings of the 2007 joint con-

ference on empirical methods in natural language processing and computational

natural language learning (EMNLP-CoNLL).

[40]

Karen Sparck Jones. 1972. A statistical interpretation of term specicity and its

application in retrieval. Journal of documentation 28, 1 (1972), 11–21.

[41]

Fang-Hsiang Su, Jonathan Bell, Gail Kaiser, and Simha Sethumadhavan. 2016.

Identifying functionally similar code in complex codebases. In Program Compre-

hension (ICPC), 2016 IEEE 24th International Conference on. IEEE, 1–10.

[42]

Cédric Teyton, Jean-Rémy Falleri, and Xavier Blanc. 2013. Automatic discovery

of function mappings between similar libraries. In Reverse Engineering (WCRE),

2013 20th Working Conference on. IEEE, 192–201.

[43]

Ferdian Thung, David Lo, and Lingxiao Jiang. 2012. Detecting similar applica-

tions with collaborative tagging. In Software Maintenance (ICSM), 2012 28th IEEE

International Conference on. IEEE, 600–603.

[44]

Christoph Treude, Ohad Barzilay, and Margaret-Anne Storey. 2011. How do

programmers ask and answer questions on the web?: Nier track. In Software

Engineering (ICSE), 2011 33rd International Conference on. IEEE, 804–807.

[45]

Gias Uddin and Foutse Khomh. 2017. Automatic summarization of API reviews.

In Automated Software Engineering (ASE), 2017 32nd IEEE/ACM International

Conference on. IEEE, 159–170.

[46]

Gias Uddin and Foutse Khomh. 2017. Opiner: an opinion search and summariza-

tion engine for APIs. In Proceedings of the 32nd IEEE/ACM International Conference

on Automated Software Engineering. IEEE Press, 978–983.

[47]

Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2010. Information theoretic

measures for clusterings comparison: Variants, properties, normalization and

correction for chance. Journal of Machine Learning Research 11, Oct (2010),

2837–2854.

[48]

Deheng Ye, Zhenchang Xing, Chee Yong Foo, Zi Qun Ang, Jing Li, and Nachiket

Kapre. 2016. Software-specic named entity recognition in software engineering

social content. In Software Analysis, Evolution, and Reengineering (SANER), 2016

IEEE 23rd International Conference on, Vol. 1. IEEE, 90–101.

[49]

Yun Zhang, David Lo, Pavneet Singh Kochhar, Xin Xia, Quanlai Li, and Jianling

Sun. 2017. Detecting similar repositories on GitHub. In Software Analysis, Evolu-

tion and Reengineering (SANER), 2017 IEEE 24th International Conference on. IEEE,

13–23.

224

Supporting Argumentation Dialogues in Group Decision Support Systems: An Approach Based on Dynamic Clustering

Article

Full-text available

Oct 2022

Group decision support systems (GDSSs) have been widely studied over the recent decades. The Web-based group decision support systems appeared to support the group decision-making process by creating the conditions for it to be effective, allowing the management and participation in the process to be carried out from any place and at any time. In GDSS, argumentation is ideal, since it makes it easier to use justifications and explanations in interactions between decision-makers so they can sustain their opinions. Aspect-based sentiment analysis (ABSA) intends to classify opinions at the aspect level and identify the elements of an opinion. Intelligent reports for GDSS provide decision makers with accurate information about each decision-making round. Applying ABSA techniques to group decision making context results in the automatic identification of alternatives and criteria, for instance. This automatic identification is essential to reduce the time decision makers take to step themselves up on group decision support systems and to offer them various insights and knowledge on the discussion they are participating in. In this work, we propose and implement a methodology that uses an unsupervised technique and clustering to group arguments on topics around a specific alternative, for example, or a discussion comparing two alternatives. We experimented with several combinations of word embedding, dimensionality reduction techniques, and different clustering algorithms to achieve the best approach. The best method consisted of applying the KMeans++ clustering technique, using SBERT as a word embedder with UMAP dimensionality reduction. These experiments achieved a silhouette score of 0.63 with eight clusters on the baseball dataset, which wielded good cluster results based on their manual review and word clouds. We obtained a silhouette score of 0.59 with 16 clusters on the car brand dataset, which we used as an approach validation dataset. With the results of this work, intelligent reports for GDSS become even more helpful, since they can dynamically organize the conversations taking place by grouping them on the arguments used.

Learning and Programming Challenges of Rust

Conference Paper

Full-text available

Aug 2022

Rust is a young systems programming language designed to provide both the safety guarantees of high-level languages and the execution performance of low-level languages. To achieve this design goal, Rust provides a suite of safety rules and checks against those rules at the compile time to eliminate many memory-safety and thread-safety issues. Due to its safety and performance, Rust's popularity has increased significantly in recent years, and it has already been adopted to build many safety-critical software systems. It is critical to understand the learning and programming challenges imposed by Rust's safety rules. For this purpose, we first conducted an empirical study through close, manual inspection of 100 Rust-related Stack Overflow questions. We sought to understand (1) what safety rules are challenging to learn and program with, (2) under which contexts a safety rule becomes more difficult to apply, and (3) whether the Rust compiler is sufficiently helpful in debugging safety-rule violations. We then performed an online survey with 101 Rust programmers to validate the findings of the empirical study. We invited participants to evaluate program variants that differ from each other, either in terms of violated safety rules or the code constructs involved in the violation, and compared the participants' performance on the variants. Our mixed-methods investigation revealed a range of consistent findings that can benefit Rust learners, practitioners, and language designers.

"How do people decide?": A Model for Software Library Selection

Conference Paper

Jun 2024

ChatGPT Incorrectness Detection in Software Reviews

Conference Paper

Apr 2024

What are Pros and Cons? Stance Detection and Summarization on Feature Request

Conference Paper

Oct 2023

Improving API Knowledge Discovery with ML: A Case Study of Comparable API Methods

Conference Paper

May 2023

Self-Admitted Library Migrations in Java, JavaScript, and Python Packaging Ecosystems: A Comparative Study

Conference Paper

Mar 2023

Concept-Annotated Examples for Library Comparison

Conference Paper

Oct 2022

Where is your app frustrating users?

Conference Paper

Jul 2022

Opinion Mining for Software Development: A Systematic Literature Review

Article

Jul 2022

Opinion mining, sometimes referred to as sentiment analysis, has gained increasing attention in software engineering (SE) studies. SE researchers have applied opinion mining techniques in various contexts, such as identifying developers’ emotions expressed in code comments and extracting users’ critics toward mobile apps. Given the large amount of relevant studies available, it can take considerable time for researchers and developers to figure out which approaches they can adopt in their own studies and what perils these approaches entail. We conducted a systematic literature review involving 185 papers. More specifically, we present (1) well-defined categories of opinion mining-related software development activities, (2) available opinion mining approaches, whether they are evaluated when adopted in other studies, and how their performance is compared, (3) available datasets for performance evaluation and tool customization, and (4) concerns or limitations SE researchers might need to take into account when applying/customizing these opinion mining techniques. The results of our study serve as references to choose suitable opinion mining tools for software development activities and provide critical insights for the further development of opinion mining techniques in the SE domain.

Automatic summarization of API reviews

Conference Paper

Full-text available

Oct 2017

Opiner: An opinion search and summarization engine for APIs

Conference Paper

Full-text available

Oct 2017

DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning

Conference Paper

Full-text available

Aug 2017

Computer programs written in one language are often required to be ported to other languages to support multiple devices and environments. When programs use language specific APIs (Application Programming Interfaces), it is very challenging to migrate these APIs to the corresponding APIs written in other languages. Existing approaches mine API mappings from projects that have corresponding versions in two languages. They rely on the sparse availability of bilingual projects, thus producing a limited number of API mappings. In this paper, we propose an intelligent system called DeepAM for automatically mining API mappings from a large-scale code corpus without bilingual projects. The key component of DeepAM is based on the multi-modal sequence to sequence learning architecture that aims to learn joint semantic representations of bilingual API sequences from big source code data. Experimental results indicate that DeepAM significantly increases the accuracy of API mappings as well as the number of API mappings when compared with the state-of-the-art approaches.

Exploring API Embedding for API Usages and Applications

Conference Paper

Full-text available

May 2017

DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning

Article

Full-text available

Apr 2017

Computer programs written in one language are often required to be ported to other languages to support multiple devices and environments. When programs use language specific APIs (Application Programming Interfaces), it is very challenging to migrate these APIs to the corresponding APIs written in other languages. Existing approaches mine API mappings from projects that have corresponding versions in two languages. They rely on the sparse availability of bilingual projects, thus producing a limited number of API mappings. In this paper, we propose an intelligent system called DeepAM for automatically mining API mappings from a large-scale code corpus without bilingual projects. The key component of DeepAM is based on the multimodal sequence to sequence learning architecture that aims to learn joint semantic representations of bilingual API sequences from big source code data. Experimental results indicate that DeepAM significantly increases the accuracy of API mappings as well as the number of API mappings, when compared with the state-of-the-art approaches.

Conference Paper

Full-text available

Feb 2017

By the Community & For the Community: A Deep Learning Approach to Assist Collaborative Editing in Q&A Sites

Article

Dec 2017

Community edits to questions and answers (called post edits) plays an important role in improving content quality in Stack Overflow. Our study of post edits in Stack Overflow shows that a large number of edits are about formatting, grammar and spelling. These post edits usually involve small-scale sentence edits and our survey of trusted contributors suggests that most of them care much or very much about such small sentence edits. To assist users in making small sentence edits, we develop an edit-assistance tool for identifying minor textual issues in posts and recommending sentence edits for correction. We formulate the sentence editing task as a machine translation problem, in which an original sentence is "translated" into an edited sentence. Our tool implements a character-level Recurrent Neural Network (RNN) encoder-decoder model, trained with about 6.8 millions original-edited sentence pairs from Stack Overflow post edits. We evaluate our edit assistance tool using a large-scale archival post edits, a field study of assisting a novice post editor, and a survey of trusted contributors. Our evaluation demonstrates the feasibility of training a deep learning model with post edits by the community and then using the trained model to assist post editing for the community.

Unsupervised Software-Specific Morphological Forms Inference from Informal Discussions

Conference Paper

May 2017

Informal discussions on social platforms (e.g., Stack Overflow) accumulates a large body of programming knowl- edge in natural language text. Natural language process (NLP) techniques can be exploited to harvest this knowledge base for software engineering tasks. To make an effective use of NLP techniques, consistent vocabulary is essential. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms in informal discussions, such as abbreviations, synonyms and misspellings. Existing techniques to deal with such morphological forms are either designed for general English or predominantly rely on domain-specific lexical rules. A thesaurus of software-specific terms and commonly-used morphological forms is desirable for normalizing software engineering text, but very difficult to build manually. In this work, we propose an automatic approach to build such a thesaurus. Our approach identifies software-specific terms by contrasting software-specific and general corpuses, and infers morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations, and graph analysis of morphological relations. We evaluate the coverage and accuracy of the resulting thesaurus against community-curated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our thesaurus in a case study of normalizing questions from Stack Overflow and CodeProject.

Efficient Estimation of Word Representations in Vector Space

Conference Paper

Jan 2013

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

Mining User Reviews for Mobile App Comparisons

Article

Sep 2017

As the number of mobile apps keeps increasing, users often need to compare many apps, in order to choose one that best fits their needs. Fortunately, as there are so many users sharing an app market, it is likely that some other users with the same preferences have already made the comparisons and shared their opinions. For example, a user may state that an app is better in power consumption than another app in a review, then the review would help other users who care about battery life while choosing apps. This paper presents a method to identify comparative reviews for mobile apps from an app market, which can be used to provide fine-grained app comparisons based on different topics. According to experiments on 5 million reviews from Google Play and manual assessments on 900 reviews, our method is able to identify opinions accurately and provide meaningful comparisons between apps, which could in turn help users find desired apps based on their preferences.

Tell them apart: distilling technology differences from crowd-scale comparison discussions

Abstract and Figures

Recommended publications

DiffTech: Differencing Similar Technologies From Crowd-Scale Comparison Discussions

DiffTech: a tool for differencing similar technologies from question-and-answer discussions

What’s Spain’s Paris? Mining analogical libraries from Q&A discussions

SimilarTech: automatically recommend analogical libraries across different programming languages