ChapterPDF Available

Data Labeling: An Empirical Investigation into Industrial Challenges and Mitigation Strategies

November 2020

November 2020

DOI:10.1007/978-3-030-64148-1_13

In book: Product-Focused Software Process Improvement (pp.202-216)

Authors:

Teodor Fredriksson

Chalmers University of Technology

David Issa Mattos

Volvo Car Corporation

Jan Bosch

Chalmers University of Technology

Helena Holmstrom Olsson

Malmö University

Labeling is a cornerstone of supervised machine learning. However, in industrial applications, data is often not labeled, which complicates using this data for machine learning. Although there are well-established labeling techniques such as crowdsourcing, active learning, and semi-supervised learning, these still do not provide accurate and reliable labels for every machine learning use case in the industry. In this context, the industry still relies heavily on manually annotating and labeling their data. This study investigates the challenges that companies experience when annotating and labeling their data. We performed a case study using a semi-structured interview with data scientists at two companies to explore their problems when labeling and annotating their data. This paper provides two contributions. We identify industry challenges in the labeling process, and then we propose mitigation strategies for these challenges.

List of the interview participants of phase II

…

Figures - uploaded by David Issa Mattos

Content may be subject to copyright.

Content uploaded by David Issa Mattos

Content may be subject to copyright.

Data Labeling: An Empirical Investigation into

Industrial Challenges and Mitigation Strategies

Teodor Fredriksson1, David Issa Mattos1[0000−0002−2501−9926] , Jan

Bosch1[0000−0003−2854−722X], and Helena Holmstr¨om

Olsson2[0000−0002−7700−1816]

1Chalmers University of Technology, H¨orselg˚angen 11, 417 56, Gothenburg, Sweden

{teodorf,davidis,jan.bosch}@chalmers.se

2Malm¨o University, Nordenski¨oldsgatan 1, 211 19 Malm¨o, Sweden

helena.holmstrom.olsson@mau.se

Abstract. Labeling is cornerstone in supervised machine learning. How-

ever, in industrial applications data is often not labeled, which compli-

cates the use of this data for machine learning. Although there are well

established labeling techniques such as crowdsourcing, active learning

and semi-supervised learning, but these still do not provide accurate and

reliable labels for every machine learning use case in industry. In this

context, industry still relies heavily on manually annotating and label-

ing their own data. This study investigates the challenges that compa-

nies experience when annotating and labeling their data. We performed

a case study using semi-structured interview with data scientists at two

companies to explore what problems they experience when labeling and

annotating their data. This paper provides two contributions. We identify

industry challenges in the labeling process and then we propose mitiga-

tion strategies for these challenges

Keywords: Data Labeling ·Machine Learning ·Case study

1 Introduction

Current research estimates that over 80% of engineering tasks in a machine-

learning ML project concerns data preparation and labeling, and that the third-

party data labeling market is expected to almost triple by 2024 [1, 2]. This large

eﬀort spent in data preparation and labeling often happens because, in indus-

try, datasets are often incomplete in the sense that some or all instances are

missing labels. In addition, in some cases the labels that are available are of

poor quality, meaning that the label associated with a data entry is incorrect or

only partially correct. Labels of suﬃcient quality are a prerequisite to perform

supervised machine learning as the performance of the model in operations is

directly inﬂuenced by the quality of the training data [3].

To overcome the lack of labels in both quantity and quality, crowdsourcing

has been a common strategy for acquiring quality labels with human supervi-

sion [4, 5], in particular for computer vision and natural language processing

2 T. Fredriksson et al.

applications. However, for other industry applications, crowdsourcing has sev-

eral limitations, such as allowing unknown third-party access to company data,

lack of people with in-depth understanding of the problem or the business to

create quality labels. In-house labeling can be half as expensive as crowdsourced

labels and while providing higher quality [6] Due to these factors, companies still

perform in-house labeling. Despite the large body of research on crowdsourcing

and machine learning systems that can overcome diﬀerent label quality prob-

lems, to the best of our knowledge, there is no research that investigates the

challenges faced and strategies adopted by data scientists and human labelers

in the labeling process of company specﬁc applications. In particular, we focus

on the problems seen in applications where labeling is non-trivial and requires

understanding of the problem domain.

Utilizing case study research based on semi-structure interviews with prac-

titioners in two companies, one of which has extensive labeling experience we

study the challenges and the adopted mitigation strategies in the data labeling

process that these companies employ. The contribution of this paper is twofold.

First, we identify the key challenges that these companies experience in relation

to labeling data. Second, we present an overview of the mitigation strategies that

companies employ regularly or potential solutions to address these challenges.

The remainder of the paper is organized as follows. In the next section, we

provide more in-depth overview of the background of our research. Subsequently,

in section 3 we present the research method that we employed in the paper as

well as an overview of the case companies. Section 4 presents the challenges that

we identiﬁed during the case study, observations and interviews at the company,

the results from the expert interviews to validate the challenges as well as the

mitigation strategies. Finally, the paper is concluded in section 6.

2 Background

Crowdsourcing is deﬁned as a process of acquiring required information or results

by request of assistance from a group of many people available through online

communities. This is a way of dividing and distributing a large project among

people. After each process is completed the people involved in the process are

rewarded [7]. According to [2] crowdsourcing is the primary way of achieving

labels. In the context of machine learning however, crowdsourcing has its own

set of problems. The primary problem is annotators that produces bad labels.

An annotator might not be able to label instances correctly and even if an

annotator is an expert, the quality of the labels will potentially decrease over

time due to the human factor [3]. Examples of crowdsourcing platforms are the

Amazon Mechanical Turk and the Lionbridge AI [8].

Allowing a third-party company to label your data has its beneﬁts, such as

not having to develop your own annotation tools and labeling infrastructure.

In-house labeling also requires investing time training your annotators and this

is not optimal if you don’t have enough time and resources. A downside is that

sensitive and conﬁdential company data has to be shared with the crowdsourc-

Title Suppressed Due to Excessive Length 3

ing platforms. Therefore, there are essential factors to consider before selecting

crowdsourcing platform, such as how many and what kind of projects has the

platform been successful with previously? Does the platform have high-quality

labeling technologies so that high quality labels can be obtained? How does the

platform ensure that the annotators can produce labels of suﬃcient quality?

What security measures are taken to ensure the safety of your data? A tool

to be used in crowdsourcing when noisy labels are cheap to obtain is repeated-

labeling. According to [9] repeated labeling should be exercised if labeling can be

repeated and the labels are noisy. This approach can improve the quality of the

labels which leads to improved quality in machine learning model. This seems to

work especially well when the repeated-labeling is done selectively, taking into

account label uncertainty and machine learning model uncertainty. However, this

approach does not guarantee that the quality is improved. Sheshadri and Lease

[10] provides an empirical evaluation study that compares diﬀerent algorithms

that computes the crowd consensus on benchmark crowdsourced data sets using

the Statistical Quality Assurance Robustness Evaluation (SQUARE) benchmark

[10]. The conclusions of [10] is that no matter what algorithm you choose, there

is no signiﬁcant diﬀerence in accuracy. These algorithms includes majority voting

(MV), ZenCrowd (ZC), David and Skene (DS)/ Naive Bayes (NB) [9]. There

are also other ways to handle noisy labels, for example in [11], they improve the

accuracy when training a deep neural network with noisy labels by incorporating

a noise layer. So rather than correcting noisy labels, there are ways to changes

the machine learning models so that they can handle noisy labels. The downside

to this approach is that you need to know which instances are clean and which

instances are noisy This can be diﬃcult with industrial data. Another strategy

to detect noisy labels is conﬁdent learning which can be used to identify noisy

labels as well a learn form noisy labels. [12].

3 Research Method

In this paper, we report on case study research in which we explored the chal-

lenges related to labeling of data for machine learning and what strategies can be

employed to mitigate them. In this section we will present the data we collected

and how we analyzed it to identify the challenges.

A case study is a research method that investigates real-world phenomena

through empirical investigations. The aim of these studies is to identify challenges

and ﬁnd mitigation strategies through action, reﬂection, theory and practice,

[13–15]. A case study suits well with our purpose because of it’s exploratory

nature and we are trying to learn more about certain processes at Company A

and B. The two main research questions we have are:

– RQ1: What are the key challenges that practitioners face in the process of

labeling data?

– RQ2 What are the mitigation strategies that practitioners use to overcome

these challenges?

4 T. Fredriksson et al.

3.1 Data Collection

Our case study was conducted in collaboration with two companies. Company

A is a worldwide telecommunication provider and one of the leading providers in

Information and Communication Technology (ICT). Company B is a company

specialized in labeling. They have developed an annotation platform in order

to provide the autonomous vehicles industry with labeled training data of top

quality. Their clients include software companies and research institutes.

– Phase I: Exploration - The empirical data collected during this phase is

based on an internship from November 18 2019 to February 28 2020 in which

the ﬁrst author spent time at Company As oﬃce two-three days a week. The

data was collected from the data scientist by observing how the they were

working with machine learning and how they deal with data where labels

are missing as well as having access to data sets. We held discussions with

each of the data scientist working with each particular dataset to collect data

regarding the origin of the data, what they wish to use it for in the future,

how often it is updated. Using Python we could investigate how skew the

label distribution is of the label distribution as well as examine the data to

potentially ﬁnd any clustering structure in the labels. The datasets studied

in phase I came from participant I and II.

– Phase II: Validation - After the challenges had been identiﬁed during

phase I, both internal and external conﬁrmation interviews were conducted

to validate if the challenges found in the previous phase were general. Four

participants in the interviews where from company A and one participant

was from company B. Company A had several data scientists but we in-

cluded only included scientists that had issues with labeling. Each partic-

ipant was interviewed separately and the interviews lasted between 25-55

minutes. All but one interview was conducted in English. The one interview

was conducted in Swedish and then translated to English by the ﬁrst author.

During the interview we asked questions such as What is the purpose of your

labels?,How do you get annotated data? and How do you assess the quality

of the data/labels?

Based on meetings and interviews, we managed to evaluate and plan strategies

to mitigate the challenges that we have observed during our study.

3.2 Data analysis

The interviews was analyzed by taking notes during the interviews and intern-

ship. We then performed a thematic analysis [16]. A thematic analysis is deﬁned

as ”a method for identifying, analyzing and reporting patterns” and was used

to to identify the diﬀerent themes and patterns in the data we collected. From

the analysis we were able to identify themes and deﬁne the industrial challenges

Title Suppressed Due to Excessive Length 5

Table 1. List of the interview participants of phase II

Company Participant Nr Title/role Experience

A I Data Scientist 4 years

A II Senior Data Scientist 8 years

A III Data Scientist 3 years.

A IV Senior Data Scientist 2 years

B V Senior Data Scientist 7 years

based on the notes. For each interview we identiﬁed diﬀerent themes such as

topics that came up during the interviews. Several of these themes were present

in more than one interview so we combined the data for each of the interviews

and based on that we could draw conclusions based on the information on the

same theme.

3.3 Threats to Validity

Accoring to [14] there are four diﬀerent concepts of validity to consider, construct

validity,internal validity,external validity and reliability. To achieve construct

validity we provided every participant of company A with an e-mail containing

all the deﬁnitions of concepts and some sample questions to be asked during

the interview. We also provided a lecture on how to use machine learning to

label data before the interviews so that the could participant’s could reﬂect

and prepare for the interview. We can argue that we achieved internal validity

through data triangulation since we interviewed every person at Company A

that had experience with labels. Therefore it is very unlikely that we missed any

necessary information when collecting data.

4 Results

In this section we shall present the results from our study. We begin by listing

the key problems that found from phase I of the study. Coming up next we state

the problems we found from Phase II. The interview we held with participant

V was then used as an inspiration for formulating mitigation strategies for the

problems faced by the data scientist from Company A.

4.1 Phase I: Exploration

Here we list the problems that we found during Phase I of the case study.

1. Lack of a systematic approach to labeling data for speciﬁc features:

It was clear that automated labeling processes was needed. The data scien-

tists working at Company A had all kinds of needs for automated labeling.

Currently they have no idea how to approach the problem.

6 T. Fredriksson et al.

2. Unclear responsibility for labeling: Data scientist do not have the time

to label instances manually. Their stakeholders can label the data by hand

but they do not want to do it either. Thus the data scientist are expected to

come up with a way to do the labeling.

3. Noisy labels: Participant I has a small subset of his data labeled. These

labels comes from experiments conducted in a lab. The label noise seem

to be negligible but that is not the case. There is a diﬀerence between the

generated data and the true data. The generated data will have features

that are continuous while the generated data will be discrete. Participant II

works on a data set that contains tens of thousands of rows and columns.

The column of interest contains two class labels, ”Yes” and ”No”. The ﬁrst

problem with the labels is that they are noisy. The ”Yes” is dependent of

two errors, I and II. Only ”Yes” based on error I is of interest. If the ”Yes”

is based on error II. then it should be relabeled as a ”No”. Furthermore the

stakeholders do not know if the ”Yes” instances are due to error I or error II.

4. Diﬃculty to ﬁnd a correlation between labels and features: Partici-

pant I works with a dataset whose label distribution contains ﬁve classes that

describes grades from ”best” to ”worst”. Where 1 ”best” and 5 is ”worst”.

Cluster analysis reveals that there is no particular cluster structure for some

of the labels. Labels of grade 5 seem to be in one cluster but the other 1-

4 seem to be randomly scattered in one cluster. Analysis of the data from

participant II reveals that there is no way of telling whether the ”Yes” is

based on error I or error II. This means that many of the ”Yes” are labeled

incorrectly.

5. Skewed label distributions: The label distribution from both datasets

is highly skewed. The dataset from participant I has fewer instances that

has a high grade compared to low grades. For participant II the number of

instances labeled ”No” is greater than the number of labels set as ”Yes”.

When training a model ion this data it will overﬁt.

6. Time dependence: Due to the nature of participant IIs data, it is possible

that some of the ”No” can become ”Yes” it the future and so the ”No” labels

are possibly incorrect too.

7. Diﬃculty to predict future uses for datasets. The purpose of the labels

in both datasets were to predict new labels for future instances provided by

the stakeholder on an irregular basis. For participant I the labels might be

used for other purposes later. There are no current plans to use the label for

other machine learning purposes.

4.2 Phase II: Validation

The problems that appeared during the interviews can be categorized as follows

Title Suppressed Due to Excessive Length 7

1. Label distribution related. Question regarding the distribution.

2. Multiple-task related. Questions regarding the purpose of the labels.

3. Annotation related. Questions regarding the oracle and noisy labels.

4. Model and data reuse related. Questions regarding reuse of trained model on

new data.

Below we discuss each category in more detail.

1. Label Distribution: We found several issues related to the label distri-

bution. Participant Is data has a label distribution that is unknown. The

current labels are measured in percentages and need to be translated into

at least two classes but If more labels are needed that can be done as well.

Participant II has a label distribution that contains two classes, ”Yes” and

”No”. Participant IIIs data has a label distribution that contains at least

three labels. Participants IV has more than three-thousand labels so it is

hard to get a clear picture of what the distribution is. Participant I-III all

have skewed label distributions. If a dataset has a skew label distribution

then the machine learning model will overﬁt. This means that if you have

a binary classiﬁcation problem and you have 80% of class A and 20% of

class B, the model might just predict A the majority of a time even when

an actual case is labeled as B [17].

2. Multiple tasks: Participant I, II and III says that that for now, the only

purpose of their labels is to ﬁnd labels for new data but chances are that

it will be reused for something else later on. Participant IV does not use

its labels for machine learning purposes but for other practical reasons. The

problem here is that if you do not plan ahead and only train a model with

respect to one task, then if you need to use the labels for something else

later you will habe to re-label the instances for each new task.

3. Annotation: Participant I has some labeled data that comes from labo-

ratory experiments. However, these labels are only used to help label new

instances to be labeled manually. Participant II has its labels coming from

the stakeholders but since these are noisy, these instances needs to be re-

labeled. Participant III has labeled data coming from stakeholders and these

are expected 100% correct. Participant IV deﬁnes all labels by itself and

does not consult the stake holders at all. The problem here is that the data

scientist are often tasked to do labeling on their own. Even if the data sci-

entist get instances from the stakeholders, the amount of labels are often of

insuﬃcient quantity and/or quality.

4. Data Reuse: Participant III has had problems with reusing a model. First

the data was labeled into two classes ”Yes” and ”No. Later the ”Yes” cate-

gory would be divided into sub-categories ”YesA” and ”YesB”. When run-

ning the model on this new data, it would predict the old ”Yes” instance as

”No” instance. Participant III has no idea as to why this happens.

8 T. Fredriksson et al.

4.3 Summary from Company B

Participant V of Company B has earlier experience with automatic labeling.

Therefore interview V was used to verify some actual labeling issues from indus-

try. According to participant V, Company B has worked and studied automatic

labeling for at least seven years. Company B uses crowd-sourcing to label data

using 1000 people. Participant V conﬁrms that thanks to active learning the

labeling task takes 200 times less time than if active learning was not used. The

main problem company B has with the labeling is that it is hard to evaluate the

quality labels and access the quality of the human annotator. A ﬁnal remark

from Company V is that they have experienced a correlation between automa-

tion and quality. The more automation included in the process, the less accurate

with the labels be. Three of the authors of this paper performed a systematic

literature review on automated labeling using machine learning [18]. Thanks to

that paper, we can draw the conclusion that active learning and semi-supervised

learning can be used to label instances.

4.4 Machine Learning methods for Data Labeling

Here we present and discuss Active Learning and Semi-supervised learning meth-

ods in terms of how they can be used in practice with labeling problems.

Active Learning: Traditionally labels would be chosen randomly to be labeled

to be used with machine learning. However, choosing instance to be labeled ran-

domly could lead to a model with low predictive accuracy since non-informative

instances could be select for labeling. To mitigate the issue of choosing non-

informative instances active learning (AL) is proposed. Active learning queries

instances by informativeness and then labels them. The diﬀerent methods used

to pose queries are known as query strategies[19]. According to [18] the most com-

monly used query strategies are uncertainty sampling, error/variance reduction,

query-by-commitee (QBC) and query-by-disagreement (QBD). After instances

are queried and labeled, they are added to the training set. A machine learn-

ing algorithm is then trained and evaluated. If the learner is not happy with

the results, more instances will be queried and the model will be retrained and

evaluated. This iterative procedure will proceed until the learner decides it is

time to stop learning. Active learning has proven to outperform passive learning

if the query strategy is properly selected based on the learning algorithm[19].

Most importantly active learning is a great way to make sure that time is not

wasted on labeling non-informative instances thus saving both time and money

in crowdsourcing [2].

Semi-supervised learning: Semi-supervised learning (SSL) is concerned with

a set of algorithms that can be used in the scenario where most of the data is

unlabeled but a small subset of it is labeled. Semi-supervised learning is mainly

divided into semi-supervised classiﬁcation and constrained clustering [20].

Title Suppressed Due to Excessive Length 9

Semi-supervised classiﬁcation is when a classiﬁer is trained based on train-

ing data that contains both labeled and unlabeled instances. Sometimes semi-

supervised learning outperforms supervised classiﬁcation [21].

Constrained clustering is an extension to unsupervised clustering. Constrained

clustering requires unlabeled instances as well as some supervised information

about the clusters. The objective of constrained clustering is to improve upon un-

supervised clustering[22]. The most popular semi-supervised classiﬁcation meth-

ods are mixture models using the EM-algorithm,co-training/multi-view learning,

graph-based SSL and Semi-supervised support vector machines (S3VM) [18].

Below we list eight practical considerations of Active Learning.

1. Data exploration to determine which algorithm is best. When start-

ing on a new project involving machine learning, it is hard to know which

algorithm will yield the best result. Often there is no way of knowing before

hand what the best choice is. There are empirical studies on which one to

choose but the results are fairly mixed [23–25]. Since the selection of algo-

rithm varies so much it is essential to understand the problem beforehand.

If it is interesting to reduce the error, then expected error or variance reduc-

tion are the best query strategies to choose from [19]. If the density of the

sample is easy to use and there is strong evidence that support correlation

between cluster structure to the labels, then use density-weighted methods

[19]. If using large probabilistic models uncertainty sampling is the only vi-

able option [19]. If there is no time testing out diﬀerent query strategies it is

best to you the more simple approaches based on uncertainty [19]. From our

investigation it is clear that company A is in need for labels in their projects.

However since they have never implemented an automatic labeling process

before, it is important that do right from the beginning. That is, the data

scientists must carefully examine the distribution of data set, check whether

there is any cluster structures and if there any relationships between the

clusters and the labels. If the data exploration is done in a detailed correct,

then ﬁnding the correct machine learning approach is easy and we don’t need

to spend time on testing diﬀerent machine learning algorithms.

2. Alternative query types: A traditional active learners queries instances to

be labeled by an oracle. However there are other ways of querying, e.g human

domain knowledge has been incorporated into machine learning algorithms.

This means the learner builds models based on human advice, such as rules

and constraints as well as labeled and unlabeled data.

An example of domain knowledge with active learning is to use informa-

tion about the features. This approach is referred to as tandem learning and

incorporates feature feedback in traditional classiﬁcation problems. Active

dual supervision is an area of active learning where features are labeled.

Here oracles label features that are judged to be good predictors of one or

more classes. The big question is how to actively query these feature labels.

10 T. Fredriksson et al.

3. Multi-task active learning: From our interview we can see that there are

cases where labels are needed to predict labels for future instances. In other

cases the labels aren’t even needed for machine learning. In one case the

data scientist thinks that the labels will be used for other prediction task

but is unsure. The most basic way in which active learning operates is that

a machine learner is trying to solve a single task. From the interviews it is

clear the same data needs to annotated in several ways for several future

tasks. This means that the data scientist will have to spend even more time

annotating at least one time for each task. It would be more economical to

label a single instance for all sub-tasks simultaneously. This can be done

with the help of multi-task active learning. [26]

4. Data reuse and the unknown model class: The labeled training set col-

lected after performing active learning always has a bias distribution. The

bias is connected to the class of model used to select the queries. If it is

necessary to switch learner to a more improved learner, then it might be

troublesome to reuse the training data with models of a diﬀerent class. This

is an important issue in practical uses for active learning. If you know the

best model class and feature set beforehand, then active learning can safely

be used. Otherwise active learning will be outperformed by passive learning.

5. Unreliable oracles: It is important to have access to top quality labeled

data. If the labels come from some experiment there is almost always some

noise present. In one of the data sets from company A, a small subset of the

data was labeled. The labels of that particular data set comes from experi-

ments conducted in a lab. The label noise seem to be negligible but that is

not the case. There is a diﬀerence between the generated data and the true

data. The generated data will have features that are continuous while the

generated data will be discrete. Another data set that we studied has labels

came from customer data. The labels were coded ”Yes” and ”No”. However

the ”Yes” were due to factors A and B. So the problem here is to ﬁnd a

model that can predict the labels, but we are only interested in the ”Yes”

that are due to factor A. The ”Yes” that are due to factor B needs to be

relabeled to a ”No”. Since the customer data does not provide whether the

”Yes” are due to factor A or B. The second problem was that some of the

”No” could develop into a ”Yes” over time. It was up to the data scientist

to ﬁnd a way to relabel the data correctly. The data scientist had a solution

to the problem but realized that it was faulty and therefore asked us for

help. We took a look at the data and the current solution. We saw two large

clusters but that there were no noteworthy relationship between the diﬀerent

labels and the features. We found two clusters but both of contained almost

equally many ”Yes” and ”No”. Lets say that the ﬁrst cluster contained about

60% ”Yes” and 40% ”No” and in the second cluster we had 60% ”No” and

40% ”Yes”. After doing this, all of the instances in the ﬁrst cluster were

re-labeled as ”Yes” and all instances in the second cluster were re-labeled as

”No”. We conclude that this is an approach that will yield noisy labels. The

Title Suppressed Due to Excessive Length 11

same goes if the labels come from a human annotator because some of the

instances might be diﬃcult to label and people can easily be distracted and

tired over time and so the quality of the labels will vary over time. Thanks

to crowd-sourcing one can let several people annotate the same data and

that way it is easier to determine which label is the correct one and pro-

duce ”gold-standard quality training sets”. This approach can also be used

to evaluate learning algorithms on training sets that are non gold-standard.

The big question is: How do we use noisy oracles in active learning? When

should the learner query new unlabeled instances rather than updating cur-

rently labeled instances in case we suspect an error. Studies where estimates

of both oracle and model uncertainty were taken into account show that data

can be improved by selectively repeated labeling. How do we evaluate the

annotators? How might the aﬀect of payment inﬂuence annotation quality?

What to do if the some instances are noisy no matter what oracle you use

and repeated labeling is not to improve the situation?

6. Skewed label distributions: In two of the data sets we studied, the dis-

tributions of the labels are skewed. That is there are more of one label than

there is of another. In the ”Yes” and ”No” labeled example, there are way

more ”No” instances. When the label distribution is skew, then active learn-

ing might not give much better results than passive learning. This is because

if the labels are not balanced, active learning might query more of one la-

bel than another. Not only is the skewed distribution a problem, but also

the lack of labeled data is also a problem. This might be a problem in the

data set, here we have instances labeled from an experiment. Very few labels

are labeled from the beginning and new unlabeled data is coming in ev-

ery ﬁfteen minutes. ”Guided learning” is proposed to mitigate the slowness

problem. Guided learning allows the human annotator to search for class-

representative instances in addition to just querying for labels. Empirical

studies indicate that guided learning performs better than active learning as

long as it’s annotation costs are less than eight times more expensive than

labeling queries

7. Real labeling costs and cost reduction: From observing the data sci-

entists at Company A, we would say that they will spend about 80% of the

time that they spend on data science is prepossessing the data. Therefore

we recognize that they do not have time to label to many instances and it is

crucial to reduce the time it takes to label things manually. If the possibility

exists avoid manual labeling.

Assume that the cost of labeling is uniform. The smaller training set used,

the lower will the associated costs be. However in some applications the cost

might be varying so simply reducing the labeled instances in the training

data does not necessarily reduce the cost. This problem is studied within

cost-sensitive active learning. To reduce eﬀort in active learning automatic

pre-annotation can help. In automatic pre-annotation the current model

predictions helps to query the labels [27, 28]. This can often help the labor-

12 T. Fredriksson et al.

ing eﬀorts of the learner. If the models does many classiﬁcation mistakes

then there will be extra work for the human annotator to correct these. To

mitigate these problems correlation propagation can be used. In correlation

propagation the local edits are used to interactively update the prediction.

In general automatic pre-annotation and correction propagation does not

deal with labeling cost themselves. However they do try to reduce the costs

indirectly by minimizing the number of labeling actions performed by the

human oracle.

Other cost-sensitive active learning methods takes varying labeling costs into

account. Both current labeling costs and expected future errors in classiﬁca-

tion costs can be incorporated [29]. The costs might not even be deterministic

but stochastic.

In many applications the costs are not known beforehand however they might

be able to be described as a function over annotation time [30]. To ﬁnd such

function, train a regression cost-model to that predicts the annotation costs.

Studies involving real human annotation cost shows the following results.

–Annotation costs are not constant across instances [31–34].

–Active learners that ignore costs might not perform better than passive

learners [19].

–The annotations costs may vary on the person doing the annotation [31],

[35].

–The annotation costs can include stochastic components. Jitter and

pause are two types of noise that aﬀect the annotation speed.

–Annotation can be predicted after seeing only a few labeled instances.

[33, 34].

8. Stopping criteria: This is related to cost-reduction. Since active learning

is an iterative process it is relevant to know when to stop learning Based

on our empirical ﬁndings the data scientist have no interest in doing any

manual labeling and if they have to they want to do it as little as possible.

So when the cost of gathering more training data is higher than the cost of

the errors made by the current system, then it is time to stop extending the

training set and hence stop training the machine learning algorithm. From

our experience at company A the data scientist have so little time free from

doing other data prepossessing, time is the most common stopping factor.

4.5 Challenges and Mitigation Strategies:

Many of the problems identiﬁed during phase I and phase II overlap to a certain

degree. Therefore we took all the problems and summarized them into three

challenges (C1-C3) that was later mapped to three mitigation strategies (MS1-

MS3). These mitigation strategies are derived from the practical consideration

above. Finally we map MS1 to C1, MS2 to C2 and MS3 to C3.

C1:Pre-processing: This challenge represents all that needs to be done during

the planning stage of the labeling procedure. This would include creating a

Title Suppressed Due to Excessive Length 13

systematic approach for labeling (problem 1 of phase I), doing an exploratory

data analysis to ﬁnd correlation between labels and features (problem 4 of

phase I), as well as , choosing a model that can be reused on new data (prob-

lem 6 of phase I) and label instances with respect to multiple tasks (problem

7 of phase I, problem 4 of phase II).

MS1:Planning: This strategy contains all the solution frameworks from practical

consideration 1, 2, 3, 4, 7 and 8. This is because they all involve the steps

necessary to plan an active learning strategy for labeling.

C2:Annotation: This challenge represent the problems concerning choosing an

annotator as well as evaluating and reduce the label noise (problems 2,3 from

phase I and problem 3 from phase II).

MS2:Oracle selection: This strategy contains only solution frameworks from

practical consideration 5. It describes how we can choose oracles to produce

top quality labels.

C3:Label Distribution: This challenge represents all the problems concerning

the symmetry of the label distributions such as learning with a skew label

distribution (problem 5 of Phase I and problem 1 o Phase II).

MS3:Label distribution: This strategy contains solution frameworks from prac-

tical consideration 6. It describes how we can do labeling when the label

distribution is skew.

5 Discussion

From our veriﬁcation interview with Company B we learned that active learning

is a popular tool for acquiring labels. Thanks to active learning the labeling task

takes 200 times less than if active learning was not used.

In the background we presented some current practices that can help with

labeling. The most popular practice being crowdsourcing. However, crowdsourc-

ing has its own set of problems. The primary concern is that bad annotators will

produce noisy labels due to inexperience or due the human factor. Secondly, The

beneﬁts of allowing third-company to label data is that you don’t have to spend

time training you employees to do then job, nor do you need to develop you own

annotation tools and infrastructure. The big downside is that you have to share

conﬁdential company data with the crowdsourcing platform. Repeated labeling

can be used to improve the quality of the labels but there is no guarantees that

this will improve the quality. Rather than correcting noisy labels, there are ways

in which you can change the machine learning models so that they can handle

noisy labels. The downside to this is that you need to know beforehand which

instances are noisy. This can be diﬃcult in an industrial setting.

Non of the techniques discussed in the background utilizes automated la-

beling using machine learning. Thanks to our eﬀorts we managed to formulate

14 T. Fredriksson et al.

three labeling challenges and provide mitigation strategies based on active ma-

chine learning. These challenges are related to questions such as, How can a

labeling processes be structured?, who and how do we label the instances? Can

correlation between labels and features be found, so that labels can be deter-

mined from the features. Both manual and automatic labeling involves some

noise in the labels. How should these noisy labels be used? What do we do if the

distribution of the labels is skewed? How do we take into account the fact that

some of the labels might change over time, due to the nature of the data? How

do we label instances so that the labels can be useful for several future tasks?

Three mitigation strategies that could possibly solve the three challenges

were presented.

6 Conclusion

The goal of this study is the provide a detailed overview of the challenges that

the industry faces with labeling, as well as outline mitigation strategies for these

challenges.

To the best of our knowledge 95%of all the machine learning algorithms

deployed in industry are supervised. Therefore, it is important that every dataset

is complete with labeled instances. Otherwise, the data would be insuﬃcient and

supervised learning would not be possible.

It proves to be challenging to ﬁnd and structure a labeling process. You need

to deﬁne a systematic approach for labeling and examine the data to choose

the optimal model. Finally you need to choose an oracle to produce top-quality

labels as well as plan how to handle skewed label distributions.

The contribution of this paper is twofold. First, based on a case study involv-

ing two companies we identiﬁed problems that companies experience in relation

to labeling data. We validated these problems using interviews at both companies

and summarized all problems into challenges. Second, we present an overview of

the mitigation strategies that companies employ (or could employ) to address

the challenges.

In our future work, we aim to develop further verify the challenges as well as

the mitigation strategies with more companies. In addition, we intend to develop

solutions to simplify the use of automated labeling in industrial contexts.

Acknowledgment

This work was partially supported by the Wallenberg AI Autonomous Systems

and Software Program (WASP) funded by Knut and Alice Wallenberg Funda-

tion.

References

1. Cognilytica Research, “Data Preparation & Labeling for AI 2020,” tech. rep., Cog-

nilytica Research, 2020.

Title Suppressed Due to Excessive Length 15

2. Y. Roh, G. Heo, and S. E. Whang, “A survey on data collection for machine

learning: a big data-ai integration perspective,” IEEE Transactions on Knowledge

and Data Engineering, 2019.

3. AzatiSoftware, AzatiSoftware Automated Data Labeling with Machine Learning,

2019. https://azati.ai/automated-data-labeling-with-machine-learning.

4. J. C. Chang, S. Amershi, and E. Kamar, “Revolt: Collaborative crowdsourcing for

labeling machine learning datasets,” in Proceedings of the 2017 CHI Conference

on Human Factors in Computing Systems, pp. 2334–2346, 2017.

5. J. Zhang, V. S. Sheng, T. Li, and X. Wu, “Improving crowdsourced label qual-

ity using noise correction,” IEEE transactions on neural networks and learning

systems, vol. 29, no. 5, pp. 1675–1688, 2017.

6. H. Cloud Factory, Crowd vs. Managed Team: A studo on Quality Data Processing

at Scale, 2020. https://go.cloudfactory.com/hubfs/02-Contents/3-Reports/Crowd-

vs-Managed-Team-Hivemind-Study.pdf.

7. J. Zhang, X. Wu, and V. S. Sheng, “Learning from crowdsourced labeled data: a

survey,” Artiﬁcial Intelligence Review, vol. 46, no. 4, pp. 543–576, 2016.

8. hackernoon.com, Crowdsourcing Data Labeling for Machine Learning Projects,

2020. https://hackernoon.com/crowdsourcing-data-labeling-for-machine-learning-

projects-a-how-to-guide-cp6h32nd.

9. P. G. Ipeirotis, F. Provost, V. S. Sheng, and J. Wang, “Repeated labeling using

multiple noisy labelers,” Data Mining and Knowledge Discovery, vol. 28, no. 2,

pp. 402–441, 2014.

10. A. Sheshadri and M. Lease, “Square: A benchmark for research on computing

crowd consensus,” in First AAAI conference on human computation and crowd-

sourcing, 2013.

11. S. Sukhbaatar and R. Fergus, “Learning from noisy labels with deep neural net-

works,” arXiv preprint arXiv:1406.2080, vol. 2, no. 3, p. 4, 2014.

12. C. G. Northcutt, L. Jiang, and I. L. Chuang, “Conﬁdent learning: Estimating

uncertainty in dataset labels,” arXiv preprint arXiv:1911.00068, 2019.

13. P. Reason and H. Bradbury, Handbook of action research: Participative inquiry and

practice. Sage, 2001.

14. P. Runeson and M. H¨ost, “Guidelines for conducting and reporting case study

research in software engineering,” Empirical software engineering, vol. 14, no. 2,

p. 131, 2009.

15. M. Staron, Action Research in Software Engineering: Theory and Applications.

Springer Nature, 2019.

16. V. Braun and V. Clarke, “Using thematic analysis in psychology,” Qualitative

research in psychology, vol. 3, no. 2, pp. 77–101, 2006.

17. T. DataScience, What To Do When Your Classiﬁcation Data is Imbalanced?, 2019.

https://towardsdatascience.com/what-to-do-when-your- classiﬁcation-dataset-is-

imbalanced-6af031b12a36.

18. T. Fredriksson, J. Bosch, and H. Holmstr¨om-Olsson, “Machine learning models for

automatic labeling: A systematic literature review,” 2020.

19. B. Settles, “Active learning. morgan claypool,” Synthesis Lectures on AI and ML,

2012.

20. X. J. Zhu, “Semi-supervised learning literature survey,” tech. rep., University of

Wisconsin-Madison Department of Computer Sciences, 2005.

21. N. N. Pise and P. Kulkarni, “A survey of semi-supervised learning methods,” in

2008 International Conference on Computational Intel ligence and Security, vol. 2,

pp. 30–34, IEEE, 2008.

16 T. Fredriksson et al.

22. E. Bair, “Semi-supervised clustering methods,” Wiley Interdisciplinary Reviews:

Computational Statistics, vol. 5, no. 5, pp. 349–361, 2013.

23. C. K¨orner and S. Wrobel, “Multi-class ensemble-based active learning,” in Euro-

pean conference on machine learning, pp. 687–694, Springer, 2006.

24. A. I. Schein and L. H. Ungar, “Active learning for logistic regression: an evalua-

tion,” Machine Learning, vol. 68, no. 3, pp. 235–265, 2007.

25. B. Settles and M. Craven, “An analysis of active learning strategies for sequence

labeling tasks,” in Proceedings of the 2008 Conference on Empirical Methods in

Natural Language Processing, pp. 1070–1079, 2008.

26. A. Harpale, Multi-task active learning. PhD thesis, Carnegie Mellon University,

2012.

27. J. Baldridge and M. Osborne, “Active learning and the total cost of annotation,”

in Proceedings of the 2004 Conference on Empirical Methods in Natural Language

Processing, pp. 9–16, 2004.

28. A. Culotta and A. McCallum, “Reducing labeling eﬀort for structured prediction

tasks,” in AAAI, vol. 5, pp. 746–751, 2005.

29. A. Kapoor, E. Horvitz, and S. Basu, “Selective supervision: Guiding supervised

learning with decision-theoretic active learning.,” in IJCAI, vol. 7, pp. 877–882,

2007.

30. B. Settles, M. Craven, and L. Friedland, “Active learning with real annotation

costs,” in Proceedings of the NIPS workshop on cost-sensitive learning, pp. 1–10,

Vancouver, CA:, 2008.

31. S. Arora, E. Nyberg, and C. Rose, “Estimating annotation cost for active learn-

ing in a multi-annotator environment,” in Proceedings of the NAACL HLT 2009

Workshop on Active Learning for Natural Language Processing, pp. 18–26, 2009.

32. E. K. Ringger, M. Carmen, R. Haertel, K. D. Seppi, D. Lonsdale, P. McClana-

han, J. L. Carroll, and N. Ellison, “Assessing the costs of machine-assisted corpus

annotation through a user study.,” in LREC, vol. 8, pp. 3318–3324, 2008.

33. S. Vijayanarasimhan and K. Grauman, “What’s it going to cost you?: Predict-

ing eﬀort vs. informativeness for multi-label image annotations,” in 2009 IEEE

Conference on Computer Vision and Pattern Recognition, pp. 2262–2269, IEEE,

2009.

34. B. C. Wallace, K. Small, C. E. Brodley, J. Lau, and T. A. Trikalinos, “Modeling

annotation time to reduce workload in comparative eﬀectiveness reviews,” in Pro-

ceedings of the 1st ACM International Health Informatics Symposium, pp. 28–35,

2010.

35. R. A. Haertel, K. D. Seppi, E. K. Ringger, and J. L. Carroll, “Return on invest-

ment for active learning,” in Proceedings of the NIPS Workshop on Cost-Sensitive

Learning, vol. 72, 2008.

Teaching Tip: Using No-Code AI to Teach Machine Learning in Higher Education

Article

Full-text available

Jan 2024

With recent advances in artificial intelligence (AI), machine learning (ML) has been identified as particularly useful for organizations seeking to create value from data. However, as ML is commonly associated with technical professions, such as computer science and engineering, incorporating training in the use of ML into non-technical educational programs, such as social sciences courses, is challenging. Here, we present an approach to address this challenge by using no-code AI in a course for university students with diverse educational backgrounds. This approach was tested in an empirical, case-based educational setting, in which students engaged in data collection and trained ML models using a no-code AI platform. In addition, a framework consisting of five principles of instruction (problem-centered learning, activation, demonstration, application, and integration) was applied. This paper contributes to the literature on IS education by providing information for instructors on how to incorporate no-code AI in their courses and insights into the benefits and challenges of using no-code AI tools to support the ML workflow in educational settings.

Efficient data management for intelligent manufacturing

Chapter

Jun 2024

This chapter is arranged as follows: Section 10.2 will delve deeper into the latest advances in data denoising, data annotation, and data balancing facilitated by DL techniques. Section 10.3 will present the manufacturing applications of these methodologies, followed by a discussion on the remaining challenges and opportunities in Section 10.4, and conclusions in Section 10.5.

RE-AdaptIR: Improving Information Retrieval through Reverse Engineered Adaptation

Preprint

Full-text available

Jun 2024

Large language models (LLMs) fine-tuned for text-retrieval have demonstrated state-of-the-art results across several information retrieval (IR) benchmarks. However, supervised training for improving these models requires numerous labeled examples, which are generally unavailable or expensive to acquire. In this work, we explore the effectiveness of extending reverse engineered adaptation to the context of information retrieval (RE-AdaptIR). We use RE-AdaptIR to improve LLM-based IR models using only unlabeled data. We demonstrate improved performance both in training domains as well as zero-shot in domains where the models have seen no queries. We analyze performance changes in various fine-tuning scenarios and offer findings of immediate use to practitioners.

Large Language Models: A New Approach for Privacy Policy Analysis at Scale

Preprint

Full-text available

May 2024

The number and dynamic nature of web and mobile applications presents significant challenges for assessing their compliance with data protection laws. In this context, symbolic and statistical Natural Language Processing (NLP) techniques have been employed for the automated analysis of these systems' privacy policies. However, these techniques typically require labor-intensive and potentially error-prone manually annotated datasets for training and validation. This research proposes the application of Large Language Models (LLMs) as an alternative for effectively and efficiently extracting privacy practices from privacy policies at scale. Particularly, we leverage well-known LLMs such as ChatGPT and Llama 2, and offer guidance on the optimal design of prompts, parameters, and models, incorporating advanced strategies such as few-shot learning. We further illustrate its capability to detect detailed and varied privacy practices accurately. Using several renowned datasets in the domain as a benchmark, our evaluation validates its exceptional performance, achieving an F1 score exceeding 93%. Besides, it does so with reduced costs, faster processing times, and fewer technical knowledge requirements. Consequently, we advocate for LLM-based solutions as a sound alternative to traditional NLP techniques for the automated analysis of privacy policies at scale.

RE-Adapt: Reverse Engineered Adaptation of Large Language Models

Preprint

Full-text available

May 2024

We introduce RE-Adapt, an approach to fine-tuning large language models on new domains without degrading any pre-existing instruction-tuning. We reverse engineer an adapter which isolates what an instruction-tuned model has learned beyond its corresponding pretrained base model. Importantly, this requires no additional data or training. We can then fine-tune the base model on a new domain and readapt it to instruction following with the reverse engineered adapter. RE-Adapt and our low-rank variant LoRE-Adapt both outperform other methods of fine-tuning, across multiple popular LLMs and datasets, even when the models are used in conjunction with retrieval-augmented generation.

End-to-End Phase Field Model Discovery Combining Experimentation, Crowdsourcing, Simulation and Learning

Article

Mar 2024

The availability of tera-byte scale experiment data calls for AI driven approaches which automatically discover scientific models from data. Nonetheless, significant challenges present in AI-driven scientific discovery: (i) The annotation of large scale datasets requires fundamental re-thinking in developing scalable crowdsourcing tools. (ii) The learning of scientific models from data calls for innovations beyond black-box neural nets. (iii) Novel visualization & diagnosis tools are needed for the collaboration of experimental and theoretical physicists, and computer scientists. We present Phase-Field-Lab platform for end-to-end phase field model discovery, which automatically discovers phase field physics models from experiment data, integrating experimentation, crowdsourcing, simulation and learning. Phase-Field-Lab combines (i) a streamlined annotation tool which reduces the annotation time (by ~50-75%), while increasing annotation accuracy compared to baseline; (ii) an end-to-end neural model which automatically learns phase field models from data by embedding phase field simulation and existing domain knowledge into learning; and (iii) novel interfaces and visualizations to integrate our platform into the scientific discovery cycle of domain scientists. Our platform is deployed in the analysis of nano-structure evolution in materials under extreme conditions (high temperature and irradiation). Our approach reveals new properties of nano-void defects, which otherwise cannot be detected via manual analysis.

Combining prompt-based language models and weak supervision for labeling named entity recognition on legal documents

Article

Full-text available

Feb 2024
Artif Intell Law

Named entity recognition (NER) is a very relevant task for text information retrieval in natural language processing (NLP) problems. Most recent state-of-the-art NER methods require humans to annotate and provide useful data for model training. However, using human power to identify, circumscribe and label entities manually can be very expensive in terms of time, money, and effort. This paper investigates the use of prompt-based language models (OpenAI’s GPT-3) and weak supervision in the legal domain. We apply both strategies as alternative approaches to the traditional human-based annotation method, relying on computer power instead human effort for labeling, and subsequently compare model performance between computer and human-generated data. We also introduce combinations of all three mentioned methods (prompt-based, weak supervision, and human annotation), aiming to find ways to maintain high model efficiency and low annotation costs. We showed that, despite human labeling still maintaining better overall performance results, the alternative strategies and their combinations presented themselves as valid options, displaying positive results and similar model scores at lower costs. Final results demonstrate preservation of human-trained models scores averaging 74.0% for GPT-3, 95.6% for weak supervision, 90.7% for GPT + weak supervision combination, and 83.9% for GPT + 30% human-labeling combination.

Understanding the Process of Data Labeling in Cybersecurity

Conference Paper

May 2024

FARPLS: A Feature-Augmented Robot Trajectory Preference Labeling System to Assist Human Labelers’ Preference Elicitation

Conference Paper

Apr 2024

Machine learning-supported manufacturing: a review and directions for future research

Article

Full-text available

Mar 2024

The evolution of manufacturing systems toward Industry 4.0 and 5.0 paradigms has pushed the diffusion of Machine Learning (ML) in this field. As the number of articles using ML to support manufacturing functions is expanding tremendously, the main objective of this review article is to provide a comprehensive and updated overview of these applications. 114 journal articles have been collected, analysed, and classified in terms of supervision approaches, function, ML algorithm, data inputs and outputs, and application domain. The findings show the fragmentation of the field and that most of the ML-based systems address limited objectives. Some inputs and outputs of the analysed support tools are shared across the reviewed contributions, and their possible combinations have been outlined. The advantages, limitations, and research opportunities of ML support in manufacturing are discussed. The paper outlines that the excessive specialization of the reviewed applications could be overcome by increasing the diffusion of transfer learning in the manufacturing domain.

Learning from crowdsourced labeled data: a survey

Article

Full-text available

Dec 2016
ARTIF INTELL REV

With the rapid growing of crowdsourcing systems, quite a few applications based on a supervised learning paradigm can easily obtain massive labeled data at a relatively low cost. However, due to the variable uncertainty of crowdsourced labelers, learning procedures face great challenges. Thus, improving the qualities of labels and learning models plays a key role in learning from the crowdsourced labeled data. In this survey, we first introduce the basic concepts of the qualities of labels and learning models. Then, by reviewing recently proposed models and algorithms on ground truth inference and learning models, we analyze connections and distinctions among these techniques as well as clarify the level of the progress of related researches. In order to facilitate the studies in this field, we also introduce open accessible real-world data sets collected from crowdsourcing systems and open source libraries and tools. Finally, some potential issues for future studies are discussed.

Machine Learning Models for Automatic Labeling: A Systematic Literature Review

Conference Paper

Jan 2020

Action Research in Software Engineering: Theory and Applications

Book

Jan 2020

Miroslaw Staron

This book addresses action research (AR), one of the main research methodologies used for academia-industry research collaborations. It elaborates on how to find the right research activities and how to distinguish them from non-significant ones. Further, it details how to glean lessons from the research results, no matter whether they are positive or negative. Lastly, it shows how companies can evolve and build talents while expanding their product portfolio. The book’s structure is based on that of AR projects; it sequentially covers and discusses each phase of the project. Each chapter shares new insights into AR and provides the reader with a better understanding of how to apply it. In addition, each chapter includes a number of practical use cases or examples. Taken together, the chapters cover the entire software lifecycle: from problem diagnosis to project (or action) planning and execution, to documenting and disseminating results, including validity assessments for AR studies. The goal of this book is to help everyone interested in industry-academia collaborations to conduct joint research. It is for students of software engineering who need to learn about how to set up an evaluation, how to run a project, and how to document the results. It is for all academics who aren’t afraid to step out of their comfort zone and enter industry. It is for industrial researchers who know that they want to do more than just develop software blindly. And finally, it is for stakeholders who want to learn how to manage industrial research projects and how to set up guidelines for their own role and expectations.

A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective

Article

Oct 2019

Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data. In this survey, we perform a comprehensive study of data collection from a data management point of view. Data collection largely consists of data acquisition, data labeling, and improvement of existing data or models. We provide a research landscape of these operations, provide guidelines on which technique to use when, and identify interesting research challenges. The integration of machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research.

Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets

Conference Paper

May 2017

Crowdsourcing provides a scalable and efficient way to construct labeled datasets for training machine learning systems. However, creating comprehensive label guidelines for crowdworkers is often prohibitive even for seemingly simple concepts. Incomplete or ambiguous label guidelines can then result in differing interpretations of concepts and inconsistent labels. Existing approaches for improving label quality, such as worker screening or detection of poor work, are ineffective for this problem and can lead to rejection of honest work and a missed opportunity to capture rich interpretations about data. We introduce Revolt, a collaborative approach that brings ideas from expert annotation workflows to crowd-based labeling. Revolt eliminates the burden of creating detailed label guidelines by harnessing crowd disagreements to identify ambiguous concepts and create rich structures (groups of semantically related items) for post-hoc label decisions. Experiments comparing Revolt to traditional crowdsourced labeling show that Revolt produces high quality labels without requiring label guidelines in turn for an increase in monetary cost. This up front cost, however, is mitigated by Revolt's ability to produce reusable structures that can accommodate a variety of label boundaries without requiring new data to be collected. Further comparisons of Revolt's collaborative and non-collaborative variants show that collaboration reaches higher label accuracy with lower monetary cost.

Improving Crowdsourced Label Quality Using Noise Correction

Article

Mar 2017

Crowdsourcing systems provide a cost effective and convenient way to collect labels, but they often fail to guarantee the quality of the labels. This paper proposes a novel framework that introduces noise correction techniques to further improve the quality of integrated labels that are inferred from the multiple noisy labels of objects. In the proposed general framework, information about the qualities of labelers estimated by a front-end ground truth inference algorithm is utilized to supervise subsequent label noise filtering and correction. The framework uses a novel algorithm termed adaptive voting noise correction (AVNC) to precisely identify and correct the potential noisy labels. After filtering out the instances with noisy labels, the remaining cleansed data set is used to create multiple weak classifiers, based on which a powerful ensemble classifier is induced to correct these noises. Experimental results on eight simulated data sets with different kinds of features and two real-world crowdsourcing data sets in different domains consistently show that: 1) the proposed framework can improve label quality regardless of inference algorithms, especially under the circumstance that each instance has a few repeated labels and 2) since the proposed AVNC algorithm considers both the number of and the probability of potential label noises, it outperforms the state-of-the-art noise correction algorithms.

Selective supervision: Guiding supervised learning with decision-theoretic active learning

Article

Jan 2007

SQUARE: A Benchmark for Research on computing crowd consensus

Article

Nov 2013

While many statistical consensus methods now exist, relatively little comparative benchmarking and integration of techniques has made it increasingly difficult to determine the current state-of-the-art, to evaluate the relative benefit of new methods, to understand where specific problems merit greater attention, and to measure field progress over time. To make such comparative evaluation easier for everyone, we present SQUARE, an open source shared task framework including benchmark datasets, defined tasks, standard metrics, and reference implementations with empirical results for several popular methods. In addition to measuring performance on a variety of public, real crowd datasets, the benchmark also varies supervision and noise by manipulating training size and labeling error. We envision SQUARE as dynamic and continually evolving, with new datasets and reference implementations being added according to community needs and interest. We invite community contributions and participation.

Multi-Task Active Learning

Article

Abhay Harpale

Semi-supervised clustering methods

Article

Sep 2013

Eric Bair

Cluster analysis methods seek to partition a data set into homogeneous subgroups. It is useful in a wide variety of applications, including document processing and modern genetics. Conventional clustering methods are unsupervised, meaning that there is no outcome variable nor is anything known about the relationship between the observations in the data set. In many situations, however, information about the clusters is available in addition to the values of the features. For example, the cluster labels of some observations may be known, or certain observations may be known to belong to the same cluster. In other cases, one may wish to identify clusters that are associated with a particular outcome variable. This review describes several clustering algorithms (known as ‘semi-supervised clustering’ methods) that can be applied in these situations. The majority of these methods are modifications of the popular k-means clustering method, and several of them will be described in detail. A brief description of some other semi-supervised clustering algorithms is also provided. For further resources related to this article, please visit the WIREs website. Conflict of interest: The authors have declared no conflicts of interest for this article.

Data Labeling: An Empirical Investigation into Industrial Challenges and Mitigation Strategies

Abstract and Figures

Recommended publications

An Empirical Evaluation of Algorithms for Data Labeling

Classification of Complex-Valued Radar Data using Semi-Supervised Learning: a Case Study

Assessing the Suitability of Semi-Supervised Learning Datasets using Item Response Theory

Towards Federated Learning: A Case Study in the Telecommunication Domain