ChapterPDF Available

Data Labeling: An Empirical Investigation into Industrial Challenges and Mitigation Strategies

Authors:

Abstract and Figures

Labeling is a cornerstone of supervised machine learning. However, in industrial applications, data is often not labeled, which complicates using this data for machine learning. Although there are well-established labeling techniques such as crowdsourcing, active learning, and semi-supervised learning, these still do not provide accurate and reliable labels for every machine learning use case in the industry. In this context, the industry still relies heavily on manually annotating and labeling their data. This study investigates the challenges that companies experience when annotating and labeling their data. We performed a case study using a semi-structured interview with data scientists at two companies to explore their problems when labeling and annotating their data. This paper provides two contributions. We identify industry challenges in the labeling process, and then we propose mitigation strategies for these challenges.
Content may be subject to copyright.
Data Labeling: An Empirical Investigation into
Industrial Challenges and Mitigation Strategies
Teodor Fredriksson1, David Issa Mattos1[0000000225019926] , Jan
Bosch1[000000032854722X], and Helena Holmstr¨om
Olsson2[0000000277001816]
1Chalmers University of Technology, H¨orselg˚angen 11, 417 56, Gothenburg, Sweden
{teodorf,davidis,jan.bosch}@chalmers.se
2Malm¨o University, Nordenski¨oldsgatan 1, 211 19 Malm¨o, Sweden
helena.holmstrom.olsson@mau.se
Abstract. Labeling is cornerstone in supervised machine learning. How-
ever, in industrial applications data is often not labeled, which compli-
cates the use of this data for machine learning. Although there are well
established labeling techniques such as crowdsourcing, active learning
and semi-supervised learning, but these still do not provide accurate and
reliable labels for every machine learning use case in industry. In this
context, industry still relies heavily on manually annotating and label-
ing their own data. This study investigates the challenges that compa-
nies experience when annotating and labeling their data. We performed
a case study using semi-structured interview with data scientists at two
companies to explore what problems they experience when labeling and
annotating their data. This paper provides two contributions. We identify
industry challenges in the labeling process and then we propose mitiga-
tion strategies for these challenges
Keywords: Data Labeling ·Machine Learning ·Case study
1 Introduction
Current research estimates that over 80% of engineering tasks in a machine-
learning ML project concerns data preparation and labeling, and that the third-
party data labeling market is expected to almost triple by 2024 [1, 2]. This large
effort spent in data preparation and labeling often happens because, in indus-
try, datasets are often incomplete in the sense that some or all instances are
missing labels. In addition, in some cases the labels that are available are of
poor quality, meaning that the label associated with a data entry is incorrect or
only partially correct. Labels of sufficient quality are a prerequisite to perform
supervised machine learning as the performance of the model in operations is
directly influenced by the quality of the training data [3].
To overcome the lack of labels in both quantity and quality, crowdsourcing
has been a common strategy for acquiring quality labels with human supervi-
sion [4, 5], in particular for computer vision and natural language processing
2 T. Fredriksson et al.
applications. However, for other industry applications, crowdsourcing has sev-
eral limitations, such as allowing unknown third-party access to company data,
lack of people with in-depth understanding of the problem or the business to
create quality labels. In-house labeling can be half as expensive as crowdsourced
labels and while providing higher quality [6] Due to these factors, companies still
perform in-house labeling. Despite the large body of research on crowdsourcing
and machine learning systems that can overcome different label quality prob-
lems, to the best of our knowledge, there is no research that investigates the
challenges faced and strategies adopted by data scientists and human labelers
in the labeling process of company specfic applications. In particular, we focus
on the problems seen in applications where labeling is non-trivial and requires
understanding of the problem domain.
Utilizing case study research based on semi-structure interviews with prac-
titioners in two companies, one of which has extensive labeling experience we
study the challenges and the adopted mitigation strategies in the data labeling
process that these companies employ. The contribution of this paper is twofold.
First, we identify the key challenges that these companies experience in relation
to labeling data. Second, we present an overview of the mitigation strategies that
companies employ regularly or potential solutions to address these challenges.
The remainder of the paper is organized as follows. In the next section, we
provide more in-depth overview of the background of our research. Subsequently,
in section 3 we present the research method that we employed in the paper as
well as an overview of the case companies. Section 4 presents the challenges that
we identified during the case study, observations and interviews at the company,
the results from the expert interviews to validate the challenges as well as the
mitigation strategies. Finally, the paper is concluded in section 6.
2 Background
Crowdsourcing is defined as a process of acquiring required information or results
by request of assistance from a group of many people available through online
communities. This is a way of dividing and distributing a large project among
people. After each process is completed the people involved in the process are
rewarded [7]. According to [2] crowdsourcing is the primary way of achieving
labels. In the context of machine learning however, crowdsourcing has its own
set of problems. The primary problem is annotators that produces bad labels.
An annotator might not be able to label instances correctly and even if an
annotator is an expert, the quality of the labels will potentially decrease over
time due to the human factor [3]. Examples of crowdsourcing platforms are the
Amazon Mechanical Turk and the Lionbridge AI [8].
Allowing a third-party company to label your data has its benefits, such as
not having to develop your own annotation tools and labeling infrastructure.
In-house labeling also requires investing time training your annotators and this
is not optimal if you don’t have enough time and resources. A downside is that
sensitive and confidential company data has to be shared with the crowdsourc-
Title Suppressed Due to Excessive Length 3
ing platforms. Therefore, there are essential factors to consider before selecting
crowdsourcing platform, such as how many and what kind of projects has the
platform been successful with previously? Does the platform have high-quality
labeling technologies so that high quality labels can be obtained? How does the
platform ensure that the annotators can produce labels of sufficient quality?
What security measures are taken to ensure the safety of your data? A tool
to be used in crowdsourcing when noisy labels are cheap to obtain is repeated-
labeling. According to [9] repeated labeling should be exercised if labeling can be
repeated and the labels are noisy. This approach can improve the quality of the
labels which leads to improved quality in machine learning model. This seems to
work especially well when the repeated-labeling is done selectively, taking into
account label uncertainty and machine learning model uncertainty. However, this
approach does not guarantee that the quality is improved. Sheshadri and Lease
[10] provides an empirical evaluation study that compares different algorithms
that computes the crowd consensus on benchmark crowdsourced data sets using
the Statistical Quality Assurance Robustness Evaluation (SQUARE) benchmark
[10]. The conclusions of [10] is that no matter what algorithm you choose, there
is no significant difference in accuracy. These algorithms includes majority voting
(MV), ZenCrowd (ZC), David and Skene (DS)/ Naive Bayes (NB) [9]. There
are also other ways to handle noisy labels, for example in [11], they improve the
accuracy when training a deep neural network with noisy labels by incorporating
a noise layer. So rather than correcting noisy labels, there are ways to changes
the machine learning models so that they can handle noisy labels. The downside
to this approach is that you need to know which instances are clean and which
instances are noisy This can be difficult with industrial data. Another strategy
to detect noisy labels is confident learning which can be used to identify noisy
labels as well a learn form noisy labels. [12].
3 Research Method
In this paper, we report on case study research in which we explored the chal-
lenges related to labeling of data for machine learning and what strategies can be
employed to mitigate them. In this section we will present the data we collected
and how we analyzed it to identify the challenges.
A case study is a research method that investigates real-world phenomena
through empirical investigations. The aim of these studies is to identify challenges
and find mitigation strategies through action, reflection, theory and practice,
[13–15]. A case study suits well with our purpose because of it’s exploratory
nature and we are trying to learn more about certain processes at Company A
and B. The two main research questions we have are:
RQ1: What are the key challenges that practitioners face in the process of
labeling data?
RQ2 What are the mitigation strategies that practitioners use to overcome
these challenges?
4 T. Fredriksson et al.
3.1 Data Collection
Our case study was conducted in collaboration with two companies. Company
A is a worldwide telecommunication provider and one of the leading providers in
Information and Communication Technology (ICT). Company B is a company
specialized in labeling. They have developed an annotation platform in order
to provide the autonomous vehicles industry with labeled training data of top
quality. Their clients include software companies and research institutes.
Phase I: Exploration - The empirical data collected during this phase is
based on an internship from November 18 2019 to February 28 2020 in which
the first author spent time at Company As office two-three days a week. The
data was collected from the data scientist by observing how the they were
working with machine learning and how they deal with data where labels
are missing as well as having access to data sets. We held discussions with
each of the data scientist working with each particular dataset to collect data
regarding the origin of the data, what they wish to use it for in the future,
how often it is updated. Using Python we could investigate how skew the
label distribution is of the label distribution as well as examine the data to
potentially find any clustering structure in the labels. The datasets studied
in phase I came from participant I and II.
– Phase II: Validation - After the challenges had been identified during
phase I, both internal and external confirmation interviews were conducted
to validate if the challenges found in the previous phase were general. Four
participants in the interviews where from company A and one participant
was from company B. Company A had several data scientists but we in-
cluded only included scientists that had issues with labeling. Each partic-
ipant was interviewed separately and the interviews lasted between 25-55
minutes. All but one interview was conducted in English. The one interview
was conducted in Swedish and then translated to English by the first author.
During the interview we asked questions such as What is the purpose of your
labels?,How do you get annotated data? and How do you assess the quality
of the data/labels?
Based on meetings and interviews, we managed to evaluate and plan strategies
to mitigate the challenges that we have observed during our study.
3.2 Data analysis
The interviews was analyzed by taking notes during the interviews and intern-
ship. We then performed a thematic analysis [16]. A thematic analysis is defined
as ”a method for identifying, analyzing and reporting patterns” and was used
to to identify the different themes and patterns in the data we collected. From
the analysis we were able to identify themes and define the industrial challenges
Title Suppressed Due to Excessive Length 5
Table 1. List of the interview participants of phase II
Company Participant Nr Title/role Experience
A I Data Scientist 4 years
A II Senior Data Scientist 8 years
A III Data Scientist 3 years.
A IV Senior Data Scientist 2 years
B V Senior Data Scientist 7 years
based on the notes. For each interview we identified different themes such as
topics that came up during the interviews. Several of these themes were present
in more than one interview so we combined the data for each of the interviews
and based on that we could draw conclusions based on the information on the
same theme.
3.3 Threats to Validity
Accoring to [14] there are four different concepts of validity to consider, construct
validity,internal validity,external validity and reliability. To achieve construct
validity we provided every participant of company A with an e-mail containing
all the definitions of concepts and some sample questions to be asked during
the interview. We also provided a lecture on how to use machine learning to
label data before the interviews so that the could participant’s could reflect
and prepare for the interview. We can argue that we achieved internal validity
through data triangulation since we interviewed every person at Company A
that had experience with labels. Therefore it is very unlikely that we missed any
necessary information when collecting data.
4 Results
In this section we shall present the results from our study. We begin by listing
the key problems that found from phase I of the study. Coming up next we state
the problems we found from Phase II. The interview we held with participant
V was then used as an inspiration for formulating mitigation strategies for the
problems faced by the data scientist from Company A.
4.1 Phase I: Exploration
Here we list the problems that we found during Phase I of the case study.
1. Lack of a systematic approach to labeling data for specific features:
It was clear that automated labeling processes was needed. The data scien-
tists working at Company A had all kinds of needs for automated labeling.
Currently they have no idea how to approach the problem.
6 T. Fredriksson et al.
2. Unclear responsibility for labeling: Data scientist do not have the time
to label instances manually. Their stakeholders can label the data by hand
but they do not want to do it either. Thus the data scientist are expected to
come up with a way to do the labeling.
3. Noisy labels: Participant I has a small subset of his data labeled. These
labels comes from experiments conducted in a lab. The label noise seem
to be negligible but that is not the case. There is a difference between the
generated data and the true data. The generated data will have features
that are continuous while the generated data will be discrete. Participant II
works on a data set that contains tens of thousands of rows and columns.
The column of interest contains two class labels, ”Yes” and ”No”. The first
problem with the labels is that they are noisy. The ”Yes” is dependent of
two errors, I and II. Only ”Yes” based on error I is of interest. If the ”Yes”
is based on error II. then it should be relabeled as a ”No”. Furthermore the
stakeholders do not know if the ”Yes” instances are due to error I or error II.
4. Difficulty to find a correlation between labels and features: Partici-
pant I works with a dataset whose label distribution contains five classes that
describes grades from ”best” to ”worst”. Where 1 ”best” and 5 is ”worst”.
Cluster analysis reveals that there is no particular cluster structure for some
of the labels. Labels of grade 5 seem to be in one cluster but the other 1-
4 seem to be randomly scattered in one cluster. Analysis of the data from
participant II reveals that there is no way of telling whether the ”Yes” is
based on error I or error II. This means that many of the ”Yes” are labeled
incorrectly.
5. Skewed label distributions: The label distribution from both datasets
is highly skewed. The dataset from participant I has fewer instances that
has a high grade compared to low grades. For participant II the number of
instances labeled ”No” is greater than the number of labels set as ”Yes”.
When training a model ion this data it will overfit.
6. Time dependence: Due to the nature of participant IIs data, it is possible
that some of the ”No” can become ”Yes” it the future and so the ”No” labels
are possibly incorrect too.
7. Difficulty to predict future uses for datasets. The purpose of the labels
in both datasets were to predict new labels for future instances provided by
the stakeholder on an irregular basis. For participant I the labels might be
used for other purposes later. There are no current plans to use the label for
other machine learning purposes.
4.2 Phase II: Validation
The problems that appeared during the interviews can be categorized as follows
Title Suppressed Due to Excessive Length 7
1. Label distribution related. Question regarding the distribution.
2. Multiple-task related. Questions regarding the purpose of the labels.
3. Annotation related. Questions regarding the oracle and noisy labels.
4. Model and data reuse related. Questions regarding reuse of trained model on
new data.
Below we discuss each category in more detail.
1. Label Distribution: We found several issues related to the label distri-
bution. Participant Is data has a label distribution that is unknown. The
current labels are measured in percentages and need to be translated into
at least two classes but If more labels are needed that can be done as well.
Participant II has a label distribution that contains two classes, ”Yes” and
”No”. Participant IIIs data has a label distribution that contains at least
three labels. Participants IV has more than three-thousand labels so it is
hard to get a clear picture of what the distribution is. Participant I-III all
have skewed label distributions. If a dataset has a skew label distribution
then the machine learning model will overfit. This means that if you have
a binary classification problem and you have 80% of class A and 20% of
class B, the model might just predict A the majority of a time even when
an actual case is labeled as B [17].
2. Multiple tasks: Participant I, II and III says that that for now, the only
purpose of their labels is to find labels for new data but chances are that
it will be reused for something else later on. Participant IV does not use
its labels for machine learning purposes but for other practical reasons. The
problem here is that if you do not plan ahead and only train a model with
respect to one task, then if you need to use the labels for something else
later you will habe to re-label the instances for each new task.
3. Annotation: Participant I has some labeled data that comes from labo-
ratory experiments. However, these labels are only used to help label new
instances to be labeled manually. Participant II has its labels coming from
the stakeholders but since these are noisy, these instances needs to be re-
labeled. Participant III has labeled data coming from stakeholders and these
are expected 100% correct. Participant IV defines all labels by itself and
does not consult the stake holders at all. The problem here is that the data
scientist are often tasked to do labeling on their own. Even if the data sci-
entist get instances from the stakeholders, the amount of labels are often of
insufficient quantity and/or quality.
4. Data Reuse: Participant III has had problems with reusing a model. First
the data was labeled into two classes ”Yes” and ”No. Later the ”Yes” cate-
gory would be divided into sub-categories ”YesA” and ”YesB”. When run-
ning the model on this new data, it would predict the old ”Yes” instance as
”No” instance. Participant III has no idea as to why this happens.
8 T. Fredriksson et al.
4.3 Summary from Company B
Participant V of Company B has earlier experience with automatic labeling.
Therefore interview V was used to verify some actual labeling issues from indus-
try. According to participant V, Company B has worked and studied automatic
labeling for at least seven years. Company B uses crowd-sourcing to label data
using 1000 people. Participant V confirms that thanks to active learning the
labeling task takes 200 times less time than if active learning was not used. The
main problem company B has with the labeling is that it is hard to evaluate the
quality labels and access the quality of the human annotator. A final remark
from Company V is that they have experienced a correlation between automa-
tion and quality. The more automation included in the process, the less accurate
with the labels be. Three of the authors of this paper performed a systematic
literature review on automated labeling using machine learning [18]. Thanks to
that paper, we can draw the conclusion that active learning and semi-supervised
learning can be used to label instances.
4.4 Machine Learning methods for Data Labeling
Here we present and discuss Active Learning and Semi-supervised learning meth-
ods in terms of how they can be used in practice with labeling problems.
Active Learning: Traditionally labels would be chosen randomly to be labeled
to be used with machine learning. However, choosing instance to be labeled ran-
domly could lead to a model with low predictive accuracy since non-informative
instances could be select for labeling. To mitigate the issue of choosing non-
informative instances active learning (AL) is proposed. Active learning queries
instances by informativeness and then labels them. The different methods used
to pose queries are known as query strategies[19]. According to [18] the most com-
monly used query strategies are uncertainty sampling, error/variance reduction,
query-by-commitee (QBC) and query-by-disagreement (QBD). After instances
are queried and labeled, they are added to the training set. A machine learn-
ing algorithm is then trained and evaluated. If the learner is not happy with
the results, more instances will be queried and the model will be retrained and
evaluated. This iterative procedure will proceed until the learner decides it is
time to stop learning. Active learning has proven to outperform passive learning
if the query strategy is properly selected based on the learning algorithm[19].
Most importantly active learning is a great way to make sure that time is not
wasted on labeling non-informative instances thus saving both time and money
in crowdsourcing [2].
Semi-supervised learning: Semi-supervised learning (SSL) is concerned with
a set of algorithms that can be used in the scenario where most of the data is
unlabeled but a small subset of it is labeled. Semi-supervised learning is mainly
divided into semi-supervised classification and constrained clustering [20].
Title Suppressed Due to Excessive Length 9
Semi-supervised classification is when a classifier is trained based on train-
ing data that contains both labeled and unlabeled instances. Sometimes semi-
supervised learning outperforms supervised classification [21].
Constrained clustering is an extension to unsupervised clustering. Constrained
clustering requires unlabeled instances as well as some supervised information
about the clusters. The objective of constrained clustering is to improve upon un-
supervised clustering[22]. The most popular semi-supervised classification meth-
ods are mixture models using the EM-algorithm,co-training/multi-view learning,
graph-based SSL and Semi-supervised support vector machines (S3VM) [18].
Below we list eight practical considerations of Active Learning.
1. Data exploration to determine which algorithm is best. When start-
ing on a new project involving machine learning, it is hard to know which
algorithm will yield the best result. Often there is no way of knowing before
hand what the best choice is. There are empirical studies on which one to
choose but the results are fairly mixed [23–25]. Since the selection of algo-
rithm varies so much it is essential to understand the problem beforehand.
If it is interesting to reduce the error, then expected error or variance reduc-
tion are the best query strategies to choose from [19]. If the density of the
sample is easy to use and there is strong evidence that support correlation
between cluster structure to the labels, then use density-weighted methods
[19]. If using large probabilistic models uncertainty sampling is the only vi-
able option [19]. If there is no time testing out different query strategies it is
best to you the more simple approaches based on uncertainty [19]. From our
investigation it is clear that company A is in need for labels in their projects.
However since they have never implemented an automatic labeling process
before, it is important that do right from the beginning. That is, the data
scientists must carefully examine the distribution of data set, check whether
there is any cluster structures and if there any relationships between the
clusters and the labels. If the data exploration is done in a detailed correct,
then finding the correct machine learning approach is easy and we don’t need
to spend time on testing different machine learning algorithms.
2. Alternative query types: A traditional active learners queries instances to
be labeled by an oracle. However there are other ways of querying, e.g human
domain knowledge has been incorporated into machine learning algorithms.
This means the learner builds models based on human advice, such as rules
and constraints as well as labeled and unlabeled data.
An example of domain knowledge with active learning is to use informa-
tion about the features. This approach is referred to as tandem learning and
incorporates feature feedback in traditional classification problems. Active
dual supervision is an area of active learning where features are labeled.
Here oracles label features that are judged to be good predictors of one or
more classes. The big question is how to actively query these feature labels.
10 T. Fredriksson et al.
3. Multi-task active learning: From our interview we can see that there are
cases where labels are needed to predict labels for future instances. In other
cases the labels aren’t even needed for machine learning. In one case the
data scientist thinks that the labels will be used for other prediction task
but is unsure. The most basic way in which active learning operates is that
a machine learner is trying to solve a single task. From the interviews it is
clear the same data needs to annotated in several ways for several future
tasks. This means that the data scientist will have to spend even more time
annotating at least one time for each task. It would be more economical to
label a single instance for all sub-tasks simultaneously. This can be done
with the help of multi-task active learning. [26]
4. Data reuse and the unknown model class: The labeled training set col-
lected after performing active learning always has a bias distribution. The
bias is connected to the class of model used to select the queries. If it is
necessary to switch learner to a more improved learner, then it might be
troublesome to reuse the training data with models of a different class. This
is an important issue in practical uses for active learning. If you know the
best model class and feature set beforehand, then active learning can safely
be used. Otherwise active learning will be outperformed by passive learning.
5. Unreliable oracles: It is important to have access to top quality labeled
data. If the labels come from some experiment there is almost always some
noise present. In one of the data sets from company A, a small subset of the
data was labeled. The labels of that particular data set comes from experi-
ments conducted in a lab. The label noise seem to be negligible but that is
not the case. There is a difference between the generated data and the true
data. The generated data will have features that are continuous while the
generated data will be discrete. Another data set that we studied has labels
came from customer data. The labels were coded ”Yes” and ”No”. However
the ”Yes” were due to factors A and B. So the problem here is to find a
model that can predict the labels, but we are only interested in the ”Yes”
that are due to factor A. The ”Yes” that are due to factor B needs to be
relabeled to a ”No”. Since the customer data does not provide whether the
”Yes” are due to factor A or B. The second problem was that some of the
”No” could develop into a ”Yes” over time. It was up to the data scientist
to find a way to relabel the data correctly. The data scientist had a solution
to the problem but realized that it was faulty and therefore asked us for
help. We took a look at the data and the current solution. We saw two large
clusters but that there were no noteworthy relationship between the different
labels and the features. We found two clusters but both of contained almost
equally many ”Yes” and ”No”. Lets say that the first cluster contained about
60% ”Yes” and 40% ”No” and in the second cluster we had 60% ”No” and
40% ”Yes”. After doing this, all of the instances in the first cluster were
re-labeled as ”Yes” and all instances in the second cluster were re-labeled as
”No”. We conclude that this is an approach that will yield noisy labels. The
Title Suppressed Due to Excessive Length 11
same goes if the labels come from a human annotator because some of the
instances might be difficult to label and people can easily be distracted and
tired over time and so the quality of the labels will vary over time. Thanks
to crowd-sourcing one can let several people annotate the same data and
that way it is easier to determine which label is the correct one and pro-
duce ”gold-standard quality training sets”. This approach can also be used
to evaluate learning algorithms on training sets that are non gold-standard.
The big question is: How do we use noisy oracles in active learning? When
should the learner query new unlabeled instances rather than updating cur-
rently labeled instances in case we suspect an error. Studies where estimates
of both oracle and model uncertainty were taken into account show that data
can be improved by selectively repeated labeling. How do we evaluate the
annotators? How might the affect of payment influence annotation quality?
What to do if the some instances are noisy no matter what oracle you use
and repeated labeling is not to improve the situation?
6. Skewed label distributions: In two of the data sets we studied, the dis-
tributions of the labels are skewed. That is there are more of one label than
there is of another. In the ”Yes” and ”No” labeled example, there are way
more ”No” instances. When the label distribution is skew, then active learn-
ing might not give much better results than passive learning. This is because
if the labels are not balanced, active learning might query more of one la-
bel than another. Not only is the skewed distribution a problem, but also
the lack of labeled data is also a problem. This might be a problem in the
data set, here we have instances labeled from an experiment. Very few labels
are labeled from the beginning and new unlabeled data is coming in ev-
ery fifteen minutes. ”Guided learning” is proposed to mitigate the slowness
problem. Guided learning allows the human annotator to search for class-
representative instances in addition to just querying for labels. Empirical
studies indicate that guided learning performs better than active learning as
long as it’s annotation costs are less than eight times more expensive than
labeling queries
7. Real labeling costs and cost reduction: From observing the data sci-
entists at Company A, we would say that they will spend about 80% of the
time that they spend on data science is prepossessing the data. Therefore
we recognize that they do not have time to label to many instances and it is
crucial to reduce the time it takes to label things manually. If the possibility
exists avoid manual labeling.
Assume that the cost of labeling is uniform. The smaller training set used,
the lower will the associated costs be. However in some applications the cost
might be varying so simply reducing the labeled instances in the training
data does not necessarily reduce the cost. This problem is studied within
cost-sensitive active learning. To reduce effort in active learning automatic
pre-annotation can help. In automatic pre-annotation the current model
predictions helps to query the labels [27, 28]. This can often help the labor-
12 T. Fredriksson et al.
ing efforts of the learner. If the models does many classification mistakes
then there will be extra work for the human annotator to correct these. To
mitigate these problems correlation propagation can be used. In correlation
propagation the local edits are used to interactively update the prediction.
In general automatic pre-annotation and correction propagation does not
deal with labeling cost themselves. However they do try to reduce the costs
indirectly by minimizing the number of labeling actions performed by the
human oracle.
Other cost-sensitive active learning methods takes varying labeling costs into
account. Both current labeling costs and expected future errors in classifica-
tion costs can be incorporated [29]. The costs might not even be deterministic
but stochastic.
In many applications the costs are not known beforehand however they might
be able to be described as a function over annotation time [30]. To find such
function, train a regression cost-model to that predicts the annotation costs.
Studies involving real human annotation cost shows the following results.
Annotation costs are not constant across instances [31–34].
Active learners that ignore costs might not perform better than passive
learners [19].
The annotations costs may vary on the person doing the annotation [31],
[35].
The annotation costs can include stochastic components. Jitter and
pause are two types of noise that affect the annotation speed.
Annotation can be predicted after seeing only a few labeled instances.
[33, 34].
8. Stopping criteria: This is related to cost-reduction. Since active learning
is an iterative process it is relevant to know when to stop learning Based
on our empirical findings the data scientist have no interest in doing any
manual labeling and if they have to they want to do it as little as possible.
So when the cost of gathering more training data is higher than the cost of
the errors made by the current system, then it is time to stop extending the
training set and hence stop training the machine learning algorithm. From
our experience at company A the data scientist have so little time free from
doing other data prepossessing, time is the most common stopping factor.
4.5 Challenges and Mitigation Strategies:
Many of the problems identified during phase I and phase II overlap to a certain
degree. Therefore we took all the problems and summarized them into three
challenges (C1-C3) that was later mapped to three mitigation strategies (MS1-
MS3). These mitigation strategies are derived from the practical consideration
above. Finally we map MS1 to C1, MS2 to C2 and MS3 to C3.
C1:Pre-processing: This challenge represents all that needs to be done during
the planning stage of the labeling procedure. This would include creating a
Title Suppressed Due to Excessive Length 13
systematic approach for labeling (problem 1 of phase I), doing an exploratory
data analysis to find correlation between labels and features (problem 4 of
phase I), as well as , choosing a model that can be reused on new data (prob-
lem 6 of phase I) and label instances with respect to multiple tasks (problem
7 of phase I, problem 4 of phase II).
MS1:Planning: This strategy contains all the solution frameworks from practical
consideration 1, 2, 3, 4, 7 and 8. This is because they all involve the steps
necessary to plan an active learning strategy for labeling.
C2:Annotation: This challenge represent the problems concerning choosing an
annotator as well as evaluating and reduce the label noise (problems 2,3 from
phase I and problem 3 from phase II).
MS2:Oracle selection: This strategy contains only solution frameworks from
practical consideration 5. It describes how we can choose oracles to produce
top quality labels.
C3:Label Distribution: This challenge represents all the problems concerning
the symmetry of the label distributions such as learning with a skew label
distribution (problem 5 of Phase I and problem 1 o Phase II).
MS3:Label distribution: This strategy contains solution frameworks from prac-
tical consideration 6. It describes how we can do labeling when the label
distribution is skew.
5 Discussion
From our verification interview with Company B we learned that active learning
is a popular tool for acquiring labels. Thanks to active learning the labeling task
takes 200 times less than if active learning was not used.
In the background we presented some current practices that can help with
labeling. The most popular practice being crowdsourcing. However, crowdsourc-
ing has its own set of problems. The primary concern is that bad annotators will
produce noisy labels due to inexperience or due the human factor. Secondly, The
benefits of allowing third-company to label data is that you don’t have to spend
time training you employees to do then job, nor do you need to develop you own
annotation tools and infrastructure. The big downside is that you have to share
confidential company data with the crowdsourcing platform. Repeated labeling
can be used to improve the quality of the labels but there is no guarantees that
this will improve the quality. Rather than correcting noisy labels, there are ways
in which you can change the machine learning models so that they can handle
noisy labels. The downside to this is that you need to know beforehand which
instances are noisy. This can be difficult in an industrial setting.
Non of the techniques discussed in the background utilizes automated la-
beling using machine learning. Thanks to our efforts we managed to formulate
14 T. Fredriksson et al.
three labeling challenges and provide mitigation strategies based on active ma-
chine learning. These challenges are related to questions such as, How can a
labeling processes be structured?, who and how do we label the instances? Can
correlation between labels and features be found, so that labels can be deter-
mined from the features. Both manual and automatic labeling involves some
noise in the labels. How should these noisy labels be used? What do we do if the
distribution of the labels is skewed? How do we take into account the fact that
some of the labels might change over time, due to the nature of the data? How
do we label instances so that the labels can be useful for several future tasks?
Three mitigation strategies that could possibly solve the three challenges
were presented.
6 Conclusion
The goal of this study is the provide a detailed overview of the challenges that
the industry faces with labeling, as well as outline mitigation strategies for these
challenges.
To the best of our knowledge 95%of all the machine learning algorithms
deployed in industry are supervised. Therefore, it is important that every dataset
is complete with labeled instances. Otherwise, the data would be insufficient and
supervised learning would not be possible.
It proves to be challenging to find and structure a labeling process. You need
to define a systematic approach for labeling and examine the data to choose
the optimal model. Finally you need to choose an oracle to produce top-quality
labels as well as plan how to handle skewed label distributions.
The contribution of this paper is twofold. First, based on a case study involv-
ing two companies we identified problems that companies experience in relation
to labeling data. We validated these problems using interviews at both companies
and summarized all problems into challenges. Second, we present an overview of
the mitigation strategies that companies employ (or could employ) to address
the challenges.
In our future work, we aim to develop further verify the challenges as well as
the mitigation strategies with more companies. In addition, we intend to develop
solutions to simplify the use of automated labeling in industrial contexts.
Acknowledgment
This work was partially supported by the Wallenberg AI Autonomous Systems
and Software Program (WASP) funded by Knut and Alice Wallenberg Funda-
tion.
References
1. Cognilytica Research, “Data Preparation & Labeling for AI 2020,” tech. rep., Cog-
nilytica Research, 2020.
Title Suppressed Due to Excessive Length 15
2. Y. Roh, G. Heo, and S. E. Whang, “A survey on data collection for machine
learning: a big data-ai integration perspective,” IEEE Transactions on Knowledge
and Data Engineering, 2019.
3. AzatiSoftware, AzatiSoftware Automated Data Labeling with Machine Learning,
2019. https://azati.ai/automated-data-labeling-with-machine-learning.
4. J. C. Chang, S. Amershi, and E. Kamar, “Revolt: Collaborative crowdsourcing for
labeling machine learning datasets,” in Proceedings of the 2017 CHI Conference
on Human Factors in Computing Systems, pp. 2334–2346, 2017.
5. J. Zhang, V. S. Sheng, T. Li, and X. Wu, “Improving crowdsourced label qual-
ity using noise correction,” IEEE transactions on neural networks and learning
systems, vol. 29, no. 5, pp. 1675–1688, 2017.
6. H. Cloud Factory, Crowd vs. Managed Team: A studo on Quality Data Processing
at Scale, 2020. https://go.cloudfactory.com/hubfs/02-Contents/3-Reports/Crowd-
vs-Managed-Team-Hivemind-Study.pdf.
7. J. Zhang, X. Wu, and V. S. Sheng, “Learning from crowdsourced labeled data: a
survey,” Artificial Intelligence Review, vol. 46, no. 4, pp. 543–576, 2016.
8. hackernoon.com, Crowdsourcing Data Labeling for Machine Learning Projects,
2020. https://hackernoon.com/crowdsourcing-data-labeling-for-machine-learning-
projects-a-how-to-guide-cp6h32nd.
9. P. G. Ipeirotis, F. Provost, V. S. Sheng, and J. Wang, “Repeated labeling using
multiple noisy labelers,” Data Mining and Knowledge Discovery, vol. 28, no. 2,
pp. 402–441, 2014.
10. A. Sheshadri and M. Lease, “Square: A benchmark for research on computing
crowd consensus,” in First AAAI conference on human computation and crowd-
sourcing, 2013.
11. S. Sukhbaatar and R. Fergus, “Learning from noisy labels with deep neural net-
works,” arXiv preprint arXiv:1406.2080, vol. 2, no. 3, p. 4, 2014.
12. C. G. Northcutt, L. Jiang, and I. L. Chuang, “Confident learning: Estimating
uncertainty in dataset labels,” arXiv preprint arXiv:1911.00068, 2019.
13. P. Reason and H. Bradbury, Handbook of action research: Participative inquiry and
practice. Sage, 2001.
14. P. Runeson and M. H¨ost, “Guidelines for conducting and reporting case study
research in software engineering,” Empirical software engineering, vol. 14, no. 2,
p. 131, 2009.
15. M. Staron, Action Research in Software Engineering: Theory and Applications.
Springer Nature, 2019.
16. V. Braun and V. Clarke, “Using thematic analysis in psychology,” Qualitative
research in psychology, vol. 3, no. 2, pp. 77–101, 2006.
17. T. DataScience, What To Do When Your Classification Data is Imbalanced?, 2019.
https://towardsdatascience.com/what-to-do-when-your- classification-dataset-is-
imbalanced-6af031b12a36.
18. T. Fredriksson, J. Bosch, and H. Holmstr¨om-Olsson, “Machine learning models for
automatic labeling: A systematic literature review,” 2020.
19. B. Settles, “Active learning. morgan claypool,” Synthesis Lectures on AI and ML,
2012.
20. X. J. Zhu, “Semi-supervised learning literature survey,” tech. rep., University of
Wisconsin-Madison Department of Computer Sciences, 2005.
21. N. N. Pise and P. Kulkarni, “A survey of semi-supervised learning methods,” in
2008 International Conference on Computational Intel ligence and Security, vol. 2,
pp. 30–34, IEEE, 2008.
16 T. Fredriksson et al.
22. E. Bair, “Semi-supervised clustering methods,” Wiley Interdisciplinary Reviews:
Computational Statistics, vol. 5, no. 5, pp. 349–361, 2013.
23. C. K¨orner and S. Wrobel, “Multi-class ensemble-based active learning,” in Euro-
pean conference on machine learning, pp. 687–694, Springer, 2006.
24. A. I. Schein and L. H. Ungar, “Active learning for logistic regression: an evalua-
tion,” Machine Learning, vol. 68, no. 3, pp. 235–265, 2007.
25. B. Settles and M. Craven, “An analysis of active learning strategies for sequence
labeling tasks,” in Proceedings of the 2008 Conference on Empirical Methods in
Natural Language Processing, pp. 1070–1079, 2008.
26. A. Harpale, Multi-task active learning. PhD thesis, Carnegie Mellon University,
2012.
27. J. Baldridge and M. Osborne, “Active learning and the total cost of annotation,”
in Proceedings of the 2004 Conference on Empirical Methods in Natural Language
Processing, pp. 9–16, 2004.
28. A. Culotta and A. McCallum, “Reducing labeling effort for structured prediction
tasks,” in AAAI, vol. 5, pp. 746–751, 2005.
29. A. Kapoor, E. Horvitz, and S. Basu, “Selective supervision: Guiding supervised
learning with decision-theoretic active learning.,” in IJCAI, vol. 7, pp. 877–882,
2007.
30. B. Settles, M. Craven, and L. Friedland, “Active learning with real annotation
costs,” in Proceedings of the NIPS workshop on cost-sensitive learning, pp. 1–10,
Vancouver, CA:, 2008.
31. S. Arora, E. Nyberg, and C. Rose, “Estimating annotation cost for active learn-
ing in a multi-annotator environment,” in Proceedings of the NAACL HLT 2009
Workshop on Active Learning for Natural Language Processing, pp. 18–26, 2009.
32. E. K. Ringger, M. Carmen, R. Haertel, K. D. Seppi, D. Lonsdale, P. McClana-
han, J. L. Carroll, and N. Ellison, “Assessing the costs of machine-assisted corpus
annotation through a user study.,” in LREC, vol. 8, pp. 3318–3324, 2008.
33. S. Vijayanarasimhan and K. Grauman, “What’s it going to cost you?: Predict-
ing effort vs. informativeness for multi-label image annotations,” in 2009 IEEE
Conference on Computer Vision and Pattern Recognition, pp. 2262–2269, IEEE,
2009.
34. B. C. Wallace, K. Small, C. E. Brodley, J. Lau, and T. A. Trikalinos, “Modeling
annotation time to reduce workload in comparative effectiveness reviews,” in Pro-
ceedings of the 1st ACM International Health Informatics Symposium, pp. 28–35,
2010.
35. R. A. Haertel, K. D. Seppi, E. K. Ringger, and J. L. Carroll, “Return on invest-
ment for active learning,” in Proceedings of the NIPS Workshop on Cost-Sensitive
Learning, vol. 72, 2008.
... The general ML workflow (e.g., Chapman et al., 1999;Kelleher & Brendan, 2018;Schröer et al., 2021) begins with the creation of a training dataset from which a machine can learn something (Figure 1). Most applications today are based on supervised learning procedures through which a machine learns from labeled data, e.g., text describing an image, such as a photo or drawing of a dog or cat (Fredriksson et al., 2020). Then the training dataset is processed by an algorithm that "trains" the machine to recognize corresponding patterns. ...
... This is important for two reasons. First, collecting and annotating data are crucial but time-consuming activities that take most of the time spent during ML development (Fredriksson et al., 2020). Second, if this element is neglected or poorly done, the resulting ML models will perform poorly and generate inaccurate, irrelevant, or even harmful results (Sambasivan et al., 2021). ...
Article
Full-text available
With recent advances in artificial intelligence (AI), machine learning (ML) has been identified as particularly useful for organizations seeking to create value from data. However, as ML is commonly associated with technical professions, such as computer science and engineering, incorporating training in the use of ML into non-technical educational programs, such as social sciences courses, is challenging. Here, we present an approach to address this challenge by using no-code AI in a course for university students with diverse educational backgrounds. This approach was tested in an empirical, case-based educational setting, in which students engaged in data collection and trained ML models using a no-code AI platform. In addition, a framework consisting of five principles of instruction (problem-centered learning, activation, demonstration, application, and integration) was applied. This paper contributes to the literature on IS education by providing information for instructors on how to incorporate no-code AI in their courses and insights into the benefits and challenges of using no-code AI tools to support the ML workflow in educational settings.
... Handling unlabeled data. While pretraining DL models using large-scale public dataset have demonstrated the potential in generalizing manufacturing settings via model refinement through the use of only a small amount of labeled manufacturing data to alleviate the challenge of scarcity of labeled data (Fredriksson, Mattos, Bosch, & Olsson, 2020), the method can be challenging as pretrained DL models are not available for many manufacturing applications. Recent advances of unsupervised and semisupervised learning (Okaro et al., 2019;Zeiser, Ö zcan, van Stein, & Bäck, 2023) have shown solutions to tackle the challenge. ...
Chapter
This chapter is arranged as follows: Section 10.2 will delve deeper into the latest advances in data denoising, data annotation, and data balancing facilitated by DL techniques. Section 10.3 will present the manufacturing applications of these methodologies, followed by a discussion on the remaining challenges and opportunities in Section 10.4, and conclusions in Section 10.5.
... However, adapting an LLM for text retrieval requires labeled datasets, with numerous example queries and documents both related and unrelated to forming a helpful response. This poses a significant challenge to improving these systems, as data annotation or synthetic generation can be too expensive, difficult, and errorprone (Fredriksson et al., 2020;Desmond et al., 2021). Making matters worse, fine-tuning an existing LLM on new domains can cause forgetting, a decreased performance on previously capable tasks (McCloskey and Cohen, 1989;Kotha et al., 2024). ...
Preprint
Full-text available
Large language models (LLMs) fine-tuned for text-retrieval have demonstrated state-of-the-art results across several information retrieval (IR) benchmarks. However, supervised training for improving these models requires numerous labeled examples, which are generally unavailable or expensive to acquire. In this work, we explore the effectiveness of extending reverse engineered adaptation to the context of information retrieval (RE-AdaptIR). We use RE-AdaptIR to improve LLM-based IR models using only unlabeled data. We demonstrate improved performance both in training domains as well as zero-shot in domains where the models have seen no queries. We analyze performance changes in various fine-tuning scenarios and offer findings of immediate use to practitioners.
... The initial prompt version primarily focused on enumerating the types of data to be identified. However, given the inherent complexity of data categorization -a challenge even for human annotators as substantiated in related literature [41]-the prompt was augmented to include the internal definitions used for manual annotations in the MAPP dataset. While this expansion resulted in a lengthier prompt, it significantly enhanced all metrics except for recall, with an increase of +4.58% in accuracy, a decrease of -0.79% in recall, +5.96% in precision, and +2.72% in F1 score. ...
Preprint
Full-text available
The number and dynamic nature of web and mobile applications presents significant challenges for assessing their compliance with data protection laws. In this context, symbolic and statistical Natural Language Processing (NLP) techniques have been employed for the automated analysis of these systems' privacy policies. However, these techniques typically require labor-intensive and potentially error-prone manually annotated datasets for training and validation. This research proposes the application of Large Language Models (LLMs) as an alternative for effectively and efficiently extracting privacy practices from privacy policies at scale. Particularly, we leverage well-known LLMs such as ChatGPT and Llama 2, and offer guidance on the optimal design of prompts, parameters, and models, incorporating advanced strategies such as few-shot learning. We further illustrate its capability to detect detailed and varied privacy practices accurately. Using several renowned datasets in the domain as a benchmark, our evaluation validates its exceptional performance, achieving an F1 score exceeding 93%. Besides, it does so with reduced costs, faster processing times, and fewer technical knowledge requirements. Consequently, we advocate for LLM-based solutions as a sound alternative to traditional NLP techniques for the automated analysis of privacy policies at scale.
... While an instruction-tuned model is generally more capable for popular tasks, the majority of data available for additional fine-tuning is unlabeled, lacking the annotations expected from instruct models. This poses a significant problem as annotation by the downstream organization can be too difficult, expensive, or error-prone (Fredriksson et al., 2020;Desmond et al., 2021). Additional fine-tuning can also degrade the performance of the instruction-tuned model outside of the new fine-tuning distribution (Kotha et al., 2024). ...
Preprint
Full-text available
We introduce RE-Adapt, an approach to fine-tuning large language models on new domains without degrading any pre-existing instruction-tuning. We reverse engineer an adapter which isolates what an instruction-tuned model has learned beyond its corresponding pretrained base model. Importantly, this requires no additional data or training. We can then fine-tune the base model on a new domain and readapt it to instruction following with the reverse engineered adapter. RE-Adapt and our low-rank variant LoRE-Adapt both outperform other methods of fine-tuning, across multiple popular LLMs and datasets, even when the models are used in conjunction with retrieval-augmented generation.
... First, a huge amount of labeled data is required to train the state of the art deep learning and other supervised machine learning methods. Such labeled data are often hard to obtain, since they require huge manual effort coming from domain experts and crowdworkers (Fredriksson et al. 2020;Sheng and Zhang 2019;Weld, Dai et al. 2011). Different applications/experiments may have unique characteristics, which makes it very difficult to use labeled data from one particular application for other applications (Zhuang et al. 2020). ...
Article
The availability of tera-byte scale experiment data calls for AI driven approaches which automatically discover scientific models from data. Nonetheless, significant challenges present in AI-driven scientific discovery: (i) The annotation of large scale datasets requires fundamental re-thinking in developing scalable crowdsourcing tools. (ii) The learning of scientific models from data calls for innovations beyond black-box neural nets. (iii) Novel visualization & diagnosis tools are needed for the collaboration of experimental and theoretical physicists, and computer scientists. We present Phase-Field-Lab platform for end-to-end phase field model discovery, which automatically discovers phase field physics models from experiment data, integrating experimentation, crowdsourcing, simulation and learning. Phase-Field-Lab combines (i) a streamlined annotation tool which reduces the annotation time (by ~50-75%), while increasing annotation accuracy compared to baseline; (ii) an end-to-end neural model which automatically learns phase field models from data by embedding phase field simulation and existing domain knowledge into learning; and (iii) novel interfaces and visualizations to integrate our platform into the scientific discovery cycle of domain scientists. Our platform is deployed in the analysis of nano-structure evolution in materials under extreme conditions (high temperature and irradiation). Our approach reveals new properties of nano-void defects, which otherwise cannot be detected via manual analysis.
... In this context, there is a growing need of dependable labeled data to provide the models, which commonly is supplied by human effort, by manually annotating and categorizing texts. (Fredriksson et al 2020) Even though this human participation improves model performance, in many projects, the usual process of reading, searching, identifying, circumscribing, and reviewing can be costly in terms of time, money, and effort. Our paper aims to explore alternative strategies to this traditional NER approach, introducing solutions using weak labeling, machine learning, and other computational methods to target cost and effort reduction. ...
Article
Full-text available
Named entity recognition (NER) is a very relevant task for text information retrieval in natural language processing (NLP) problems. Most recent state-of-the-art NER methods require humans to annotate and provide useful data for model training. However, using human power to identify, circumscribe and label entities manually can be very expensive in terms of time, money, and effort. This paper investigates the use of prompt-based language models (OpenAI’s GPT-3) and weak supervision in the legal domain. We apply both strategies as alternative approaches to the traditional human-based annotation method, relying on computer power instead human effort for labeling, and subsequently compare model performance between computer and human-generated data. We also introduce combinations of all three mentioned methods (prompt-based, weak supervision, and human annotation), aiming to find ways to maintain high model efficiency and low annotation costs. We showed that, despite human labeling still maintaining better overall performance results, the alternative strategies and their combinations presented themselves as valid options, displaying positive results and similar model scores at lower costs. Final results demonstrate preservation of human-trained models scores averaging 74.0% for GPT-3, 95.6% for weak supervision, 90.7% for GPT + weak supervision combination, and 83.9% for GPT + 30% human-labeling combination.
Article
Full-text available
The evolution of manufacturing systems toward Industry 4.0 and 5.0 paradigms has pushed the diffusion of Machine Learning (ML) in this field. As the number of articles using ML to support manufacturing functions is expanding tremendously, the main objective of this review article is to provide a comprehensive and updated overview of these applications. 114 journal articles have been collected, analysed, and classified in terms of supervision approaches, function, ML algorithm, data inputs and outputs, and application domain. The findings show the fragmentation of the field and that most of the ML-based systems address limited objectives. Some inputs and outputs of the analysed support tools are shared across the reviewed contributions, and their possible combinations have been outlined. The advantages, limitations, and research opportunities of ML support in manufacturing are discussed. The paper outlines that the excessive specialization of the reviewed applications could be overcome by increasing the diffusion of transfer learning in the manufacturing domain.
Article
Full-text available
With the rapid growing of crowdsourcing systems, quite a few applications based on a supervised learning paradigm can easily obtain massive labeled data at a relatively low cost. However, due to the variable uncertainty of crowdsourced labelers, learning procedures face great challenges. Thus, improving the qualities of labels and learning models plays a key role in learning from the crowdsourced labeled data. In this survey, we first introduce the basic concepts of the qualities of labels and learning models. Then, by reviewing recently proposed models and algorithms on ground truth inference and learning models, we analyze connections and distinctions among these techniques as well as clarify the level of the progress of related researches. In order to facilitate the studies in this field, we also introduce open accessible real-world data sets collected from crowdsourcing systems and open source libraries and tools. Finally, some potential issues for future studies are discussed.
Book
This book addresses action research (AR), one of the main research methodologies used for academia-industry research collaborations. It elaborates on how to find the right research activities and how to distinguish them from non-significant ones. Further, it details how to glean lessons from the research results, no matter whether they are positive or negative. Lastly, it shows how companies can evolve and build talents while expanding their product portfolio. The book’s structure is based on that of AR projects; it sequentially covers and discusses each phase of the project. Each chapter shares new insights into AR and provides the reader with a better understanding of how to apply it. In addition, each chapter includes a number of practical use cases or examples. Taken together, the chapters cover the entire software lifecycle: from problem diagnosis to project (or action) planning and execution, to documenting and disseminating results, including validity assessments for AR studies. The goal of this book is to help everyone interested in industry-academia collaborations to conduct joint research. It is for students of software engineering who need to learn about how to set up an evaluation, how to run a project, and how to document the results. It is for all academics who aren’t afraid to step out of their comfort zone and enter industry. It is for industrial researchers who know that they want to do more than just develop software blindly. And finally, it is for stakeholders who want to learn how to manage industrial research projects and how to set up guidelines for their own role and expectations.
Article
Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data. In this survey, we perform a comprehensive study of data collection from a data management point of view. Data collection largely consists of data acquisition, data labeling, and improvement of existing data or models. We provide a research landscape of these operations, provide guidelines on which technique to use when, and identify interesting research challenges. The integration of machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research.
Conference Paper
Crowdsourcing provides a scalable and efficient way to construct labeled datasets for training machine learning systems. However, creating comprehensive label guidelines for crowdworkers is often prohibitive even for seemingly simple concepts. Incomplete or ambiguous label guidelines can then result in differing interpretations of concepts and inconsistent labels. Existing approaches for improving label quality, such as worker screening or detection of poor work, are ineffective for this problem and can lead to rejection of honest work and a missed opportunity to capture rich interpretations about data. We introduce Revolt, a collaborative approach that brings ideas from expert annotation workflows to crowd-based labeling. Revolt eliminates the burden of creating detailed label guidelines by harnessing crowd disagreements to identify ambiguous concepts and create rich structures (groups of semantically related items) for post-hoc label decisions. Experiments comparing Revolt to traditional crowdsourced labeling show that Revolt produces high quality labels without requiring label guidelines in turn for an increase in monetary cost. This up front cost, however, is mitigated by Revolt's ability to produce reusable structures that can accommodate a variety of label boundaries without requiring new data to be collected. Further comparisons of Revolt's collaborative and non-collaborative variants show that collaboration reaches higher label accuracy with lower monetary cost.
Article
Crowdsourcing systems provide a cost effective and convenient way to collect labels, but they often fail to guarantee the quality of the labels. This paper proposes a novel framework that introduces noise correction techniques to further improve the quality of integrated labels that are inferred from the multiple noisy labels of objects. In the proposed general framework, information about the qualities of labelers estimated by a front-end ground truth inference algorithm is utilized to supervise subsequent label noise filtering and correction. The framework uses a novel algorithm termed adaptive voting noise correction (AVNC) to precisely identify and correct the potential noisy labels. After filtering out the instances with noisy labels, the remaining cleansed data set is used to create multiple weak classifiers, based on which a powerful ensemble classifier is induced to correct these noises. Experimental results on eight simulated data sets with different kinds of features and two real-world crowdsourcing data sets in different domains consistently show that: 1) the proposed framework can improve label quality regardless of inference algorithms, especially under the circumstance that each instance has a few repeated labels and 2) since the proposed AVNC algorithm considers both the number of and the probability of potential label noises, it outperforms the state-of-the-art noise correction algorithms.
Article
While many statistical consensus methods now exist, relatively little comparative benchmarking and integration of techniques has made it increasingly difficult to determine the current state-of-the-art, to evaluate the relative benefit of new methods, to understand where specific problems merit greater attention, and to measure field progress over time. To make such comparative evaluation easier for everyone, we present SQUARE, an open source shared task framework including benchmark datasets, defined tasks, standard metrics, and reference implementations with empirical results for several popular methods. In addition to measuring performance on a variety of public, real crowd datasets, the benchmark also varies supervision and noise by manipulating training size and labeling error. We envision SQUARE as dynamic and continually evolving, with new datasets and reference implementations being added according to community needs and interest. We invite community contributions and participation.
Article
Cluster analysis methods seek to partition a data set into homogeneous subgroups. It is useful in a wide variety of applications, including document processing and modern genetics. Conventional clustering methods are unsupervised, meaning that there is no outcome variable nor is anything known about the relationship between the observations in the data set. In many situations, however, information about the clusters is available in addition to the values of the features. For example, the cluster labels of some observations may be known, or certain observations may be known to belong to the same cluster. In other cases, one may wish to identify clusters that are associated with a particular outcome variable. This review describes several clustering algorithms (known as ‘semi-supervised clustering’ methods) that can be applied in these situations. The majority of these methods are modifications of the popular k-means clustering method, and several of them will be described in detail. A brief description of some other semi-supervised clustering algorithms is also provided. For further resources related to this article, please visit the WIREs website. Conflict of interest: The authors have declared no conflicts of interest for this article.