Conference PaperPDF Available

Defending active learning against adversarial inputs in automated document classification

December 2016

December 2016

DOI:10.1109/GlobalSIP.2016.7905843

Conference: 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

Authors:

Yalin E. Sagduyu

Virginia Tech (Virginia Polytechnic Institute and State University)

Business and government operations generate large volumes of documents to be categorized through machine learning techniques before dissemination and storage. One prerequisite in such classification is to properly choose training documents. Active learning emerges as a technique to achieve better accuracy with fewer training documents by choosing data to learn and querying oracles for unknown labels. In practice, such oracles are usually human analysts who are likely to make mistakes or, in some cases, even intentionally introduce erroneous labels for malicious purposes. We propose a risk-factor based strategy to defend active-learning-based document classification against human mistakes or adversarial inputs. We show that the proposed strategy can substantially alleviate the damage caused by malicious labeling. Our experimental results demonstrate the effectiveness of our defense strategy in terms of maintaining accuracy against adversaries.

The scenario of active learning under attacks.

…

Figures - uploaded by Yalin E. Sagduyu

Content may be subject to copyright.

Content uploaded by Yalin E. Sagduyu

Content may be subject to copyright.

DEFENDING ACTIVE LEARNING AGAINST ADVERSARIAL INPUTS IN AUTOMATED

DOCUMENT CLASSIFICATION

Lei Pi†, Zhuo Lu‡, Yalin Sagduyu∗, Su Chen†

†University of Memphis, TN 38152. Emails: {lpi,schen4}@memphis.edu

‡University of South Florida, Tampa FL 33620. Email: zhuolu@usf.edu

∗Intelligent Automation Inc., Rockville, MD 20855, Email: ysagduyu@i-a-i.com

ABSTRACT

Business and government operations generate large volumes

of documents to be categorized through machine learning

techniques before dissemination and storage. One prerequi-

site in such classiﬁcation is to properly choose training doc-

uments. Active learning emerges as a technique to achieve

better accuracy with fewer training documents by choosing

data to learn and querying oracles for unknown labels. In

practice, such oracles are usually human analysts who are

likely to make mistakes or, in some cases, even intentionally

introduce erroneous labels for malicious purposes. We pro-

pose a risk-factor based strategy to defend active-learning-

based document classiﬁcation against human mistakes or

adversarial inputs. We show that the proposed strategy can

substantially alleviate the damage caused by malicious label-

ing. Our experimental results demonstrate the effectiveness

of our defense strategy in terms of maintaining accuracy

against adversaries.

Index Terms—active learning, document classiﬁcation,

security and attacks, malicious inputs.

1. INTRODUCTION

Daily routine operations in business and governments pro-

duce a large numbers of documents, which must be properly

categorized or labeled, then disseminated to authorized per-

sonnel and stored in appropriate places. For example, doc-

uments in government operations may be labeled as public

information or a classiﬁed level may be assigned according to

national security requirements. Machine learning techniques,

such as Naive Bayes classiﬁer and Support Vector Machine

(SVM) [1], have been extensively used as a vital assistance

for automated d ocument classiﬁcation [2, 3].

To facilitate processing training data sets, active learn-

ing [4] has been used to achieve better accuracy with smaller

training sets for document classiﬁcation. The essential idea

behind active learning is to let the learning system choose

data to learn from and query an oracle for a label. In practice,

such an oracle is usually a human analyst who is tasked to

identify and classify given documents. For example, in gov-

ernment operations, security analysts are assigned to classify

any documents into a security classiﬁcation level for proper

information control and dissemination.

On one hand, active learning can signiﬁcantly reduce the

size of training documents that are essential to train an un-

derlying machine learning model [4, 5]. On the other hand,

however, it also introduces risks that could lead to less ac-

curate classiﬁcation. Speciﬁcally, active learning usually in-

volves the inputs from human analysts who can sometimes

make mistakes. More severely, due to inside threats or ac-

count hacking, such inputs can even be malicious with intent

to sabotage the entire active learning process. Many poten-

tial vulnerabilities in active learning can make such attacks

possible [6]: 1) the attacker (i.e., a human analyst with mali-

cious intent) can fabricate less signiﬁcant data but appealing

for the learner to choose; 2) the attacker can leverage existing

machine learning vulnerabilities inherited by active learning;

3) the attacker can provide incorrect results when the learner

queries for labels. Therefore, it should never be taken for

granted that the inputs from human analysts are always cor-

rect, and it is critical to make active learning resilient to erro-

neous inputs due to human errors or malicious attacks.

In this paper, we aim at designing a robust active learning

defense strategy. In particular, we focus on the scenario of

SVM-based active learning under a malicious attacker that

gives erroneous inputs during learning queries as SVM is an

extensively-used method in classiﬁcation and active learning

[4,5, 7–9]. Our defense strategy is to design a risk factor based

mechanism to guide whether we should accept or reject the

input from active learning. By examining the distance of a

newly labeled document to the current separating hyperplane

of the SVM model, the mechanism will decide if it is too risky

to accept the input depending on whether the distance is larger

than a given threshold. Our method is shown to substantially

alleviate the damage caused by malicious attacks.

2. BACKGROUNDS AND RELATED WORK

In this section, we brieﬂy introduce SVM and active learning.

2.1. SVM and Active Learning

SVM is a widely-used classiﬁcation method [1] to ﬁnd a hy-

perplane that separates the training data into desirable sub-

sets with different categories/labels based on support vectors,

which are a set of instances from the training data closest to

the hyperplane. In SVM-based document classiﬁcation, an in-

stance is a feature vector representing the counting of words

extracted from a document.

To perform accurate classiﬁcation, SVM requires train-

ing based on a substantial number of instances with labels

already given as the ground truth. However, labeling many in-

stances for training a classiﬁer could be cumbersome in prac-

tice. Hence, active learning [4,10] has been designed as an ad-

vanced process, in which only a subset of unlabeled instances

is chosen to be labeled and added to the training set.

Active learning involves two parties: the learner (that is

usually a machine to build an accurate classiﬁer) and the ora-

cle (that is usually a human analyst in practice), and consists

of three components [4]: (f, q, X ), where fis a classiﬁer

mapping a document into a label, Xis the training set, and

qis the query function, which chooses and returns the next

instance from all unlabeled instances and query the oracle for

the corresponding label. After each query, the learner updates

Xand returns a new classiﬁer.

2.2. Adversarial Active Learning

Since active learning relies on oracles that are usually human

analyst in practice, it is subject to common security vulnera-

bilities and exposed potential risks associated with or due to

human analysts. A list of possible vulnerabilities were sum-

marized in [6] with focus on the query strategies, leaving the

risks due to human analysts less discussed.

As human analyst is an essential part in active learning,

we have to consider the active learning scenario in a security

sense that the inputs from human expert should not be trusted,

but carefully examined to ensure security. During document

classiﬁcation, an analyst can maliciously label a document,

which can be in fact hard to detect. When there are a fairly

large number of malicious labels, the inaccuracy introduced

to the resulting classiﬁer will become signiﬁcant enough to

reduce or diminish the usability of an application. The work

in [11] proved that it is not even necessary for an adversary to

have a perfect knowledge of the classiﬁer to launch such at-

tacks against active learning. It is imperative to at least allevi-

ate, if not completely eliminate, the damages due to malicious

labeling in active learning for document classiﬁcation.

3. MODELS AND DESIGN

In this section, we ﬁrst present the models and then describe

our design to protect active learning from malicious inputs.

To maintain simplicity without loss of generality, if a doc-

ument set Dcan be separated into two disjoint subsets D0and

D1, we say a document d∈ D is labeled 0 if d∈ D0, and say

dis labeled 1 if d∈ D1.

3.1. Active Learning under Attacks

As aforementioned, the effectiveness of active learning relies

on the outside inputs that may be manipulated by an adver-

sary. Therefore, we focus on providing a defense strategy

to combat such an attack to protect active learning from ac-

cepting malicious inputs. Speciﬁcally, as Fig. 1 shows, we

consider an active learning process for document classiﬁca-

tion including a learner (i.e., the machine that performs active

learning), a malicious human analyst that randomly gives ma-

licious inputs, and queries from the learner to human analysts.

Attacker

Select document

(di)

Labeled

documents

(Di)

label (li)

query(qi)

might be malicious

train classi



UnLabeled

Documents

SVM Classi



Fig. 1. The scenario of active learning under attacks.

As shown in Fig. 1, in the i-th query, the learner already

has a labeled document set Di−1, and sends a query qicon-

taining document dito the analyst who then gives a (poten-

tially wrong) label lito the learner. The problem is whether

the learner should accept the label to form a new labeled doc-

ument set Di=Di−1∪ {di}, reject or even revert the label to

keep the original labeled document set Di=Di−1.

3.2. Risk Factor based Defense Strategy

In what follows, we design a risk factor based defense strategy

to protect active learning. The intuition to model the risk is as

follows: in SVM, data close to a hyperplane means it is more

likely to be mis-classiﬁed; if an attacker has no knowledge or

access to the entire training data set, there is no way for the at-

tacker to know where exactly the hyperplane is; therefore, the

mislabeled data may have a larger distance to its hyperplane.

Consequently, if a document dicomes from the analyst

with a label li, we deﬁne the risk factor rifor this document

in active learning as

ri=a∆max

i,(1)

where ∆max

iis the maximum distance between current sup-

port vectors to the separation hyperplane based upon the ex-

isting document set Di−1, and a > 0is a constant threshold.

Then, our method works as follows. When a query qi

containing document diis made, the learner is offered with a

label lifrom the analyst. We ﬁrst use the current model built

upon document set Di−1to predict the label of document di,

Algorithm 1 The risk-factor based defense algorithm

Given: current set Di−1, query document di, input label li

1: l′

i=predict using current set(Di−1,di)

2: if l′

i6=li:

3: ∆i=compute distance(di)

4: if ∆i> ri:

5: return FALSE LABEL

6: return TRUE LABEL

and get the predicted label l′

i. If l′

i6=li, we calculate the

distance ∆ibetween the dito current separating hyperplane,

and compare it to the risk factor ri. If di> ri, we can think

the label is mistakenly provided and reject it.

The defense process is in Algorithm 1. In algorithm 1,

function predict using current set accepts current docu-

ment set Di−1and a document dias parameters, and outputs

the predicted label of di; and function compute distance ac-

cepts the document dias parameter and calculates the dis-

tance between the corresponding feature vector and current

separating hyperplane in the SVM algorithm.

Without doubt, this approach relies on the correctness of

initial training set D0, which is assumed to be accurate in this

paper. By design, this strategy leverages the statical property

initially derived from D0, therefore it is not designed to pre-

vent all attacks but only to identify and correct a subset of

mislabels based on the initial and inherited statistical proper-

ties during the active learning process.

3.3. Choosing the Risk Factor

We propose to design benchmark tests on the initial training

set D0to adequately choose the risk factor. We used the fol-

lowing metric of accuracy score Sto evaluate and compare

the effectiveness of classiﬁcation.

S=total number of accurate classif ications

total number of classifications (2)

In each benchmark test (i.e, the function benchmark test

in Algorithm 2), we train the classiﬁer using active learning

with a given set of parameters, including risk factor rand de-

fense strategy, and record the accuracy score for the testing

data set as the number of queries increases. When the train-

ing set is mixed with malicious labels without any defense,

the score is called affected score Sa. When the querying is

protected with our defense strategy, the corresponding score

is called Sd. We evaluate the effectiveness of our defense

strategy by comparing Sdwith Sa. Our goal is to choose

a risk factor that makes the defense strategy effective, i.e.,

Sd≫Sa. With the benchmark tests, our heuristic approach

to search for the risk factor is shown in Algorithm 2.

In Algorithm 2, the labeling error rate reis the probability

that an input is wrongly labeled in benchmark tests. In prac-

tice, it should be of small value as a large value is likely to be

Algorithm 2 Risk Factor Search Algorithm

Given: risk factor r, search step ∆, labeling error rate re

1: Sa=benchmark test(r,re, defense=False)

2: Sd=benchmark test(r,re, defense=True)

3: if Sd≫Sa:

4: return r

5: else

6: r=r+ ∆

7: goto 1

noticeable and raise suspicion. For example, when the gov-

ernment uses document analysts to classify documents, ad-

ministrative approaches such as internal review and sample

checking techniques can be effective in detecting such errors.

4. EXPERIMENTS AND RESULTS

In this section, we present experimental setups and results.

4.1. Experiment Setups and Parameters

We build a data set of 1264 instances with 10233 features ex-

tracted from real documents in Reuters-21578 Data Set [12].

Instances from the data set are uniformly distributed among

two categories. Three fourths of the data set is used for train-

ing and the rest is for testing. Among the training set, four

ﬁfths is labeled data, the rest is the query pool.

For the SVM algorithm, we use the radial basis function

(RBF) kernel with parameters γ= 1.0/1264 and C= 1.0.

We consider three test cases: 1) No attack: There is no error

labeling without defense strategy; 2) Attack without defense:

There is 25% error labeling due to the attack without defense

strategy; 3) Attack with defense: There is 25% error labeling

due to the attack with defense strategy. During active learn-

ing, queries are all made randomly in the three set of tests.

We ﬁrst use the random sampling strategy then use the uncer-

tainty sampling strategy in experiments [10].

4.2. Experimental Results

We ﬁrst consider the scenarios that the risk factor is not care-

fully chosen. The query strategy in active learning is random

sampling in the experiments.

Fig. 2 shows the accuracy scores of three test cases when

the risk factor is too small. We can observe that the perfor-

mance of the attack with defense case is even worse than that

of the no attack cases under small risk factor although the

number of correct inputs are 3 times more than that of mali-

cious ones. This is because the defense strategy cannot really

distinguish which input is due to the attacker, but can only de-

tect which label may be malicious by comparing the distance

of its instance in the SVM model with the risk factor. When

the risk factor is too small, the defense strategy has a very

0 50 100 150 200

Number of Queries

0.660

0.665

0.670

0.675

0.680

0.685

0.690

Accuracy Score

No Attack

Attack With Defense

Attack Without Defense

Fig. 2. Comparison of accuracy scores in

three cases when a= 0.5.

0 20 40 60 80 100 120 140 160 180

Number of Queries

0.660

0.665

0.670

0.675

0.680

0.685

Accuracy Score

No Attack

Attack With Defense

Attack Without Defense

Fig. 3. Comparison of accuracy scores in

three cases when a= 1.5.

0 50 100 150 200

Number of Queries

0.665

0.670

0.675

0.680

0.685

0.690

0.695

0.700

Accuracy Score

No Attack

Attack With Defense

Attack Without Defense

Fig. 4. Comparison of accuracy scores

with optimal risk factor.

small tolerance level to accept new inputs, making the strat-

egy erroneous by rejecting many inputs with correct labels.

Fig. 2 shows the accuracy scores of three test cases when

the risk factor has a large value, i.e.; the threshold in the risk

factor a= 1.5. As Fig. 3 depicts, the attack with defense

case yields with the same performance as the attack without

defense case, which is substantially worse than the no attack

case. This means that the defense strategy neither improves

nor degrades the performance of classiﬁers comparing with

the attack without defense case. This is because the risk factor

is chosen improperly large and the distance of each instance

with error label to the separating hyperplane in the SVM clas-

siﬁer is considered acceptable in the defense strategy.

Figs. 2 and 3 clearly show how largely malicious inputs

can affect the accuracy of document classiﬁcation. The two

ﬁgures also show how the value of the risk factor can affect

the effectiveness of the defense strategy: a very small risk fac-

tor yields worse performance than the attack without defense

case; and a very large risk factor leads to equal performance

than the attack without defense case;

Then we use a risk factor that is locally optimal given in

Algorithm 2. Fig. 4 shows when the risk factor is optimal, the

attack with defense case almost achieves similar performance

as the no attack case, and outperforms the attack without de-

fense case. Admittedly, there are errors that the defense strat-

egy cannot detect. First, it ignores mistakes where an error

label is the same with the prediction of current classiﬁer. Sec-

ond, it omits the cases in which the corresponding instance

of an error label is within the distance margin allowed by the

risk factor. This explains why the attack with defense perfor-

mance is overall worse than the no attack case. From Figs. 3

and 4, we can conclude that the defense strategy is effective

when the risk factor is carefully chosen.

Finally, we evaluate the effectiveness of the defense when

the query strategy is uncertainty sampling and compare the

result with random sampling. We decrease the size of train-

ing set to half of the entire dataset and labeled set to half of

training set, and increase the input error ratio to 1/2. Fig. 5

shows in uncertainty sampling where each queried sample is

10 20 30 40 50 60 70 80 90100

110

120

130

140

150

160

170

180

190

200

Number of Queries

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

Accuracy Score

No Attack

Attack Without Defense

Attack With Defense

Uncertainty Sampling

Random Sampling

Fig. 5. Comparison of different sampling strategies.

closer to the current separating hyperplane than others, our

defense is still effective in defending against erroneous label-

ing. With uncertainty sampling, the classiﬁer should achieve

a higher accuracy with less queries compared to random sam-

pling. But under a heavy attack, Fig. 5 shows that the affected

classiﬁer degrades to random sampling case. With the pro-

posed defense strategy, the damage is largely reduced and the

classiﬁcation accuracy is approximately equal to that of the

original classiﬁer without attack.

5. CONCLUSION

In this paper, we considered the scenario of protecting active

learning in document classiﬁcation against adversarial inputs.

We proposed a risk-factor based defense strategy. We used

real data sets and experiments to show that by adequately

adjusting the risk factor, the proposed defense strategy can

improve the classiﬁcation accuracy and therefore shows its

effectiveness in defending active-learning-based document

classiﬁcation against adversarial inputs.

6. REFERENCES

[1] J. Huang, J. Lu, and C. X. Ling, “Comparing naive

bayes, decision trees, and svm with auc and accuracy,”

in Proc. of IEEE International Conference on Data Min-

ing (ICDM), 2003, pp. 553–556.

[2] F. Sebastiani, “Machine learning in automated text cate-

gorization,” ACM Computing Surveys, vol. 34, no. 1, pp.

1–47, 2002.

[3] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell,

“Text classiﬁcation from labeled and unlabeled docu-

ments using em,” Machine learning, vol. 39, no. 2-3,

pp. 103–134, 2000.

[4] S. Tong and D. Koller, “Support vector machine ac-

tive learning with applications to text classiﬁcation,” The

Journal of Machine Learning Research, vol. 2, pp. 45–

66, 2002.

[5] E. Pasolli, F. Melgani, D. Tuia, F. Paciﬁci, and W. J.

Emery, “Svm active learning approach for image clas-

siﬁcation using spatial information,” IEEE Trans. Geo-

science and Remote Sensing, vol. 52, no. 4, pp. 2217–

2233, 2014.

[6] B. Miller, A. Kantchelian, S. Afroz, R. Bachwani,

E. Dauber, L. Huang, M. C. Tschantz, A. D. Joseph,

and J. D. Tygar, “Adversarial active learning,” in Proc.

of Workshop on Artiﬁcial Intelligent and Security Work-

shop, 2014, pp. 3–14.

[7] C. S. Leslie, E. Eskin, and W. S. Noble, “The spectrum

kernel: A string kernel for svm protein classiﬁcation.” in

Paciﬁc Symposium on Biocomputing, vol. 7, 2002, pp.

566–575.

[8] C.-W. Hsu, C.-C. Chang, C.-J. Lin et al., “A practical

guide to support vector classiﬁcation,” 2003.

[9] S. Tong and E. Chang, “Support vector machine active

learning for image retrieval,” in Proc. of ACM Interna-

tional Conference on Multimedia, 2001, pp. 107–118.

[10] B. Settles, “Active learning literature survey,” University

of Wisconsin, Madison, vol. 52, no. 55-66, p. 11, 2010.

[11] D. Lowd and C. Meek, “Adversarial learning,” in Proc.

of ACM International Conference on Knowledge Dis-

covery in Data Mining (SIGKDD), 2005, pp. 641–647.

[12] W.-N. Hsu and H.-T. Lin, “Active learning by learning,”

2015.

TXAI-ADV: Trustworthy XAI for Defending AI Models against Adversarial Attacks in Realistic CIoT

Article

Full-text available

May 2024

Adversarial attacks are more prevalent in Consumer Internet of Things (CIoT) devices (i.e., smart home devices, cameras, actuators, sensors, and micro-controllers) because of their growing integration into daily activities, which brings attention to their possible shortcomings and usefulness. Keeping protection in the CIoT and countering emerging risks require constant updates and monitoring of these devices. Machine learning (ML), in combination with Explainable Artificial Intelligence (XAI), has become an essential component of the CIoT ecosystem due to its rapid advancement and impressive results across several application domains for attack detection, prevention, mitigation, and providing explanations of such decisions. These attacks exploit and steal sensitive data, disrupt the devices’ functionality, or gain unauthorized access to connected networks. This research generates a novel dataset by injecting adversarial attacks into the CICIoT2023 dataset. It presents an adversarial attack detection approach named TXAI-ADV that utilizes deep learning (Mutli-Layer Perceptron (MLP) and Deep Neural Network (DNN)) and machine learning classifiers (K-Nearest Neighbor (KNN), Support Vector Classifier (SVC), Gaussian Naive Bayes (GNB), ensemble voting, and Meta Classifier) to detect attacks and avert such situations rapidly in a CIoT. This study utilized Shapley Additive Explanations (SHAP) techniques, an XAI technique, to analyze the average impact of each class feature on the proposed models and select optimal features for the adversarial attacks dataset. The results revealed that, with a 96% accuracy rate, the proposed approach effectively detects adversarial attacks in a CIoT.

Adversarial Machine Learning in Wireless Communications Using RF Data: A Review

Article

Full-text available

Jan 2022

Machine learning (ML) provides effective means to learn from spectrum data and solve complex tasks involved in wireless communications. Supported by recent advances in computational resources and algorithmic designs, deep learning (DL) has found success in performing various wireless communication tasks such as signal recognition, spectrum sensing and waveform design. However, ML in general and DL in particular have been found vulnerable to manipulations thus giving rise to a field of study called adversarial machine learning (AML). Although AML has been extensively studied in other data domains such as computer vision and natural language processing, research for AML in the wireless communications domain is still in its early stage. This paper presents a comprehensive review of the latest research efforts focused on AML in wireless communications while accounting for the unique characteristics of wireless systems. First, the background of AML attacks on deep neural networks is discussed and a taxonomy of AML attack types is provided. Various methods of generating adversarial examples and attack mechanisms are also described. In addition, an holistic survey of existing research on AML attacks for various wireless communication problems as well as the corresponding defense mechanisms in the wireless domain are presented. Finally, as new attacks and defense techniques are developed, recent research trends and the overarching future outlook for AML in next-generation wireless communications are discussed.

Adversarial Machine Learning for Cybersecurity and Computer Vision: Current Developments and Challenges

Preprint

Full-text available

Jun 2021

Bowei Xi

We provide a comprehensive overview of adversarial machine learning focusing on two application domains, i.e., cybersecurity and computer vision. Research in adversarial machine learning addresses a significant threat to the wide application of machine learning techniques -- they are vulnerable to carefully crafted attacks from malicious adversaries. For example, deep neural networks fail to correctly classify adversarial images, which are generated by adding imperceptible perturbations to clean images.We first discuss three main categories of attacks against machine learning techniques -- poisoning attacks, evasion attacks, and privacy attacks. Then the corresponding defense approaches are introduced along with the weakness and limitations of the existing defense approaches. We notice adversarial samples in cybersecurity and computer vision are fundamentally different. While adversarial samples in cybersecurity often have different properties/distributions compared with training data, adversarial images in computer vision are created with minor input perturbations. This further complicates the development of robust learning techniques, because a robust learning technique must withstand different types of attacks.

Q-learning and LSTM based deep active learning strategy for malware defense in industrial IoT applications

Article

Full-text available

Apr 2021
MULTIMED TOOLS APPL

Edge devices are extensively used as intermediaries between the device and the service layer in an industrial Internet of things (IIoT) environment. These devices are quite vulnerable to malware attacks. Existing studies have worked on designing complex learning algorithms or deep architectures to accurately classify malware assuming that a sufficient number of labeled examples are provided. In the real world, getting labeled examples is one of the major issues for training any classification algorithm. Recent advances have allowed researchers to use active learning strategies that are trained on a handful of labeled examples to perform the classification task, but they are based on the selection of informative instances. This study integrates the Q-learning characteristics into an active learning framework, which allows the network to either request or predict a label during the training process. We proposed the use of phase space embedding, sparse autoencoder, and LSTM with the action-value function to classify malware applications while using a handful of labeled examples. The network relies on its uncertainty to either request or predict a label. The experimental results show that the proposed method can achieve better accuracy than the supervised learning strategy while using few labeled requests. The results also show that the trained network is resilient to the adversarial attacks, which proves the robustness of the proposed method. Additionally, this study explores the tradeoff between classification accuracy and number of label requests via the choice of rewards and the use of decision-level fusion strategies to boost the classification performance. Furthermore, we also provide a hypothetical framework as an implication of the proposed method.

Deep Learning for Wireless Communications

Preprint

May 2020

Existing communication systems exhibit inherent limitations in translating theory to practice when handling the complexity of optimization for emerging wireless applications with high degrees of freedom. Deep learning has a strong potential to overcome this challenge via data-driven solutions and improve the performance of wireless systems in utilizing limited spectrum resources. In this chapter, we first describe how deep learning is used to design an end-to-end communication system using autoencoders. This flexible design effectively captures channel impairments and optimizes transmitter and receiver operations jointly in single-antenna, multiple-antenna, and multiuser communications. Next, we present the benefits of deep learning in spectrum situation awareness ranging from channel modeling and estimation to signal detection and classification tasks. Deep learning improves the performance when the model-based methods fail. Finally, we discuss how deep learning applies to wireless communication security. In this context, adversarial machine learning provides novel means to launch and defend against wireless attacks. These applications demonstrate the power of deep learning in providing novel means to design, optimize, adapt, and secure wireless communications.

Adversarial Machine Learning for 5G Communications Security

Chapter

Sep 2021

Machine learning provides automated means to capture complex dynamics of wireless spectrum and support better understanding of spectrum resources and their efficient utilization. As communication systems become smarter with cognitive radio capabilities empowered by machine learning to perform critical tasks such as spectrum awareness and spectrum sharing, they also become susceptible to new vulnerabilities due to the attacks that target the machine learning applications. This chapter identifies the emerging attack surface of adversarial machine learning and corresponding attacks launched against wireless communications in the context of 5G systems. The focus is on attacks against (i) spectrum sharing of 5G communications with incumbent users such as in the Citizens Broadband Radio Service (CBRS) band and (ii) physical layer authentication of 5G User Equipment (UE) to support network slicing. For the first attack, the adversary transmits during data transmission or spectrum sensing periods to manipulate the signal-level inputs to the deep learning classifier that is deployed at the Environmental Sensing Capability (ESC) to support the 5G system. For the second attack, the adversary spoofs wireless signals with the generative adversarial network (GAN) to infiltrate the physical layer authentication mechanism based on a deep learning classifier that is deployed at the 5G base station. Results indicate major vulnerabilities of 5G systems to adversarial machine learning. To sustain the 5G system operations in the presence of adversaries, a defense mechanism is presented to increase the uncertainty of the adversary in training the surrogate model used for launching its subsequent attacks.

Adversarial machine learning for cybersecurity and computer vision: Current developments and challenges

Article

Apr 2020

Bowei Xi

We provide a comprehensive overview of adversarial machine learning focusing on two application domains, that is, cybersecurity and computer vision. Research in adversarial machine learning addresses a significant threat to the wide application of machine learning techniques—they are vulnerable to carefully crafted attacks from malicious adversaries. For example, deep neural networks fail to correctly classify adversarial images, which are generated by adding imperceptible perturbations to clean images. We first discuss three main categories of attacks against machine learning techniques—poisoning attacks, evasion attacks, and privacy attacks. Then the corresponding defense approaches are introduced along with the weakness and limitations of the existing defense approaches. We notice adversarial samples in cybersecurity and computer vision are fundamentally different. While adversarial samples in cybersecurity often have different properties/distributions compared with training data, adversarial images in computer vision are created with minor input perturbations. This further complicates the development of robust learning techniques, because a robust learning technique must withstand different types of attacks. This article is categorized under: • Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification • Statistical Learning and Exploratory Methods of the Data Sciences > Deep Learning • Statistical and Graphical Methods of Data Analysis > Robust Methods

Detection of Causative Attack and Prevention Using CAP Algorithm on Training Datasets

Chapter

Jan 2020

Machine learning is the scientific study of algorithms, which has been widely used for making automated decisions. The attackers change the training datasets by using their knowledge, it cause impulses to implement the malicious results and models. Causative attack in adversarial machine learning explores certain security threat against carefully executed poisonous data points into the training datasets. This type of attacks are caused when the malicious data gets injected on training data to efficiently train the model. Defense techniques have leveraged robust training datasets and prevents accuracy on evaluating the machine learning algorithms. The novel algorithm CAP is explained to substitute trusted data instead of the untrusted data, which improves the reliability of machine learning algorithms.

Deep Learning for Wireless Communications

Chapter

Jan 2020

IoT Network Security from the Perspective of Adversarial Deep Learning

Conference Paper

Full-text available

Jun 2019

Adversarial Active Learning

Conference Paper

Full-text available

Nov 2014

Active learning is an area of machine learning examining strategies for allocation of finite resources, particularly human labeling efforts and to an extent feature extraction, in situations where available data exceeds available resources. In this open problem paper, we motivate the necessity of active learning in the security domain, identify problems caused by the application of present active learning techniques in adversarial settings, and propose a framework for experimentation and implementation of active learning systems in adversarial contexts. More than other contexts, adversarial contexts particularly need active learning as ongoing attempts to evade and confuse classifiers necessitate constant generation of labels for new content to keep pace with adversarial activity. Just as traditional machine learning algorithms are vulnerable to adversarial manipulation, we discuss assumptions specific to active learning that introduce additional vulnerabilities, as well as present vulnerabilities that are amplified in the active learning setting. Lastly, we present a software architecture, Security-oriented Active Learning Testbed (SALT), for the research and implementation of active learning applications in adversarial contexts.

SVM Active Learning Approach for Image Classification Using Spatial Information

Article

Full-text available

Apr 2014

In the last few years, active learning has gained growing interest in the remote sensing community to optimize the process of training sample collection for supervised image classification. Current strategies formulate the active learning problem in the spectral domain only. However, remote sensing images are intrinsically defined both in the spectral and spatial domains. In this paper, we aim at exploring this fact by proposing a new active learning approach for support vector machine (SVM) classification. In particular, we suggest combining spectral and spatial information directly in the iterative process of sample selection. For this purpose, three criteria are proposed in order to favour the selection of samples distant from the samples already composing the current training set. In the first strategy, the Euclidean distances in the spatial domain from the training samples are explicitly computed, while the second one is based on the Parzen window method in the spatial domain. Finally, the last criterion involves the concept of spatial entropy. Experiments on two very high resolution (VHR) images show the effectiveness of regularization in spatial domain for active learning purposes.

Machine Learning in Automated Text Categorization

Article

Full-text available

Apr 2001

Fabrizio Sebastiani

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.

Adversarial learning

Conference Paper

Full-text available

Aug 2005

Many classification tasks, such as spam filtering, intrusion detection, and terrorism detection, are complicated by an adversary who wishes to avoid detection. Previous work on adversarial classification has made the unrealistic assumption that the attacker has perfect knowledge of the classifier [2]. In this paper, we introduce the adversarial classifier reverse engineering (ACRE) learning problem, the task of learning sufficient information about a classifier to construct adversarial attacks. We present efficient algorithms for reverse engineering linear classifiers with either continuous or Boolean features and demonstrate their effectiveness using real data from the domain of spam filtering.

Mismatch String Kernels for SVM Protein Classification

Conference Paper

Full-text available

Jan 2002
Adv Neural Inform Process Syst

We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the protein classification problem. These kernels measure sequence sim- ilarity based on shared occurrences of -length subsequences, counted with up to mismatches, and do not rely on any generative model for the positive training sequences. We compute the kernels efficiently using a mismatch tree data structure and report experiments on a benchmark SCOP dataset, where we show that the mismatch kernel used with an SVM classifier performs as well as the Fisher kernel, the most success- ful method for remote homology detection, while achieving considerable computational savings.

Active Learning by Learning

Article

Feb 2015

Pool-based active learning is an important technique that helps reduce labeling efforts within a pool of unlabeled instances. Currently, most pool-based active learning strategies are constructed based on some human-designed philosophy; that is, they reflect what human beings assume to be “good labeling questions.” However, while such human-designed philosophies can be useful on specific data sets, it is often difficult to establish the theoretical connection of those philosophies to the true learning performance of interest. In addition, given that a single human-designed philosophy is unlikely to work on all scenarios, choosing and blending those strategies under different scenarios is an important but challenging practical task. This paper tackles this task by letting the machines adaptively “learn” from the performance of a set of given strategies on a particular data set. More specifically, we design a learning algorithm that connects active learning with the well-known multi-armed bandit problem. Further, we postulate that, given an appropriate choice for the multi-armed bandit learner, it is possible to estimate the performance of different strategies on the fly. Extensive empirical studies of the resulting ALBL algorithm confirm that it performs better than state-of-the-art strategies and a leading blending algorithm for active learning, all of which are based on human-designed philosophy.

A Practical Guide to Support Vector Classification

Article

Jan 2003

Support Vector Machine Active Learning With Applications To Text Classification

Article

Dec 2001
J MACH LEARN RES

Support vector machines have met with significant success in numerous real-world learning tasks. However, like most machine learning algorithms, they are generally applied using a randomly selected training set classified in advance. In many settings, we also have the option of using pool-based active learning . Instead of using a randomly selected training set, the learner has access to a pool of unlabeled instances and can request the labels for some number of them. We introduce a new algorithm for performing active learning with support vector machines, i.e., an algorithm for choosing which instances to request next. We provide a theoretical motivation for the algorithm using the notion of a version space . We present experimental results showing that employing our active learning method can significantly reduce the need for labeled training instances in both the standard inductive and transductive settings.

Active Learning Literature Survey}

Article

Jul 2010

Burr Settles

The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with fewer labeled training instances if it is allowed to choose the data from which is learns. An active learner may ask queries in the form of unlabeled instances to be labeled by an oracle (e.g., a human annotator). Active learning is well-motivated in many modern machine learning problems, where unlabeled data may be abundant but labels are difficult, time-consuming, or expensive to obtain. This report provides a general introduction to active learning and a survey of the literature. This includes a discussion of the scenarios in which queries can be formulated, and an overview of the query strategy frameworks proposed in the literature to date. An analysis of the empirical and theoretical evidence for active learning, a summary of several problem setting variants, and a discussion of related topics in machine learning research are also presented.

Text Classification from Labeled and Unlabeled Documents using EM

Article

May 2000

This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available.We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.

Defending active learning against adversarial inputs in automated document classification

Abstract and Figures

Recommended publications

Multi-Label Classification of Product Reviews Using Structured SVM

Active Learning Methods based on Statistical Leverage Scores

Discretization: An Enabling Technique

An Oracle based Co-training Framework for Writer Identification in Offline Handwriting