PreprintPDF Available

Music Instrument Classification Reprogrammed

November 2022

November 2022

DOI:10.48550/arXiv.2211.08379

License
CC BY-NC-SA 4.0

Authors:

Alexander Lerch

Georgia Institute of Technology

Preprints and early-stage research may not have been peer reviewed yet.

The performance of approaches to Music Instrument Classification, a popular task in Music Information Retrieval, is often impacted and limited by the lack of availability of annotated data for training. We propose to address this issue with "reprogramming," a technique that utilizes pre-trained deep and complex neural networks originally targeting a different task by modifying and mapping both the input and output of the pre-trained model. We demonstrate that reprogramming can effectively leverage the power of the representation learned for a different task and that the resulting reprogrammed system can perform on par or even outperform state-of-the-art systems at a fraction of training parameters. Our results, therefore, indicate that reprogramming is a promising technique potentially applicable to other tasks impeded by data scarcity.

U-Net structure for input reprogramming.

…

OpenMIC: positive vs. negative labels.

…

Macro F1 scores of the evaluated methods.

…

Figures - available via license: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

Content may be subject to copyright.

Available via license: CC BY-NC-SA 4.0

Content may be subject to copyright.

Music Instrument Classiﬁcation Reprogrammed

Hsin-Hung Chen

[0000−0002−3406−741X]

and Alexander Lerch

[0000−0001−6319−578X]

Music Informatics Group, Georgia Institute of Technology

Abstract.

The performance of approaches to Music Instrument Classiﬁ-

cation, a popular task in Music Information Retrieval, is often impacted

and limited by the lack of availability of annotated data for training.

We propose to address this issue with “reprogramming,” a technique

that utilizes pre-trained deep and complex neural networks originally

targeting a diﬀerent task by modifying and mapping both the input and

output of the pre-trained model. We demonstrate that reprogramming

can eﬀectively leverage the power of the representation learned for a

diﬀerent task and that the resulting reprogrammed system can perform

on par or even outperform state-of-the-art systems at a fraction of train-

ing parameters. Our results, therefore, indicate that reprogramming is

a promising technique potentially applicable to other tasks impeded by

data scarcity.

Keywords: Reprogramming ·Instrument Classiﬁcation

1 Introduction

The task of Music Instrument Classiﬁcation (MIC) aims at automatically recog-

nizing the musical instruments playing in a music recording. MIC can provide

important information for a variety of applications such as music recommendation,

music discovery, and automatic mixing. In recent years, Deep Learning (DL)

models have shown superior performance in practically all Music Information

Retrieval (MIR) tasks including MIC. However, the lack of large-scale annotated

data remains a major problem for data-hungry supervised machine learning

algorithms in this ﬁeld [

]. Beyond small-scale expert-annotated datasets,

larger datasets are often collected by crowd-sourcing annotations, which leads to

noisy and sometimes incomplete labels. For example, the majority of labels in the

OpenMIC dataset [

] —a popular dataset for polyphonic MIC— are missing;

this data scarcity can negatively impact the training of complex classiﬁers for

this multi-label task.

One established approach to address this data challenge is transfer learning.

In this approach, the knowledge of a source domain with suﬃcient training

data is transferred to a related but diﬀerent target domain with insuﬃcient

training data. This knowledge transfer is often achieved by either directly using

a pre-learned representation as classiﬁer input or by ﬁne-tuning a pre-trained

source-domain model with the target-domain data. For instance, the VGGish

representation [

], trained on a wide variety of audio data, has been successfully

arXiv:2211.08379v1 [cs.SD] 15 Nov 2022

2 H. Chen, A. Lerch

Repr ogramming

(trainable) Pre-trained Model

(frozen) Output Mapping

(trainable)

Target input Reprogrammed input Source output Target output

Fig. 1. The concept of model reprogramming.

utilized for MIC [

]. Model reprogramming aims at expanding the transfer

learning paradigm by treating the source-domain models as unmodiﬁed black-box

machine learning models extended only by input pre-processing and output

post-processing. Model reprogramming was ﬁrst introduced in 2018 by Elsayed

et al. [

], who showed that a trainable universal input transformation function

can reprogram a pre-trained ImageNet model (without changing the model

weights) to solve the MNIST/CIFAR-10 image classiﬁcation task with high

accuracy. Figure 1 illustrates the basic concept of model reprogramming: a

trainable model for input reprogramming modiﬁes the input data to be fed into

the frozen black-box model pre-trained on the source task, followed by an output

transformation that maps the outputs of the pre-trained model to the target

categories. Thus, the reprogramming layer serves to “reprogram” the pre-trained

model to work with a new target task with diﬀerent input data and diﬀerent

target classes. Since the complexity of the input and output transformation

can be low, model reprogramming can combine the advantage of leveraging a

powerful deep pre-trained representation with the advantages of (i) reduced

training complexity and (ii) reduced data requirements. Reprogramming methods

have been successfully applied to various tasks such as medical image classiﬁcation

[

], time-series classiﬁcation [

], and language processing [

]. Results show

that reprogramming methods can perform on par or better than state of the

art methods, thus demonstrating the feasibility of reprogramming pre-trained

models and showing the potential of this method for improving the performance

on other tasks with low amounts of data.

In this work, we investigate reprogramming for the task of MIC with the

incompletely labeled OpenMIC dataset. We choose a pre-trained state-of-the-art

audio classiﬁcation model and extend it by input pre-processing and label-

mapping. We provide extensive results on various input processing conﬁgurations.

The main contributions of the paper are

(i)

the presentation of a system with low complexity that is able to outperform

state-of-the-art MIC systems, and

(ii)

the introduction of the reprogramming paradigm to the ﬁeld of MIR with a

multitude of applications with insuﬃcient training data.

The remainder of the paper is structured as follows. The following Sect. 2

presents a brief overview of relevant work. The pre-trained model and the proposed

reprogramming methods are introduced in Sect. 3. The evaluation and analysis

are presented in Sect. 4. We conclude with a brief summary and directions of

future work.

Music Instrument Classiﬁcation Reprogrammed 3

2 Related Work

This related work is structured into three main parts, an overview of MIC, a

short survey of transfer learning, and recent work on reprogramming.

While earlier research on MIC has focused on the detection of instruments

from audio only containing one instrument [

] or on the detection of

the pre-dominant instrument in a mixture [

], current research has focused

on recognizing instruments in polyphonic and poly-timbral audio recordings

containing multiple instruments playing multiple voices simultaneously. Similar to

other audio classiﬁcation tasks, earlier systems tended to use traditional machine

learning approaches with low-level audio features at the input [

] while modern

approaches are dominated by neural networks. Li et al. [

] proposed to learn

features for instrument recognition with a Convolutional Neural Network (CNN)

using the MedleyDB dataset [

]. Hung et al. proposed to detect instrument activity

at a high time-resolution and showed the advantage of pitch-conditioning on

instrument recognition performance [

]. Gururani et al. [

] introduced training

attention-based models to the task for enhanced accuracy and implemented a

partial binary cross-entropy loss to ignore missing labels in the OpenMIC dataset

[

]. Gururani and Lerch showed that a semi-supervised approach based on

consistency loss to adapt and leverage the data with missing labels outperforms

other systems [

]. Despite previous eﬀorts on data curation and annotation,

the access to fully annotated data on a large scale remains a challenge. The

IRMAS dataset [

] is a polyphonic dataset with mixed genres, however, it targets

predominant instrument recognition and is therefore not suitable for multi-

instrument classiﬁcation. The MedleyDB [

] and Mixing Secrets [

] datasets

are both multi-track datasets with strong annotations of instrument activity.

However, only a few hundred of songs are available, creating potential issues not

only with respect to the dataset size itself but also regarding data distribution and

diversity. The OpenMIC dataset for polyphonic instrument recognition, published

by Humphrey et al. [

], presents a reasonably large sample size across various

genres. Unfortunately, a considerable number of labels are missing as not all clips

are labeled with all 20 instruments due to the crowd-sourced annotation process.

Slakh2100 is another large music dataset containing mixed tracks of 34 instrument

categories and with perfect annotation. However, it is synthesized and rendered

from MIDI ﬁles instead of real recordings. Transfer learning is an important tool

in machine learning when facing the fundamental problem of insuﬃcient training

data. It aims to transfer the knowledge from a source domain to the target domain

by relaxing the assumption that the training data and the test data must be

independent and identically distributed [

]. It is based on the idea that a powerful

representation, learned for a source task with large datasets, can be adapted

to a related but not identical target task that lacks training data. For example,

Qiuqiang et al. [

] present the adaptation of pre-trained models such as ResNet

[

] and MobileNet [

] to audio tagging tasks, showing the generalizability of

systems pre-training on large-scale datasets to audio pattern recognition. Choi

et al. [

] show that representations pre-trained on the music tagging task can

be successfully transferred to various music classiﬁcation tasks and can lead

4 H. Chen, A. Lerch

to competitive results. Jordi et al. introduced the “musicnn” representations,

featuring a set of CNNs pre-trained on the music audio tagging task [

]. In the

context of MIC, Gururani et al. [

] successfully adopt the VGGish pre-trained

representation [19] as the input features for their attention-based model.

The promising results of transfer learning outlined above led to Tsai et al.

framing two new research questions [

]: (i) “is ﬁnetuning a pre-trained model

necessary for learning a new task?” and (ii) “can transfer learning be expanded

where nothing but only the input-output model responses are observable?” The

attempt to answer these questions inspired by ideas of adversarial approaches

led to the idea of “reprogramming.” Adversarial machine learning aims at ma-

nipulating the prediction of a well-trained deep learning model by designing and

learning perturbations to the data inputs without changing the target model

[

]. The success of these approaches shows that the classiﬁer output can be

changed just by modifying its input, and thus suggests that such methods might

be applicable in a non-adversarial context by modifying the input of a pre-trained

model to “adapt” the model to a target task. This leads to the concept of model

reprogramming, also referred to as adversarial reprogramming [

], which is a tech-

nique that aims at leveraging the knowledge of a model pre-trained for a diﬀerent

task by pre-processing the input and post-processing the output. Elsayed et al.

showed that pre-trained ImageNet models can be reprogrammed to classifying

MNIST and CIFAR-10 by adding learnable parameters around the input image

[

]. Tsai et al. demonstrate the advantage of reprogramming on label-limited

data such as biomedical image classiﬁcation [

], and combine reprogramming

with multi-label mapping. Model reprogramming has also been used in tasks

other than image classiﬁcation such as natural language processing. For example,

Hambardzumyan et al. propose Word-level Adversarial ReProgramming (WARP)

for language understanding by adding learnable tokens to the input sequence [

The evaluation shows that WARP outperforms all models with less parameters.

Neekhara et al. demonstrate re-purposing character-level classiﬁcation tasks to

sentiment classiﬁcation tasks by implementing a trainable adversarial sequence

with surrounding input tokens [

]. Yang et al. [

] applied reprogramming to

acoustic models for time-series classiﬁcation on the UCR Archive benchmark [

The input audio is treated as time-series and is padded to be reprogrammed.

The model achieves state-of-the-art accuracy on 20 out of 30 datasets with

considerably fewer trainable parameters than the pre-trained models.

The presented reprogramming methods achieve promising results with simple

learnable reprogramming operations. Therefore, we can conclude that repro-

gramming is an eﬀective new transfer learning approach inspired by adversarial

methods that could address data insuﬃciency problems by requiring less training

complexity.

Music Instrument Classiﬁcation Reprogrammed 5

3 Proposed Method

3.1 Pre-trained Model

The criteria for choosing the pre-trained model to be used as black-box model

in our reprogramming setup were that the model (i) oﬀers state-of-the-art per-

formance in audio classiﬁcation, (ii) is trained on a comparable but diﬀerent

task, (iii) is preferably an attention-based model to make it better suited to work

on the weakly labeled OpenMIC dataset (see below), (iv) is of suﬃciently high

complexity to learn a powerful representation, and (v) has been trained on a

large number of data points.

Given these criteria, the Audio Spectrogram Transformer (AST) [

]

was selected. AST is a convolution-free, purely attention-based model with an

audio spectrogram input, achieving state-of-the-art results on AudioSet [

ESC-50 [

], and Speech Commands V2 [

]. Choosing the AST model trained

on AudioSet provides us with a pre-trained model that should be suitable for

music audio. The input audio is pre-processed into 128-dimensional Log-Mel

spectrogram features computed with a 25 ms von-Hann window every 10 ms. The

spectrogram is split into a sequence of 16

16 patches with overlap, and then

linearly projected and added to a learnable positional embedding as the input of

the transformer encoder. For the AST pre-trained on AudioSet, the number of

output classes is 527.

3.2 Reprogramming

The overall structure follows the ﬂow-chart presented in Figure 1 with the

reprogramming stage, the pre-trained model, and the output label mapping stage.

To explore the potential and performance of reprogramming, we investigate

various forms of input and output reprogramming in this work.

Input Reprogramming

The input reprogramming step aims to ﬁnd a trainable

input modiﬁcation that can be applied to inputs universally to transform them

into an input representation useful for the pre-trained model. Previously proposed

methods include the superposition of noise to the input also known as adversarial

reprogramming [

]. We investigate this approach, but we also propose a novel

method to extend this simple superposition by transforming the input spectrogram

by means of a neural network. In this way, a well-crafted perturbation of the

input might improve “compatibility” with the pre-trained model.

Noise Reprogramming: Previous reprogramming methods add a learnable

noise component to input to translate the target data to the source domain of

the pre-trained model. This is similar to what many approaches to adversarial

attacks do[

]. Unlike previous methods, we choose to add the noise to the

spectrogram and not the time domain signal. We hypothesize that a spectrogram

is a more ﬁtting representation, because of the higher complexity of music audio

compared to other signals reprogramming has been applied to. The operation

can be formulated as

N. X

is the input spectrogram with size

T×F

6 H. Chen, A. Lerch

where

is the number of time bins, and

is the number of mel-band ﬁlters.

is the input of the pre-trained model. The learnable noise

is universal to all

target data and independent of

. The dimension of

is identical to

, the

size of the input spectrogram.

CNN Reprogramming: The application of CNN to audio spectrograms is

considered a standard baseline in many audio classiﬁcation tasks. Therefore, we

propose to use trainable CNN layers as an input transformation of the input

data. These layers replace the noise superposition as input processing. The

transformation can be formally described by

(

)

in which

represents

the CNN consisting of two 2D convolutional layers with a receptive ﬁeld of 3

a stride of 1

1, and a padding size of 1

1. Note that in this special case, no

max-pooling is applied as the input dimension matches the input dimension of

the pre-trained model. The diﬀerence between Noise Reprogramming and CNN

Reprogramming is that CNN layers apply learnable transformation to the input

itself instead of simply adding learnable noise that is independent of the input.

Due to the idea of weight sharing in CNN, the training parameters needed are

the CNN kernels; hence, the amount of parameters is considerable less than Noise

Reprogramming.

U-Net Reprogramming: The idea behind using CNN Reprogramming is that

the input audio spectrogram can be transformed into a suitable and compatible

spectrogram for the pre-trained model. In order to provide more ﬂexibility for

the reprogramming learning, it is fair to consider the features at diﬀerent time

resolutions. Therefore, in addition to CNN reprogramming, we further propose

U-Net reprogramming for MIC. U-Net is a CNN structure ﬁrst developed for

biomedical image segmentation [

] and has become popular in both speech and

music separation tasks. The architecture consists of a contraction path to capture

context and a symmetric expansion path to reconstruct the extracted features

back to input resolution [

]. With convolutional layers and skip connections,

U-Net is able to represent the input in both high-level features on coarser time-

frequency scales and detailed features in deep CNN layers. These features are

combined using bilinear upsampling blocks, yielding multi-scale features with

both high-level and deep representations [

]. The U-Net’s success in music

source separation tasks might also imply that U-Net processing on a spectrogram

is eﬀective to diﬀerentiate instrument content. To the best of our knowledge,

this is the ﬁrst time a U-Net structure has been proposed for reprogramming.

The proposed U-Net structure for reprogramming is shown in Figure 2. The

formulation is identical to the CNN reprogramming mentioned above as the

transform function is applied to the input. It consists of three convolutional layers

in both contraction path and expansion path, and each convolutional layer is

followed by a batch normalization layer and a ReLU activation. A 2

2 max-

pooling is applied for the CNN layer in the contraction path while upsampling is

applied for that in expansion path.

Output Reprogramming

The output categories of the pre-trained system

obviously do not match the music instrument classes to be classiﬁed. Therefore,

Music Instrument Classiﬁcation Reprogrammed 7

Skip connection

U-Net reprogramming

Fig. 2. U-Net structure for input reprogramming.

the outputs of the pre-trained model have to be mapped to the target labels. For

example, Yang et al. propose to use a many-to-one label mapping [

]. For

each target label, its class prediction will be the averaged class predictions over

the set of source labels assigned to it. We investigate this approach, but also

propose a new output mapping utilizing fully-connected (FC) layers to ﬁt the

targets. Here, the output probabilities are mapped to the target labels by the FC

layer (FCL) and a sigmoid activation function. The mapping can thus be learned

during the training phase. Note that the original last layer activation in AST is

removed in our model.

4 Experimental Setup

To evaluate the impact of diﬀerent input and output reprogramming options, the

results for the following systems will be reported:

–

Noise Reprogramming (AST-NRP): pre-trained model with added noise at

the input and with output FCL label mapping,

–

CNN Reprogramming (AST-CNNRP): pre-trained model with CNN input

processing and with output FCL label mapping, and

–

U-Net Reprogramming (AST-URP): pre-trained model with U-Net input

processing and with output FCL label mapping.

In addition to the three methods introduced above (AST-NRP, AST-CNNRP,

AST-URP), we also add the following systems for comparison: (i) the pre-trained

AST (AST-BS) without input reprogramming as a baseline to evaluate the power

of the AST representation for MIC, (ii) a CNN baseline (CNN-BS) with roughly

the same number of training parameters as our proposed input transformation

methods AST-CNNRP and AST-URP, (iii) a transfer learning approach by

ﬁne-tuning the AST system with the target data (AST-TL), (iv) the previous

state-of-the-art Mean Teacher (MT) model by Gururani and Lerch [

], and

(v) the Random Forest (RF) baseline released with the OpenMic dataset [21].

The implementation of the proposed methods is publicly available.1

4.1 Dataset

We use the OpenMIC dataset for the experiments in this paper. OpenMIC is the

ﬁrst open multi-instrument music dataset with a comparably diverse set of musical

instruments and genres, addressing issues in other previously existing datasets

1github.com/hchen605/ast inst cls, last accessed: Nov 9, 2022

8 H. Chen, A. Lerch

accordion

banjo

bass

cello

clarinet

cymbals

drums

flute

guitar

mallet

mandolin

organ

piano

saxophone

synthesizer

trombone

trumpet

ukulele

violin

voice

250

500

750

1000

1250

1500

1750

Label count

OpenMIC dataset label count over the instruments

Positive

Negative

Fig. 3. OpenMIC: positive vs. negative labels.

for MIC [

]. It consists of 20,000 audio clips, each of 10 s length. Every clip is

labeled with the presence (positive) or absence (negative) of least one of 20 musical

instruments, and each instrument class has at least 500 conﬁrmed positives and

at least totally 1500 conﬁrmed labels. Note that if the dataset were fully labeled,

it would come with 20000

clips ·

instruments

= 400000

labels

, however, the

actual number of labels in the dataset is 41

268, meaning that approximately

90% of the labels are missing. Moreover, each 10s clip has instrument presence or

absence tags without specifying onset and oﬀset times, also referred to as weak

labels. Therefore, models cannot be trained using ﬁne-grained instrument activity

annotation. Figure 3 visualizes the overall label distribution of OpenMIC. We can

observe the unbalanced nature; commonly seen instruments, such as piano, voice,

and violin, generally have more positive labels than the others. Another possible

reason that we can see more positive labels for these common instruments might

be that crowd-sourced annotators were more familiar with these common timbres.

The models investigated in this study are trained to identify the presence or

absence of musical instruments in OpenMIC dataset. The publicly available data

splits are used for training and testing: approx. 25% of data are used for testing,

and 15% of the training data is sampled randomly to form the validation set.

4.2 Training Procedure and Evaluation Metrics

The reprogramming model is trained with a batch size of 8, and the Adam

optimizer with binary cross-entropy loss is used. We apply an initial learning

rate of 5e-5 for the ﬁrst 10 epochs, and then the learning rate is cut into half

every 5 epochs until reaching 50 epochs.

We are interested in both the classiﬁcation performance and the model

complexity of each investigated system. To be comparable to previously reported

results on the MIC task, the macro F1 score, calculated by purely averaging the

per-instrument F1 scores and ignoring the weight of per-class data amount, is

reported. The ﬁnal results for each setup will be reported by averaging the macro

F1 score of 10 experiments. In addition to this classiﬁcation metric, the number

of training parameters of each of the evaluated systems will be reported.

Music Instrument Classiﬁcation Reprogrammed 9

78.3

78.0

68.4

60.762.0

81.6 81.1 81.3

Fig. 4. Macro F1 scores of the evaluated methods.

4.3 Results and Discussion

Input Reprogramming

The overall average macro F1 scores are shown in

Figure 4. We note that the previous state-of-the-art MT [

] is reported with the

result 81.3% as shown in the paper, and the RF baseline [

] is reported with the

result 78.3% calculated from the released Python Notebook. We can make the

following observations. First, simply using the pre-trained AST without input

processing does not lead to convincing results, although at around 62% the result

is considerably higher than guessing (50%). The performance of the un-tuned

system roughly matches the performance of the simple CNN-BS trained on data

for the task, indicating that AST has learned a powerful and useful representation.

Second, a trained noise signal that is simply added to the input can improve this

baseline performance by more than 6%, but the resulting system remains far from

the performance of state-of-the-art systems. This implies that while traditional

reprogramming approaches can work to a certain degree, the complexity and

variability of music signals requires an input-adaptive transform as opposed to the

addition a constant signal. Third, the results for AST-TL show the eﬀectiveness

of transfer learning showing performance roughly on par with the state-of-the-art.

This result emphasizes that transfer learning is a powerful tool with competitive

results for weakly-labeled data. Fourth, both the AST-CNNRP and AST-URP

pre-processing steps dramatically improve classiﬁcation performance over the

AST-NRP with the U-Net-based reprogramming performing about 4% better

than the CNN. We also note that transfer learning with ﬁne-tuning results

in only limited improvement over the reprogramming methods, implying the

AST representation without ﬁne-tuning is suitable for this task. Fifth, AST-URP

slightly outperforms all presented system and even the state-of-the-art MT system.

Diving deeper into the detailed results, Figure 5 displays the instrument-wise

Table 1. Comparison of model training parameters.

Method AST-BS CNN-BS AST-TL AST-NRP AST-CNNRP AST-URP MT

#Param.(M) 0.017 0.017 87.873 0.148 0.017 0.018 0.111

10 H. Chen, A. Lerch

accordion

banjo

bass

cello

clarinet

cymbals

drums

flute

guitar

mallet

mandolin

organ

piano

saxophone

synthesizer

trombone

trumpet

ukulele

violin

voice

100

F1 score %

Instrument-wise F1 score comparison

AST-BS

AST-NPR

AST-CNNPR

AST-UPR

Fig. 5. F1 score comparison between the reprogramming methods.

F1 scores. We can observe a lot of variation in classiﬁcation performance over

instruments. These variations are related to the amount of positive labels as

shown in Figure 3, a clear indicator that the number of (positive) labels directly

inﬂuences the classiﬁcation performance. In fact, the correlation coeﬃcient of the

amount of instrument-wise positive labels and AST-BS F1 score is 0.75, showing

the high correlation. We can see that the U-Net reprogramming model AST-URP

outperforms all other models for all instruments, and we observe considerable

performance gains especially for the instrument classes with few positive labels.

For example, accordion has the lowest number of positive labels with around 500

and the AST-URP improvement is over 25%. This demonstrates the capability of

the proposed reprogramming method to solve the data scarcity issue by leveraging

powerful other representations.

The results support our assumption that an input transformation utilizing

both high-level and low-level features beneﬁts reprogramming. Overall, we can

see that both AST-CNNRP and AST-URP are eﬀective approaches to adapt a

pre-trained model to a new task. It encourages future research on reprogramming

pre-trained models with other methods of transformation.

Complexity Analysis

One of the main advantages of reprogramming is reduced

training complexity as the pre-trained model remains unchanged. As an indicator

of model complexity, Table 1 reports the number of training parameters along

with the F1 score previously visualized in Figure 4. In terms of complexity, it is

clear that the training complexity of reprogramming is very low. We can observe

that MT, the previous state of the art system using a Mean Teacher approach

with consistency loss has a higher complexity (one order of magnitude) than

either our CNN-based approach or the high-performing U-Net approach, despite

them using the VGGish representation as input to their system. Compared to

the number of AST training parameters, the proposed AST-URP model has only

a fraction of training parameters (0.02%) but is still the best-performing system.

5 Conclusion

In this work, we propose to apply model reprogramming to the task of music

instrument classiﬁcation. We extend the existing reprogramming approaches

Music Instrument Classiﬁcation Reprogrammed 11

by utilizing a novel U-Net-based input reprogramming method. By leveraging

the power of pre-trained audio spectrogram transformer model, we show that

our model can achieve state-of-the-art performance at a fraction of the train-

ing complexity of other models. We provide detailed results on the impact of

diﬀerent input and output reprogramming approaches. Without applying data

augmentation and advanced model architectures like semi-supervised MT learning,

reprogramming a pre-trained model is a low-complexity approach to achieving

state-of-the-art performance. Although the workload during the inference stage

remains unchanged, the low training complexity of this new transfer learning

approach in combination with our promising results opens up a multitude of

possible use cases in tasks in MIR and other ﬁelds with insuﬃcient data.

In future work, we plan to explore variations of the input reprogramming

stages beyond CNN or U-Net structures. Furthermore, we plan to test the

reprogramming algorithm with various other pre-trained models from other audio

and non-audio tasks. We believe that reprogramming is a promising approach

with the potential to be used in many other MIR tasks.

References

Benetos, E., Kotti, M., Kotropoulos, C.: Musical instrument classiﬁcation using non-

negative matrix factorization algorithms and subset feature selection. In: ICASSP

(2006)

Biggio, B., Roli, F.: Wild patterns: Ten years after the rise of adversarial machine

learning. Pattern Recognit. 84, 317–331 (2018)

Bittner, R.M., Salamon, J., Tierney, M., Mauch, M., Cannam, C., Bello, J.P.:

Medleydb: A multitrack dataset for annotation-intensive mir research. In: ISMIR

(2014)

Bosch, J.J., Janer, J., Fuhrmann, F., Herrera, P.: A comparison of sound segregation

techniques for predominant instrument recognition in musical audio signals. In:

ISMIR (2012)

Choi, K., Fazekas, G., Sandler, M.B., Cho, K.: Transfer learning for music classiﬁ-

cation and regression tasks. In: ISMIR (2017)

Dau, H.A., Bagnall, A., Kamgar, K., Yeh, C.C.M., Zhu, Y., Gharghabi, S.,

Ratanamahatana, C.A., Keogh, E.: The UCR time series archive. IEEE/CAA

Journal of Automatica Sinica 6(6), 1293–1305 (2019)

Deﬀerrard, M., Benzi, K., Vandergheynst, P., Bresson, X.: FMA: A dataset for

music analysis. In: ISMIR (2017)

Elsayed, G.F., Goodfellow, I., Sohl-Dickstein, J.: Adversarial reprogramming of

neural networks. In: ICLR (2019)

Eronen, A., Klapuri, A.: Musical instrument recognition using cepstral coeﬃcients

and temporal features. In: ICASSP (2000)

10.

Essid, S., Richard, G., David, B.: Hierarchical classiﬁcation of musical instruments

on solo recordings. In: ICASSP (2006)

11.

Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore,

R.C., Plakal, M., Ritter, M.: Audio set: An ontology and human-labeled dataset

for audio events. In: ICASSP (2017)

12.

Gong, Y., Chung, Y.A., Glass, J.R.: AST: Audio spectrogram transformer. In:

Interspeech (2021)

12 H. Chen, A. Lerch

13. Gong, Y., Lai, C.I., Chung, Y.A., Glass, J.R.: SSAST: Self-supervised audio spec-

trogram transformer. In: AAAI (2021)

14.

Gururani, S., Lerch, A.: Mixing secrets: A multi-track dataset for instrument

recognition in polyphonic music. In: ISMIR (2017)

15.

Gururani, S., Lerch, A.: Semi-supervised audio classiﬁcation with partially labeled

data. In: IEEE ISM (2021)

16.

Gururani, S., Sharma, M., Lerch, A.: An attention mechanism for musical instrument

recognition. In: ISMIR (2019)

17.

Hambardzumyan, K., Khachatrian, H., May, J.: WARP: Word-level Adversarial

ReProgramming. In: IJCNLP (2021)

18.

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.

In: CVPR (2016)

19.

Hershey, S., Chaudhuri, S., Ellis, D.P., Gemmeke, J.F., Jansen, A., Moore, R.C.,

Plakal, M., Platt, D., Saurous, R.A., Seybold, B., et al.: CNN Architectures for

Large-scale Audio Classiﬁcation. In: ICASSP (2017)

20.

Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T.,

Andreetto, M., Adam, H.: Mobilenets: Eﬃcient convolutional neural networks for

mobile vision applications. CVPR (2017)

21.

Humphrey, E., Durand, S., McFee, B.: Openmic-2018: An open data-set for multiple

instrument recognition. In: ISMIR (2018)

22.

Hung, Y., Yang, Y.: Frame-level instrument recognition by timbre and pitch. In:

ISMIR (2018)

23.

Jansson, A., Humphrey, E.J., Montecchio, N., Bittner, R.M., Kumar, A., Weyde,

T.: Singing voice separation with deep u-net convolutional networks. In: ISMIR

(2017)

24.

Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-

scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM

Transactions on Audio, Speech, and Language Processing (2020)

25.

Li, P.Q., Qian, J., Wang, T.: Automatic instrument recognition in polyphonic music

using convolutional neural networks. CoRR abs/1511.05520 (2015)

26.

Lostanlen, V., Cella, C.E.: Deep convolutional networks on the pitch spiral for

music instrument recognition. In: ISMIR (2016)

27.

Nagawade, M.S., Ratnaparkhe, V.R.: Musical instrument identiﬁcation using MFCC.

In: RTEIC) (2017)

28.

Neekhara, P., Hussain, S., Dubnov, S., Koushanfar, F.: Adversarial reprogramming

of text classiﬁcation neural networks. In: EMNLP-IJCNLP (2019)

29.

Piczak, K.J.: ESC: Dataset for Environmental Sound Classiﬁcation. In: ACM MM.

pp. 1015–1018. ACM (2015)

30.

Pons, J., Serra, X.: Musicnn: Pre-trained convolutional neural networks for music

audio tagging. ISMIR (2019)

31.

Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical

image segmentation. In: MICCAI (2015)

32.

Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey on deep transfer

learning. In: ICANN. Springer (2018)

33.

Tsai, Y.Y., Chen, P.Y., Ho, T.Y.: Transfer learning without knowing: Reprogram-

ming black-box machine learning models with scarce data and limited resources.

In: ICML. PMLR (2020)

34.

Warden, P.: Speech commands: A dataset for limited-vocabulary speech recognition.

CoRR abs/1804.03209 (2018)

35.

Yang, C.H.H., Tsai, Y.Y., Chen, P.Y.: Voice2series: Reprogramming acoustic models

for time series classiﬁcation. In: ICML. PMLR (2021)

ResearchGate has not been able to resolve any citations for this publication.

SSAST: Self-Supervised Audio Spectrogram Transformer

Article

Full-text available

Jun 2022

Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology can also be applied to the audio domain. Specifically, the Audio Spectrogram Transformer (AST) achieves state-of-the-art results on various audio classification benchmarks. However, pure Transformer models tend to require more training data compared to CNNs, and the success of the AST relies on supervised pretraining that requires a large amount of labeled data and a complex training pipeline, thus limiting the practical usage of AST. This paper focuses on audio and speech classification, and aims to reduce the need for large amounts of labeled data for the AST by leveraging self-supervised learning using unlabeled data. Specifically, we propose to pretrain the AST model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio from AudioSet and Librispeech. We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification. The proposed self-supervised framework significantly boosts AST performance on all tasks, with an average improvement of 60.9%, leading to similar or even better results than a supervised pretrained AST. To the best of our knowledge, it is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self-supervised learning framework for AST.

AST: Audio Spectrogram Transformer

Conference Paper

Full-text available

Aug 2021

In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.

The UCR time series archive

Article

Full-text available

Nov 2019

The UCR time series archive - introduced in 2002, has become an important resource in the time series data mining community, with at least one thousand published papers making use of at least one data set from the archive. The original incarnation of the archive had sixteen data sets but since that time, it has gone through periodic expansions. The last expansion took place in the summer of 2015 when the archive grew from 45 to 85 data sets. This paper introduces and will focus on the new data expansion from 85 to 128 data sets. Beyond expanding this valuable resource, this paper offers pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive. Finally, this paper makes a novel and yet actionable claim: of the hundreds of papers that show an improvement over the standard baseline (1nearest neighbor classification), a fraction might be mis-attributing the reasons for their improvement. Moreover, the improvements claimed by these papers might have been achievable with a much simpler modification, requiring just a few lines of code.

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

Article

Full-text available

Apr 2018

Pete Warden

Describes an audio dataset of spoken words designed to help train and evaluate keyword spotting systems. Discusses why this task is an interesting challenge, and why it requires a specialized dataset that is different from conventional datasets used for automatic speech recognition of full sentences. Suggests a methodology for reproducible and comparable accuracy metrics for this task. Describes how the data was collected and verified, what it contains, previous versions and properties. Concludes by reporting baseline results of models trained on this dataset.

Semi-Supervised Audio Classification with Partially Labeled Data

Conference Paper

Nov 2021

WARP: Word-level Adversarial ReProgramming

Conference Paper

Jan 2021

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

Article

Jan 2020

Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification, music classification, speech emotion classification and sound event detection. Recently, neural networks have been applied to tackle audio pattern recognition problems. However, previous systems are built on specific datasets with limited durations. Recently, in computer vision and natural language processing, systems pretrained on large-scale datasets have generalized well to several tasks. However, there is limited research on pretraining systems on large-scale datasets for audio pattern recognition. In this paper, we propose pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset. These PANNs are transferred to other audio related tasks. We investigate the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks. We propose an architecture called Wavegram-Logmel-CNN using both log-mel spectrogram and waveform as input feature. Our best PANN system achieves a state-of-the-art mean average precision (mAP) of 0.439 on AudioSet tagging, outperforming the best previous system of 0.392. We transfer PANNs to six audio pattern recognition tasks, and demonstrate state-of-the-art performance in several of those tasks. We have released the source code and pretrained models of PANNs: https://github.com/qiuqiangkong/audioset_tagging_cnn .

Musical instrument identification using MFCC

Conference Paper

May 2017

Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning

Article

Dec 2017
PATTERN RECOGN

Learning-based pattern classifiers, including deep networks, have demonstrated impressive performance in several application domains, ranging from computer vision to computer security. However, it has also been shown that adversarial input perturbations carefully crafted either at training or at test time can easily subvert their predictions. The vulnerability of machine learning to adversarial inputs (also known as adversarial examples), along with the design of suitable countermeasures, have been investigated in the research field of adversarial machine learning. In this work, we provide a thorough overview of the evolution of this interdisciplinary research area over the last ten years, starting from pioneering, earlier work up to more recent work aimed at understanding the security properties of deep learning algorithms, in the context of different applications. We report interesting connections between these apparently-different lines of work, highlighting common misconceptions related to the evaluation of the security of machine-learning algorithms. We finally discuss the main limitations of current work, along with the corresponding future research challenges towards the design of more secure learning algorithms.

Audio Set: An ontology and human-labeled dataset for audio events

Conference Paper

Mar 2017

Music Instrument Classification Reprogrammed

Abstract and Figures

Recommended publications

Low-Resource Music Genre Classification with Cross-Modal Neural Model Reprogramming

Group Masked Model Learning for General Audio Representation

M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation

Leveraging Hierarchical Structures for Few-Shot Musical Instrument Recognition