Five types of visual representation features.

Five types of visual representation features.

Source publication
Article
Full-text available
Music genre recognition (MGR) plays a fundamental role in the context of music indexing and retrieval. Unlike images, music genres consist of immediate characteristics that are highly diversified with abstractions in different levels. However, most representation learning methods for MGR focus on global features and make decisions from features in...

Contexts in source publication

Context 1
... feature is a learned feature extracted by a CNN with transfer learning. Figure 1 shows five types of time-frequency representations. Same with other works, we sample at 22050 samples per second for feature extraction. ...
Context 2
... is the most loosely distributed. As shown in Figure 10, distributions of "Wcswing", "Pasodoble", and "Quickstep" are the most compact for the Extended Ballroom dataset while distributions of "Salsa" and "Slowwaltz" are the loosest. Overall, the overlapping between classes is insignificant for all three datasets. ...
Context 3
... show that our deep model does not overfit on these small datasets. Furthermore, we feed the best feature combination to the meta classifier and visualize the training process of the model on these three datasets in Figure 11. Figure 11 shows that the proposed model does not overfit and yields good test accuracies (i.e. ...
Context 4
... we feed the best feature combination to the meta classifier and visualize the training process of the model on these three datasets in Figure 11. Figure 11 shows that the proposed model does not overfit and yields good test accuracies (i.e. good generalization capability). ...

Citations

... Actions using this method are restricted to the same monotonous level. Hence, W.Y. Ng et al. [40] proposed a novel method to infuse the Network Vector of Locally Aggregated Descriptors (NetVLAD) model, a way for local representation and self-attention. The former is used to capture data across various levels, and the latter is used for learning their chronic dependencies. ...
Article
Full-text available
Over the past decade, the invention of streaming services has led to the magnification of the music industry. With a plethora of available song choices, there is a dire need for recommendation techniques to help listeners discover music genres complementing their palate. This makes a vital need for automatic music genre categorization systems. With this objective, in this work fusion of direct and indirect features is introduced for the automatic categorization of music genres. In direct Feature Extraction (FE), the physical characteristics of music genres are assessed by timbral, chroma, and source separation-based features. In indirect FE, tunable Q-Wavelet transform and Teager energy operator are used to explore the non-linear characteristics of music signals. The proposed algorithm is examined on the GTZAN dataset, primarily focusing on the four-class classification problem. The introduced features are tested with multiple machine learning techniques to explore the best for music genre categorization. The wide neural network classifier with a single fully connected layer churned out optimal performance fetching an overall accuracy and F1 score of 95.8% and 95.82%, respectively. The proposed algorithm also outperforms most of the state-of-the-art techniques for the given dataset.
... A novel combination model that fuses harmonic and instrumental information was also proposed [45], exhibiting that jointly processing both pieces of information improves accuracy. A multi-level feature coding network using a CNN with self-attention and NetVLAD learned high-level features for each low-level feature [46]. The proposed model achieved a high classification accuracy of 96.50% on the GTZAN dataset. ...
... While recent CNN-based methods [17], [31], [41], [46], [47] have exhibited state-of-the-art performance on MGC datasets, all except one [47] employ an early fusion strategy [66], wherein all features are combined and fed into the models as input data. Meanwhile, in the studies by [42], [47], a different approach is used, where input data is separated into visual, spectral, and acoustic features. ...
Article
Full-text available
The goal of music genre classification is to identify the genre of given feature vectors representing certain characteristics of music clips. In addition, to improve the accuracy of music genre classification, considerable research has been conducted on extracting spectral features, which contain critical information for genre classification, from music clips and feeding these features into training models. In particular, recent studies argue that classification accuracy can be enhanced by employing multiple spectral features simultaneously. Consequently, fusing information from multiple spectral features is a critical consideration in designing music genre classification models. Hence, this paper provides a short survey of recent studies on music genre classification and compares the performance of the most recent CNN-based models with a newly devised model that employs a late fusion strategy for the multiple spectral features. Our empirical study of 12 public datasets, including Ballroom, ISMIR04, and GTZAN, showed that the late fusion CNN model outperforms other compared methods. Additionally, we performed an in-depth analysis to validate the effectiveness of the late fusion strategy in music genre classification.
... Spectrogram is a two-dimensional image that can reflect the essence of sound and can fully reflect the time-domain and frequency-domain information of sound [23]. e characteristics of spectrogram of different sounds are different, which is the premise of sound recognition using deep learning image processing method. ...
Article
Full-text available
National music is a treasure of Chinese traditional culture. It contains the cultural characteristics of various regions and reflects the core value of Chinese traditional culture. Classification technology classifies a large number of unorganized drama documents, which are not labeled, and to some extent, it helps folk music better enter the lives of ordinary people. Simulate folk music of different spectrum and record corresponding music audio under laboratory conditions Through Fourier transform and other methods, music audio is converted into spectrogram, and a total of 2608 two-dimensional spectrogram images are obtained as datasets. The sonogram dataset is imported into the deep convolution neural network GoogLeNet for music type recognition, and the test accuracy is 99.6%. In addition, the parallel GoogLeNet technology based on inverse autoregressive flow is used. The unique improvement is that acoustic features can be quickly converted into corresponding speech time-domain waveforms, reaching the real-time level, improving the efficiency of model training and loading, and outputting speech with higher naturalness. In order to further prove the reliability of the experimental results, the spectrogram datasets are imported into Resnet18 and Shufflenet for training, and the test accuracy of 99.2% is obtained. The results show that this method can effectively classify and recognize music. The experimental results show that this scheme can achieve more accurate classification. The research realizes the recognition of national music through deep learning spectrogram classification for the first time, which is an intelligent and fast new method of classification and recognition.
... Ng et al. [78] integrated a CNN and self-attention to capture the local information across levels of songs and learn their long-term dependencies. A meta-classifier was furthermore used to make the final music genre classification by learning from aggregated high-level features from different local feature coding networks. ...
Article
Music genres can reveal our preferences and are one of the main tools for retailers, libraries, and people to organize music. In addition, the music industry uses genres as a key method to define and target different markets, and thus, being able to categorize genres is an asset for marketing and music production. Several pieces of research have been done to classify western music genres, yet nothing has been done to classify Persian music genres so far. In this research, a tailored deep neural network-based method, termed PMG-Net, is introduced to automatically classify Persian music genres. Also, to assess the PMG-Net, a dataset, named PMG-Data, consisting of 500 music from different genres of Pop, Rap, Traditional, Rock, and Monody are collected and labeled, which is made publicly available for researchers. The accuracy obtained by PMG-Net on the PMG-Data is 86%, indicating an acceptable performance of the method compared with the existing deep neural network-based approaches.
... The GTZAN Genre mainly divides music genres into 10 categories: blues, country, hip-hop, jazz, pop, disco, classical, rock, reggae, and metal. The ISMIR2004 Genre mainly divides music into six genres: classical, electronic, jazz/blues, metal/punk, and rock/pop (Ng et al., 2020). ...
Article
Full-text available
The research expects to explore the application of intelligent music recognition technology in music teaching. Based on the Long Short-Term Memory network knowledge, an algorithm model which can distinguish various music signals and generate various genres of music is designed and implemented. First, by analyzing the application of machine learning and deep learning in the field of music, the algorithm model is designed to realize the function of intelligent music generation, which provides a theoretical basis for relevant research. Then, by selecting massive music data, the music style discrimination and generation model is tested. The experimental results show that when the number of hidden layers of the designed model is 4 and the number of neurons in each layer is 1,024, 512, 256, and 128, the training result difference of the model is the smallest. The classification accuracy of jazz, classical, rock, country, and disco music types can be more than 60% using the designed algorithm model. Among them, the classification effect of jazz schools is the best, which is 77.5%. Moreover, compared with the traditional algorithm, the frequency distribution of the music score generated by the designed algorithm is almost consistent with the spectrum of the original music. Therefore, the methods and models proposed can distinguish music signals and generate different music, and the discrimination accuracy of different music signals is higher, which is superior to the traditional restricted Boltzmann machine method.
... e recognition accuracy is improved by using the information obtained from the offline data set, including RGB information and available depth and bone information. Ng et al. proposed an optimization method based on machine learning to match dance technical movements and music in music arrangement [19]. First, combined with machine learning theory, this method constructs the mapping relationship between dance action and music based on historical sample data set and obtains the evaluation function of the harmonious relationship between dance action and music. ...
... e experimental data are from the dance action capture database provided by Carnegie Mellon University Laboratory, and the length of each music segment is about 5 seconds. Using the methods of this study and references [17] and [19], the optimization experiment of dance action matching in music choreography is carried out, and the matching degrees (%) of the three methods are compared. Figure 7 shows the comparison results of dance movements in music arrangement. ...
... is aggressive and comprehensive matching approach is more universal and scalable, and it can give a platform for additional distinct songs and dances besides chime music and dance. In this Matching degree [19] Method [17] Method The method of this paper Data Availability e data used to support the findings of this study are available from the corresponding author upon request. ...
Article
Full-text available
Music and dance videos have been popular among researchers in recent years. Music is one of the most important forms of human communication; it carries a wealth of emotional information, and it is studied using computer tools. In the feature engineering process, most present machine learning approaches suffer from information loss or insufficient extracted features despite the relevance of computer interface and multimedia technologies in sound and music matching tasks. Multifeature fusion is widely utilized in education, aerospace, intelligent transportation, biomedicine, and other fields, and it plays a critical part in how humans get information. In this research, we offer an effective simulation method for matching dance technique movements with music based on multifeature fusion. The initial step is to use music beat extraction theory to segment the synchronized dance movements and music data, then locate mutation points in the music, and dynamically update the pheromones based on the merits of the dance motions. The audio feature sequence is obtained by extracting audio features from the dancing video’s accompanying music. Then, we combine the two sequences to create an entropy value sequence based on audio variations. By comparing the consistency of several approaches for optimizing dance movement simulation trials, the optimized simulation method described in this research has an average consistency of 87%, indicating a high consistency. As a result, even though the background and the subject are readily confused, the algorithm in this research can keep a consistent recognition rate for more complicated dance background music, and the approach in this study can still guarantee a certain accuracy rate.
... However, most of MGR's representation learning methods focus on global features and make decisions from the same set of features. In order to make up for these defects, Ng et al. (2020) integrated the convolutional neural network with NetVLAD and self-attention to capture local information across levels and understand their long-term dependence. The meta classifier was used for the final MGR classification by learning from aggregated advanced features from different local feature coding networks. ...
Article
Full-text available
In order to study the application of the deep learning (DL) method in music genre recognition, this study introduces the music feature extraction method and the deep belief network (DBN) in DL and proposes the parameter extraction feature and the recognition classification method of an ethnic music genre based on the DBN with five kinds of ethnic musical instruments as the experimental objects. A national musical instrument recognition and classification network structure based on the DBN is proposed. On this basis, a music library classification retrieval learning platform has been established and tested. The results show that, when the DBN only contains one hidden layer and the number of neural nodes in the hidden layer is 117, the basic convergence accuracy is approximately 98%. The first hidden layer has the greatest impact on the prediction results. When the input sample feature size is one-third of the number of nodes in the first hidden layer, the network performance is basically convergent. The DBN is the best way for softmax to identify and classify national musical instruments, and the accuracy rate is 99.2%. Therefore, the proposed DL algorithm performs better in identifying music genres.
... A music type recognition system is developed on a gtzan data set, including ten different music types, such as rock, pop, classical, and so on. Ng et al. [6] proposed multilevel local feature coding fusion for music type recognition. Music-type recognition plays a basic role in music indexing and retrieval. ...
Article
Full-text available
When the current method is used to recognize music genre style, the extracted features are not fused, which leads to poor recognition effectiveness. Therefore, the application research based on multilevel local feature coding in music genre recognition is proposed. Features of music are extracted from timbre, rhythm, and pitch, and the extracted features are fused based on D-S evidence theory. The fused music features are input into the improved deep learning network, and the storage system structure is determined from the advantages of cloud storage availability, manageability, and expansibility. It is divided into four modules: storage layer, management layer, structure layer, and access layer. The model of music genre style recognition is constructed to realize the application research based on multilevel local feature coding in music genre recognition. The experimental results show that the recognition accuracy of the proposed method is always at a high level, and the mean square error positively correlated with the number of beats. After segmentation, the waveform is denser, which has a good application effect.
... His work shows that the proposed approach provides higher accuracies than other state-of-the-art models on GTZAN, ISMIR2004, and Extended Ballroom dataset [8]. ...
... Where, fk is the k-th frequency, N is the number of frequency bins. P (fk) is the spectral amplitude on the k-th frequency [8]. ...
Article
Full-text available
Music Genre classification on Neural Network is presented in this article. The research work uses spectrogram images generated from the songs timeslices and given as input to NN to do classification of songs to their respective musical genre. The research work focuses on analyzing the parameters of the model. Using two different datasets and implementing NN technique we have achieved an optimized result. The Convolutional Neural Network model presented in this article classifies 10 classes of Music Genres with the improved accuracy.
... In recent years, with the remarkable success of deep learning techniques in computer vision applications, deep neural networks (DNNs) have also shown great success in speech/music classification or recognition tasks, such as speaker recognition [36,43], music genre classification [6,39], speech emotion recognition [49], etc. In these tasks, deep learning provides a new way to extract discriminative embeddings from those famous hand-crafted acoustic features, called i-vector content, for classification/recognition 3-second raw waveform ...
... purposes. To this end, deep learning methods based on convolutional neural networks (CNNs) are the most widely used approach to obtain embeddings from those i-vector content, such as MFCC [46,47,53], OSC coefficients [54], 2D representations like audio spectrogram or chromagram [6,39], etc. Bisharad et al. proposed a music genre classification system using residual neural network (ResNet) based model [8]. Specifically, ResNet-18 is used to extract time-frequency features from the Melspectrogram of each 3-second music clip. ...
... Ng et al. [39] proposed the FusionNet to combine the classification results obtained from a set of hand-crafted features, including timbre, rhythm, Mel-spectrogram, constant-Q spectrogram [17], harmonic spectrogram [12], percussive spectrogram [12], scatter transform spectrogram [1], and transfer feature [10]. They fed each feature into the individual feature coding network with NetVLAD [2] and self-attention to obtain the classification results. ...
Preprint
Full-text available
In this study, we proposed a new end-to-end convolutional neural network, called MS-SincResNet, for music genre classification. MS-SincResNet appends 1D multi-scale SincNet (MS-SincNet) to 2D ResNet as the first convolutional layer in an attempt to jointly learn 1D kernels and 2D kernels during the training stage. First, an input music signal is divided into a number of fixed-duration (3 seconds in this study) music clips, and the raw waveform of each music clip is fed into 1D MS-SincNet filter learning module to obtain three-channel 2D representations. The learned representations carry rich timbral, harmonic, and percussive characteristics comparing with spectrograms, harmonic spectrograms, percussive spectrograms and Mel-spectrograms. ResNet is then used to extract discriminative embeddings from these 2D representations. The spatial pyramid pooling (SPP) module is further used to enhance the feature discriminability, in terms of both time and frequency aspects, to obtain the classification label of each music clip. Finally, the voting strategy is applied to summarize the classification results from all 3-second music clips. In our experimental results, we demonstrate that the proposed MS-SincResNet outperforms the baseline SincNet and many well-known hand-crafted features. Considering individual 2D representation, MS-SincResNet also yields competitive results with the state-of-the-art methods on the GTZAN dataset and the ISMIR2004 dataset. The code is available at https://github.com/PeiChunChang/MS-SincResNet