FIG 2 - uploaded by Johar M. Ashfaque
Content may be subject to copyright.
Diagram of k-fold cross-validation with k = 10. Image from Karl Rosaen Log http://karlrosaen.com/ml/learning-log/2016-06-20/

Diagram of k-fold cross-validation with k = 10. Image from Karl Rosaen Log http://karlrosaen.com/ml/learning-log/2016-06-20/

Source publication
Article
Full-text available
We explain the support vector machine algorithm, and its extension the kernel method, for machine learning using small datasets. We also briefly discuss the Vapnik-Chervonenkis theory which forms the theoretical foundation of machine learning. This review is based on lectures given by the second author.

Similar publications

Preprint
Full-text available
As quantum computers become increasingly practical, so does the prospect of using quantum computation to improve upon traditional algorithms. Kernel methods in machine learning is one area where such improvements could be realized in the near future. Paired with kernel methods like support-vector machines, small and noisy quantum computers can eval...
Article
Full-text available
A method for analyzing the feature map for the kernel-based quantum classifier is developed; that is, we give a general formula for computing a lower bound of the exact training accuracy, which helps us to see whether the selected feature map is suitable for linearly separating the dataset. We show a proof of concept demonstration of this method fo...
Article
Full-text available
In this paper, we propose a stochastic gradient descent algorithm, called stochastic gradient descent method-based generalized pinball support vector machine (SG-GPSVM), to solve data classification problems. This approach was developed by replacing the hinge loss function in the conventional support vector machine (SVM) with a generalized pinball...
Article
Full-text available
We investigated the potential application of quantum computing using the Kronecker kernel to pairwise classification and have devised a way to apply the Harrow-Hassidim-Lloyd (HHL)-based quantum support vector machine algorithm. Pairwise classification can be used to predict relationships among data and is used for problems such as link prediction...

Citations

... K-cross validation technique[37]. ...
Article
Full-text available
Automatic dependent surveillance-broadcast (ADS-B) is the future of aviation surveillance and traffic control, allowing different aircraft types to exchange information periodically. Despite this protocol’s advantages, it is vulnerable to flooding, denial of service, and injection attacks. In this paper, we decided to join the initiative of securing this protocol and propose an efficient detection method to help detect any exploitation attempts by injecting these messages with the wrong information. This paper focused mainly on three attacks: path modification, ghost aircraft injection, and velocity drift attacks. This paper aims to provide a revolutionary methodology that, even in the face of new attacks (zero-day attacks), can successfully detect injected messages. The main advantage was utilizing a recent dataset to create more reliable and adaptive training and testing materials, which were then preprocessed before using different machine learning algorithms to feasibly create the most accurate and time-efficient model. The best outcomes of the binary classification were obtained with 99.14% accuracy, an F1-score of 99.14%, and a Matthews correlation coefficient (MCC) of 0.982. At the same time, the best outcomes of the multiclass classification were obtained with 99.41% accuracy, an F1-score of 99.37%, and a Matthews correlation coefficient (MCC) of 0.988. Eventually, our best outcomes outdo existing models, but we believe the model would benefit from more testing of other types of attacks and a bigger dataset.
... In each iteration, a different test set is used. The final model performance is the average of the scores obtained in each iteration, as shown in Figure 4 [23]. ...
Preprint
Full-text available
Breast cancer is a significant health problem, with about 2 million new cases annually diagnosed and 600,000 deaths. Early detection and accurate diagnosis are critical to patient prognosis. Machine learning (ML) models show promising results in accurate and efficient diagnosis. In the present work, the performance of different models of ML are studied in the publicly accessible online dataset "Wisconsin Breast Cancer Dataset". Those models are formed by logistic regressions, Random Forest, Naïve Bayes, and Support Vector Machine algorithms, being the last one the best performing. An ensemble model combining the best proposed models is then implemented. An SVM model with standardized dataset is used, a logistic regression model with standardized dataset and 10-component PCA analysis. A Random Forest model with standardized dataset and 60 estimators. All models use a test dataset formed by 30% of the original dataset. The models are combined using a majority weighted voting system. The SVM model has a weight of 0.5 while the regression and Random Forest models have weights of 0.25. The ensemble voting model manages to improve the results of the individual models with an accuracy of 98%, precision of 97%, recall of 99% and F1 score of 98%.
... The GPR model was trained with 10 cross fold validation which was reported as the most commonly used value according to Singh et al. (2011). This is to avoid overfitting or underfitting a model with algorithms that are too simple or too complex and assisting the selection of the best performing model by using the validation dataset to calculate the error (Refaeilzadeh et al. 2016;Ashfaque and Iqbal 2019). ...
Article
Full-text available
Palm oil mill effluent (POME) contributes to 23.7% of the methane emissions in Malaysia. Development of a methane emission prediction tool by using machine learning (ML) enables the estimated volume of methane released to be determined. In this study, Gaussian Process Regression (GPR) along with its respective kernels was explored for the development of the prediction tool. Synthetic minority oversampling technique (SMOTE) was also implemented to study the effect of the training sample size used on the model validation. The GPR model was trained using synthetic data created from SMOTE, while the measure data from the plant was used to test the reliability of the trained model. The application of SMOTE was capable in producing high model validation performance (R² = 0.98, RMSE = 0.133, MSE = 0.018 and MAE = 0.08) using the common squared exponential kernel GPR model. However, the Matern 5/2 and rational quadratic kernel GPR model had the best model validation performance (R² = 0.98, RMSE = 0.131, MSE = 0.017 and MAE = 0.083). In terms of model testing performance, rational quadratic kernel had the best performance with R² = 0.99, RMSE = 0.061, MSE = 0.0037 and MAE = 0.044. The results of this study indicate the prediction tool developed using SMOTE-based rational quadratic kernel GPR model can predict methane emissions with high accuracy. The methane emissions prediction tool developed is an alternative cost friendly and reliable option to existing methods.
... K-Cross Validation Technique[40] ...
Preprint
Full-text available
Automatic Dependent Surveillance-Broadcast (ADS-B) is considered the future of aviation surveillance and traffic control as it allows different types of aircraft to transmit and gain information about their and other nearby aircraft's positions, velocity, and various other variables periodically. But, as this protocol still show that it lacks in terms of security and that researchers are still developing more methods and frameworks in order to secure this technology, we decided to join the initiative and propose an efficient detection method to help aid with detecting any attempts at injecting these messages which would cause multiple risks to aircrafts such as causing collision avoidance system failure, reporting wrong status of an aircraft, or even stealing it. This paper focused mainly on three different attacks; path modification, ghost aircraft injection, and velocity drift attacks. The dataset we utilized consisted of authentic messages captured from the OpenSky Network and generated injected messages using PyCharm. This study aims to provide a revolutionary methodology that, even in the face of new attacks (zero-day attacks), can successfully detect injected messages. The main advantage was utilizing a recent dataset to create more reliable and adaptive training and testing materials, which were then preprocessed before using different machine learning algorithms to feasibly create the most accurate and time-efficient model. The best outcomes of the binary classification were obtained with 99.14% accuracy, an F1-Score of 99.14%, and a Matthews' Correlation Coefficient (MCC) of 0.982. At the same time, the best outcomes of the multiclass classification were obtained with 99.41% accuracy, an F1-Score of 99.37%, and a Matthews' Correlation Coefficient (MCC) of 0.988. The dataset is thought to offer good outcomes, but the model still requires more testing and a bigger dataset, bearing in mind that the model still needs to be tested against other types of attacks.
... In general, a validation dataset, that is independent of both training and testing sets, is used to validate the GCN model. Then, the test dataset is finally used to evaluate the trained model [55]. Unfortunately, in the case of analog circuits, the luxury of a big dataset is not always available. ...
Article
Full-text available
Citation: Deeb, A.; Ibrahim, A.; Salem, M.; Pichler, J.; Tkachov, S.; Karaj, A.; Al Machot, F.; Kyandoghere, K. A Robust Abstract: Analog mixed-signal (AMS) verification is one of the essential tasks in the development process of modern systems-on-chip (SoC). Most parts of the AMS verification flow are already automated, except for stimuli generation, which has been performed manually. It is thus challenging and time-consuming. Hence, automation is a necessity. To generate stimuli, subcircuits or subblocks of a given analog circuit module should be identified/classified. However, there currently needs to be a reliable industrial tool that can automatically identify/classify analog sub-circuits (eventually in the frame of a circuit design process) or automatically classify a given analog circuit at hand. Besides verification, several other processes would profit enormously from the availability of a robust and reliable automated classification model for analog circuit modules (which may belong to different levels). This paper presents how to use a Graph Convolutional Network (GCN) model and proposes a novel data augmentation strategy to automatically classify analog circuits of a given level. Eventually, it can be upscaled or integrated within a more complex functional module (for a structure recognition of complex analog circuits), targeting the identification of subcircuits within a more complex analog circuit module. An integrated novel data augmentation technique is particularly crucial due to the harsh reality of the availability of generally only a relatively limited dataset of analog circuits' schematics (i.e., sample architectures) in practical settings. Through a comprehensive ontology, we first introduce a graph representation framework of the circuits' schematics, which consists of converting the circuit's related netlists into graphs. Then, we use a robust classifier consisting of a GCN processor to determine the label corresponding to the given input analog circuit's schematics. Furthermore, the classification performance is improved and robust by involving a novel data augmentation technique. The classification accuracy was enhanced from 48.2% to 76.6% using feature matrix augmentation, and from 72% to 92% using Dataset Augmentation by Flipping. A 100% accuracy was achieved after applying either multi-Stage augmentation or Hyperphysical Augmentation. Overall, extensive tests of the concept were developed to demonstrate high accuracy for the analog circuit's classification endeavor. This is solid support for a future up-scaling towards an automated analog circuits' structure detection, which is one of the prerequisites not only for the stimuli generation in the frame of analog mixed-signal verification but also for other critical endeavors related to the engineering of AMS circuits.
... Diagram of K-Fold cross-validation (Modified from Ashfaque,2018)[17] ...
Chapter
Full-text available
Epilepsy is a type of neurological brain disorder due to a temporary change in the brain’s electrical function. If diagnosed and treated, there can be no seizures. Electroencephalogram (EEG) is the most common technique used in diagnosing epilepsy to avoid danger and take preventive precautions. This paper applied deep learning and machine learning techniques for detecting epileptic seizures and identifying whether machine learning or deep learning classifiers are more pertinent for the purpose and then trying to improve the present techniques for seizure detection. The best performance of the deep learning models has been achieved by implementing the convolutional neural network (CNN) algorithm on the EEG signal dataset in which the result appears as follows: accuracy 99.2%, specificity 99.3% and sensitivity 98.7%. For hybrid deep neural network CNN with long short-term memory (LSTM), the accuracy reached is 98.7%.KeywordsConvolutional neural networkEpilepsySeizuresElectroencephalographyLong short-term memory
... Overall, the proposed methods from various researchers have been concluded in the benchmark table above [25][26][27]. Table 1 Benchmark table for all existing research work on the mushroom. ...
... In (2), > 0 is the regularization constant that controls the trade-off between the minimization of the training errors and maximization of the of the margin [21] and is the slack variable which indicates the degree that a data point may lie within the margin [27], [28]. In (2), the ":" should be read as "such that". ...
... A representation of 10-fold Cross-Validation[83] ...
Thesis
La principale cause de cécité dans la population pourrait être surtout la détérioration de la rétine causée par les problèmes liés au diabète et la complication du vieillissement. La rétinopathie diabétique (DR) et l'oedème maculaire diabétique (DME) sont les principales causes directes de problèmes de vision chez les citoyens en âge de travailler de la plupart des pays avancés. Le nombre élevé de personnes diabétiques dans le monde indique que le DME et la RD resteront les principaux facteurs de perte de vision partielle ou totale, ce qui affecte la qualité de vie des patients pendant de nombreuses années et menace leur vie. Par conséquent, une détection précoce suivie de procédures de traitement rapide des personnes atteintes de maladies liées au diabète est importante pour prévenir les problèmes optiques et peut réduire le risque de cécité. De plus, les personnes de plus de 50 ans sont exposées à la dégénérescence maculaire liée à l'âge (AMD) qui attaque la rétine. Par conséquent, les chercheurs du monde entier sont attirés par les différences liées à plusieurs maladies rétiniennes. Plusieurs méthodes automatisées utilisant l'AI ont été appliquées pour la détection et le test des maladies rétiniennes. Malheureusement, ces modèles peuvent être confondus avec une incapacité de calcul, ce qui nécessite une intervention supplémentaire de la part de spécialistes. Cette thèse présente une méthode automatique - basée sur des algorithmes de réseaux de neurones d'apprentissage en profondeur - pour détecter DME et DR, ce qui permet de dépasser l'évaluation pratique subjective des ophtalmologistes. Basé sur "Convolutional Neural Network", un modèle proposé est présenté avec un classificateur soft-max et entrainé de bout en bout pour la classification automatique des images rétiniennes de tomographie par cohérence optique (OCT). Ce modèle a la capacité de détecter des caractéristiques permettant d'identifier la DR et le DME en utilisant ces images rétiniennes avec une précision et une sensibilité améliorées. De plus, un modèle préformé a été affiné et réformé à l'aide d'un ensemble de données qui a été enrichi à l'aide de "Generative Adversarial Networks" (GAN). Contrairement au diagnostic manuel de la maladie rétinienne basé sur un examen clinique personnel et l'analyse des images OCT, cette méthode a montré la capacité de prédire automatiquement les cas atteints de DME par rapport aux cas sains. Les expériences ont été évaluées sur plusieurs ensembles de données fournis par différentes institutions. Le modèle, comparé à d'autres modèles CNN entrainés de bout en bout ou pre-entrainés et affinés, montre des fonctionnalités d'extraction efficaces, avec moins de temps, sur la base d'une étape de prétraitement efficace des données. Les résultats expérimentaux ont montré une plus grande précision de classification, ce qui est prometteur dans le domaine de la détection précoce des maladies diabétiques pour aider les ophtalmologistes dans les technologies biomédicales.
... The latter is more flexible but also requires more computational resources and becomes less straightforward to explain. Readers can refer to (Ng, 2000) for more details. We build the support vector machine model in Matlab using the function fitcsvm. ...
Article
Full-text available
Urban pluvial flooding is a threatening natural hazard in urban areas all over the world, especially in recent years given its increasing frequency of occurrence. In order to prevent flood occurrence and mitigate the subsequent aftermath, urban water managers aim to predict precipitation characteristics, including peak intensity, arrival time and duration, so that they can further warn inhabitants in risky areas and take emergency actions when forecasting a pluvial flood. Previous studies that dealt with the prediction of urban pluvial flooding are mainly based on hydrological or hydraulic models, requiring a large volume of data for simulation accuracy. These methods are computationally expensive. Using a rainfall threshold to predict flooding based on a data-driven approach can decrease the computational complexity to a great extent. In order to prepare cities for frequent pluvial flood events – especially in the future climate – this paper uses a rainfall threshold for classifying flood vs. non-flood events, based on machine learning (ML) approaches, applied to a case study of Shenzhen city in China. In doing so, ML models can determine several rainfall threshold lines projected in a plane spanned by two principal components, which provides a binary result (flood or no flood). Compared to the conventional critical rainfall curve, the proposed models, especially the subspace discriminant analysis, can classify flooding and non-flooding by different combinations of multiple-resolution rainfall intensities, greatly raising the accuracy to 96.5% and lowering the false alert rate to 25%. Compared to the conventional model, the critical indices of accuracy and true positive rate (TPR) were 5%-15% higher in ML models. Such models are applicable to other urban catchments as well. The results are expected to be used to assist early warning systems and provide rational information for contingency and emergency planning.