Fig 1 - uploaded by Walter de Donato
Content may be subject to copyright.
Diagram of the Continuous Training Traffic Classification system based on TIE 

Diagram of the Continuous Training Traffic Classification system based on TIE 

Source publication
Conference Paper
Full-text available
The network measurement community has proposed multiple machine learning (ML) methods for traffic classification during the last years. Although several research works have reported accuracies over 90%, most network operators still use either obsolete (e.g., port-based) or extremely expensive (e.g., pattern matching) methods for traffic classificat...

Context in source publication

Context 1
... this section, we show the interaction of our KD-Tree plugin with the rest of the TIE architecture, and describe the modifications done in TIE to allow our plugin to continuously retrain itself. Figure 1 shows the data flow of our continuous training system based on TIE. The first three modules are used without any modification as found in the original version of TIE. ...

Similar publications

Conference Paper
Full-text available
The correct selection of performance metrics is one of the most key issues in evaluating classifier's performance. Although many performance metrics have been proposed and used in machine learning community, there is not any common conclusions among practitioners regarding which metric to choose for evaluating a classifier's performance. In this pa...
Conference Paper
Full-text available
Classification problems are of profound interest for the machine learning community as well as to an array of application fields. However, multi-class classification problems can be very complex, in particular when the number of classes is high. Although very successful in so many applications, GP was never regarded as a good method to perform mult...
Article
Full-text available
Transfer Learning (TL) aims to transfer knowledge acquired in one problem, the source problem, onto another problem, the target problem, dispensing with the bottom-up construction of the target model. Due to its relevance, TL has gained significant interest in the Machine Learning community since it paves the way to devise intelligent learning mode...
Article
Full-text available
Probabilistic classifiers are considered to be among the most popular classifiers for the machine learning community and are used in many applications. Although popular probabilistic classifiers exhibit very good performance when used individually in a specific classification task, very little work has been done on assessing the performance of two...

Citations

... Port-based traffic classification use transport layer protocol UDP/TCP port numbers for classification. This technique was particularly successful in the early phases of classifying network traffic [2]. However, the classification accuracy of this method has significantly fallen to 50%-70% due to the continual increase in the number of applications and the advent of port number disguising techniques [3]. ...
... The paper by Wang et al. successfully applied the transformation concept of by training a CNN network with the TensorFlow framework [2,15]. The CNN received a 28×28-byte input matrix and produced an average F1 score of 98.5%. ...
Article
Full-text available
The classification, detection, and analysis of routine network traffic has been a hot topic for businesses and research institutions due to the proliferation of Internet of Things devices and the explosive development of networks. Traditional methods for categorizing network traffic primarily employ common machine learning algorithms e.g., decision trees and plain Bayes algorithms, but as deep learning technology advances, more and more traffic classifications are being successfully applied. This study examines existing deep learning-based network traffic classification techniques and focuses on the categorization of computer network traffic. Firstly, the research background of the topic is introduced, and then the traffic classification based on deep learning is mainly described, which includes traffic classification based on Stacked Autoencoder, traffic classification based on Convolutional Neural Network and traffic classification based on Recurrent Neural Networks. Following investigation, this paper comes to the conclusion that Long Short-Term Memory and Convolutional Neural Network models are the best deep learning models for traffic classification, with three-dimensional Convolutional Neural Network outperforming the others.
... However, their use relies on obtaining handcrafted (domain-expert driven) features, which in TC context usually correspond to packetsequence statistics. Such feature engineering process is unable to cope with modern network-traffic evolution, and impairs the design of both accurate and up-to-date traffic classifiers [9] using "traditional" ML approaches [10,11,12]. ...
Article
Full-text available
Traffic classification, i.e. the inference of applications and/or services from their network traffic, represents the workhorse for service management and the enabler for valuable profiling information. The growing trend toward encrypted protocols and the fast-evolving nature of network traffic are obsoleting the traffic-classification design solutions based on payload-inspection or machine learning. Conversely, deep learning is currently foreseen as a viable means to design traffic classifiers based on automatically-extracted features. These reflect the complex patterns distilled from the multifaceted (encrypted) traffic, that implicitly carries information in "multimodal" fashion, and can be also used in application scenarios with diversified network visibility for (simul-taneously) tackling multiple classification tasks. To this end, in this paper a novel multimodal multitask deep learning approach for traffic classification is proposed, leading to the Distiller classifier. The latter is able to capitalize traffic-data heterogeneity (by learning both intra-and inter-modality dependencies), overcome performance limitations of existing (myopic) single-modal deep learning-based traffic classification proposals, and simultaneously solve different traffic categorization problems associated to different providers' desiderata. Based on a public dataset of encrypted traffic, we evaluate Distiller in a fair comparison with state-of-the-art deep learning architectures proposed for encrypted traffic classification (and based on single-modality philosophy). Results show the gains of our proposal over both multitask extensions of single-task baselines and native multitask architectures.
... Such process is impractical when facing the fast-paced mobile traffic evolution, because it can be neither automated nor crowdsourced to non-experts (due to the high specialization required). After a large number of ML-based approaches [3,4,5,6], recently deep learning (DL) [7,8], a cutting-edge subset of ML techniques, has emerged as the disruptive breakthrough toward the automatic design of accurate inference systems able to capture complex dependencies among data, thus limiting human expert intervention. ...
Article
Full-text available
Traffic Classification (TC), consisting in how to infer applications generating network traffic, is currently the enabler for valuable profiling information, other than being the workhorse for service differentiation/blocking. Further, TC is fostered by the blooming of mobile (mostly encrypted) traffic volumes, fueled by the huge adoption of hand-held devices. While researchers and network operators still rely on machine learning to pursue accurate inference, we envision Deep Learning (DL) paradigm as the stepping stone toward the design of practical (and effective) mobile traffic classifiers based on automatically-extracted features, able to operate with encrypted traffic, and reflecting complex traffic patterns. In this context, the paper contribution is four-fold. First, it provides a taxonomy of the key network traffic analysis subjects where DL is foreseen as attractive. Secondly, it delves into the non-trivial adoption of DL to mobile TC, surfacing potential gains. Thirdly, to capitalize such gains, it proposes and validates a general framework for DL-based encrypted TC. Two concrete instances originating from our framework are then experimentally evaluated on three mobile datasets of human users' activity. Lastly, our framework is leveraged to point to future research perspectives.
... But to some extent, the non-weight distance calculation lacks adaptability to different tasks of encrypted traffic classification. Carela-Español et al. [54] used KD-Tree (k-dimension tree) to improve the efficiency of KNN in traffic classification. Bar-Yanai et al. [55] combined KNN and K-means to improve the efficiency for real-time traffic classification. ...
Article
Full-text available
The fine-grained classification of encrypted traffic is important for network security analysis. Malicious attacks are usually encrypted and simulated as normal application or content traffic. Supervised machine learning methods are widely used for traffic classification and show good performances. However, they need a large amount of labeled data to train a model, while labeled data is hard to obtain. Aiming at solving this problem, this paper proposes a method to train a model based on the K-nearest neighbor (KNN) algorithm, which only needs a small amount of data. Due to the fact that the importance of different traffic features varies, and traditional KNN does not highlight the importance of different features, this study introduces the concept of feature weight and proposes the weighted feature KNN (WKNN) algorithm. Furthermore, to obtain the optimal feature set and the corresponding feature weight set, a feature selection and feature weight self-adaptive algorithm for WKNN is proposed. In addition, a three-layer classification framework for encrypted network flows is established. Based on the improved KNN and the framework, this study finally presents a method for fine-grained classification of encrypted network flows, which can identify the encryption status, application type and content type of encrypted network flows with high accuracies of 99.3%, 92.4%, and 97.0%, respectively.
... Table IIb reports the per-flow features F i extracted. Previous works in the field of traffic classification via machine learning successfully leveraged these features to feed the classification algorithms they devised [16,[21][22][23]. ...
Conference Paper
Full-text available
Network traffic analysis, i.e. the umbrella of procedures for distilling information from network traffic, represents the enabler for highly-valuable profiling information, other than being the workhorse for several key network management tasks. While it is currently being revolutionized in its nature by the rising share of traffic generated by mobile and hand-held devices, existing design solutions are mainly evaluated on private traffic traces, and only a few public datasets are available, thus clearly limiting repeatability and further advances on the topic. To this end, this paper introduces and describes MIRAGE, a reproducible architecture for mobile-app traffic capture and ground-truth creation. The outcome of this system is MIRAGE-2019, a human-generated dataset for mobile traffic analysis (with associated ground-truth) having the goal of advancing the state-of-the-art in mobile app traffic analysis. A first statistical characterization of the mobile-app traffic in the dataset is provided in this paper. Still, MIRAGE is expected to be capitalized by the networking community for different tasks related to mobile traffic analysis.
... Thus, these training samples were sampled randomly based on different weights of tissue types. And then, a fast k-means algorithm [31] was first performed to obtain C cluster anchors within the training CT patch samples and the corresponding voxel-wise MR descriptors. The same neighbor samples can be shared by different anchors. ...
... 3) Hierarchical search of nearest neighbor: Motivated by the property of k-d tree [31] that can decrease the execution time, a hierarchical search (two-layer) approach was added to the INAR method to further accelerate the synthesis of the pCT images. The C MR anchors were first clustered into c 1 groups using k-means, each with an l 2 -normalized centroid. ...
Article
Full-text available
Given the complicated relationship between the Magnetic Resonance Imaging (MRI) signals and the attenuation values, the attenuation correction in hybrid Positron Emission Tomography (PET)/MRI systems remains a challenging task. Currently, existing methods are either time-consuming or require sufficient samples to train the models. In this work, an efficient approach for predicting pseudo computed tomography (CT) images from T1- and T2-weighted MRI data with limited data is proposed. The proposed approach uses improved neighborhood anchor regression (INAR) as a baseline method to pre-calculate projected matrices to flexibly predict the pseudo CT patches. Techniques, including the augmentation of the MR/CT dataset, learning of the nonlinear descriptors of MR images, hierarchical search for nearest neighbors, data-driven optimization, and multi-regressor ensemble, are adopted to improve the effectiveness of the proposed approach. In total, 22 healthy subjects were enrolled in the study. The pseudo CT images obtained using INAR with multi-regressor ensemble yielded mean absolute error (MAE) of 92.73 $\pm$ 14.86 HU, peak signal-to-noise ratio of 29.77 $\pm$ 1.63 dB, Pearson linear correlation coefficient of 0.82 $\pm$ 0.05, dice similarity coefficient of 0.81 $\pm$ 0.03, and the relative mean absolute error (rMAE) in PET attenuation correction of 1.30 $\pm$ 0.20% compared with true CT images. Moreover, our proposed INAR method, without any refinement strategies, can achieve considerable results with only seven subjects (MAE 106.89 $\pm$ 14.43, rMAE 1.51 $\pm$ 0.21%). The experiments prove the superior performance of the proposed method over the six innovative methods. Moreover, the proposed method can rapidly generate the pseudo CT images that are suitable for PET attenuation correction.
... Hence, classifiers based on Machine Learning (ML) are deemed the most appropriate, especially in this context, since they suit also ET while not necessarily relying on port information [9,10,11], and they are also able to discriminate traffic generated from several apps. 1 However, the successful use of standard ML classifiers relies on obtaining handcrafted (domain-expert driven) features, which in TC context usually correspond to statistics extracted from the sequence of packets [9,13] or message sizes [14,15]. Sadly, such process is time-consuming, unsuited to automation, and it is becoming rapidly outdated when compared to the evolution and mix of mobile traffic, being a constantly moving target, and precluding the design of accurate and up-to-date mobile-traffic classifiers [10,13,16] with "traditional" ML approaches. ...
... Confusion Matrices: Turning to the details of classifiers behavior, Fig. 3 shows the confusion matrices of best-performing DL-approaches in the three datasets, so as to investigate noteworthy error-patterns. 11 From inspection of the results, the 1D-CNN (L7-784) (in Android and FB/FBM datasets) and 2D-CNN (L7-784) (in iOS dataset) achieve almost-uniform error patterns. The FB/FBM matrix contrasts, only at a first look, the earlier result shown in [18], referring to an older (smaller and class-imbalanced) version of the dataset. ...
Article
Full-text available
The massive adoption of hand-held devices has led to the explosion of mobile traffic volumes traversing home and enterprise networks, as well as the Internet. Traffic Classification (TC), i.e. the set of procedures for inferring (mobile) applications generating such traffic, has become nowadays the enabler for highly-valuable profiling information (with certain privacy downsides), other than being the workhorse for service differ-entiation/blocking. Nonetheless, the design of accurate classifiers is exacerbated by the raising adoption of encrypted protocols (such as TLS), hindering the suitability of (effective) deep packet inspection approaches. Also, the fast-expanding set of apps and the moving-target nature of mobile traffic makes design solutions with usual machine learning, based on manually-and expert-originated features, outdated and unable to keep the pace. For these reasons Deep Learning (DL) is here proposed, for the first time, as a viable strategy to design practical mobile traffic classi-fiers based on automatically-extracted features, able to cope with encrypted traffic, and reflecting their complex traffic patterns. To this end, different state-of-the-art DL techniques from (standard) TC are here reproduced, dissected (highlighting critical choices), and set into a systematic framework for comparison, including also a performance evaluation workbench. The latter outcome, although declined in the mobile context, has the applicability appeal to the wider umbrella of encrypted TC tasks. Finally, the performance of these DL classifiers is critically investigated based on an exhaustive experimental validation (based on three mobile datasets of real human users' activity), highlighting the related pitfalls, design guidelines, and challenges.
... To overcome the above issues, recognizing applications based on flow statistical fingerprints by means of Machine Learning (ML) has been the dominant trend in the field of NTC [5], [6]. ML classifiers automatically build a model to map flow-level statistical features to applications during a training phase and then use such model to classify new flows. ...
... In terms of ML-based approaches for classifying anomalies, the field of automatic traffic analysis and classification trough ML techniques has been extensively studied during the last half-decade. A standard non-exhaustive list of supervised ML-based approaches includes the use of Bayesian classifiers [19], linear discriminant analysis and k-nearest-neighbors [20], decision trees [23] and feature selection techniques [21], and support vector machines [22]. Unsupervised and semi-supervised learning techniques have also been used before for traffic analysis and classification, including the use of k-means, DBSCAN, and AutoClass clustering [24], sub-space clustering techniques [26], [28], and a combination of k-means and maximum-likelihood clusters labeling [25]. ...
... Additionally, collective traffic statistics from multiple flows were used to achieve greater classification accuracy. Similarly Carela-Español et al. [34] used í µí±˜-dimensional trees to implement an online real-time classifier using only initial packets from flows and destination port numbers for classification. de Donato et al. [35] introduced a comprehensive traffic identification engine (TIE) incorporating several modular classifier plug-ins, using the available input traffic features to select the classifier(s), merging the obtained results from each, and giving the final classification output. ...
Article
Full-text available
Traffic classification utilizing flow measurement enables operators to perform essential network management. Flow accounting methods such as NetFlow are, however, considered inadequate for classification requiring additional packet-level information, host behaviour analysis, and specialized hardware limiting their practical adoption. This paper aims to overcome these challenges by proposing two-phased machine learning classification mechanism with NetFlow as input. The individual flow classes are derived per application through k -means and are further used to train a C5.0 decision tree classifier. As part of validation, the initial unsupervised phase used flow records of fifteen popular Internet applications that were collected and independently subjected to k -means clustering to determine unique flow classes generated per application. The derived flow classes were afterwards used to train and test a supervised C5.0 based decision tree. The resulting classifier reported an average accuracy of 92.37% on approximately 3.4 million test cases increasing to 96.67% with adaptive boosting. The classifier specificity factor which accounted for differentiating content specific from supplementary flows ranged between 98.37% and 99.57%. Furthermore, the computational performance and accuracy of the proposed methodology in comparison with similar machine learning techniques lead us to recommend its extension to other applications in achieving highly granular real-time traffic classification.