Conference PaperPDF Available

A flow trace generator using graph-based traffic classification techniques

Authors:

Abstract and Figures

We propose a novel methodology to generate realistic network flow traces to enable systematic evaluation of network monitoring systems in various traffic conditions. Our technique uses a graph-based approach to model the communication structure observed in real-world traces and to extract traffic templates. By combining extracted and user-defined traffic templates, realistic network flow traces that comprise normal traffic and customized conditions are generated in a scalable manner. A proof-of-concept implementation demonstrates the utility and simplicity of our method to produce a variety of evaluation scenarios. We show that the extraction of templates from real-world traffic leads to a manageable number of templates that still enable accurate re-creation of the original communication properties on the network flow level.
Content may be subject to copyright.
A preview of the PDF is not available
... By merely altering the combination of the profiles, it is possible to control the composition (e.g., protocols) and statistical characteristics (e.g., request times, packet arrival times, burst rates, volume) of the resulting data set. Siska et al. (2010) , propose a template-based approach using graph-theoretic metrics to define behavioral profiles present in traffic scenarios such as attacks, anomalies, and borderline cases. At first, a directed graph is created by analyzing the interactions between hosts on a given service port. ...
Article
In contrast to previous surveys, the present work is not focused on reviewing the datasets used in the network security field. The fact is that many of the available public labeled datasets represent the network behavior just for a particular time period. Given the rate of change in malicious behavior and the serious challenge to label, and maintain these datasets, they become quickly obsolete. Therefore, this work is focused on the analysis of current labeling methodologies applied to network-based data. In the field of network security, the process of labeling a representative network traffic dataset is particularly challenging and costly since very specialized knowledge is required to classify network traces. Consequently, most of the current traffic labeling methods are based on the automatic generation of synthetic network traces, which hides many of the essential aspects necessary for a correct differentiation between normal and malicious behavior. Alternatively, a few other methods incorporate non-experts users in the labeling process of real traffic with the help of visual and statistical tools. However, after conducting an in-depth analysis, it seems that all current methods for labeling suffer from fundamental drawbacks regarding the quality, volume, and speed of the resulting dataset. This lack of consistent methods for continuously generating a representative dataset with an accurate and validated methodology must be addressed by the network security research community. Moreover, a consistent label methodology is a fundamental condition for helping in the acceptance of novel detection approaches based on statistical and machine learning techniques.
... This input traffic should serve as a baseline for normal user behavior. Then, FLAME and ID2T add malicious network traffic by editing 53 [81] present a graph-based flow generator which extracts traffic templates from real network traffic. Then, their generator uses these traffic templates in order to create new synthetic flowbased network traffic. ...
Article
Labeled data sets are necessary to train and evaluate anomaly-based network intrusion detection systems. This work provides a focused literature survey of data sets for network-based intrusion detection and describes the underlying packet- and flow-based network data in detail. The paper identifies 15 different properties to assess the suitability of individual data sets for specific evaluation scenarios. These properties cover a wide range of criteria and are grouped into five categories such as data volume or recording environment for offering a structured search. Based on these properties, a comprehensive overview of existing data sets is given. This overview also highlights the peculiarities of each data set. Furthermore, this work briefly touches upon other sources for network-based data such as traffic generators and data repositories. Finally, we discuss our observations and provide some recommendations for the use and the creation of network-based data sets.
... Then, FLAME and ID2T add malicious network traffic by editing values of input traffic or by injecting synthetic flows under consideration of typical attack patterns. Siska et al. [82] propose another approach for generating synthetic network traffic. The authors present a graph-based flow generator which extracts traffic templates from real network traffic. ...
Preprint
Labeled data sets are necessary to train and evaluate anomaly-based network intrusion detection systems. This work provides a focused literature survey of data sets for network-based intrusion detection and describes the underlying packet- and flow-based network data in detail. The paper identifies 15 different properties to assess the suitability of individual data sets for specific evaluation scenarios. These properties cover a wide range of criteria and are grouped into five categories such as data volume or recording environment for offering a structured search. Based on these properties, a comprehensive overview of existing data sets is given. This overview also highlights the peculiarities of each data set. Furthermore, this work briefly touches upon other sources for network-based data such as traffic generators and traffic repositories. Finally, we discuss our observations and provide some recommendations for the use and creation of network-based data sets.
... Stiborek et al.[30]use an anomaly score to evaluate their generated data. Siska et al.[31]and Iannucci et al.[32]build graphs and evaluate the diversity of the generated traffic by comparing the number of nodes and edges between generated and 335 real network traffic. Other flow-based network traffic generators often focus on specific aspects in their evaluation, e.g. ...
Article
Flow-based data sets are necessary for evaluating network-based intrusion detection systems (NIDS). In this work, we propose a novel methodology for generating realistic flow-based network traffic. Our approach is based on Generative Adversarial Networks (GANs) which achieve good results for image generation. A major challenge lies in the fact that GANs can only process continuous attributes. However, flow-based data inevitably contain categorical attributes such as IP addresses or port numbers. Therefore, we propose three different preprocessing approaches for flow-based data in order to transform them into continuous values. Further, we present a new method for evaluating the generated flow-based network traffic which uses domain knowledge to define quality tests. We use the three approaches for generating flow-based network traffic based on the CIDDS-001 data set. Experiments indicate that two of the three approaches are able to generate high quality data.
... Stiborek et al. [30] use an anomaly score to evaluate their generated data. Siska et al. [31] and Iannucci et al. [32] build graphs and evaluate the diversity of the generated traffic by comparing the number of nodes and edges between generated and real network traffic. Other flow-based network traffic generators often focus on specific aspects in their evaluation, e.g. ...
Preprint
Full-text available
Flow-based data sets are necessary for evaluating network-based intrusion detection systems (NIDS). In this work, we propose a novel methodology for generating realistic flow-based network traffic. Our approach is based on Generative Adversarial Networks (GANs) which achieve good results for image generation. A major challenge lies in the fact that GANs can only process continuous attributes. However, flow-based data inevitably contain categorical attributes such as IP addresses or port numbers. Therefore, we propose three different preprocessing approaches for flow-based data in order to transform them into continuous values. Further, we present a new method for evaluating the generated flow-based network traffic which uses domain knowledge to define quality tests. We use the three approaches for generating flow-based network traffic based on the CIDDS-001 data set. Experiments indicate that two of the three approaches are able to generate high quality data.
Article
Network traffic workloads are widely utilized in applied research to verify correctness and to measure the impact of novel algorithms, protocols, and network functions. We provide a comprehensive survey of traffic generators referenced by researchers over the last 13 years, providing in-depth classification of the functional behaviors of the most frequently cited generators. These classifications are then used as a critical component of a methodology presented to aid in the selection of generators derived from the workload requirements of future research.
Conference Paper
Full-text available
La generación de tráfico realista en una red es un problema complejo, que habitualmente se aborda mediante el establecimiento previo de un modelo de tráfico. Un modelo de tráfico es una representación, lo más cercana a la realidad, del comportamiento de los paquetes y mensajes que recorren una red en un escenario concreto. Por ello, en este artículo se parte del análisis del estado del arte de todas las tecnologías que permiten la generación de tráfico, para posteriormente introducir el diseño de un sistema cuyo objetivo es la generación de tráfico realista basado en comportamiento de usuario, haciendo uso de usuarios simulados (NPC – Non-Playable Characters), totalmente automatizado y auto adaptable a escenarios virtualizados para su uso en plataformas de tipo Cyber Range utilizadas en el ámbito del entrenamiento en ciberseguridad.
Chapter
This chapter takes you on a journey on Internet traffic, from understanding its profile (i.e., by modeling and analysis) to generating packets or flows (either real or synthetic), in diverse environments. Usual decision network engineers and researchers find when designing performance evaluation experimental plans that are concerned with traffic generation. Suppose that you have measured and collected sufficient Internet traffic in your core network to derive a statistical model of the aggregate traffic. Now you want to use such analytical model for traffic prediction or capacity planning purposes in another what-if (a.k.a. sensitivity) analysis scenarios [14], via Systems Operational Dependency Analysis (SODA), for example [13]. If your further analysis will be conducted in a simulation environment, you either need to use the available models or to bring your traffic model into the environment as accurate as possible. If your sensitivity analysis will be done in a test bed, you have to assess the adequacy of your hardware- or software-based traffic generator. Sections 4.1 and 4.2 provide an overview of traffic analysis by looking at recent advances in traffic identification and classification and then discussing techniques and tools to effectively profile network traffic in a scalable fashion. Section 4.3 provides some examples of models that can be used to generate traffic. It is worth emphasizing that both traffic analysis and traffic modeling are very broad fields of investigation. Section 4.3 also deals with workload generation. There is a particular interest in methods that effectively and efficiently mimic network traffic in a certain layer of the Internet protocol stack. Last, but not the least, Sect. 4.4 discusses the world of simulation and emulation of computer network protocols and services. There is a massive amount of material in these topics that makes it impossible to condense them into a single book chapter. However, there will be lots of references, so the interested reader can delve into.
Article
Only little is publicly known about traffic in non-educational data centers. Recent studies made some knowledge available, which gives us the opportunity to create more realistic traffic models for data-center research. We used this knowledge to create the first publicly available traffic generator that produces realistic traffic between hosts in data centers of arbitrary size. We characterize traffic by using six probability distribution functions and concentrate on the generation of traffic on flow-level. The distribution functions are easily exchangeable to enable using up-to-date traffic characteristics whenever new data is available from publications or own experiments. Moreover, in data centers, traffic between hosts in the same rack and hosts in different racks have different properties. We model this phenomenon, making our generated traffic very realistic. We carefully evaluated our approach and conclude that it reproduces these characteristics with accuracy.
Article
Full-text available
This paper presents Swing, a closed-loop, network-responsive traffic generator that accurately captures the packet interactions of a range of applications using a simple structural model. Starting from observed traffic at a single point in the network, Swing automatically extracts distributions for user, application, and network behavior. It then generates live traffic corresponding to the underlying models in a network emulation environment running commodity network protocol stacks. We find that the generated traffic is statistically similar to the original traffic. Furthermore, to the best of our knowledge, we are the first to reproduce burstiness in traffic across a range of time-scales using a model applicable to a variety of network settings. An initial sensitivity analysis reveals the importance of our individual model parameters to accurately reproduce such burstiness. Finally, we explore Swing's ability to vary user characteristics, application properties, and wide-area network conditions to project traffic characteristics into alternate scenarios.
Conference Paper
Full-text available
Monitoring network trac and detecting unwanted applica- tions has become a challenging problem, since many applica- tions obfuscate their trac using unregistered port numbers or payload encryption. Apart from some notable exceptions, most trac monitoring tools use two types of approaches: (a) keeping trac statistics such as packet sizes and inter- arrivals, flow counts, byte volumes, etc., or (b) analyzing packet content. In this paper, we propose the use of Trac Dispersion Graphs (TDGs) as a way to monitor, analyze, and visualize network trac. TDGs model the social behav- ior of hosts ("who talks to whom"), where the edges can be defined to represent dierent interactions (e.g. the exchange of a certain number or type of packets). With the introduc- tion of TDGs, we are able to harness a wealth of tools and graph modeling techniques from a diverse set of disciplines.
Conference Paper
Full-text available
Network traffic can be represented by a Traffic Dispersion Graph (TDG) that contains an edge between two nodes that send a particular type of traffic (e.g., DNS) to one another. TDGs have recently been proposed as an alternative way to interpret and visualize network traffic. Previous stud- ies have focused on static properties of TDGs using graph snapshots in isolation. In this work, we represent network traffic with a series of related graph instances that change over time. This representation facilitates the analysis of the dynamic nature of network traffic, providing additional descriptive power. For example, DNS and P2P graph in- stances can appear similar when compared in isolation, but the way the DNS and P2P TDGs change over time differs significantly. To quantify the changes over time, we intro- duce a series of novel metrics that capture changes both in the graph structure (e.g., the average degree) and the par- ticipants (i.e., IP addresses) of a TDG. We apply our new methodologies to improve graph-based traffic classification and to detect changes in the profile of legacy applications (e.g., e-mail).
Conference Paper
Full-text available
There are several remaining open questions in the area of flow-based anomaly detection, e.g., how to do meaning- ful evaluations of anomaly detection mechanisms; how to get conclusive information about the origin and na- ture of an anomaly; or how to detect low intensity at- tacks. In order to answer these questions, network traffic traces that are representative for a specific test environ- ment, and that contain anomalies with selected character- istics are a prerequisite. In this work, we present flame, a tool for injection of hand-crafted anomalies into a given background traffic trace. This tool combines the control- lability offered by simulation with the realism provided by captured traffic traces. We present the design and pro- totype implementation of flame, and show how it is ap- plied to inject three example anomalies into a given flow trace. We believe that flame can contribute significantly to the development and evaluation of advanced anomaly detection mechanisms.
Conference Paper
Full-text available
LiTGen is an easy to use and tune open-loop trac,generator that statistically models wireless trac,on a per user and application basis. We first show how to calibrate the underlying hierarchical model, from packet level capture originating in an ISP wireless network. Using wavelet and semi-experiments analysis, we then prove LiTGen’s ability to reproduce accurately the captured trac,burstiness and internal properties over a wide range of timescales. In addition the flexibility of LiTGen enables us to investigate the sensitivity of the trac,structure with respect to the possible distributions of the random,variables involved in the model. Finally this study helps understanding the trac scaling behaviors and their corresponding internal structure. Key words: trac,generator, scaling behaviors, energy plot, semi-experiments
Conference Paper
Full-text available
Evaluating network components such as network intrusion detection systems, firewalls, routers, or switches suffers from the lack of available network traffic traces that on the one hand are appropriate for a specific test environment but on the other hand have the same characteristics as actual traffic. Instead of just capturing traffic and replaying the trace, we identify a set of packet trace manipulation operations that enable us to generate a trace bottom-up: our trace primitives can be traces from different environments or artificially generated ones; our basic operations include merging of two traces, moving a flow across time, duplicating a flow, and stretching a flow's time-scale. After discussing the potential as ell as the dangers of each operation with respect to analysis at different protocol layers, we present a framework within which these operations can be realized and show an example configuration for our prototype.
Conference Paper
The ability to generate repeatable, realistic network traffic is critical in both simulation and testbed environments. Traffic generation capabilities to date have been limited to either simple sequenced packet streams typically aimed at throughput testing, or to application-specific tools focused on, for example, recreating representative HTTP requests. In this paper we describe Harpoon, a new application-independent tool for generating representative packet traffic at the IP flow level. Harpoon generates TCP and UDP packet flows that have the same byte, packet, temporal and spatial characteristics as measured at routers in live environments. Harpoon is distinguished from other tools that generate statistically representative traffic in that it can self-configure by automatically extracting parameters from standard Netflow logs or packet traces. We provide details on Harpoon's architecture and implementation, and validate its capabilities in controlled laboratory experiments using configurations derived from flow and packet traces gathered in live environments. We then demonstrate Harpoon's capabilities in a router benchmarking experiment that compares Harpoon with commonly used throughput test methods. Our results show that the router subsystem load generated by Harpoon is significantly different, suggesting that this kind of test can provide important insights into how routers might behave under actual operating conditions.
Conference Paper
We describe Harpoon, a new application-independent tool for generating representative packet traffic at the . Harpoon is a configurable tool for creating TCP and UDP packet flows that have the same byte, packet, temporal, and spatial characteristics as measured at routers in live environments. We validate Harpoon using traces collected from a live router and then demonstrate its capabilities in a series of router performance benchmark tests.
Conference Paper
This paper presents Swing, a closed-loop, network-responsive traffic generator that accurately captures the packet interactions of a range of applications using a simple structural model. Starting from observed traffic at a single point in the network, Swing automatically extracts distributions for user, application, and network behavior. It then generates live traffic corresponding to the underlying models in a network emulation environment running commodity network protocol stacks. We find that the generated traces are statistically similar to the original traces. Further, to the best of our knowledge, we are the first to reproduce burstiness in traffic across a range of timescales using a model applicable to a variety of network settings. An initial sensitivity analysis reveals the importance of capturing and recreating user, application, and network characteristics to accurately reproduce such burstiness. Finally, we explore Swing's ability to vary user characteristics, application properties, and wide-area network conditions to project traffic characteristics into alternate scenarios.